R
tidyverse
Kubernetes
FastR
Statistical Process Control
- https://rpubs.com/anhoej/controlcharts
- https://link.springer.com/chapter/10.1007/978-1-4614-3652-2_12
- https://stackoverflow.com/questions/38661660/spc-control-charts-by-group-in-r
- http://blog.yhat.com/posts/quality-control-in-r.html
Packages
spc
qicharts
qcc
About
- The documentation on how to use R is straight up crystal clear spotless perfect, easy to understand.
- You don’t even need to install any libraries to perform powerful statistical and visualization tests.
- It’s all built-in, and the dataframe structure is completely genius and simple!
- In other languages you would have to make multi-dimensional arrays, where referring to each object in the array makes you want to cry.
- The only time that I don’t use R is when my data is an absolute mess.
- Also, in R you don’t need to use your standard loops very much, getting lost in loop-ception because R has simple wrapper loops that you can use.
- Even though python is very powerful for a lot of things including web-development, I can’t be bothered to use it because white-space is an error.
- In R, white-space isn’t necessary if you have newlines so it’s not an error.
Source: Unknown
R User Conference
2017
- pool connections with shiny proxy
- dataMaid package: clean() function
- Opening the Publication Process with Executable Research dlib.org
- docker-reproducible-research
- containerit Generating dockerfiles for reproducible research with R
- drake
- remake … %>% htmlwidgets::widgetFrame()
- time series imputation: multivariate, ts imputation: Amelia, mtsdi; in case of high correlation between cross-section variables, use those packages
Graphical User Interface (GUI)
Archlinux version
Dependencies
boost
sudo dnf install boost
cython
sudo dnf install -y python*-Cython*
Geospatial Data
Update Packages
update.packages(ask = FALSE, dependencies = c('Suggests'))
all.packages <- installed.packages()
r.version <- paste(version[['major']], '.', version[['minor']], sep = '')
for (i in 1:nrow(all.packages))
{
package.name <- all.packages[i, 1]
package.version <- all.packages[i, 3]
if (package.version != r.version)
{
print(paste('Installing', package.name))
install.packages(package.name)
}
}
Code Style
Text Mining
- Text Mining with R - A Tidy Approach, Julia Silge and David Robinson, 2017-05-22
- wikibooks: R Programming/Text Processing
Class System
RStudio IDE
Updating RStudio
- check https://www.rstudio.com/products/rstudio/download/ for updates
install using rpm in Fedora
cd ~/Downloads
wget https://download1.rstudio.org/rstudio-1.0.153-x86_64.rpm
rpm -e rstudio
rpm -ivh rstudio-1.0.153-x86_64.rpm
Building RStudio from source
C/C++ Tools
- Qt Creator: Download download
Qt Creator 4.2.2 for Linux 64-bit (94 MB)
, allow executing and double-click run file Boost
: navigate to./dependencies/common
and executebash install-boost
after building CMakeLists.txt using Qt Creator
cd ./src/build-cpp-Desktop-Default
make
# run RStudio server on [http://localhost:8787](http://localhost:8787)
./rserver-dev
# run desktop application
export QT_QPA_PLATFORM_PLUGIN_PATH=/home/xps13/Qt5.4.0/5.4/gcc_64/plugins/platforms
desktop/rstudio
GWT
- gwt: Eclipse Plugin Download describes how to install the GWT plugin for Eclipse
monitor java files for changes and recompile upon change
ant devmode
Eclipse (CDT) + StatET
Learning R
- JMM 2016 Minicourse: Teaching Statistics with R and RStudio
- f.briatte.org: Introduction to Data Analysis
- Statistics in Action with R
Connect to DB
Microsoft SQL Server
- githbu: imanuelcostigan/RSQLServer
- cran: RSQLServer: SQL Server R Database Interface (DBI) and ‘dplyr’ SQL Backend
library(DBI)
con <- dbConnect(RSQLServer::SQLServer(), server="localhost", port=1401, properties=list(useNTLMv2="true", user="SA", password=Sys.getenv("MSSQLPW"))
con <- dbConnect(RSQLServer::SQLServer(), server = "TEST", database = "TestDB")
dbWriteTable(con, "band_members", dplyr::band_members)
dbWriteTable(con, "band_instruments", dplyr::band_instruments)
dbListTables(con)
dbReadTable(con, 'band_members')
dbListFields(con, 'band_instruments')
dplyr usage
library(dplyr, warn.conflicts = FALSE)
members <- tbl(con, "band_members")
instruments <- tbl(con, "band_instruments")
members %>%
left_join(instruments) %>%
filter(band == "Beatles")
collect(members)
clean up
dbRemoveTable(con, "band_instruments")
dbRemoveTable(con, "band_members")
dbDisconnect(con)
PostgreSQL
- Postgresql + R Sandbox free online PostgreSQL database
- My advice on dplyr::mutate() using dplyr
- ElephantSQL: PostgreSQL as a Service
- github: rstats-db/RPostgres
- Introduction to dbplyr
system requirements
sudo dnf install -y postgresql-devel
Rscript -e 'install.packages("RPostgreSQL")'
connect to database
con <- dbConnect(drv = "PostgreSQL",
dbname = "szgdlszd",
host = "horton.elephantsql.com",
port = 5432,
user = "szgdlszd",
password = Sys.getenv("ELEPHANTSQLPW"))
RMySQL
- github: rstats-db: RMySQL
- requires
mariadb-devel
(Fedora <= 25), ‘MariaDB-devel’ (Fedora 26) orlibmariadb-client-lgpl-dev
(Debian)
using MariaDB 10.2.8, need to copy C header files to location specified in RMariaDB/configure
sudo cp /usr/include/mysql/server/mysql_version.h /usr/include/mysql/mysql/mysql_version.h
sudo cp /usr/include/mysql/server/mysql_com.h /usr/include/mysql/mysql/mysql_com.h
Install R on Linux
Debian Jessie
- add to
/etc/apt/sources.list
deb http://cran.univ-paris1.fr/bin/linux/debian jessie-cran3/
Uninstall version installed from source
- download source code from cran
- extract and execute
sudo make uninstall
RStudio using Docker
- running locally
$ sudo docker run -d -p 8787:8787 rocker/hadleyverse
will download and install
RStudio in AWS
Popularity
When/Why R is Better than Excel
- Data Manipulation.
- R allows you to manipulate (e.g., subset, recode, merge) data quickly. Some R packages have been designed specifically for these purposes, e.g., plyr. Typically, a majority of the time spent on an analysis project is spent before the analysis—preparing the data. R is much more adept and efficient in data preparation than Excel. Fantasy data scraped from websites often require many steps in data processing to be ready for analysis, so R is ideal.
- Easier Automation
- R uses a scripting language rather than a GUI, so it’s much easier to automate things in R than in Excel. This can save you loads of time, especially when you plan to re-run the same analysis multiple times (e.g., every new fantasy season).
- Faster Computation
- Because of the automation provided by R scripts, many operations are much faster to perform in R than Excel.
- It Reads Any Type of Data
- R can basically read any type of data (.txt, .csv, .dat, etc.). There are also R packages specifically designed to read JSON, SPSS, Excel, SAS, and STATA files. You can also scrape data from websites and execute SQL queries. Scraping websites can be useful for downloading fantasy projections from ESPN and other websites for data analysis.
- Easier Project Organization
- In Excel, projects are often organized in different tabs of the same file. This can make the Excel file slow, clunky, and difficult to navigate. It is easier to keep a project organized when dealing with R scripts because different tasks or sub-projects can be stored in separate files stored in the same folder and linked together in the same project with RStudio. For an example folder structure for R projects, see here.
- It Supports Larger Data Sets
- Excel has restrictions for how large your data can be. Even if if your data don’t exceed this maximum size, Excel can become slow with large data sets (especially after you add tabs, formulas, and references). R supports larger data, and can support big data with packages such as Hadoop.
- Reproducibility
- R has features that make it much easier to reproduce the findings of your analysis, which is important for detecting errors. First, it’s easy to add comments to your scripts to make it clear what you’re doing. Commenting your code is crucial, and can serve as a translation for someone else looking at your code, or as a reminder of what you did 6 months ago! It is difficult to document steps you’ve done in Excel. Second, data and analysis are separated in R, allowing you to see the logical progression for data analysis in the R code. In Excel, however, data and formulas are together, and it can be difficult to follow the data analyst’s train of logic. Third, you can use version control with git a) to track (and revert) changes you make over time and b) to share your scripts with others to collaborate on projects as a community. Having more people examining your work can help find and fix errors and make other improvemnts. Excel files are binary files, so you can’t track changes to Excel files. The github site hosting the R scripts for this site is located here. Feel free to use the scripts and suggest improvements!
- Accuracy
- Researchers have shown that Excel and other spreadsheets show important inaccuracies for basic analyses like linear regression. R was specifically designed for statistical analysis, so it is more precise and accurate for data analysis.
- Easier to Find and Fix Errors
- Because R uses scripting rather than clicking, and allows comments and version control, one can see a history of the actions taken to achieve the result. This makes it easier to find and troubleshoot errors. In Excel, however, errors can be hidden in formulas in cells that can be difficult to find. Spreadsheet errors have led to widely-publicized mistakes, including disastrous financial losses, faulty government policies, and the wrong drugs being given to cancer patients. Humans make mistakes and mistakes in data analysis are inevitable, whether with spreadsheets or with R code. The bottom line is that it’s easier to find and fix these mistakes in R than it is in Excel, making it more likely that you’re getting an accurate result in R.
- It’s Free
- Enough said.
- It’s Open Source
- Unlike Excel and other proprietary software used for data analysis, R is not a black box. You can examine the code for any function or computation you perform. You can even modify and improve these functions by changing the code.
- Advanced Statistics
- R has many more (and more advanced) statistics capabilities than Excel does. They also tend to be faster and more flexible. Part of the advanced capabilities of R owes to the fact that R is open source and many users have contributed packages for performing specialized functions. For example, this fantasy football draft optimizer uses the Rglpk package to find your optimal starting lineup of players that maximizes the team’s projected points while minimizing its downside risk.
- State-of-the-Art Graphics
- R has advanced graphics capabilities (see here for examples and code for how to create them). You can create beautiful graphics using the base R package, or with the lattice or ggplot packages. People like to digest and understand statistics visually, and R provides a better tool for creating pretty visualizations than Excel does.
- It Runs on Many Platforms
- You can use R on Windows, Mac, Linux, and Unix.
- Anyone (Including You) Can Contribute Packages to the Community to Improve its Functionality
- In the chance there isn’t an R package that does what you need to do, you can write a function to perform the task and can contribute it as a package to the community for others to use and improve. The number of R packages contributed to the community is increasing at a rapid rate. Chances are, if there’s an analysis you need to do, an R package exists to do it.
CRAN
CGE Computable General Equilibrium (and DSGE)
Image Processing
read_image("frink.png")
- vignette with TOC: vector images: layers, frames, other composition etc.
Data Cleaning
Debian Linux
Syntax Highlighter
- Pretty R syntax highlighter www.inside-r.org/pretty-r
Commands
- start help server
help.start()
- debug
options(error = [NULL | recover])
traceback()
.Rprofile
: Rinitfunctions
- select CRAN mirror
current_repo["CRAN"] <- "http://cran.us.r-project.org"
- print list of mirrors
chooseCRANmirror()
Rcpp
rscala
- copy Github repository
$ git clone https://github.com/cran/rscala
- navigate to folder containing JARs
$ cd rscala/inst/java
- start scala instance with specified classpath
$ scala -cp rscala_2.10-1.0.6.jar
- instantiate an R interpreter
scala> val R = org.ddahl.rscala.callback.RClient()
- evaluate R expression
scala> val side = R.evalS0("sample(c('heads', 'tails', 1))")
- print statement
scala> println(s"Your coin landed $side.")
rJava
- clone from github: s-u: rJava
- R Tutorial: How to integrate R with Java using rJava
- rforge.net: JRI
- slideshare.net: rcuprak: Combining R With Java For Data Analysis (Devoxx UK 2015 Session)
Set CLASSPATH
- add rJava JARs to Java
CLASSPATH
$ nano ~/.profile
addexport CLASSPATH="$CLASSPATH:/home/xps13/R/x86_64-redhat-linux-gnu-library/3.2/rJava/jri/JRI.jar"
Link rJava libraries
$ sudo ln -s /home/xps13/R/x86_64-redhat-linux-gnu-library/3.2/rJava/jri/libjri.so /usr/lib64/libjri.so
- run examples in
rJava/jri/examples
./run rtest
,./run rtest2
- add packaged example in subfolder
pkg
[... jri]$ javac -cp JRI.jar:. examples/pkg/Temp.java
[... jri]$ ./run pkg.Temp
Web queries
curl
RCurl
Conferences
- EARL: Effective Applications of the R Language
- eRum 2016
- satRday #1 September 3 2016 MTA TT, Budapest, Hungary
Web applications
DeployR
SAP Hana integration
Parallel Computation
- Keynote UseR 2017 Norman Matloff: Obstacles to performance in parallel programming
- CRAN Task View: High-Performance and Parallel Computing
- PDF R-core Package
parallel
- r-bloggers: How-to go parallel in R – basics + tips
- cran.r-project.org: doParallel
- Using R for HPC Data Science
Map Reduce
RHadoop
sparklyr
SparkR
- github.com: amplab-extras: SparkR-pkg
- amplab-extras.github.io: SparkR
- amplab-extras.github.io: SparkR docs: R frontend for Spark
- spark.apache.org: SparkR (R on Spark)
Installation
- download Apache Spark
- set
SPARK_HOME
environment variable and add$SPARK_HOME/bin
toPATH
environment variable
- from GitHub
devtools::install_github("apache/spark", subdir = file.path("R", "pkg"))
Examples
- download Spark 1.4 from http://spark.apache.org/downloads.html
- download the nyc flights dataset as a CSV from https://s3-us-west-2.amazonaws.com/sparkr-data/nycflights13.csv
- launch SparkR using
./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3
web scraping / harvesting
rvest
- github: hadley: rvest
- blog.rstudio.org: rvest easy webscraping with R
- blog.datacamp.com: Scraping Javascript Generated Data with R
- phantomjs.org: headless WebKit
RStudio Cheat sheets
ggplot
rmarkdown
dplyr, tidyr
- rstudio.com: dplyr and tidyr cheat sheet
- dplyr and tidyr cheat sheet (local)
- rstudio-pubs-static: Brad Boehmke: Data Processing with dplyr & tidyr
Documentation
Visualization
lattice
mosaic
yarrr
$ ./configure && make && sudo make install
$ sudo mkdir /usr/local/lib64/pkgconfig
$ sudo cp ~/Downloads/JAGS-4.2.0/etc/jags.pc /usr/local/lib64/pkgconfig/
ggplot2
ggedit
htmlwidgets
- htmlwidgets
- rstudio.github.io: dygraphs for R
- github: jjallaire/sigma
- github: rstudio/d3heatmap
- github: timelyportfolio: rcdimple
- github: bart6114: dimple
- Plotly R Library 2.0
- plotly cheat sheet
- github: mattflor/chorddiag
%}
+——————+
| htmlwidgets |
+————+ | | +—————–+
| R function | +—> | JS library | +—> | SVG for website |
+————+ | | +—————–+
| custom user data |
+——————+
%}
rCharts
shiny
shinytest
Color input
Google Sheets
Mail lists
R devel
Links
- wikibooks: Statistical Analysis: an Introduction using R
- R Tutorial - An R Introduction to Statistics
- Comparing R and Stata, Oscar Torres-Reyna
- Awesome R A curated list of awesome R packages and tools
- R for Data Science r4ds.had.co.nz
- github.com/hadley/r4ds
- Advanced R adv-r.had.co.nz
- Packaging with R r-pkgs.had.co.nz
- Capital of Statistics (in Chinese) cos.name
- kaggle Tutorials
- microsoft-r-open-training-series
- Microsoft advanced-analytics
- Scalable Machine Learning and Data Science with Microsoft R Server and Spark, Ali Zaidi, Machine Learning and Data Science, Microsoft, 2016-06-01
- David Robinson: Data Analysis and Visualization Using R (training course)
Videos
- H. Wickham - Expressing yourself in R (Stanford Seminar)
- blog.revolutionanalytics.com: John Chambers recounts the history of S and R
Articles
- Ihaka R, Gentleman R (1996). “R: A Language for Data Analysis and Graphics.” Journal of Computational and Graphical Statistics, 5(3), 299–314.
- R Development Core Team (2010). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
- Gentleman R (2008). R Programming for Bioinformatics. CRC Press.
- R Development Core Team (2008). R: Regulatory Compliance and Validation Issues A Guidance Document for the Use of R in Regulated Clinical Trial Environments. R Foundation for Statistical Computing, Vienna, Austria.
Youtube
Meetup
- R Addicts Paris meetup
Books
- The Art of R Programming, Norman Matloff, 2011
- All of Statistics, A Concise Course in Statistical Inference, by Larry Wasserman
- Principles of Econometrics with R, Constantin Colonescu, 2016-09-01
- R Programming for Data Science, Roger D. Peng, 2016-12-22
- Exploratory Data Analysis with R, Roger D. Peng, 2016-09-14
- Efficient R Programming, C. Gillespie & R. Lovelace, 2017-04-10
- Mastering Software Development in R, Roger D. Peng, Sean Kross, and Brooke Anderson, 2017-01-11
- The S Language (Blue Book)
- Chambers J.M., Hastie T.J. Statistical Models in S (White Book)
- Machine Learning for Hackers
- Author: Drew Conway and John Myles White
Subtitle: Case Studies and Algorithms to Get You Started
Publisher: O’Reilly
Year: 2012
Configuration
Environment Variables
Windows
R_LIBS
D:\R\R-3.2.1\library
R_LIBS_USER
D:\R\R-3.2.1\library
PATH
to select default R version ofD:\R\R-3.1.1\bin\x64
Packages
Building
.Rbuildignore
# Extra material related but not to be included in the package
./inst/extras/eurostat.Rcheck/*
./inst/extras/..Rcheck/*
./inst/extras/...Rcheck/*
./inst/extras/*
./inst/extras/.*
./inst/extras/*.Rmd
./inst/extras/.RData
./inst/extras/*.RData
inst/extras/*.RData
inst/extras/.RData
sandbox/*.R
sandbox/
README.md
# Git related
./.git*
# Travis scripts
.travis.yml
# Extra Vignette materials
vignettes/pxweb.md
# Misc
./.*~
./*~
^.*\.Rproj$
^\.Rproj\.user$
Testing
Travis CI
- docs.travis-ci.com: Building an R Project
- jtleek.com: Routinely testing your R package with Travis
- github: craigcitro: r-travis
Unit tests
Each new feature should be accompanied with unit tests, by using the testthat R package.
- to set up your package to use testthat, run
devtools::use_testthat()
For each R-script file named script.R, a correspond test file should be created in tests/testthat directory, using the writing convention test_<script>.R
The test_<script>.R
should have the following structure:
require(rsdmx, quietly = TRUE) #load the rsdmx package
require(testthat) # load the testthat package
context("script") # create a unit test context for the given script file
#unit test 1
test_that("Test1",{
...
})
#unit test 2
test_that("Test2",{
...
})
Build tests
After any modification of the source code (bug fix, enhancement, added feature), a package build should be tested by the developer using the command R CMD check (requires installation of an R instance and RTools). The option –as-cran should be enabled to ensure the updated package will be later accepted by CRAN. Such program will run a set of check operations required for a proper package build, including the unit tests. In order to guarantee a proper R package build, the R CMD check will be performed automatically after each commit, through Travis Continuous Integration (see [travis-ci.org]{https://travis-ci.org/opensdmx/rsdmx}). This second build test is required to ensure users will be able to successfully install the package from Github.
Test coverage
Excel interaction
- github: kassambara: r2excel
- www.sthda.com: r2excel : Créer et formater facilement un document Excel avec le logiciel R
Open Office
- Options - ROOo - Path Settings
- R-Home:
/usr/lib/R
Proxy:/usr/local/lib/R/site-library/rscproxy/libs