Inaugural Meeting: Las Vegas R Users Group

Dennis Murphy
2 Septemer 2014

Welcome!

Tonight's agenda:

  • History of S and R (very brief!)
  • Why R? Why not R?
  • Essential features of R
  • Discussion of future topics

History of S and R

Why R?

R is a statistical programming language. Its original target audience was statisticians and applied scientists. Big Data has changed that.

Features:

  • Open source, freely available
  • Versions for Windows, Linux and Mac OS's
  • Interactive and interpreted language (similar to Python)
  • Object-oriented (but not like Java or C++)
  • Scripting language
  • Has most features of a functional programming language
  • Well documented (manuals, books, web sites)
  • User-extensible
  • Most functions in base R perform vectorized computation
  • Plays nice with other open-source software
  • Has an active and growing user base

What does R do well?

  • graphics
  • data manipulation
  • statistical modeling and diagnosis thereof
  • communication with other software
  • with Sweave and knitr, documentation is attractive and easy
  • RStudio lets you use Sweave/knitr in a way similar to iPython

What doesn't R do well?

R was written in the early 1990's before the advent of Big Data. Some design decisions made at the time limit its effectiveness (at least w/o help from contributed packages):

  • R is in-memory software, which means that the amount of data capable of being processed is limited by available RAM.
  • R does not natively perform parallel processing, although several packages exist to enable it on various platforms.
  • R does not natively interface with databases, but several packages enable communications with various DBMS.
  • Contributed packages sometimes don't communicate well with other packages having the same functionality.
  • It has no native way to scrape data from web sites (but a few packages enable it to a limited extent).
  • Probably some others that escape me at the moment…

Things you need to know about R

  • Everything in R is an object
  • R's evaluator searches for objects in a hierarchical fashion
  • R is lexically scoped (affects how the evaluator searches for objects)
  • R applies lazy evaluation.
  • Most functions in base R apply vectorized computation.
  • Almost 6000 user-contributed packages exist on CRAN in a wide variety of application areas. See the Task Views page at CRAN for an overview.

Graphics

Two graphics engines:

  • base graphics (package graphics)
  • grid graphics (package grid)

Some graphics packages of note

  • lattice and latticeExtra (grid graphics)
  • ggplot2 (grid graphics)
  • plotrix (base graphics)
  • rgl, scatterplot3d (3D graphics)
  • iplots (interactive graphics, Java-based interface)
  • ggvis, rCharts (web-based, interactive graphics)
  • shiny (web-based R applications with a focus on graphics)

Data manipulation/munging

The realm of the data scientist….

Some useful base functions:

  • apply family of functions
  • aggregate()
  • sub(), gsub(), grep(), etc. (regular expression handling)

Data manipulation, cont.

Packages:

  • doBy (a good starter package, very well documented)
  • plyr and reshape2
  • dplyr (the next generation of plyr)
  • data.table (almost requisite for munging big data in R)
  • stringr (simplifies regex applications in R)
  • lubridate (simplifies date handling)
  • sqldf (allows one to use SQL code on R data frames)
  • ff (package to manage I/O of big data into R)
  • bigmemory and friends (big data processing in Unix)
  • pbdR library (large-scale parallel programming: Unix-based)

RStudio

An excellent IDE for R, useful for novices and developers alike. Some features:

  • available on all supported R platforms
  • more or less the R GUI on steroids
  • easy to document your work:
    • code scripts
    • reports/documents (Markdown, HTML, pdf, Word via pandoc)
    • presentations (like this one)
    • conversion of code scripts to notebooks
  • code and package development

What Next?

Have given you a basic introduction to what R can do, but the important part is to learn HOW to do it…which leads to asking which topics you are interested in learning about.

Most of this group consists of R novices, so I suggest that this year's topics be devoted to learning the basics (e.g., graphics and data manipulation). This is an RFD, not a statement!

Some possible topics

  • Sessions on R graphics:

    • base graphics
    • lattice graphics
    • ggplot2 graphics
    • grid graphics
    • ggvis
  • Sessions on data manipulation

    • base R functions
    • plyr, reshape2
    • data.table
    • dplyr

Possible topics, cont.

  • overview of RStudio
  • overview of shiny
  • DBMS and R communication
  • basics of statistical modeling in R

Which topics interest you that are not mentioned above?

Opinion

By providing the foundations first, more people will be able to appreciate and apply the more interesting and trendy features of R.

Several of the topics above require some prior understanding; e.g., you need to be familiar with RStudio before you can appreciate shiny. It helps to know plyr before diving into dplyr, because many concepts in dplyr are extensions of concepts introduced in plyr and reshape2.

Coda

  • Suggestions and participation are welcome.
  • Feel free to volunteer to speak on a topic of your choice; contact Daniel to arrange the details.
  • I have no problem developing a basic R course if there is sufficient interest; my conception is that it would have the length, and perhaps the workload, of a one-term course at the upper undergraduate level.
  • Since several members are fluent in other programming languages, presentations between R and language X would be valuable.