Keywords: data-wrangling, high-performance, big-data, in-memory

data.table for beginners

By Arun Srinivasan, data.table co-author

Description:

data.table is one of the fastest open-source in-memory data manipulation packages available today. The package’s syntax has a learning curve, but once internalised through proper understanding of its philosophy it just clicks! First released to CRAN by Matt Dowle in 2006, it continues to grow in popularity. Over 300​CRAN and Bioconductor packages now import or depend on data.table. Its StackOverflow tag has attracted ~5,000 questions from users in many fields making it a top 3 asked about R package. It is the 8th most starred R package on GitHub.

This three hour tutorial will start with basic queries and go all the way to advanced topics. At the beginning, you will be asked to solve a few (commonly occuring) data manipulation tasks (of varying complexity) using your favourite package in R, or even your favourite programming language (~10-15 min). After a short discussion, we will proceed towards learning data.table (see Outline). You will be asked to solve a few exercises after each section to internalise each concept. We will finally come back to the tasks from the start, but use data.table this time.

Outline:

  • Efficiently reading files in - fread
  • Introducing general form of a data.table query - DT[i, j, by] (or for those familiar with SQL: DT[where, select|update, group by])
  • Fast, parallelised subsets, secondary indices and automated indexing for even faster subsets
  • Extending subsets to joins - exploring similarities between subsets and joins is key to understanding data.table’s philosophy
  • Fast and flexible grouped aggregations and updates
  • Powerful joins - equi, rolling, overlapping/range/interval and non-equi joins
  • Unique data.table feature - by = .EACHI
  • quick look at other new and useful features in the recent releases, including the new and parallel file writer, fwrite
  • Using data.table in your own package, along with other packages, understanding its optimisations to get the maximum performance etc.

Background knowledge:

Familiarity with base R and/or SQL is useful but is not absolutely essential.

In base R, good understanding of list data structure is a plus. In particular manipulating lists with mapply, lapply, Map, Reduce etc.

Requirements:

You will need your laptop with the latest version of R and latest stable (CRAN) version of data.table already installed.

Instructor biography:

LinkedIn | Twitter | Github

Package Authors:

Matt Dowle (main author, @MattDowle), Arun Srinivasan (co-author, @arun_sriniv) and plenty other contributors.