From basic R function to dplyr

Using R for 5 years, now I want to learn a new approach to handle my dataframe.

Yijie Wang
2 min readJan 15, 2022
Photo by Azzedine Rouichi on Unsplash

Since the first R class I took during undergrad, I have always enjoyed coding with R because of its straightforward syntax and the classic design of Rstudio. I have to use Python during most of my internship, but whenever I have a choice, I always choose R for data science projects. I am comfortable with the base function that R provides like head, cbind, summary, etc. And I am also familiar with processing dataframe step by step, one function a line.

Last year, I start to work on some large and messy datasets and I notice my code become less readable and the data preparation is too time consuming. That’s the moment I want to find a way to manipulate data faster and tidier. The first package came in my mind is data.table. fread and fwrite are fast like magic compared with read.csv . However, I found data.table ’s syntax complicated and the learning curve very steep. As a result, I turned into dplyr because of the following reasons:

  1. capability of handling large scale data at high speed,
  2. accessible syntax
  3. I have previous experience with other packages in tidyverse like ggplot2 , I also like how Hadley Wickham embeds his personal thinking and theories in programming syntax

In this post, I would like to rewrite one of my previous script using dplyr.

The old script

This project is about auto-insurance customer scoring for cross-selling. By analyzing the data of the pilot test, we want to predict each customer’s probability of accepting cross-selling offer based on their demographic and historical purchase history. Before running classification models, we conduct a basic data cleaning process, including column filtering, creating dummy variables, outlier detection and log/sqrt transformation.

The new scipt

Using dplyr and ggplot2, the whole data preparation and visualization process can be much more readable.

The new code is also 50% faster. The basic function took 0.06s and dplyr only need 0.03s to finish data transformation and dummy variables encoding. If the dataset is bigger, the advantage of dplyr can be more significant.

I am still new to dyplr and ggplot2 and I would surely use more tidyverse in my future project.

--

--

Yijie Wang

Business analytics student at Fuqua, Duke University