From basic R function to dplyr
Using R for 5 years, now I want to learn a new approach to handle my dataframe.
Since the first R class I took during undergrad, I have always enjoyed coding with R because of its straightforward syntax and the classic design of Rstudio. I have to use Python during most of my internship, but whenever I have a choice, I always choose R for data science projects. I am comfortable with the base function that R provides like head
, cbind
, summary
, etc. And I am also familiar with processing dataframe step by step, one function a line.
Last year, I start to work on some large and messy datasets and I notice my code become less readable and the data preparation is too time consuming. That’s the moment I want to find a way to manipulate data faster and tidier. The first package came in my mind is data.table
. fread
and fwrite
are fast like magic compared with read.csv
. However, I found data.table
’s syntax complicated and the learning curve very steep. As a result, I turned into dplyr because of the following reasons:
- capability of handling large scale data at high speed,
- accessible syntax
- I have previous experience with other packages in
tidyverse
likeggplot2
, I also like how Hadley Wickham embeds his personal thinking and theories in programming syntax
In this post, I would like to rewrite one of my previous script using dplyr.
The old script
This project is about auto-insurance customer scoring for cross-selling. By analyzing the data of the pilot test, we want to predict each customer’s probability of accepting cross-selling offer based on their demographic and historical purchase history. Before running classification models, we conduct a basic data cleaning process, including column filtering, creating dummy variables, outlier detection and log/sqrt transformation.
The new scipt
Using dplyr and ggplot2, the whole data preparation and visualization process can be much more readable.
The new code is also 50% faster. The basic function took 0.06s and dplyr only need 0.03s to finish data transformation and dummy variables encoding. If the dataset is bigger, the advantage of dplyr can be more significant.
I am still new to dyplr and ggplot2 and I would surely use more tidyverse in my future project.