- R Data Cleaning Cheat Sheet Pdf
- R Data Cleaning Cheat Sheet Printable
- R Data Manipulation Cheat Sheet
- Cheat Sheet Dplyr
- R Data Wrangling Cheat Sheet
By Anna Kayfitz, CEO of StrategicDB Corp
All data needs to be clean before you can explore and create models. Common sense, right. Cleaning data can be tedious but I created a function that will help. The function do the following: Clean Data from NA’s and Blanks Separate the clean data – Integer dataframe, Double dataframe, Factor. Mark van der Loo A systematic approach to data cleaning with R. The statistical value chain From raw to technically correct data From technically correct to. Note: When cleaning data, always maintain both the raw data and the cleaned version(s). The raw data should be kept intact and preserved for future use. Any type of data cleaning/analysis should be done on a copy of the raw data. Data Cleaning Feature engineering is the process of using domain knowl-edge to create features or input variables.
As millions or billions of data elements come into your business each day, it is almost inevitable that some of it will lack the necessary quality to create efficient business models. Ensuring that your data is clean should always be the first and arguably most important part of a Data Science workflow as without it, you will have difficulty seeing what is important and potentially make the wrong decisions due to duplicates, anomalies or missing information.
One of the most common and powerful data programming tools is R, an open source language and environment for statistical computing and graphics. R provides uses with all the tools needed to create data science projects but with anything, it is only as good as the data that feeds into it. With that, there are a number of libraries within the R environment that help with data cleaning and manipulation before the start of any project.
Exploring the data
Most of the tools for exploring a set of data that you’ve imported already exist within the R platform.
Summary(data)
This handy command simply gives an overview of all your data attributes, displaying the min, max, median, mean and category splits for each. It is a good method for quickly spotting any potential data anomalies.
Following on from this, you can use a histogram to better understand the distribution of your data. This will visualise show any outliers within the dataset or any numeric column that you are particularly looking to observe.
R Data Cleaning Cheat Sheet Pdf
The plyr package
You will need to install the plyr package to create your Histogram, using the standard R functionality for installing libraries
This will create a visualisation of your data to spot any anomalies quickly for. A boxplot visualisation uses the same package but splits into quartiles for outlier detection. Both of these combined will quickly tell you if you need to limit the dataset or only use certain segments of it within any algorithms or statistical modelling.
Correcting errors
R has a number of pre-built methods for correcting data errors such as converting values as you might do in Excel or SQL with simple logic e.g. as.charater() converts the column to a character string.
However, if you want to start correcting the errors that you saw in your histogram or boxplot, there are additional packages that have the capability of doing just that.
The stringr package
There are a few different ways in which stringr can help cleanse your data including trimming white spaces and replacing certain unnecessary words. These are quite standard bits of code structured as str_trim(YOUR_DATA_FIELD) which simply removes the white space.
However, what about removing the anomalies that our histogram told us we had? It would need a bit more complexity than this but as a basic example, we can tell R to replace all the outliers in our field with the median value of that field. This will move everything in together and take away anomaly bias.
Missing values
It is very simple in R to check for incomplete data and perform and action with that field. For example, this function will eliminate missing vales completely from your chosen data column.
There are similar options to replace blank values with 0’s or N/A depending on the field type and improve the consistency of the dataset.
The tidyr package
The tidyr package is designed to tidy your data. It works by identifying the variables in your dataset and using the tools provided to move them into columns with three main functions or gather (), separate () and spread().
The gather() function takes multiple columns and gathers them into key value pairs. A an example, say you have exam score data like.
Name | Exam A | Exam B |
John | 55 | 80 |
Mike | 76 | 90 |
Sam | 45 | 75 |
The gather functions work by transforming that into usable columns like this.
Name | Exam | Score |
John | A | 55 |
Mike | A | 76 |
Sam | A | 45 |
John | B | 80 |
Mike | B | 90 |
Sam | B | 75 |
Now we are truly able to analyse the exam scores. The separate and spread functions do similar things which you can explore once you have the package but ultimately theyalig your data as needed.
R Data Cleaning Cheat Sheet Printable
Here are a few other packages of note that may be useful for data cleansing in R
- The purr package
The purr package is designed for data wrangling. It is quite similar to the plyr package, albeit older and some users simply find it easier to use and more standardised in its functionality.
- The sqldf package
A lot of R users are more comfortable coding in SQL language rather than R. This function allows you to write SQL code within R studio to select your data elements
- The janitor package
This package is able to find duplicates by multiple columns and make friendly columns with ease from your dataframe. It even has a get_dupes() function for finding duplicate values amongst multiple rows of data. If you are looking to dedupe your data in a more advanced manner, for example, finding different combinations or using fuzzy logic, you may want to look into a deduping tool instead.
- The splitstackshape package
R Data Manipulation Cheat Sheet
This is an older package that can work with comma separated values in a dataframe column. Useful for survey or text analysis preparation.
R has a huge number of packages and this article only really touches the surface of what it can do. With new libraries popping up all the time it is important to do your research and get the right ones for you before starting any new project.
Bio: Anna Kayfitz is a CEO of StrategicDB Corp, a data cleansing and analytics company. She holds an MBA from a Schulich School of Business, and has spent over 10 years working in data analytics and marketing roles prior to founding StrategicDB.
Cheat Sheet Dplyr
Resources:
R Data Wrangling Cheat Sheet
Related: