Sunday, 1 February 2015

Data manipulation in R

DPLYR Package Basics

Dplyr is a very powerful package for data manipulation in R. The 6 main verbs for dplyr are -
1. Select
2. Filter
3. Arrange
4. Group
5. Mutate
6. Summarise

Together, these verbs can be used to perform powerful data analysis.


All you need to know about select function.

#Shown below are 2 ways to select the set of columns. Notice, how easy it becomes when we make use of the select function. The "%>%" is the piping operator and is of great use when longer codes need to be written.

> #Difficult way
> hflights[c("Year","Month","DayOfWeek","DepTime","ArrTime")]
Source: local data frame [227,496 x 5]
   Year Month DayOfWeek DepTime ArrTime
1  2011     1         6    1400    1500
2  2011     1         7    1401    1501
3  2011     1         1    1352    1502
4  2011     1         2    1403    1513
5  2011     1         3    1405    1507
6  2011     1         4    1359    1503
7  2011     1         5    1359    1509
8  2011     1         6    1355    1454
9  2011     1         7    1443    1554
10 2011     1         1    1443    1553
..  ...   ...       ...     ...     ...
> #Easy Way
> hflights %>%
          select(Year:ArrTime, - DayofMonth)


Mutate is used to create new columns. You can always reuse variables created within mutate function to create more columns. Always remember that the new columns are created to a copy of the dataset.
Let's write a code to create 2 new columns - TotalTaxiTime, GroundTime

#Creating new columns
hflights %>%
          mutate(TotalTaxiTime = TaxiIn+ TaxiOut, GroundTime = ActualElapsedTime - ArrTime)

No comments:

Post a Comment