Monday, 13 April 2015

Data manipulation in R - summarise & group_by

Summarise and group by are 2 very powerful functions. Together they can be used to provide an over arching view of our data.

Summarise Function













Syntax 

Let us understand the summarise function through a sample question.

Question

Find out the average Taxi In & Taxi Out time for all American Airlines flights.
> s2 <- hflights %>%
 filter(UniqueCarrier == "American",!is.na(TaxiIn), !is.na(TaxiOut)) %>%
 summarise(avg_In = mean(TaxiIn), avg_out = mean(TaxiOut))
> s2
Source: local data frame [1 x 2]
 
    avg_In  avg_out
1 8.032349 15.08229
is.na() is used to filter out the NA values, otherwise we won't get the right result.

Aggregate functions in dplyr


















Let us understand on how to use these functions in summarise by solving a problem statement.

Question

1. Write a code to return
a. count of distinct values of UniqueCarriers
b. the min and max TaxiIn of Continental Airlines
c. Summarising statistics of Continental Airlines

Solution

Part a
a <- n_distinct(hflights$UniqueCarrier)
a
Part b
a <-  hflights %>%
 filter(UniqueCarrier == "Continental",!is.na(ArrTime), !is.na(DepTime)) %>%
 summarise(max_Taxi = max(TaxiIn), min_Taxi = min(TaxiIn))
> a
Source: local data frame [1 x 2]
 
  max_Taxi min_Taxi
1      140        1
Part c

> a <- hflights %>%
                 filter(UniqueCarrier == "Continental") %>%
               summarise(n_obs = n(),
                           distinct_flightnum = n_distinct(FlightNum),
                           avg_Airtime = mean(AirTime, na.rm = TRUE),
                           n_canc = sum(Cancelled == 1),
                           p_canc = (n_canc/n_obs)*100)
> a
Source: local data frame [1 x 5]
 
  n_obs distinct_flightnum avg_Airtime n_canc    p_canc
1 70032                927    145.4579    475 0.6782614
Group By Function















Group by lets you define groups within your data set. In conjunction with summarise, we can perform powerful analysis on our data.

Question

Find out which Airline had the highest average  Taxi Time (Taxi In + Taxi out) and use the percentage of cancelled flights as a deciding parameter in case of a tie.

Solution


> a <- hflights %>%
         mutate(TotalTaxiTime = TaxiIn + TaxiOut) %>%
           group_by(UniqueCarrier) %>%
                 summarise(n_obs = n(),
                   n_canc = sum(Cancelled == 1),
                   n_per = (n_canc/n_obs)*100,
                  avg_taxi = round(mean(TotalTaxiTime, na.rm= TRUE),2)) %>%
         arrange(desc(avg_taxi),n_per)
> a
Source: local data frame [15 x 5]
 
        UniqueCarrier n_obs n_canc     n_per avg_taxi
1                Mesa    79      1 1.2658228    29.45
2               Delta  2641     42 1.5903067    28.70
3              United  2072     34 1.6409266    26.04
4         Continental 70032    475 0.6782614    25.48
5          US_Airways  4082     46 1.1268986    24.15
6             SkyWest 16061    224 1.3946828    23.6
Is this the most optimized way to write the code ?
Notice that, we created a new column "TotalTaxiTime" especially for this calculation. What if, mutate function was to be avoided ?

Thankfully, summarise does allow us to do small mathematical operations within it. Please see below how we can succinctly write the same code below  -

Optimized Code
> a <- hflights %>%
               group_by(UniqueCarrier) %>%
                 summarise(n_obs = n(),
                   n_canc = sum(Cancelled == 1),
                   n_per = (n_canc/n_obs)*100,
                  avg_taxi = round(TotalTaxiTime = mean(TaxiIn + TaxiOut, na.rm= TRUE),2)) %>%
         arrange(desc(avg_taxi),n_per)
> a
Source: local data frame [15 x 5]
        UniqueCarrier n_obs n_canc     n_per avg_taxi
1                Mesa    79      1 1.2658228    29.45
2               Delta  2641     42 1.5903067    28.70
3              United  2072     34 1.6409266    26.04
4         Continental 70032    475 0.6782614    25.48
5          US_Airways  4082     46 1.1268986    24.15
6             SkyWest 16061    224 1.3946828    23.6

No comments:

Post a Comment