Summarise and group by are 2 very powerful functions. Together they can be used to provide an over arching view of our data.
Summarise Function
Syntax
Let us understand the summarise function through a sample question.
Question
Find out the average Taxi In & Taxi Out time for all American Airlines flights.
Aggregate functions in dplyr
Let us understand on how to use these functions in summarise by solving a problem statement.
Question
1. Write a code to return
a. count of distinct values of UniqueCarriers
b. the min and max TaxiIn of Continental Airlines
c. Summarising statistics of Continental Airlines
Solution
Part a
> a <- hflights %>%
filter(UniqueCarrier == "Continental") %>%
summarise(n_obs = n(),
distinct_flightnum = n_distinct(FlightNum),
avg_Airtime = mean(AirTime, na.rm = TRUE),
n_canc = sum(Cancelled == 1),
p_canc = (n_canc/n_obs)*100)
Group by lets you define groups within your data set. In conjunction with summarise, we can perform powerful analysis on our data.
Question
Find out which Airline had the highest average Taxi Time (Taxi In + Taxi out) and use the percentage of cancelled flights as a deciding parameter in case of a tie.
Solution
Notice that, we created a new column "TotalTaxiTime" especially for this calculation. What if, mutate function was to be avoided ?
Thankfully, summarise does allow us to do small mathematical operations within it. Please see below how we can succinctly write the same code below -
Optimized Code
Summarise Function
Syntax
Let us understand the summarise function through a sample question.
Question
Find out the average Taxi In & Taxi Out time for all American Airlines flights.
> s2 <- hflights %>% filter(UniqueCarrier == "American",!is.na(TaxiIn), !is.na(TaxiOut)) %>% summarise(avg_In = mean(TaxiIn), avg_out = mean(TaxiOut)) > s2 Source: local data frame [1 x 2] avg_In avg_out 1 8.032349 15.08229is.na() is used to filter out the NA values, otherwise we won't get the right result.
Aggregate functions in dplyr
Let us understand on how to use these functions in summarise by solving a problem statement.
Question
1. Write a code to return
a. count of distinct values of UniqueCarriers
b. the min and max TaxiIn of Continental Airlines
c. Summarising statistics of Continental Airlines
Solution
Part a
a <- n_distinct(hflights$UniqueCarrier) aPart b
a <- hflights %>% filter(UniqueCarrier == "Continental",!is.na(ArrTime), !is.na(DepTime)) %>% summarise(max_Taxi = max(TaxiIn), min_Taxi = min(TaxiIn)) > a Source: local data frame [1 x 2] max_Taxi min_Taxi 1 140 1Part c
> a <- hflights %>%
filter(UniqueCarrier == "Continental") %>%
summarise(n_obs = n(),
distinct_flightnum = n_distinct(FlightNum),
avg_Airtime = mean(AirTime, na.rm = TRUE),
n_canc = sum(Cancelled == 1),
p_canc = (n_canc/n_obs)*100)
Group By Function> a Source: local data frame [1 x 5] n_obs distinct_flightnum avg_Airtime n_canc p_canc 1 70032 927 145.4579 475 0.6782614
Group by lets you define groups within your data set. In conjunction with summarise, we can perform powerful analysis on our data.
Question
Find out which Airline had the highest average Taxi Time (Taxi In + Taxi out) and use the percentage of cancelled flights as a deciding parameter in case of a tie.
Solution
> a <- hflights %>% mutate(TotalTaxiTime = TaxiIn + TaxiOut) %>% group_by(UniqueCarrier) %>% summarise(n_obs = n(), n_canc = sum(Cancelled == 1), n_per = (n_canc/n_obs)*100, avg_taxi = round(mean(TotalTaxiTime, na.rm= TRUE),2)) %>% arrange(desc(avg_taxi),n_per) > a Source: local data frame [15 x 5] UniqueCarrier n_obs n_canc n_per avg_taxi 1 Mesa 79 1 1.2658228 29.45 2 Delta 2641 42 1.5903067 28.70 3 United 2072 34 1.6409266 26.04 4 Continental 70032 475 0.6782614 25.48 5 US_Airways 4082 46 1.1268986 24.15 6 SkyWest 16061 224 1.3946828 23.6Is this the most optimized way to write the code ?
Notice that, we created a new column "TotalTaxiTime" especially for this calculation. What if, mutate function was to be avoided ?
Thankfully, summarise does allow us to do small mathematical operations within it. Please see below how we can succinctly write the same code below -
Optimized Code
> a <- hflights %>% group_by(UniqueCarrier) %>% summarise(n_obs = n(), n_canc = sum(Cancelled == 1), n_per = (n_canc/n_obs)*100, avg_taxi = round(TotalTaxiTime = mean(TaxiIn + TaxiOut, na.rm= TRUE),2)) %>% arrange(desc(avg_taxi),n_per) > a
Source: local data frame [15 x 5]
UniqueCarrier n_obs n_canc n_per avg_taxi 1 Mesa 79 1 1.2658228 29.45 2 Delta 2641 42 1.5903067 28.70 3 United 2072 34 1.6409266 26.04 4 Continental 70032 475 0.6782614 25.48 5 US_Airways 4082 46 1.1268986 24.15 6 SkyWest 16061 224 1.3946828 23.6
No comments:
Post a Comment