Wednesday, 21 January 2015

R Basics: Graphics in R

R is a powerful tool which can be used to make amazing graphics for representing your data.This post is going to just scratch the surface of the most commonly used graphics.

BOXPLOT: It is used for quantitative variables and its goal is to give you an idea of the distribution of the data. The upper and the lower bounds of the box represent the 75th percentile and the 25th percentile respectively. The black line is seen to be towards the top which means the distribution is asymmetric.
Dataset Source: http://datamarket.com/data/set/12sc/infant-mortality-rate#!

#Graphical Plots in R 
install.packages('knitr')
library(knitr)
infant_mortality<-read.csv("infant-mortality-rate.csv")
dim(infant_mortality)
[1] 92  4
#Reading first few lines of datahead(infant_mortality)
  Year Australia India Pakistan
1 2010     4.437 50.58    71.35
2 2011     4.437 50.58    71.35
3 2012     4.437 50.58    71.35
4 2013     4.437 50.58    71.35
5 2014     4.437 50.58    71.35
6 2015     4.437 50.58    71.35
#Boxlpot
boxplot(infant_mortality$India,col="red")
boxplot
#Seeing the summary statistics of the dataset
summary(infant_mortality)
      Year      Australia        India         Pakistan   
 2010   : 1   Min.   :4.44   Min.   :50.6   Min.   :71.3  
 2011   : 1   1st Qu.:4.44   1st Qu.:50.6   1st Qu.:71.3  
 2012   : 1   Median :4.44   Median :50.6   Median :71.3  
 2013   : 1   Mean   :4.44   Mean   :50.6   Mean   :71.3  
 2014   : 1   3rd Qu.:4.44   3rd Qu.:50.6   3rd Qu.:71.3  
 2015   : 1   Max.   :4.44   Max.   :50.6   Max.   :71.3  
 (Other):86   NA's   :1      NA's   :1      NA's   :1  
#Boxplot for comarisons#Compares mortality rate VS  Year
 boxplot(infant_mortality$Pakistan ~ as.factor(infant_mortality$Year),col="blue")
boxplot1

Dataset Source: http://datamarket.com/data/set/vih/retail-prices-of-some-commodities-and-services-1996-2013#!
 #Barplot 
retail_prices<-read.csv("retail-prices-of-some-commoditie.csv")
head(retail_prices)
    Month Dairy.cheese Dark.chocolate Eggs Grapes
1 1996-11          701            190  348    447
2 1997-02          736            188  360    593
3 1997-05          742            189  360    381
4 1997-08          747            191  359    375
5 1997-11          758            194  360    479
6 1998-02          786            192  341    378
barplot(table(retail_prices$Eggs),col="red")

barplot
  #Histogram  hist(retail_prices$Eggs,col="blue",breaks=50) 

histogram
#Density Plots

#Density plots are smoothed histograms
 dens<-density(retail_prices$Eggs) 
#lwd is for line thickness 
#Density plots the percentage of observation instead of absolute numbers 
plot(dens,lwd=3,col="blue") 
#Density plots - Multiple Distributions 
dens_grapes<-density(retail_prices$Grapes) 
lines(dens_grapes,lwd=3,col="orange")
density
#Timeseries 

#Install the following package to plot a timeseries
install.packages("astsa")
library(astsa)

#start parameter is used to indicate the year on x-axis
retail_prices_timeseries<-ts(retail_prices$Eggs,start=c(1961,1))
plot(retail_prices_timeseries,,col="blue",lwd=3)

#To plot multiple timeseries on same graph for comparison 
retail_prices_timeseries1<-ts(retail_prices$Dark.chocolate,start=c(1961,1))
#Lines command is used to overlay multiple plots on same graph
lines(retail_prices_timeseries1,col="red",lwd=3)
retail_prices_timeseries2<-ts(retail_prices$Grapes,start=c(1961,1))
lines(retail_prices_timeseries2,col="green",lwd=3)
retail_prices_timeseries2<-ts(retail_prices$Dairy.cheese,start=c(1961,1))
lines(retail_prices_timeseries2,col="yellow",lwd=3)
Timeseries

Wednesday, 14 January 2015

All about Correlation - Part 1

What is Correlation ? 

Correlation indicates how strongly are 2 variables associated. Correlation,however does not imply causation. The value of correlation varies between 0 & 1.

Correlation does not mean causation


Using a R dataset - mtcars to understand correlation deeper.

> #To read the first few lines of the datset
> head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
> 
> #To generate a correlation matrix
> correlations <- cor(mtcars[2:6])
> #Round function can also be used
> # correlations <- round(cor(mtcars[2:6]),2) - Will round to 2 decimal digits
> #Print the result
> correlations
            cyl       disp         hp       drat         wt
cyl   1.0000000  0.9020329  0.8324475 -0.6999381  0.7824958
disp  0.9020329  1.0000000  0.7909486 -0.7102139  0.8879799
hp    0.8324475  0.7909486  1.0000000 -0.4487591  0.6587479
drat -0.6999381 -0.7102139 -0.4487591  1.0000000 -0.7124406
wt    0.7824958  0.8879799  0.6587479 -0.7124406  1.0000000
> 
> #To create a visual plot of the correlations
> corrplot(correlations)
> 

Correlation matrix