Thursday, 4 December 2014

R Basics - Reading and summazring data ( Part 1)

Image

This is a screenshot of R Studio which can be used for all the programming activities.Download R Studio. ( R  needs to be there on system before this. Download R.)

In order to help everyone who is interested in R to get a jump start, I have decided to pen down my learning for the benefit of all. This tutorial is the first of the 7 part series.

For this tutorial, I have used the following file: https://www.dropbox.com/s/vm4zdoxgbgvfvc3/Marks.csv

The main purpose of this post is to make you understand how to perform basic operations such as reading a file , summarizing data, converting data & cleaning of data.Since, every analysis begins with the first step of getting the data ready for further processing.

Right ! So, Here goes the code and the explanation will follow it.
install.packages('knitr')
Installing package into 'C:/Users/Shanky/Documents/R/win-library/3.0'
(as 'lib' is unspecified)
library(knitr)
#To know the working directory 
getwd()
[1] "C:/Users/Shanky/Documents"#Reading the file
marks<-read.csv("Marks.csv")
#Checking the dimensions of the file 
dim(marks)
[1] 10  5
#Can be done separetly as follows
nrow(marks)
[1] 10
ncol(marks)
[1] 5
#To check the heading of the file
names(marks)
[1] "Id"           "Student.Name" "English"      "Maths"       
[5] "History.1"   
#To check first few rows of the data frame
head(marks,2)
  Id Student.Name English Maths History.1
1  1            A      91   100        45
2  2            B      12    25        39
#To check summary statistics 
summary(marks)
       Id         Student.Name    English         Maths      
 Min.   : 1.00   A      :1     Min.   :12.0   Min.   : 25.0  
 1st Qu.: 3.25   B      :1     1st Qu.:64.0   1st Qu.: 67.0  
 Median : 5.50   C      :1     Median :71.0   Median : 86.0  
 Mean   : 5.50   D      :1     Mean   :69.9   Mean   : 78.7  
 3rd Qu.: 7.75   E      :1     3rd Qu.:90.0   3rd Qu.: 99.0  
 Max.   :10.00   F      :1     Max.   :96.0   Max.   :100.0  
                 (Other):4     NA's   :1      NA's   :1      
   History.1    
 Min.   :  2.0  
 1st Qu.: 45.0  
 Median : 59.0  
 Mean   : 64.7  
 3rd Qu.: 98.0  
 Max.   :100.0  
 NA's   :1      
#As you can see, the Student Name is being treated as factors
#To check the type of data 
class(marks$English)
[1] "integer"
class(marks$Student.Name)
[1] "factor"
#To check for missing values
is.na(marks$English)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#Alternate way to know total no. of missing values
sum(is.na(marks$English))
[1] 1

The following steps need to be followed to read and process data in R

1. Depending on the type of file, read the data into R. For example, in this case the file was in csv format hence read.csv  was used.

2. After reading the data, it is important to check the dimensions. This will help us know the size of the data frame that we are dealing with.

3. It is always necessary to check the summary statistics in order to guage the descriptive statistics of the data that we are dealing with.

4. Any data that we get may not always be clean. Thus, always check for missing values. These missing values need to be taken into account while processing later on.
Cleaning of Data

As you can see, the heading "History.1" is unwanted and complicated. Data cleaning forms chunk of the data analysis time.( Approx. 90% time spent on the not so glamorous part ! )
#Reading the file
marks<-read.csv("Marks.csv")
#Checking the header names
names(marks)
[1] "Id"           "Student.Name" "English"      "Maths"       
[5] "History.1"   

#1st way to change header names
names(marks)<-c("Id","Student.Name","English","Maths","History") 
#c -> concatenation 
#Checking if the header names have been changed
names(marks)
[1] "Id"           "Student.Name" "English"      "Maths"       
[5] "History"     
#To convert all header names to lowercase/uppercase
tolower(names(marks))
[1] "id"           "student.name" "english"      "maths"       
[5] "history"     
toupper(names(marks))
[1] "ID"           "STUDENT.NAME" "ENGLISH"      "MATHS"       
[5] "HISTORY"     
#Removing the '.' from History.1 header
stringsplit<-strsplit(names(marks),"//.")
#Using the stringsplit function and specifying the special character to 
be split at.
functionsplit<-function(x){x[1]}    
#Creating a function which keeps only the first vector
sapply(stringsplit,functionsplit)   
#Using sapply we can apply a given function to every column of a row 
[1] "Id"           "Student.Name" "English"      "Maths"       
[5] "History"       
#We now have cleaned our headers of the file

No comments:

Post a Comment