Wednesday, 31 December 2014

Market Basket Analysis

Ever wondered why stores like Easyday,Big Bazar place confectioneries near the checkout counter ? Or why the dairy products section is always near one corner of the store ? While some answers may seem highly intuitive, the fact of the matter is most are not !

These stores make use of analytics by analyzing your transactions , in order to guess which items go together in the shopping basket.Retailers can use this data in countless ways to increase their revenue.

For example -

1. Store Layout planning (Placing products that go together close by to increase spend per shopping basket and to increase customer shopping experience)

2. Targeted Marketing (Using the data about customers collected to inform them of latest offering which might be of their interest)

This post will try to explain the basic concept behind this form of retail analytics. I have performed the market basket analysis using the apriori algorithm in R.

The dataset for this can be downloaded from here: https://www.dropbox.com/s/n09hu3mr63pq4bw/skusold.csv

An intro into Apriori Algorithm:

Apriori AlgorithmCourtesy: http://webdocs.cs.ualberta.ca/~zaiane/courses/cmput499/slides/Lect10/img054.jpg

Follow the comments alongside the code to understand the flow of logic.

*Text in white is the code.

*Text in blue is the output.
#Market Basket Analysis 

#Installing package  'arules' for apriori algorithm
install.packages("arules")
library(arules)install.packages("knitr")
library(knitr)
#Checking working directory
getwd()
[1] "C:/Users/Shanky/Documents"
#Reading the file
tt<-read.csv("skusold.csv")
head(tt)

#It can be seen that there are multiple orders from the same Order ID. 
Thus, we need to now arrange them in such a manner that all orders pertaining to one transaction are listed together.
  Order       SKU
1  9305     Bread
2  9305 Tropicana
3  9305      Eggs
4 11020 Tropicana
5 11020  Gilette 
6 11020      Eggs

#Our objective is to group or aggregate the items together based on the Order ID. We can do that as follows:
Aggdata<-split(tt$SKU,tt$Order)
head(Aggdata)

#All items bought from same order ID listed
$`9305`
[1] Bread     Tropicana Eggs     
Levels: AfterShave Bread Eggs Gilette  Hide Jam Tomatoes Tropicana

$`11013`
[1] Bread     Tropicana Jam      
Levels: AfterShave Bread Eggs Gilette  Hide Jam Tomatoes Tropicana

$`11015`
[1] Eggs  Bread
Levels: AfterShave Bread Eggs Gilette  Hide Jam Tomatoes Tropicana

$`11017`
[1] Tomatoes   AfterShave Tropicana  Gilette    Eggs      
Levels: AfterShave Bread Eggs Gilette  Hide Jam Tomatoes Tropicana

$`11018`
[1] Jam        AfterShave Tropicana  Tomatoes  
Levels: AfterShave Bread Eggs Gilette  Hide Jam Tomatoes Tropicana

$`11019`
[1] AfterShave Jam        Tomatoes   Tropicana  Hide       Gilette   
Levels: AfterShave Bread Eggs Gilette  Hide Jam Tomatoes Tropicana
#For using the Apriori algorithm, we need to coerce the transaction
abc = as(Aggdata,"transactions")
#Checking the summary statistics of the data
summary(abc)
transactions as itemMatrix in sparse format with
 13 rows (elements/itemsets/transactions) and
 8 columns (items) and a density of 0.4712 

most frequent items:
     Bread  Tropicana AfterShave       Eggs   Gilette     (Other) 
         8          8          7          7          7         12 

element (itemset/transaction) length distribution:
sizes
2 3 4 5 6 
2 4 3 3 1 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00    3.00    4.00    3.77    5.00    6.00 

includes extended item information - examples:
      labels
1 AfterShave
2      Bread
3       Eggs
includes extended transaction information - examples:
  transactionID
1          9305
2         11013
3         11015

#The transaction an itemMatrix forms the input for the Apriori Algorithm
Rules<-apriori(abc,parameter=list(supp=0.3,conf=0.8,target="rules",minlen=2))

parameter specification:
 confidence minval smax arem  aval originalSupport support minlen maxlen
        0.8    0.1    1 none FALSE            TRUE     0.3      2     10
 target   ext
  rules FALSE

algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

apriori - find association rules with the apriori algorithm

#To view the Rules generated
inspect(Rules)
  lhs            rhs          support confidence  lift
1 {Tomatoes}  => {AfterShave}  0.3077     1.0000 1.857
2 {Jam}       => {Tropicana}   0.3846     0.8333 1.354
3 {Eggs}      => {Bread}       0.4615     0.8571 1.393
4 {Eggs,                                              
   Tropicana} => {Bread}       0.3077     0.8000 1.300
5 {Bread,                                             
   Tropicana} => {Eggs}        0.3077     0.8000 1.486
#To inspect items by frequency
itemFrequencyPlot(abc,topN=5,col="red")
barplot

Interpretation of the Rules
lhs                  rhs       support    confidence     lift

{Eggs} => {Bread}   0.4615      0.857             1 1.393

The support indicates the percentage of occurrence of this transaction in all possible transactions.
The confidence indicates the correctness of prediction and likelihood of purchase.
In this example, if a customer buys eggs then the likelihood that he will buy bread as well is ~ 85%. The support indicates that ~46% of transactions had similar characteristics.

References
1.
Association Rule Mining using R package
2.Wikipedia - Apriori Algorithm3.Maching learning

Wednesday, 24 December 2014

Wordcloud on Ashes Series

ASHES - Desire for Domination

"England have only three major problems. They can't bat, they can't bowl and they can't field." - Martin Jonson (England's tour of Australia 1986-7)

With the ongoing Ashes series gathering steam, I decided to marry analytics with data to gather some interesting insights. For this, I have picked up data in the form of tweets from Twitter with hashtags #ashesThe data processing has been done using a statistical software called R.

A wordcloud or a tag cloud highlights the frequency of occurrence of words in a text document using very intuitive and easy visualization techniques. The larger the text size ; the greater the frequency. Also, words with same color and size have the same occurrence rate.Ashes


Technical Details & Code

 I have broken down the overall process into numerous steps for ease of reading.



  1. The text mining program makes use of 4 important R packages namely RoAuth, twitteR,tm,wordcloud and RJSonio. Install the requisite packages and get authorized to access content from Twitter.

  2. The authorization process gets completed when the program asks you to enter the 'token'.

  3. Next pull the tweets with specified hashtag by setting the no. of tweets that you want.

  4. The data now needs to be cleaned, post which the frequency and maximum word limits can be set to plot the wordcloud.

Please note: Though the code specifies 1500 tweets, only 799 were returned by the twitter API.
#Installing Packages
install.packages("ROAuth")
install.packages("twitteR")
install.packages("RJSONIO")
install.packages("tm")
install.packages("wordcloud")
install.packages("knitr")
#Loading Packages
library("knitr")
library("ROAuth")library("twitteR")library("RJSONIO")library("tm")
library("wordcloud")load("twitter_auth.Rdata")
#Registering on twitter API 
reqURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "http://api.twitter.com/oauth/access_token"
authURL <- "http://api.twitter.com/oauth/authorize"
#Important step for Windows users
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")
#Follow the link:https://twitter.com/apps/new to get your consumer key and secret.
consumerKey <- "Enter your Consumer Key"
consumerSecret <- "Enter your consumer secret key"
Cred <- OAuthFactory$new(consumerKey = consumerKey,consumerSecret 
= consumerSecret, requestURL = reqURL,accessURL = accessURL, authURL = authURL)
Cred$handshake(cainfo = "cacert.pem") 
#When complete, record the PIN given to you and provide it on the console '
save(Cred, file = "twitter_auth.Rdata")
registerTwitterOAuth(Cred)
#Extracting tweets
 Ashes <- searchTwitter('#ashes', n = 1500,lang = 'en', cainfo
 = "cacert.pem")Ashes <- sapply(Ashes, function(x) x$getText())
#Create a corpus 
Ashes_corpus <- Corpus(VectorSource(Ashes))#Cleaning of data
Ashes_corpus <- tm_map(Ashes_corpus, tolower)Ashes_corpus <- 
tm_map(Ashes_corpus, removePunctuation)Ashes_corpus <- 
tm_map(Ashes_corpus, function(x) removeWords(x, stopwords()))
 #Selecting color palettes for wordcloud
library(RColorBrewer)
pal2 <- brewer.pal(8,"Pastel2")
wordcloud(Ashes_corpus, scale = c(4,1),min.freq=5,random.order 
= T, random.color = T,colors = pal2)

Acknowledgements

The following resources have been used for this post.

1. Tweetsent

2. Mining Twiiter with R

3. One R tip a day

Thursday, 4 December 2014

R Basics - Reading and summazring data ( Part 1)

Image

This is a screenshot of R Studio which can be used for all the programming activities.Download R Studio. ( R  needs to be there on system before this. Download R.)

In order to help everyone who is interested in R to get a jump start, I have decided to pen down my learning for the benefit of all. This tutorial is the first of the 7 part series.

For this tutorial, I have used the following file: https://www.dropbox.com/s/vm4zdoxgbgvfvc3/Marks.csv

The main purpose of this post is to make you understand how to perform basic operations such as reading a file , summarizing data, converting data & cleaning of data.Since, every analysis begins with the first step of getting the data ready for further processing.

Right ! So, Here goes the code and the explanation will follow it.
install.packages('knitr')
Installing package into 'C:/Users/Shanky/Documents/R/win-library/3.0'
(as 'lib' is unspecified)
library(knitr)
#To know the working directory 
getwd()
[1] "C:/Users/Shanky/Documents"#Reading the file
marks<-read.csv("Marks.csv")
#Checking the dimensions of the file 
dim(marks)
[1] 10  5
#Can be done separetly as follows
nrow(marks)
[1] 10
ncol(marks)
[1] 5
#To check the heading of the file
names(marks)
[1] "Id"           "Student.Name" "English"      "Maths"       
[5] "History.1"   
#To check first few rows of the data frame
head(marks,2)
  Id Student.Name English Maths History.1
1  1            A      91   100        45
2  2            B      12    25        39
#To check summary statistics 
summary(marks)
       Id         Student.Name    English         Maths      
 Min.   : 1.00   A      :1     Min.   :12.0   Min.   : 25.0  
 1st Qu.: 3.25   B      :1     1st Qu.:64.0   1st Qu.: 67.0  
 Median : 5.50   C      :1     Median :71.0   Median : 86.0  
 Mean   : 5.50   D      :1     Mean   :69.9   Mean   : 78.7  
 3rd Qu.: 7.75   E      :1     3rd Qu.:90.0   3rd Qu.: 99.0  
 Max.   :10.00   F      :1     Max.   :96.0   Max.   :100.0  
                 (Other):4     NA's   :1      NA's   :1      
   History.1    
 Min.   :  2.0  
 1st Qu.: 45.0  
 Median : 59.0  
 Mean   : 64.7  
 3rd Qu.: 98.0  
 Max.   :100.0  
 NA's   :1      
#As you can see, the Student Name is being treated as factors
#To check the type of data 
class(marks$English)
[1] "integer"
class(marks$Student.Name)
[1] "factor"
#To check for missing values
is.na(marks$English)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#Alternate way to know total no. of missing values
sum(is.na(marks$English))
[1] 1

The following steps need to be followed to read and process data in R

1. Depending on the type of file, read the data into R. For example, in this case the file was in csv format hence read.csv  was used.

2. After reading the data, it is important to check the dimensions. This will help us know the size of the data frame that we are dealing with.

3. It is always necessary to check the summary statistics in order to guage the descriptive statistics of the data that we are dealing with.

4. Any data that we get may not always be clean. Thus, always check for missing values. These missing values need to be taken into account while processing later on.
Cleaning of Data

As you can see, the heading "History.1" is unwanted and complicated. Data cleaning forms chunk of the data analysis time.( Approx. 90% time spent on the not so glamorous part ! )
#Reading the file
marks<-read.csv("Marks.csv")
#Checking the header names
names(marks)
[1] "Id"           "Student.Name" "English"      "Maths"       
[5] "History.1"   

#1st way to change header names
names(marks)<-c("Id","Student.Name","English","Maths","History") 
#c -> concatenation 
#Checking if the header names have been changed
names(marks)
[1] "Id"           "Student.Name" "English"      "Maths"       
[5] "History"     
#To convert all header names to lowercase/uppercase
tolower(names(marks))
[1] "id"           "student.name" "english"      "maths"       
[5] "history"     
toupper(names(marks))
[1] "ID"           "STUDENT.NAME" "ENGLISH"      "MATHS"       
[5] "HISTORY"     
#Removing the '.' from History.1 header
stringsplit<-strsplit(names(marks),"//.")
#Using the stringsplit function and specifying the special character to 
be split at.
functionsplit<-function(x){x[1]}    
#Creating a function which keeps only the first vector
sapply(stringsplit,functionsplit)   
#Using sapply we can apply a given function to every column of a row 
[1] "Id"           "Student.Name" "English"      "Maths"       
[5] "History"       
#We now have cleaned our headers of the file