Wednesday, 24 December 2014

Wordcloud on Ashes Series

ASHES - Desire for Domination

"England have only three major problems. They can't bat, they can't bowl and they can't field." - Martin Jonson (England's tour of Australia 1986-7)

With the ongoing Ashes series gathering steam, I decided to marry analytics with data to gather some interesting insights. For this, I have picked up data in the form of tweets from Twitter with hashtags #ashesThe data processing has been done using a statistical software called R.

A wordcloud or a tag cloud highlights the frequency of occurrence of words in a text document using very intuitive and easy visualization techniques. The larger the text size ; the greater the frequency. Also, words with same color and size have the same occurrence rate.Ashes


Technical Details & Code

 I have broken down the overall process into numerous steps for ease of reading.



  1. The text mining program makes use of 4 important R packages namely RoAuth, twitteR,tm,wordcloud and RJSonio. Install the requisite packages and get authorized to access content from Twitter.

  2. The authorization process gets completed when the program asks you to enter the 'token'.

  3. Next pull the tweets with specified hashtag by setting the no. of tweets that you want.

  4. The data now needs to be cleaned, post which the frequency and maximum word limits can be set to plot the wordcloud.

Please note: Though the code specifies 1500 tweets, only 799 were returned by the twitter API.
#Installing Packages
install.packages("ROAuth")
install.packages("twitteR")
install.packages("RJSONIO")
install.packages("tm")
install.packages("wordcloud")
install.packages("knitr")
#Loading Packages
library("knitr")
library("ROAuth")library("twitteR")library("RJSONIO")library("tm")
library("wordcloud")load("twitter_auth.Rdata")
#Registering on twitter API 
reqURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "http://api.twitter.com/oauth/access_token"
authURL <- "http://api.twitter.com/oauth/authorize"
#Important step for Windows users
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")
#Follow the link:https://twitter.com/apps/new to get your consumer key and secret.
consumerKey <- "Enter your Consumer Key"
consumerSecret <- "Enter your consumer secret key"
Cred <- OAuthFactory$new(consumerKey = consumerKey,consumerSecret 
= consumerSecret, requestURL = reqURL,accessURL = accessURL, authURL = authURL)
Cred$handshake(cainfo = "cacert.pem") 
#When complete, record the PIN given to you and provide it on the console '
save(Cred, file = "twitter_auth.Rdata")
registerTwitterOAuth(Cred)
#Extracting tweets
 Ashes <- searchTwitter('#ashes', n = 1500,lang = 'en', cainfo
 = "cacert.pem")Ashes <- sapply(Ashes, function(x) x$getText())
#Create a corpus 
Ashes_corpus <- Corpus(VectorSource(Ashes))#Cleaning of data
Ashes_corpus <- tm_map(Ashes_corpus, tolower)Ashes_corpus <- 
tm_map(Ashes_corpus, removePunctuation)Ashes_corpus <- 
tm_map(Ashes_corpus, function(x) removeWords(x, stopwords()))
 #Selecting color palettes for wordcloud
library(RColorBrewer)
pal2 <- brewer.pal(8,"Pastel2")
wordcloud(Ashes_corpus, scale = c(4,1),min.freq=5,random.order 
= T, random.color = T,colors = pal2)

Acknowledgements

The following resources have been used for this post.

1. Tweetsent

2. Mining Twiiter with R

3. One R tip a day

No comments:

Post a Comment