Minute by Minute Twitter Sentiment Timeline from the VP debate

Click on above graph to enlarge.

Background

The data for this graph was collected automatically every ~60 seconds of the VP debate on 10/11/2012, with an ending aggregate sample size of 363,163 tweets.  From this dataset duplicate tweets were removed (because of bots), which gave a final dataset of 81,124 remaining unique tweets (52,303-Biden, 28,821-Ryan).  Every point in this graph is the mean sentiment of tweets gathered for that minute.  The farther above zero the point is means that it is higher positive sentiment of the tweets, and the lower it gets below zero the more negative. It would be very interesting to compare this to the transcript for inference.  The one very noticeable take away is the jump in sentiment as soon as the debate was over at 22:30

R Code for this data collection and graphing

To collect this data I updated my original code from the presidential debate as follows:

vp<-function(x){
Ryan=searchTwitter('@PaulRyanVP', n=1500)
Biden=searchTwitter('@JoeBiden', n=1500)
textRyan=laply(Ryan, function(t) t$getText())
textBiden=laply(Biden, function(t) t$getText())
resultRyan=score.sentiment(textRyan, positive.words, negative.words)
resultRyan$candidate='Ryan'
resultBiden=score.sentiment(textBiden, positive.words, negative.words)
resultBiden$candidate='Biden'
result<-merge(resultBiden,resultRyan, all=TRUE)
result$candidate<-as.factor(result$candidate)
result$time<-date()
return(result)
}

Then to have it R run automatically collect the data every 60 seconds in an endless loop (I wasn’t sure when I wanted to stop it at the time) you just run a repeat function.

debate<-vp()
repeat {
startTime x<-vp()
debate<-merge(x, debate, all=TRUE)
sleepTime 0)
Sys.sleep(sleepTime)
}

At 10:56pm I got bored and the debate was over, so I just hit stop and ran the following to get the graph:
x<-subset(debate, !duplicated(text))
x$minute<-strptime(x$time, "%a %b %d %H:%M:%S %Y")
x$minute1<-format(x$minute,"%H:%M")
x<-subset(x, minute1>="21:00")
period<-unique(x$minute1)
period<-period[order(period)]
Biden Ryan mean<-data.frame(period, Biden, Ryan)
dfm ggplot(dfm, aes(period, value, colour=variable, group=variable, xlab="time", ylab="score"))+
geom_point()+geom_line()+opts(axis.text.x=theme_text(angle=45),
axis.ticks = theme_blank(),axis.title.y=theme_blank())

I have to admit, doing this actually made watching the debate kind of fun.

For cleaner access to the code please go to my git hub

Advertisements

Presidential Candidate Sentiment Analysis

After watching the Presidential debates and hearing all the opinions on how the candidates performed, I got the hair brained idea of creating a simple function that would do automate the pulling down of tweets for each candidate, analyze the positivity or negativity of tweets, and then graph them out. This project turned out to be a lot easier than I thought even after playing the debate drinking game.

I started out reading a slide share from Jeffrey Breen on Airline sentiment analysis, from which I ended up using his score.sentiment() function with only a very minor tweak (line 6 removes foreign characters). The other thing you need are the Opinion Lexicon written by Minqing Hu and Bing Liu, which is an amazing collection of 6800 words that gauge the sentiment, and the twitteR R package.  All code can be found here.

While the Lexicon is a pretty complete collection, you will need to add political specific words.  After loading up the two files you can easily add to them.

positive.words <- scan("~/Downloads/opinion-lexicon-English/positive-words.txt", what='character',comment.char=';')
negative.words <- scan("~/Downloads/opinion-lexicon-English/negative-words.txt",           what='character',comment.char=';')

Add new positive or negative words by simply merging it with the original list, like:

negative.words<-c(negative.words, "one percent")

Once you’ve added all the words, loaded in the functions from my GitHub (.R files), and packages, all you have to do is type the following and you’re done:

data<-president()

This will give you a histogram with mean line(dotted line) and data frame of all the tweets and scores for each one.

Obviously the higher the score the more positive the tweets.

It would be really interesting to track sentiment over time (you can only pull down 1500 most recent at a time) and connect it with other variables like macro-economic indicators, poll results, and ad spending, but I just can’t devote that much time to side project.  If you add to this project let me know how it turns out.

Querying a database from within R

For a while now I have been contemplating pulling data from our postgreSQL db directly from R, but just never actually pulled the trigger until today.  What I found was that it was a lot easier than I ever could have imagined.  My laptop was already on the VPN, so I decided to try it locally before deploying our R studio server.  After a bit of researching, I decided to use the the RPostgreSQL CRAN package.  It literally only takes two lines of code and you are ready to go.

drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname="database", host="", port="", user="", password="")

The first line installs the driver, in my case a postgreSQL driver, the second line actually connects you to your database.  All you have to do is input the name of the db, host, port, username, and password.  After that, you ready to query!

ba <- dbGetQuery(con,"select * from badass limit 100")

Fun with geocoding and mapping in JGR

For a recent project I had to do some mapping of addresses, but I didn’t have there lat/lons do use the Deducer and DeducerSpatial packages in R JGR.  After frustrating myself trying to adapt this code from stackoverflow.com, I found a much easier way of geocoding using the dismo and XML packages in R.

First you need to have the complete address in one column or vector, so that Google will give you the right right coordinates.  Since Address, City , and State were all in different columns, I used the paste function to put them in a new column name “location” ( this is strategic for purposes you will see later) using the following code:

mls$location<-paste(mls$address, mls$city, mls$state, sep = ",")

Now you can open and run the geocode on the new variable, like this:

library(c("dismo", "XML"))
loc<-geocode(data$address, oneRecord=TRUE, progress = "text")

The new data.frame, loc, will be structured like this:

  ID lon      lat      lonmin    lonmax    latmin   latmax
1 1 -75.17598 39.93971 -75.17733 -75.17463 39.93836 39.94106
2 2 -75.12525 40.04373 -75.12660 -75.12390 40.04238 40.04508
3 3 -75.11745 40.03415 -75.11880 -75.11610 40.03280 40.03550
4 4 -75.15250 39.93793 -75.15385 -75.15115 39.93658 39.93928
5 5 -74.98711 40.04627 -74.98846 -74.98576 40.04492 40.04762
location
1 1017 S 20th St B,Philadelphia,PA
2 267 W Spencer St,Philadelphia,PA
3 305 E Gale St,Philadelphia,PA
4 522 Queen St,Philadelphia,PA
5 45304 Delaire Landing Rd 304,Philadelphia,PA

Because the full address variable is named the same in both data$location and loc$location, you can merge the two data.frames by the location variable using the merge function:

data<-merge(data, loc, all=TRUE)

Now you map observations in R JGR, downloaded from here, after installing and opening the Deducer and DeducerSpatial packages.  First you need to convert the data.frame under the spatial tab, and then go to the spatial plot builder.  Once in the plot builder, you can plot based on any variable in your dataset.  When your happy with your map and plot, just hit run and it will save it and give you an output like this:

Converting cross sectional data with dates to weekly averages in R.

I was recently confronted with a problem where I had to compare two very different data sets. The problem was that one data set was observed cross sectional data with dates over the course of three months and the other was weekly averages during those same three months.  After a bit of research, I discovered that there is very simple way to convert the data in R.

First we’ll create some sample date with randomly generated dates within our time frame:

first <- as.Date("2012/01/25", "%Y/%m/%d")##start date##
last <- as.Date("2012/05/11", "%Y/%m/%d")##end day##
dt <- last-first
nSamples <- 1000
set.seed(1)
date<-as.Date(round(first+
runif(nSamples)*as.numeric(dt)))

Then we will combine with randomly generated values:

value<-sample(1:10, size=1000, replace=TRUE)
data<-data.frame(value, date)

Now that we have our observations we can move onto finding the weekly averages. However our weekly average data starts with the week ending 1/30/2012 which is a Tuesday, so you have to assign that date to everyday in that week using the lubridate package:

library("lubridate")
data$week<-floor_date(data$date,”week”) +8

The “+8” is because floor_date goes to the previous sunday, and we need it to go the following Tuesday.

Now we can use ddply function from the ply package to find the averages from every week:

library("plyr")
x<-ddply(data, .(week), function(z) mean(z$value))

The ddply function finds the averages of all values within each particular week in the data.

The hard work is now all done, but we will need to rename the columns before calling it done:

colnames(x) <- c("week", "value")

You now have the weekly averages to compare to the other dataset.