Opening Large CSV Files in R

Before heading home for the holidays, I had a large data set (1.6 GB with over 1.25 million rows) with columns of text and integers ripped out of the company (Kwelia) Database and put into a .csv file since I was going to be offline a lot over the break. I tried opening the csv file in the usual way:

all <- read.csv("file.csv")

However it never finished even after letting it go all night. I also tried reading it into a SQLlite database first and reading it out of that, but the file was so messy it kept coming back with errors. I finally got it read in to R by using the ff package and the following code:

x<- read.csv.ffdf(file="file.csv", header=TRUE, VERBOSE=TRUE, first.rows=10000, next.rows=50000, colClasses=NA)

Because the file was so messy, I had to turn off column classes (colClasses=NA) to have the read ignore giving each column a class on the first 10,000. After reading the first 10,000 rows, the script then reads in chunks of 50,000 so as to not completely overload the ram in my laptop. I also turned Verbose because it would drive me nuts to not be able to follow the progress.

Mapping Current Average Price Per Sqft for Rentals by Zip in San Fran

My company, Kwelia, is sitting on mountains of data, so I decided to try my hand at mapping.  I have played around with JGR but it’s just too buggy, at least on my mac, so I went looking for other alternatives and found a good write up here.  I decided on mapping prices per sqft for apartment rentals by zip codes in the bay area because we are launching our services there shortly.

First transforming the data was easy using ddply in the plyr package after I had queried all the right zip codes into a data frame from the database.

w=ddply(sanfran, .(zip), summarise, pricepersqft=mean(price/sqft))

Then it’s a matter of loading the shape file after downloading it from here.



The ddply will sort the zip codes, so I transformed the zip spatial data into a regular data frame, merged it with “w”, added pricepersqft to an ordered “zip” data frame, and finally subset out zips without data.

##transform to regular data frame
##merge with the ddply data
r=merge(a, w, all=TRUE)
##order zips in the spatial poly data
##merge price per sqft with spatial data
##subset out zips with missing data

Finally comes the plotting, which ,luckily, is almost exactly the same as the example that it I found above.

#select color palette and the number colors(prices per sqft) to represent on the map colors
colors=brewer.pal(9, "YlOrRd")
#set breaks for the 9 colors
brks=classIntervals(zip$INCOME, n=9, style="quantile")
#plot the map
plot(zip, col=colors[findInterval(zip$INCOME, brks,all.inside=TRUE)], axes=F)
#add a title
title(paste ("SF Bay Area Price Per SQFT for rentals by Zip"))
#add a legend
legend(x=6298809, y=2350000, legend=leglabs(round(brks)), fill=colors, bty="n",x.intersp = .5, y.intersp = .5)

Here are the actual current averages of rental prices by zip:






Building a Simple Web App using R

I’ve been interested in building a web app using R for a while, but never put any time into it until I was informed of the Shiny package.  It looked too easy, so I absolutely had to try it out.

First you need to install the package from the command line .

options(repos=c(RStudio="http://rstudio.org/_packages", getOption("repos")))

The version in the tutorial uses a two R files, won the front end(ui.R) and the other being the server side (server.R).  However, I wanted a refresher in HTML, so I built it with one HTML and one R file.  The structure is defined here.

For the data, I’m using MLS (home sales) data for Philadelphia, which I’ve been sitting on for quite a while.  The thought behind the app is to be able to examine monthly average listing prices to monthly average prices, and in the end it turned out rather well.

At first, I had the server.R try to do everything within the reactive function, but it  literally took minutes to load a new graph. The solution was to run the data through the following and just have the reactive function call from the data frames:

##raw data
d ##convert to numeric from factored variable
#round the listing date to first of the month
listed<-paste(format(as.POSIXlt(d$listdate), format="%Y-%m"), "01", sep="-")
#round the sales date to first of the month
solded<-paste(format(as.POSIXlt(d$solddate), format="%Y-%m"), "01", sep="-")
#create the time period
period<-seq(as.Date("2010-02-01"), length=24, by="1 month")
#create empty data frame for average monthly listing and sales data
##list of all zip codes in Philly that we'll examine
#find the average monthly listing and sales figures for each zip code
for(z in a){
listing[[z]]<- sapply(period, function(x) mean(d[x >= listed & x <= solded & zip==z, 'listprice']))
sales[[z]]<-sapply(period, function(x) mean((d[ x== solded & zip==z, 'soldprice'])))
##save both files because the server will have to call it
save(sold, listing, file="shiny.RData")

Once the data is saved, the structure of the UI and Server can be defined.  The UI in HTML was very easy (found here),  All did was change font, add background pic to the body , add more zip code choices to the drop down box (<select>), and make sure that the plot was properly defined in <div>

The server side was a little more complicated, but once all the data was defined outside of the reactive function it went very smoothly.  First, call any packages you need, load the data, and, finally, call the shinyServer function.

## load all neccesary packages
## load data

# Define server logic required to plot various zips
shinyServer(function(input, output) {
# Generate a plot of the requested zip code for sale and listing averages
output$plot average<-data.frame(period, sold=sold[[input$zip]], listing=listing[[input$zip]])
dfm <- melt(average, id = "period", measure = c("sold","listing"))
c<-qplot(period, value, data=dfm, geom="smooth", colour = variable,xlim = as.Date(c("2010-02-01","2012-01-01")),xlab="Date",ylab="Average Price")+

The first line of the shinyServer sets up the data, with the columns defined as the months during the defined two years, the average monthly sales data for selected zip, and the average monthly listing data for each zip. The next line is melting the data, which is used in the next line for plotting. For the plot, I chose qplot and a  smoothed line because the sales data was so stochastic.

I didn’t deploy this to the interweb, but the final product looks like this:


For a cleaner version of the final code, go to my git hub here.

Top Facebook Posts During the US Presidential Debate

The following data was collected during the Presidential Debate on the 22nd of October by tapping into the Facebook social graph API using R.

The top three posted links during the debate for each candidate are:


#1     http://bit.ly/QCODJg

#2     http://bit.ly/RXstnm

#3    http://bit.ly/P8MmJ1


#1    http://bit.ly/zDdsKf

#2    http://bit.ly/SjFbKx

#3    http://bit.ly/SjJaXv

The top five video (for both candidates) of the debate are:

#1    http://bit.ly/WearzC

#2    http://bit.ly/QTEibe

#3    http://bit.ly/VT9xXE

#4    http://bit.ly/SeIGDJ

#5   http://bit.ly/PcKJV5

The top five shared photos are:

#1    http://on.fb.me/TOTPIY

#2    http://on.fb.me/RYF7m6

#3    http://on.fb.me/QCQLkj

#4    http://on.fb.me/RSkXXv

#5    http://on.fb.me/T7Lr6h

This data was collected by tapping into the facebook API using the following R code (adapted from cloudstat.org) run in a loop :


Result fb.base<-paste("http://graph.facebook.com/search?q=",Facebook,"&limit=10",sep="")


for(a in candidate){
fbkeyword fbresult fbdata fbdata.length fbid = facebookers = fbmessage = fbpic = fblink = fbname = fbcaption =
fbdescription = fbicon = fbtype = fbcreated = fbupdated = fblikecount = 0
for (i in 1:fbdata.length){
fbid[i] facebookers[i] fbmessage[i] fbpic[i] fblink[i] fbname[i] fbcaption[i] fbdescription[i] fbicon[i] fbtype[i] fbcreated[i] fbupdated[i]fblikecount[i]

#for(j in 1:fblikecount[i]){ fblikename[i] }
assign(paste0("fbtable.", a),as.data.frame(cbind(fbid, facebookers, fbmessage, fbpic, fblink, fbname, fbcaption, fbdescription, fbicon, fbtype, fbcreated, fbupdated, fblikecount, candidate=a)))
debate<-merge(fbtable.Romney,fbtable.Obama, all=TRUE)
debate$fbupdated<-format(as.POSIXct(strptime(as.character(debate$fbupdated), "%Y-%m-%dT%H:%M:%S+0000"), tz="GMT"), tz="America/New_York",usetz=TRUE)
debate$fbcreated<-format(as.POSIXct(strptime(as.character(debate$fbcreated), "%Y-%m-%dT%H:%M:%S+0000"), tz="GMT"), tz="America/New_York",usetz=TRUE)

Twitter Analysis of the US Presidential Debate

The following are word clouds of tweets for each candidate from the October 16, 2012 debate with the bigger words the more often they were used in tweets (click on each word cloud to enlarge):

And the net-negative posts for each candidate:

Please note that the bigger the word is in the word cloud the more often it was used.

The R code for creating the word clouds

The following code was adapted from here and is an extension of previous work:

ap.corpus <- Corpus(DataframeSource(data.frame(as.character(romneypositive[,3]))))
ap.corpus <- tm_map(ap.corpus, removePunctuation)
ap.corpus <- tm_map(ap.corpus, tolower)
ap.corpus <- tm_map(ap.corpus, function(x) removeWords(x, c(r,stopwords("english"))))
ap.tdm <- TermDocumentMatrix(ap.corpus)
ap.m <- as.matrix(ap.tdm)
ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
ap.d <- data.frame(word = names(ap.v),freq=ap.v)
pal2 <- brewer.pal(8,"Dark2")
png("romneypositive.png", width=1280,height=800)
wordcloud(ap.d$word,ap.d$freq, scale=c(8,.2),min.freq=3,
max.words=Inf, random.order=FALSE, rot.per=.15, colors=pal2)

Disclaimer: Any errors can be attributed to the fact that I was drinking heavily, that I was in Dallas, and that it was half four in the morning when I finished writing this.

Minute by Minute Twitter Sentiment Timeline from the VP debate

Click on above graph to enlarge.


The data for this graph was collected automatically every ~60 seconds of the VP debate on 10/11/2012, with an ending aggregate sample size of 363,163 tweets.  From this dataset duplicate tweets were removed (because of bots), which gave a final dataset of 81,124 remaining unique tweets (52,303-Biden, 28,821-Ryan).  Every point in this graph is the mean sentiment of tweets gathered for that minute.  The farther above zero the point is means that it is higher positive sentiment of the tweets, and the lower it gets below zero the more negative. It would be very interesting to compare this to the transcript for inference.  The one very noticeable take away is the jump in sentiment as soon as the debate was over at 22:30

R Code for this data collection and graphing

To collect this data I updated my original code from the presidential debate as follows:

Ryan=searchTwitter('@PaulRyanVP', n=1500)
Biden=searchTwitter('@JoeBiden', n=1500)
textRyan=laply(Ryan, function(t) t$getText())
textBiden=laply(Biden, function(t) t$getText())
resultRyan=score.sentiment(textRyan, positive.words, negative.words)
resultBiden=score.sentiment(textBiden, positive.words, negative.words)
result<-merge(resultBiden,resultRyan, all=TRUE)

Then to have it R run automatically collect the data every 60 seconds in an endless loop (I wasn’t sure when I wanted to stop it at the time) you just run a repeat function.

repeat {
startTime x<-vp()
debate<-merge(x, debate, all=TRUE)
sleepTime 0)

At 10:56pm I got bored and the debate was over, so I just hit stop and ran the following to get the graph:
x<-subset(debate, !duplicated(text))
x$minute<-strptime(x$time, "%a %b %d %H:%M:%S %Y")
x<-subset(x, minute1>="21:00")
Biden Ryan mean<-data.frame(period, Biden, Ryan)
dfm ggplot(dfm, aes(period, value, colour=variable, group=variable, xlab="time", ylab="score"))+
axis.ticks = theme_blank(),axis.title.y=theme_blank())

I have to admit, doing this actually made watching the debate kind of fun.

For cleaner access to the code please go to my git hub

Presidential Candidate Sentiment Analysis

After watching the Presidential debates and hearing all the opinions on how the candidates performed, I got the hair brained idea of creating a simple function that would do automate the pulling down of tweets for each candidate, analyze the positivity or negativity of tweets, and then graph them out. This project turned out to be a lot easier than I thought even after playing the debate drinking game.

I started out reading a slide share from Jeffrey Breen on Airline sentiment analysis, from which I ended up using his score.sentiment() function with only a very minor tweak (line 6 removes foreign characters). The other thing you need are the Opinion Lexicon written by Minqing Hu and Bing Liu, which is an amazing collection of 6800 words that gauge the sentiment, and the twitteR R package.  All code can be found here.

While the Lexicon is a pretty complete collection, you will need to add political specific words.  After loading up the two files you can easily add to them.

positive.words <- scan("~/Downloads/opinion-lexicon-English/positive-words.txt", what='character',comment.char=';')
negative.words <- scan("~/Downloads/opinion-lexicon-English/negative-words.txt",           what='character',comment.char=';')

Add new positive or negative words by simply merging it with the original list, like:

negative.words<-c(negative.words, "one percent")

Once you’ve added all the words, loaded in the functions from my GitHub (.R files), and packages, all you have to do is type the following and you’re done:


This will give you a histogram with mean line(dotted line) and data frame of all the tweets and scores for each one.

Obviously the higher the score the more positive the tweets.

It would be really interesting to track sentiment over time (you can only pull down 1500 most recent at a time) and connect it with other variables like macro-economic indicators, poll results, and ad spending, but I just can’t devote that much time to side project.  If you add to this project let me know how it turns out.

Querying a database from within R

For a while now I have been contemplating pulling data from our postgreSQL db directly from R, but just never actually pulled the trigger until today.  What I found was that it was a lot easier than I ever could have imagined.  My laptop was already on the VPN, so I decided to try it locally before deploying our R studio server.  After a bit of researching, I decided to use the the RPostgreSQL CRAN package.  It literally only takes two lines of code and you are ready to go.

drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname="database", host="", port="", user="", password="")

The first line installs the driver, in my case a postgreSQL driver, the second line actually connects you to your database.  All you have to do is input the name of the db, host, port, username, and password.  After that, you ready to query!

ba <- dbGetQuery(con,"select * from badass limit 100")

Fun with geocoding and mapping in JGR

For a recent project I had to do some mapping of addresses, but I didn’t have there lat/lons do use the Deducer and DeducerSpatial packages in R JGR.  After frustrating myself trying to adapt this code from stackoverflow.com, I found a much easier way of geocoding using the dismo and XML packages in R.

First you need to have the complete address in one column or vector, so that Google will give you the right right coordinates.  Since Address, City , and State were all in different columns, I used the paste function to put them in a new column name “location” ( this is strategic for purposes you will see later) using the following code:

mls$location<-paste(mls$address, mls$city, mls$state, sep = ",")

Now you can open and run the geocode on the new variable, like this:

library(c("dismo", "XML"))
loc<-geocode(data$address, oneRecord=TRUE, progress = "text")

The new data.frame, loc, will be structured like this:

  ID lon      lat      lonmin    lonmax    latmin   latmax
1 1 -75.17598 39.93971 -75.17733 -75.17463 39.93836 39.94106
2 2 -75.12525 40.04373 -75.12660 -75.12390 40.04238 40.04508
3 3 -75.11745 40.03415 -75.11880 -75.11610 40.03280 40.03550
4 4 -75.15250 39.93793 -75.15385 -75.15115 39.93658 39.93928
5 5 -74.98711 40.04627 -74.98846 -74.98576 40.04492 40.04762
1 1017 S 20th St B,Philadelphia,PA
2 267 W Spencer St,Philadelphia,PA
3 305 E Gale St,Philadelphia,PA
4 522 Queen St,Philadelphia,PA
5 45304 Delaire Landing Rd 304,Philadelphia,PA

Because the full address variable is named the same in both data$location and loc$location, you can merge the two data.frames by the location variable using the merge function:

data<-merge(data, loc, all=TRUE)

Now you map observations in R JGR, downloaded from here, after installing and opening the Deducer and DeducerSpatial packages.  First you need to convert the data.frame under the spatial tab, and then go to the spatial plot builder.  Once in the plot builder, you can plot based on any variable in your dataset.  When your happy with your map and plot, just hit run and it will save it and give you an output like this:

Converting cross sectional data with dates to weekly averages in R.

I was recently confronted with a problem where I had to compare two very different data sets. The problem was that one data set was observed cross sectional data with dates over the course of three months and the other was weekly averages during those same three months.  After a bit of research, I discovered that there is very simple way to convert the data in R.

First we’ll create some sample date with randomly generated dates within our time frame:

first <- as.Date("2012/01/25", "%Y/%m/%d")##start date##
last <- as.Date("2012/05/11", "%Y/%m/%d")##end day##
dt <- last-first
nSamples <- 1000

Then we will combine with randomly generated values:

value<-sample(1:10, size=1000, replace=TRUE)
data<-data.frame(value, date)

Now that we have our observations we can move onto finding the weekly averages. However our weekly average data starts with the week ending 1/30/2012 which is a Tuesday, so you have to assign that date to everyday in that week using the lubridate package:

data$week<-floor_date(data$date,”week”) +8

The “+8” is because floor_date goes to the previous sunday, and we need it to go the following Tuesday.

Now we can use ddply function from the ply package to find the averages from every week:

x<-ddply(data, .(week), function(z) mean(z$value))

The ddply function finds the averages of all values within each particular week in the data.

The hard work is now all done, but we will need to rename the columns before calling it done:

colnames(x) <- c("week", "value")

You now have the weekly averages to compare to the other dataset.