Scheduling R Tasks with Crontabs to Conserve Memory

One of R’s biggest pitfalls is that eats up memory without letting it go.  This can be a huge problem if you are running really big jobs, have a lot of tasks  to run, or there are multiple users on your local computer or r server.  When I run huge jobs on my mac, I can pretty much forget doing anything else like watching a movie or ram intensive gaming.  For my work, Kwelia, I run a few servers with a couple dedicated solely to R jobs with multiple users, and I really don’t want to up the size of the server just for the few times that memory is exhausted by multiple large jobs or all users on at the same time.  To solve this problem, I borrowed a tool, crontab, from the linux (we use an ubuntu server but works on my mac as well) folks to schedule my Rscripts to run at off hours (between 2am-8am), and the result is that I can almost cut the size of the server in half.

Installing Crontabs is easy (I used this tutorial and this video) in a linux environment but should be similar for mac and windows. From the command line enter the following to install:

sudo apt-get install gnome-schedule

Then to create a new task for any user on the system enter if you are the root user or admin:

sudo crontab -e

or as a specific user:

crontab -u yourusername -e

You must then choose your preferred text editor. I chose nano, but the vim works just as well. This will create a file that looks like this:
Screen Shot 2013-09-03 at 5.01.19 PM

The cron job is laid out in this format:minute (0-59), hour (0-23, 0 = midnight), day (1-31), month (1-12), weekday (0-6, 0 = Sunday), command. To run an rscript in the command just put the “Rscript” and then the file path name. An example:

0 0 * * * Rscript Dropbox/rstudio/dbcode/loop/loop.R

This runs the loop.R file at midnight (zero minute of the zero hour) every day of every week of every month because the stars mean all.  I have run endless repeat loops before in previous posts, but R consumes the memory and never free it.  However, running  cron jobs is like opening and closing R every time so the memory is freed (probably not totally) after the job is done.

As an example, I ran the same job in a repeat every twelve hours on the left side of the black vertical line, and on the right is the same job being called at 8pm and 8am.  Here’s the memory usage as seen through munin:

Screen Shot 2013-09-03 at 5.10.41 PM Screen Shot 2013-09-03 at 5.11.09 PM

I don’t have to worry nearly as much about my server overloading now, and I could actually downsize the server.

QED

Advertisements

Heatmapping Washington, DC Rental Price Changes using OpenStreetMaps

Percentage change of median price per square foot from July 2012 to July 2013:

stamentonerPPSQFT

Percentage change of median price from July 2012 to July 2013:

wazepricechange

Last November I made a  choropleth of median rental prices in the San Francisco Bay Area using data from my company, Kwelia.  I have wanted to figure out how to plot a similar heat map over an actual map tile, so I once again took some Kwelia data to plot both percentage change of median price and percentage change of price per sqft from July 2012 to this past month (yep, we have realtime data.)

How it’s made:

While the google maps API through R is very good, I decided to use the OpenStreetMap package because I am a complete supporter of open source projects (which is why I love R).

First, you have to download the shape files, in this case I used census tracts from the Us Census tigerlines.   Then you need to read to read it into R using the maptools package like this and merge your data to the shape file:

library("maptools")
zip=readShapeSpatial( "tl_2010_11001_tract10.shp" )

##merge data with shape file
 zip$geo_id=paste("1400000US", zip$GEOID10, sep="")
 zip$ppsqftchange <- dc$changeppsqft[match(zip$geo_id,dc$geo_id , nomatch = NA )]
 zip$pricechange <- dc$changeprice[match(zip$geo_id,dc$geo_id , nomatch = NA )]

Then you pull down the map tile from the OpenStreetMaps. I used the max and mins from the actual shape file to get the four corners of the tile to pull down the two above maps (“waze” and “stamen-toner”)

map = openproj(openmap(c(lat= max(as.numeric(as.character(zip$INTPTLAT10))),   lon= min(as.numeric(as.character(zip$INTPTLON10)))),
 c(lat= min(as.numeric(as.character(zip$INTPTLAT10))),   lon= max(as.numeric(as.character(zip$INTPTLON10)))),type="stamen-toner"))

Finally, plotting the project. The one thing different from plotting the choropleths from the Bay area is adjusting the transparency of the colors. To adjust the transparency you need to add two extra numbers (00 is fully transparent and 99 is solid) to the end of the colors as you will see in the  annotations.

##grab nine colors
 colors=brewer.pal(9, "YlOrRd")
 ##make nine breaks in the value
 brks=classIntervals(zip1$pricechange, n=9, style="quantile")$brks
 ##apply the breaks to the colors
 cols <- colors[findInterval(zip1$pricechange, brks, all.inside=TRUE)]
 ##changing the color to an alpha (transparency) of 60%
 cols <- paste0( cols, "60")
 is.na(cols) <- grepl("NA", cols)
 ##changing the color to an alpha (transparency) of 60%
 colors <- paste0( colors, "60")

 ##plot the open street map
 plot(map)
 ##add the shape file with the percentage changes to the osm 
 plot( zip , col = cols , axes=F , add=TRUE)
 ##adding the ledgend with breaks at 75%(cex) and without border(bty)
 legend('right', legend= leglabs( round(brks , 1 ) ) , fill = colors , bty="n", cex=.75)

Getting started with twitteR in R

I have asked by a few people lately to help walk them through using twitter API in R, and I’ve always just directed them to the blog post I wrote last year during the US presidential debates not knowing that Twitter had changed a few things. Having my interest peaked through a potential project at work I tried using some of my old code only to confronted with errors.

First of all, you now need to have a consumer key and secret from twitter themselves. After some research, I found it really easy to get one by going to twitter and creating a new applications.  Don’t be discouraged, anyone can get one.  Here is what the page looks like:

Screen Shot 2013-06-13 at 4.12.47 PM

Enter your name, brief description, and a website (you can use your blog or a place holder), and once you agree it will give you a screen like this where you get your consumer key and secret:key

You now have to authenticate within R by inserting your consumer key and secret into this code:

 getTwitterOAuth(consumer_key, consumer_secret)

It should spit out text and uri to get and input a pin, like:

To enable the connection, please direct your web browser to:
https://api.twitter.com/oauth/authorize?oauth_token=xpf0KGiALpjeChEQvWfP6HqV31VnpZKSs
When complete, record the PIN given to you and provide it here:

You are now ready to use the searchTwitter() function. Since I work in real estate software, Kwelia, I wanted to do sentiment analysis for apartment hunting in manhattan, so I wrote out the following:

searchTwitter('apartment hunting', geocode='40.7361,-73.9901,5mi',  n=5000, retryOnRateLimit=1)

where “apartment hunting” is what I am searching for, the geocode is a lat long with greater circle of five miles of where the tweets are sent from (union square, manhattan), n is the number of tweets i want, and retweet modifies n to the limit of tweets available if n is too high. In this case you, I got back 177 tweets.

QED

Tapping the FourSquare Trending Venues API with R

I came up with the following function to tap into the FourSquare trending venues API:

library("RCurl", "RJSONIO")
 
foursquare<-function(x,y,z){
    w<-paste("https://api.foursquare.com/v2/venues/trending?ll=",x,"&radius=2000&oauth_token=",y,"&v=",z,sep="")
    u<-getURL(w)
    test<-fromJSON(u)
    locationname=""
    lat=""
    long=""
    zip=""
    herenowcount=""
    likes=""
    for(n in 1:length(test$response$venues)) {
        locationname[n] = test$response$venues[[n]]$name
        lat[n] = test$response$venues[[n]]$location$lat
        long[n] = test$response$venues[[n]]$location$lng
        zip[n] = test$response$venues[[n]]$location$postalCode
        herenowcount[n]<-test$response$venues[[n]]$hereNow$count
        likes[n]<-test$response$venues[[n]]$likes$count
        xb<-as.data.frame(cbind(locationname, lat, long, zip, herenowcount, likes))
    }
    xb$pulled=date()
    return(xb)
}

where x=”lat,long”, y=oAuth_token, and z=date. You can find out your oAuth_token by signing into FourSquare and going to https://developer.foursquare.com/docs/venues/trending, click on the “try it out” button, then copy and the code that would be where the deleted box is.Screen Shot 2013-03-04 at 8.44.41 PM

an example:

philly<-foursquare("39.9572,-75.1691","XXXXDSAFAEWRFAEFRAAFDASDFASFD","20130304")

or you can scrape by running in a repeat function.

QED

UPDATE Multiple postgreSQL Table Records in Parellel

Unfortunately the RpostgreSQL package (I’m pretty sure other SQL DBs as well) doesn’t have a provision to UPDATE multiple records (say a whole data.frame) at once or allow placeholders making the UPDATE a one row at a time ordeal, so I built a work around hack to do the job in parellel.  The big problem was that you have to open and close the connections with every iteration or you will exceed max connections since it goes through every row.

First the function for connecting, updating, and closing the DB:

update <- function(i) {
    drv <- dbDriver("PostgreSQL")
    con <- dbConnect(drv, dbname="db_name", host="localhost", port="5432", user="chris", password="password")
    txt <- paste("UPDATE data SET column_one=",data$column_one[i],",column_two=",data$column_two[i]," where id=",data$id[i])
    dbGetQuery(con, txt)
    dbDisconnect(con)
}

Then run the query:

library("foreach")
library("doMC")

registerDoMC()

foreach(i = 1:length(data$column_one), .inorder=FALSE,.packages="RPostgreSQL")%dopar%{
    update(i)
}

QED

Opening Large CSV Files in R

Before heading home for the holidays, I had a large data set (1.6 GB with over 1.25 million rows) with columns of text and integers ripped out of the company (Kwelia) Database and put into a .csv file since I was going to be offline a lot over the break. I tried opening the csv file in the usual way:

all <- read.csv("file.csv")

However it never finished even after letting it go all night. I also tried reading it into a SQLlite database first and reading it out of that, but the file was so messy it kept coming back with errors. I finally got it read in to R by using the ff package and the following code:

library("ff")
x<- read.csv.ffdf(file="file.csv", header=TRUE, VERBOSE=TRUE, first.rows=10000, next.rows=50000, colClasses=NA)

Because the file was so messy, I had to turn off column classes (colClasses=NA) to have the read ignore giving each column a class on the first 10,000. After reading the first 10,000 rows, the script then reads in chunks of 50,000 so as to not completely overload the ram in my laptop. I also turned Verbose because it would drive me nuts to not be able to follow the progress.

Mapping Current Average Price Per Sqft for Rentals by Zip in San Fran

My company, Kwelia, is sitting on mountains of data, so I decided to try my hand at mapping.  I have played around with JGR but it’s just too buggy, at least on my mac, so I went looking for other alternatives and found a good write up here.  I decided on mapping prices per sqft for apartment rentals by zip codes in the bay area because we are launching our services there shortly.

First transforming the data was easy using ddply in the plyr package after I had queried all the right zip codes into a data frame from the database.

library("plyr")
w=ddply(sanfran, .(zip), summarise, pricepersqft=mean(price/sqft))

Then it’s a matter of loading the shape file after downloading it from here.

library(maptools)
library(RColorBrewer)
library(classInt)

zip=readShapePoly("bayarea_zipcodes.shp")

The ddply will sort the zip codes, so I transformed the zip spatial data into a regular data frame, merged it with “w”, added pricepersqft to an ordered “zip” data frame, and finally subset out zips without data.

##transform to regular data frame
a=as.data.frame(zip)
##merge with the ddply data
r=merge(a, w, all=TRUE)
##order zips in the spatial poly data
d=zip[order(zip$ZIP),]
##merge price per sqft with spatial data
d@data$pricepersqft=r$pricepersqft
##subset out zips with missing data
yy=d[!is.na(d$pricepersqft),]

Finally comes the plotting, which ,luckily, is almost exactly the same as the example that it I found above.

#select color palette and the number colors(prices per sqft) to represent on the map colors
colors=brewer.pal(9, "YlOrRd")
#set breaks for the 9 colors
brks=classIntervals(zip$INCOME, n=9, style="quantile")
brks=brks$brks
#plot the map
plot(zip, col=colors[findInterval(zip$INCOME, brks,all.inside=TRUE)], axes=F)
#add a title
title(paste ("SF Bay Area Price Per SQFT for rentals by Zip"))
#add a legend
legend(x=6298809, y=2350000, legend=leglabs(round(brks)), fill=colors, bty="n",x.intersp = .5, y.intersp = .5)

Here are the actual current averages of rental prices by zip:

sanzipsprice

 

 

 

QED

Building a Simple Web App using R

I’ve been interested in building a web app using R for a while, but never put any time into it until I was informed of the Shiny package.  It looked too easy, so I absolutely had to try it out.

First you need to install the package from the command line .

options(repos=c(RStudio="http://rstudio.org/_packages", getOption("repos")))
install.packages("shiny")

The version in the tutorial uses a two R files, won the front end(ui.R) and the other being the server side (server.R).  However, I wanted a refresher in HTML, so I built it with one HTML and one R file.  The structure is defined here.

For the data, I’m using MLS (home sales) data for Philadelphia, which I’ve been sitting on for quite a while.  The thought behind the app is to be able to examine monthly average listing prices to monthly average prices, and in the end it turned out rather well.

At first, I had the server.R try to do everything within the reactive function, but it  literally took minutes to load a new graph. The solution was to run the data through the following and just have the reactive function call from the data frames:

##raw data
d ##convert to numeric from factored variable
zip<-as.numeric(as.character(d$area))
#round the listing date to first of the month
listed<-paste(format(as.POSIXlt(d$listdate), format="%Y-%m"), "01", sep="-")
#round the sales date to first of the month
solded<-paste(format(as.POSIXlt(d$solddate), format="%Y-%m"), "01", sep="-")
#create the time period
period<-seq(as.Date("2010-02-01"), length=24, by="1 month")
#create empty data frame for average monthly listing and sales data
listing<-data.frame(period=period)
sales<-data.frame(period=period)
##list of all zip codes in Philly that we'll examine
a<-list("19103","19111","19114","19115","19116","19119","19120",
"19124","19125","19128","19130","19131","19134","19135","19136",
"19138","19142","19143","19144","19145","19146","19147","19148",
"19149","19152","19154")
#find the average monthly listing and sales figures for each zip code
for(z in a){
listing[[z]]<- sapply(period, function(x) mean(d[x >= listed & x <= solded & zip==z, 'listprice']))
sales[[z]]<-sapply(period, function(x) mean((d[ x== solded & zip==z, 'soldprice'])))
}
##save both files because the server will have to call it
save(sold, listing, file="shiny.RData")

Once the data is saved, the structure of the UI and Server can be defined.  The UI in HTML was very easy (found here),  All did was change font, add background pic to the body , add more zip code choices to the drop down box (<select>), and make sure that the plot was properly defined in <div>

The server side was a little more complicated, but once all the data was defined outside of the reactive function it went very smoothly.  First, call any packages you need, load the data, and, finally, call the shinyServer function.

## load all neccesary packages
library(shiny)
library("zoo")
library("reshape2")
library("ggplot2")
library("plyr")
library("gridExtra")
## load data
load("~/Documents/shiny.RData")

# Define server logic required to plot various zips
shinyServer(function(input, output) {
# Generate a plot of the requested zip code for sale and listing averages
output$plot average<-data.frame(period, sold=sold[[input$zip]], listing=listing[[input$zip]])
dfm <- melt(average, id = "period", measure = c("sold","listing"))
c<-qplot(period, value, data=dfm, geom="smooth", colour = variable,xlim = as.Date(c("2010-02-01","2012-01-01")),xlab="Date",ylab="Average Price")+
scale_colour_manual(values=c("#0000CC","#000000"))
print(c)
})
})

The first line of the shinyServer sets up the data, with the columns defined as the months during the defined two years, the average monthly sales data for selected zip, and the average monthly listing data for each zip. The next line is melting the data, which is used in the next line for plotting. For the plot, I chose qplot and a  smoothed line because the sales data was so stochastic.

I didn’t deploy this to the interweb, but the final product looks like this:

QED

For a cleaner version of the final code, go to my git hub here.

Top Facebook Posts During the US Presidential Debate

The following data was collected during the Presidential Debate on the 22nd of October by tapping into the Facebook social graph API using R.

The top three posted links during the debate for each candidate are:

Obama-

#1     http://bit.ly/QCODJg

#2     http://bit.ly/RXstnm

#3    http://bit.ly/P8MmJ1

Romney-

#1    http://bit.ly/zDdsKf

#2    http://bit.ly/SjFbKx

#3    http://bit.ly/SjJaXv

The top five video (for both candidates) of the debate are:

#1    http://bit.ly/WearzC

#2    http://bit.ly/QTEibe

#3    http://bit.ly/VT9xXE

#4    http://bit.ly/SeIGDJ

#5   http://bit.ly/PcKJV5

The top five shared photos are:

#1    http://on.fb.me/TOTPIY

#2    http://on.fb.me/RYF7m6

#3    http://on.fb.me/QCQLkj

#4    http://on.fb.me/RSkXXv

#5    http://on.fb.me/T7Lr6h

This data was collected by tapping into the facebook API using the following R code (adapted from cloudstat.org) run in a loop :

library(RCurl)
library(RJSONIO)

Result fb.base<-paste("http://graph.facebook.com/search?q=",Facebook,"&limit=10",sep="")
fb.url<-getURL(fb.base)
fb.parse<-fromJSON(fb.url)
return(fb.parse)

}

FBdebate<-function(x){
candidate<-list("Romney","Obama")
for(a in candidate){
fbkeyword fbresult fbdata fbdata.length fbid = facebookers = fbmessage = fbpic = fblink = fbname = fbcaption =
fbdescription = fbicon = fbtype = fbcreated = fbupdated = fblikecount = 0
for (i in 1:fbdata.length){
fbid[i] facebookers[i] fbmessage[i] fbpic[i] fblink[i] fbname[i] fbcaption[i] fbdescription[i] fbicon[i] fbtype[i] fbcreated[i] fbupdated[i]fblikecount[i]

#for(j in 1:fblikecount[i]){ fblikename[i] }
assign(paste0("fbtable.", a),as.data.frame(cbind(fbid, facebookers, fbmessage, fbpic, fblink, fbname, fbcaption, fbdescription, fbicon, fbtype, fbcreated, fbupdated, fblikecount, candidate=a)))
}
debate<-merge(fbtable.Romney,fbtable.Obama, all=TRUE)
debate$fbupdated<-format(as.POSIXct(strptime(as.character(debate$fbupdated), "%Y-%m-%dT%H:%M:%S+0000"), tz="GMT"), tz="America/New_York",usetz=TRUE)
debate$fbcreated<-format(as.POSIXct(strptime(as.character(debate$fbcreated), "%Y-%m-%dT%H:%M:%S+0000"), tz="GMT"), tz="America/New_York",usetz=TRUE)
return(debate)
}

Twitter Analysis of the US Presidential Debate

The following are word clouds of tweets for each candidate from the October 16, 2012 debate with the bigger words the more often they were used in tweets (click on each word cloud to enlarge):

And the net-negative posts for each candidate:

Please note that the bigger the word is in the word cloud the more often it was used.

The R code for creating the word clouds

The following code was adapted from here and is an extension of previous work:


ap.corpus <- Corpus(DataframeSource(data.frame(as.character(romneypositive[,3]))))
ap.corpus <- tm_map(ap.corpus, removePunctuation)
ap.corpus <- tm_map(ap.corpus, tolower)
ap.corpus <- tm_map(ap.corpus, function(x) removeWords(x, c(r,stopwords("english"))))
ap.tdm <- TermDocumentMatrix(ap.corpus)
ap.m <- as.matrix(ap.tdm)
ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
ap.d <- data.frame(word = names(ap.v),freq=ap.v)
table(ap.d$freq)
pal2 <- brewer.pal(8,"Dark2")
png("romneypositive.png", width=1280,height=800)
wordcloud(ap.d$word,ap.d$freq, scale=c(8,.2),min.freq=3,
max.words=Inf, random.order=FALSE, rot.per=.15, colors=pal2)
dev.off()

Disclaimer: Any errors can be attributed to the fact that I was drinking heavily, that I was in Dallas, and that it was half four in the morning when I finished writing this.