IPython notebook and Spark setup for Windows 10

I recently took a new job as a Senior Data Scientist at a Consulting firm, Clarity Solution Group, and as part of the switch into consulting I had to switch to a Windows (10) environment. I have been a loyal Mac and Linux user up until now, so it was a bit of a jump setting up all the tools the way I like. Since I will be using Python, IPython notebooks, and Spark for the majority of the work that I do I wanted to be able to initiate it with a single keyword from the command line. As a pre-requisite you may need to download an unzip tool like 7z

Install Java jdk

You can download and install from Oracle here. Once installed I created a new folder called Java in program files moved the JDK folder into it.  Copy the path and set JAVA_HOME within your system environment variables.  The add %JAVA_HOME%\bin to Path variable. To get the click start, then settings, and then search environment variables. This should bring up ‘Edit your system environment variables’, which should look like this:

Screenshot (2)

Installing and setting up Python and IPython

For simplicity I downloaded and installed Anaconda with the python 2.7 version from Continuum Analytics (free) using the built-in install wizard. Once installed you need to do the same thing you did with Java.  Name the variable as PYTHONPATH, where my paths were C:\User\cconnell\Anaconda2. You will also need to add %PYTHONPATH% to the Path variable

Installing Spark

I am using 1.6.0 (I like to go back at least one version) from Apache Spark here. Once unzipped do the same thing as before setting Spark_Home variable to where your spark folder is and then add %Spark_Home%\bin to the path variable.

Installing Hadoop binaries

Download and unzip the Hadoop common binaries and set a HADOOP_HOME variable to that location.

Getting everything to work together

  1. Go into your spark/conf folder and rename log4j.properties.template to log4j.properties
  2. Open log4j.properties in a text editor and change log4j.rootCategory to WARN from INFO
  3. Add two new environment variables like before: PYSPARK_DRIVER_PYTHON to jupyter (edited from ‘ipython’ in the pic) and PYSPARK_DRIVER_PYTHON_OPTS to notebook

Screenshot (1)

Now to launch your PySpark notebook just type pyspark from the console and it will automatically launch in your browser.