IPython notebook and Spark setup for Windows 10

I recently took a new job as a Senior Data Scientist at a Consulting firm, Clarity Solution Group, and as part of the switch into consulting I had to switch to a Windows (10) environment. I have been a loyal Mac and Linux user up until now, so it was a bit of a jump setting up all the tools the way I like. Since I will be using Python, IPython notebooks, and Spark for the majority of the work that I do I wanted to be able to initiate it with a single keyword from the command line. As a pre-requisite you may need to download an unzip tool like 7z

Install Java jdk

You can download and install from Oracle here. Once installed I created a new folder called Java in program files moved the JDK folder into it.  Copy the path and set JAVA_HOME within your system environment variables.  The add %JAVA_HOME%\bin to Path variable. To get the click start, then settings, and then search environment variables. This should bring up ‘Edit your system environment variables’, which should look like this:

Screenshot (2)

Installing and setting up Python and IPython

For simplicity I downloaded and installed Anaconda with the python 2.7 version from Continuum Analytics (free) using the built-in install wizard. Once installed you need to do the same thing you did with Java.  Name the variable as PYTHONPATH, where my paths were C:\User\cconnell\Anaconda2. You will also need to add %PYTHONPATH% to the Path variable

Installing Spark

I am using 1.6.0 (I like to go back at least one version) from Apache Spark here. Once unzipped do the same thing as before setting Spark_Home variable to where your spark folder is and then add %Spark_Home%\bin to the path variable.

Installing Hadoop binaries

Download and unzip the Hadoop common binaries and set a HADOOP_HOME variable to that location.

Getting everything to work together

  1. Go into your spark/conf folder and rename log4j.properties.template to log4j.properties
  2. Open log4j.properties in a text editor and change log4j.rootCategory to WARN from INFO
  3. Add two new environment variables like before: PYSPARK_DRIVER_PYTHON to jupyter (edited from ‘ipython’ in the pic) and PYSPARK_DRIVER_PYTHON_OPTS to notebook

Screenshot (1)

Now to launch your PySpark notebook just type pyspark from the console and it will automatically launch in your browser.

 

Advertisements

8 thoughts on “IPython notebook and Spark setup for Windows 10

  1. Notebook does open. But “spark” command is not recognized. How can I solve it?

    spark

    —————————————————————————
    NameError Traceback (most recent call last)
    in ()
    —-> 1 spark

    NameError: name ‘spark’ is not defined

  2. Thanks for your guidance.I have followed all of the above procedure and my notebook is working on typing ‘pyspark’ but on writing ‘sc’ ,i get output ‘ ‘ and on command prompt

    [I 16:20:44.788 NotebookApp] Accepting one-time-token-authenticated connection from ::1
    [I 16:20:59.105 NotebookApp] Creating new notebook in /Untitled Folder
    [I 16:21:00.055 NotebookApp] Kernel started: 8bbbba3c-932d-46f1-94cb-9c1eeb9810cc
    [IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file C:\spark\bin\..\python\pyspark\shell.py:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s