I recently took a new job as a Senior Data Scientist at a Consulting firm, Clarity Solution Group, and as part of the switch into consulting I had to switch to a Windows (10) environment. I have been a loyal Mac and Linux user up until now, so it was a bit of a jump setting up all the tools the way I like. Since I will be using Python, IPython notebooks, and Spark for the majority of the work that I do I wanted to be able to initiate it with a single keyword from the command line. As a pre-requisite you may need to download an unzip tool like 7z
Install Java jdk
You can download and install from Oracle here. Once installed I created a new folder called Java in program files moved the JDK folder into it. Copy the path and set JAVA_HOME
within your system environment variables. The add %JAVA_HOME%\bin
to Path
variable. To get the click start, then settings, and then search environment variables. This should bring up ‘Edit your system environment variables’, which should look like this:
Installing and setting up Python and IPython
For simplicity I downloaded and installed Anaconda with the python 2.7 version from Continuum Analytics (free) using the built-in install wizard. Once installed you need to do the same thing you did with Java. Name the variable as PYTHONPATH
, where my paths were C:\User\cconnell\Anaconda2. You will also need to add %PYTHONPATH%
to the Path
variable
Installing Spark
I am using 1.6.0 (I like to go back at least one version) from Apache Spark here. Once unzipped do the same thing as before setting Spark_Home
variable to where your spark folder is and then add %Spark_Home%\bin
to the path
variable.
Installing Hadoop binaries
Download and unzip the Hadoop common binaries and set a HADOOP_HOME
variable to that location.
Getting everything to work together
- Go into your spark/conf folder and rename log4j.properties.template to log4j.properties
- Open log4j.properties in a text editor and change log4j.rootCategory to WARN from INFO
- Add two new environment variables like before:
PYSPARK_DRIVER_PYTHON
tojupyter
(edited from ‘ipython’ in the pic) andPYSPARK_DRIVER_PYTHON_OPTS
tonotebook
Now to launch your PySpark notebook just type pyspark
from the console and it will automatically launch in your browser.
Great tutorial!!! Thanks a lot!!
Great, fantastic. Straight to the point on my Win7 SP1.
Good article. I was struggling with PySpark on windows, this one comes just in time.
Notebook does open. But “spark” command is not recognized. How can I solve it?
spark
—————————————————————————
NameError Traceback (most recent call last)
in ()
—-> 1 spark
NameError: name ‘spark’ is not defined
Thanks for your guidance.I have followed all of the above procedure and my notebook is working on typing ‘pyspark’ but on writing ‘sc’ ,i get output ‘ ‘ and on command prompt
[I 16:20:44.788 NotebookApp] Accepting one-time-token-authenticated connection from ::1
[I 16:20:59.105 NotebookApp] Creating new notebook in /Untitled Folder
[I 16:21:00.055 NotebookApp] Kernel started: 8bbbba3c-932d-46f1-94cb-9c1eeb9810cc
[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file C:\spark\bin\..\python\pyspark\shell.py:
Worked like a charm – thanks a lot
Great and helpful blog to everyone.. Installation procedure are very clear and step by so easy to understand.. All installation commands are very clear and i learnt installation procedure easily form this blog so i install hadoop in my system very quickly.. thanks a lot for sharing this blog to us…
hadoop training institute in adyar | big data training institute in adyar