Upgrading Spark Standalone Cluster in WSL2
The Problem
Since 2019 I’ve been using PySpark 2.4.4 for my local big data development, and my system’s Python had migrated from 3.6 into 3.8. This version drift caused me some headache when it throws the following error during the initialization of SparkSession
object from pyspark.sql module in my code:
TypeError: an integer is required (got type bytes)
This error was caused because PySpark version 2.4.4 does not support Python 3.8. Most of the recommendations i’ve found on the internet are telling me to downgrade to Python 3.7 or to upgrade PySpark to the later version to work around the issue, for example by running pip3 install --upgrade pyspark
.
I am using a Spark standalone cluster in my local i.e. “installing from source”-way, and the above command did nothing to my PySpark installation i.e. the version stays at 2.4.4
. There are more steps needed to be taken.
Furthermore, my WSL2 is a spaghetti-maze installation of binaries / distributions. Some challenges are:
- There are binaries that were installed by brew and some that came by means of
sudo apt-get
- I forgot how I installed PySpark and Apache-Spark in the first place
Obviously doing brew uninstall pyspark
and apt-get remove apache-spark
did nothing. To start from a clean slate is not as easy as uninstall-and-reinstall.
The Solution
First, I need to locate the installation path of both PySpark and Spark-Shell:
$ which pyspark && which spark-shell
/home/linuxbrew/.linuxbrew/bin/pyspark
/home/linuxbrew/.linuxbrew/bin/spark-shell
Those files are actually bash scripts used by brew to launch the application. Looking at the code of /home/linuxbrew/.linuxbrew/bin/pyspark
we can see the following PATH
definition:
# Add the PySpark classes to the Python path:
export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.2-src.zip:$PYTHONPATH"
Then, I need to locate which path $SPARK_HOME
resolves to:
$ echo $SPARK_HOME
/opt/spark
I went to /opt
and voila! That’s where it was installed. What have to be done next to “upgrade” spark version is to delete the folder containing all spark-related binaries and replace it with the newer one (make sure the hadoop version tallies with your installation - I am still using 2.7
at the time of writing):
rm -rf /opt/spark
sudo wget https://www.apache.org/dyn/closer.lua/spark/spark-3.2.1/spark-3.2.1-bin-hadoop2.7.tgz
sudo tar xvzf spark-3.2.1-bin-hadoop2.7.tgz
sudo mv spark-3.2.1-bin-hadoop2.7 spark
After the above steps have been done, check your PySpark version:
$ pyspark --version
22/03/26 15:51:40 WARN Utils: Your hostname, PMIIDIDNL13144 resolves to a loopback address: 127.0.1.1; using 172.17.50.37 instead (on interface eth0)22/03/26 15:51:40 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.2.1
/_/
Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 1.8.0_292
Branch HEAD
Compiled by user hgao on 2022-01-20T20:15:47Z
Revision 4f25b3f71238a00508a356591553f2dfa89f8290
Url https://github.com/apache/spark
Type --help for more information.
Takeaways
- Always be consistent with your package management. Stick to brew or apt-get but don’t mix both.
- Try to use
venv
to manage Python packages for different projects. Actually. this is one advice that I’ve heard so many times but at the moment it feels like it’s too late to be implemented in my current setup.
Comments