How To Run A PySpark Notebook#

Introduction#

Jupyter is provided as a service in Hopsworks, providing the same user experience and features as if run on your laptop.

Supports JupyterLab and the classic Jupyter front-end
Configured with Python3, PySpark and Ray kernels

Step 1: Jupyter dashboard#

The image below shows the Jupyter service page in Hopsworks and is accessed by clicking Jupyter in the sidebar.

From this page, you can configure various options and settings to start Jupyter with as described in the sections below.

Step 2: A Spark environment must be configured#

The PySpark kernel will only be available if Jupyter is configured to use the spark-feature-pipeline or an environment cloned from it. You can easily refer to the green ticks as to what kernels are available in which environment.

Select an environment with PySpark kernel enabled

Step 3 (Optional): Configure spark properties#

Next step is to configure the Ray properties to be used in Jupyter, Click edit configuration to get to the configuration page and select Ray.

Resource and compute#

Resource allocation for the Spark driver and executors can be configured, also the number of executors and whether dynamic execution should be enabled.

Driver memory: Number of cores to allocate for the Spark driver
Driver virtual cores: Number of MBs to allocate for the Spark driver
Executor memory: Number of cores to allocate for each Spark executor
Executor virtual cores: Number of MBs to allocate for each Spark executor
Dynamic/Static: Run the Spark application in static or dynamic allocation mode (see spark docs for details).

Resource configuration for the Spark kernels

Attach files or dependencies#

Additional files or dependencies required for the Spark job can be configured.

Additional archives: List of zip or .tgz files that will be locally accessible by the application
Additional jars: List of .jar files to add to the CLASSPATH of the application
Additional python dependencies: List of .py, .zip or .egg files that will be locally accessible by the application
Additional files: List of files that will be locally accessible by the application

File configuration for the Spark kernels

Line-separates properties to be set for the Spark application. For example, changing the configuration variables for the Kryo Serializer or setting environment variables for the driver, you can set the properties as shown below.

Click Save to save the new configuration.

Step 4 (Optional): Configure root folder and automatic shutdown#

Before starting the server there are two additional configurations that can be set next to the Run Jupyter button.

The runtime of the Jupyter instance can be configured, this is useful to ensure that idle instances will not be hanging around and keep allocating resources. If a limited runtime is not desirable, this can be disabled by setting no limit.

The root path from which to start the Jupyter instance can be configured. By default it starts by setting the /Jupyter folder as the root.

Step 5: (Kueue enabled) Select a Queue#

Currently we do not have Kueue support for Spark. You do not need to select a queue to run the notebook in.

Step 5: Start Jupyter#

Start the Jupyter instance by clicking the Run Jupyter button.

Starting Jupyter and running a Spark notebook

Step 6: Access Spark UI#

Navigate back to Hopsworks and a Spark session will have appeared, click on the Spark UI button to go to the Spark UI.

Access Spark UI and see application logs

Accessing project data#

Read directly from the filesystem (recommended)#

To read a dataset in your project using Spark, use the full filesystem path where the data is stored. For example, to read a CSV file named data.csv located in the Resources dataset of a project called my_project:

df = spark.read.csv("/Projects/my_project/Resources/data.csv", header=True, inferSchema=True)
df.show()

Additional files#

Different files can be attached to the jupyter session and made available in the /srv/hops/artifacts folder when the PySpark kernel is started. This configuration is mainly useful when you need to add additional configuration such as jars that needs to be added to the CLASSPATH.

When reading data in your Spark application, it is recommended to use the Spark read API as previously demonstrated, since this reads from the filesystem directly, whereas Additional files configuration options will download the files in its entirety and is not a scalable option.

Going Further#

You can learn how to install a library so that it can be used in a notebook.