Skip to content

How To Run A Spark Job#

Introduction#

All members of a project in Hopsworks can launch the following types of applications through a project's Jobs service:

  • Python (Hopsworks Enterprise only)
  • Apache Spark

Launching a job of any type is very similar process, what mostly differs between job types is the various configuration parameters each job type comes with. After following this guide you will be able to create a Spark job.

UI#

Step 1: Jobs overview#

The image below shows the Jobs overview page in Hopsworks and is accessed by clicking Jobs in the sidebar.

Jobs overview
Jobs overview

Step 2: Create new job dialog#

To configure a job, click Advanced options, which will open up the advanced configuration page for the job.

Create new job dialog
Create new job dialog

Step 3: Set the jar#

Next step is to select the program to run. You can either select From project, if the file was previously uploaded to Hopsworks, or Upload new file which lets you select a file from your local filesystem as demonstrated below. After that set the name for the job.

Configure program
Configure program

Step 4: Set the job type#

Next step is to set the job type to SPARK to indicate it should be executed as a spark job. Then specify additional configuration or click Create New Job to create the job.

Set the job type
Set the job type

Step 5: Set the main class#

Next step is to set the main class for the application. Then specify advanced configuration or click Create New Job to create the job.

Set the main class
Set the main class

Step 6 (optional): Advanced configuration#

Resource allocation for the Spark driver and executors can be configured, also the number of executors and whether dynamic execution should be enabled.

  • Driver memory: Number of cores to allocate for the Spark driver

  • Driver virtual cores: Number of MBs to allocate for the Spark driver

  • Executor memory: Number of cores to allocate for each Spark executor

  • Executor virtual cores: Number of MBs to allocate for each Spark executor

  • Dynamic/Static: Run the Spark application in static or dynamic allocation mode (see spark docs for details).

Resource configuration for the Spark kernels
Resource configuration for the Spark kernels

Additional files or dependencies required for the Spark job can be configured.

  • Additional archives: Number of cores to allocate for the Spark driver

  • Additional jars: Number of MBs to allocate for the Spark driver

  • Additional python dependencies: Number of cores to allocate for each Spark executor

  • Additional files: Number of MBs to allocate for each Spark executor

File configuration for the Spark kernels
File configuration for the Spark kernels

Line-separates properties to be set for the Spark application. For example, changing the configuration variables for the Kryo Serializer or setting environment variables for the driver, you can set the properties as shown below.

File configuration for the Spark kernels
Additional Spark configuration

Step 7: Execute the job#

Now click the Run button to start the execution of the job, and then click on Executions to see the list of all executions.

Start job execution
Start job execution

Step 8: Application logs#

To monitor logs while the execution is running, click Spark UI to open the Spark UI in a separate tab.

Once the execution is finished, you can click on Logs to see the full logs for execution.

Access Spark logs
Access Spark logs

Code#

Step 1: Upload the python program#

This snippet assumes the python script is in the current working directory and named script.py. It will upload the python script to run to the Resources dataset.

import hopsworks

connection = hopsworks.connection()

project = connection.get_project()

dataset_api = project.get_dataset_api()

uploaded_file_path = dataset_api.upload("script.ipynb", "Resources")

Step 2: Create SPARK job#

In this snippet we get the JobsApi object to get the default job configuration for a SPARK job, set the python script to run and create the Job object.

jobs_api = project.get_jobs_api()

spark_config = jobs_api.get_configuration("SPARK")

spark_config['appPath'] = uploaded_file_path
spark_config['mainClass'] = 'org.apache.spark.examples.SparkPi'

job = jobs_api.create_job("pyspark_job", spark_config)

Step 3: Execute the job#

In this snippet we execute the job synchronously, that is wait until it reaches a terminal state, and then download and print the logs.

execution = job.run(await_termination=True)

out, err = execution.download_logs()

f_out = open(out, "r")
print(f_out.read())

f_err = open(err, "r")
print(f_err.read())

API Reference#

Jobs

Executions

Conclusion#

In this guide you learned how to create and run a PySpark job.