hopsworks.spark #

SparkSession builder for Hopsworks.

In Spark Connect mode (the terminal-spark image's bashrc sets SPARK_REMOTE, and PySpark also flips SPARK_CONNECT_MODE_ENABLED once a Connect session is created), Hopsworks reads and writes offline feature groups through Delta Lake and the bare SparkSession.builder does not load the Delta extensions or rewire the default catalog to DeltaCatalog — every fg.read() / fg.insert(df) then silently misbehaves. build_spark returns a Connect session with both configs pre-applied so this class of failure becomes impossible by construction.

For non-Connect runs (spark-submit, classic spark-on-yarn, an external cluster) the cluster's spark-defaults.conf is the source of truth and we do not override its session config from here. The Delta defaults are applied only in Spark Connect mode.

build_spark #

build_spark(
    app_name: str = "hopsworks",
    extra_configs: dict[str, str] | None = None,
) -> SparkSession

Return a SparkSession configured for the Hopsworks runtime.

In Spark Connect mode the returned session has Delta Lake's SQL extension and DeltaCatalog wired in, so every Hopsworks offline-feature-group read or write works without further user setup. extra_configs is layered on top, so callers can still add their own configs (e.g. spark.sql.shuffle.partitions).

Outside Spark Connect, the helper does not inject the Delta defaults — classic Spark deployments expect their cluster's spark-defaults.conf to be authoritative, and overriding it from here can mask real misconfigurations. Only app_name and extra_configs flow through.

Spark Connect detection delegates to hopsworks_common.spark_connect_utils._is_spark_connect_env (which checks SPARK_CONNECT_MODE_ENABLED and falls back to pyspark.sql.utils.is_remote()); the latter also covers the SPARK_REMOTE env var that the terminal-spark image sets. The helper never calls .remote(...) itself — hard-coding the URI breaks the moment the script runs anywhere other than that one terminal pod.

PARAMETER	DESCRIPTION
`app_name`	Spark application name; surfaces in the Spark UI. TYPE: `str` DEFAULT: `'hopsworks'`
`extra_configs`	Additional `spark.<key> = <value>` configs. TYPE: `dict[str, str] \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`SparkSession`	A configured `SparkSession`.

RAISES	DESCRIPTION
`ImportError`	If PySpark is not installed. The terminal-spark image ships PySpark; outside it, install `pyspark[connect]`.

Example

from hopsworks import build_spark

spark = build_spark("my_pipeline")
df = spark.read.format("delta").load("/hopsfs/.../my_table")