hsfs.storage_connector #

StorageConnector #

Bases: ABC

description `property` #

description: str | None

User provided description of the storage connector.

id `property` #

id: int | None

Id of the storage connector uniquely identifying it in the Feature store.

name `property` #

name: str

Name of the storage connector.

type `property` #

type: str | None

Type of the connector as string, e.g. "HOPFS, S3, ADLS, REDSHIFT, JDBC or SNOWFLAKE.

connector_options #

connector_options() -> dict[str, Any]

Return prepared options to be passed to an external connector library.

Not implemented for this connector type.

RETURNS	DESCRIPTION
`dict[str, Any]`	An empty dictionary.

get_data #

get_data(
    data_source: ds.DataSource, use_cached=True
) -> DataSourceData

Retrieve the data from the data source.

Example

# connect to the Feature Store
fs = ...

sc = fs.get_storage_connector("conn_name")

tables = sc.get_tables("database_name")

data = sc.get_data(tables[0])

Parameters: data_source (DataSource): The data source to retrieve data from. use_cached (bool): Whether to use cached data if available. Only supported for CRM, Google Sheets, and REST connectors. Defaults to True.

RETURNS	DESCRIPTION
`DataSourceData`	An object containing the data retrieved from the data source.

get_data_batch #

get_data_batch(
    data_sources: list[ds.DataSource], use_cached=True
) -> dict[str, DataSourceData]

Retrieve the data of several data sources with a single schema-fetch job.

Only supported for CRM, Google Sheets, and REST connectors. The backend starts ONE job that fetches the schemas of all resources sequentially in one container, instead of one job per resource. This call blocks until every resource has been fetched.

Example

# connect to the Feature Store
fs = ...

sc = fs.get_storage_connector("conn_name")

tables = sc.get_tables()

data_by_resource = sc.get_data_batch(tables[:3])

Parameters: data_sources: The data sources to retrieve data for; each needs a table name in data_source.table. use_cached: Whether to use cached data if available. Defaults to True.

RETURNS	DESCRIPTION
`dict[str, DataSourceData]`	A dictionary mapping each resource name to the data retrieved for it.

RAISES	DESCRIPTION
`hopsworks.client.exceptions.DataSourceException`	If the schema fetch failed for one or more resources.

get_databases #

get_databases() -> list[str]

Retrieve the list of available databases.

Example

# connect to the Feature Store
fs = ...

sc = fs.get_storage_connector("conn_name")

databases = sc.get_databases()

RETURNS	DESCRIPTION
`list[str]`	A list of database names available in the storage connector.

get_feature_groups #

get_feature_groups() -> list[FeatureGroup]

Get the feature groups using this storage connector, based on explicit rovenance.

Only the accessible feature groups are returned. For more items use the base method, see get_feature_groups_provenance.

RETURNS	DESCRIPTION
`list[FeatureGroup]`	List of feature groups.

get_feature_groups_provenance #

get_feature_groups_provenance() -> Links | None

Get the generated feature groups using this storage connector, based on explicit provenance.

These feature groups can be accessible or inaccessible.

Explicit provenance does not track deleted generated feature group links, so deleted will always be empty. For inaccessible feature groups, only a minimal information is returned.

RETURNS	DESCRIPTION
`Links \| None`	The feature groups generated using this storage connector or `None` if none were created.

RAISES	DESCRIPTION
`hopsworks.client.exceptions.RestAPIError`	In case the backend encounters an issue.

get_metadata #

get_metadata(data_source: ds.DataSource) -> dict

Retrieve metadata information about the data source.

Example

# connect to the Feature Store
fs = ...

sc = fs.get_storage_connector("conn_name")

tables = sc.get_tables("database_name")

metadata = sc.get_metadata(tables[0])

PARAMETER	DESCRIPTION
`data_source`	The data source to retrieve metadata from. TYPE: `ds.DataSource`

RETURNS	DESCRIPTION
`dict`	A dictionary containing metadata about the data source.

get_tables #

get_tables(
    database: str | None = None,
) -> list[ds.DataSource]

Retrieve the list of tables from the specified database.

Example

# connect to the Feature Store
fs = ...

sc = fs.get_storage_connector("conn_name")

tables = sc.get_tables("database_name")

PARAMETER	DESCRIPTION
`database`	The name of the database to list tables from. If not provided, the default database is used. Not required for Google Sheets connectors — sheet names are fetched from the connector's spreadsheet. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`list[ds.DataSource]`	A list of DataSource objects representing the tables.
`list[ds.DataSource]`	For Google Sheets connectors, each entry represents a sheet name.

get_training_datasets #

get_training_datasets() -> list[TrainingDataset]

Get the training datasets using this storage connector, based on explicit provenance.

Only the accessible training datasets are returned. For more items use the base method, get_training_datasets_provenance.

RETURNS	DESCRIPTION
`list[TrainingDataset]`	List of training datasets.

get_training_datasets_provenance #

get_training_datasets_provenance() -> Links | None

Get the generated training datasets using this storage connector, based on explicit provenance.

These training datasets can be accessible or inaccessible. Explicit provenance does not track deleted generated training dataset links, so deleted will always be empty. For inaccessible training datasets, only a minimal information is returned.

RETURNS	DESCRIPTION
`Links \| None`	The training datasets generated using this storage connector or `None` if none were created.

RAISES	DESCRIPTION
`hopsworks.client.exceptions.RestAPIError`	In case the backend encounters an issue.

infer_metadata #

infer_metadata(
    data_source: ds.DataSource,
    preview_data: DataSourceData | None = None,
) -> InferredMetadata

Use platform intelligence to infer feature metadata for a data source table.

Calls the same backend used by the "Infer metadata" button in the UI when creating an external feature group: an LLM proposes per-column renames, Hopsworks types, descriptions, and a suggested primary key and event time.

Example

fs = ...

sc = fs.get_storage_connector("conn_name")

tables = sc.get_tables("database_name")

inferred = sc.infer_metadata(tables[0])

PARAMETER	DESCRIPTION
`data_source`	The data source (typically a table returned by `get_tables`) to infer metadata for. TYPE: `ds.DataSource`
`preview_data`	Pre-fetched preview data to skip a server round-trip; if `None`, a preview is fetched via `get_data`. TYPE: `DataSourceData \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`InferredMetadata`	An object containing the suggested feature renames, types, descriptions, primary key, and event time.

RAISES	DESCRIPTION
`hopsworks.client.exceptions.PlatformIntelligenceException`	If platform intelligence is not enabled on the cluster, or the LLM call fails.

prepare_spark #

prepare_spark(path: str | None = None) -> str | None

Prepare Spark to use this Storage Connector.

PARAMETER	DESCRIPTION
`path`	Path to prepare for reading from cloud storage. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`str \| None`	The path to be used for reading from Spark, which may be different from the input path if the connector has a base path configured.

read #

read(
    query: str | None = None,
    data_format: str | None = None,
    options: dict[str, Any] | None = None,
    path: str | None = None,
    dataframe_type: Literal[
        "default",
        "spark",
        "pandas",
        "polars",
        "numpy",
        "python",
    ] = "default",
) -> (
    TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | pd.DataFrame
    | np.ndarray
    | pl.DataFrame
)

Reads a query or a path into a dataframe using the storage connector.

Note, paths are only supported for object stores like S3, HopsFS and ADLS, while queries are meant for JDBC or databases like Redshift and Snowflake.

PARAMETER	DESCRIPTION
`query`	By default, the storage connector will read the table configured together with the connector, if any. It's possible to overwrite this by passing a SQL query here. TYPE: `str \| None` DEFAULT: `None`
`data_format`	When reading from object stores such as S3, HopsFS and ADLS, specify the file format to be read, e.g., `csv`, `parquet`. TYPE: `str \| None` DEFAULT: `None`
`options`	Any additional key/value options to be passed to the connector. TYPE: `dict[str, Any] \| None` DEFAULT: `None`
`path`	Path to be read from within the bucket of the storage connector. Not relevant for JDBC or database based connectors such as Snowflake, JDBC or Redshift. TYPE: `str \| None` DEFAULT: `None`
`dataframe_type`	The type of the returned dataframe. Defaults to "default", which maps to Spark dataframe for the Spark Engine and Pandas dataframe for the Python engine. TYPE: `Literal['default', 'spark', 'pandas', 'polars', 'numpy', 'python']` DEFAULT: `'default'`

RETURNS	DESCRIPTION
`TypeVar('pyspark.sql.DataFrame') \| TypeVar('pyspark.RDD') \| pd.DataFrame \| np.ndarray \| pl.DataFrame`	The read dataframe.

save #

save() -> StorageConnector

Persist this storage connector to the feature store.

Example

import hopsworks

project = hopsworks.login()
fs = project.get_feature_store()

sc = hsfs.storage_connector.S3Connector(
    id=None,
    name="my_s3_connector",
    featurestore_id=fs.id,
    bucket="my-bucket",
    region="eu-north-1",
)
sc.save()

RETURNS	DESCRIPTION
`StorageConnector`	The saved storage connector with its assigned id.

spark_options `abstractmethod` #

spark_options() -> dict[str, Any]

Return prepared options to be passed to Spark, based on the additional arguments.

RETURNS	DESCRIPTION
`dict[str, Any]`	A dictionary containing the configuration options for Spark.

update #

update() -> StorageConnector

Update this storage connector in the feature store.

Example

import hopsworks

project = hopsworks.login()
fs = project.get_feature_store()

sc = fs.get_data_source("my_s3_connector").storage_connector
sc._bucket = "new-bucket"
sc.update()

RETURNS	DESCRIPTION
`StorageConnector`	The updated storage connector.

AdlsConnector #

Bases: StorageConnector

account_name `property` #

account_name: str | None

Account name of the ADLS storage connector.

application_id `property` #

application_id: str | None

Application ID of the ADLS storage connector.

container_name `property` #

container_name: str | None

Container name of the ADLS storage connector.

directory_id `property` #

directory_id: str | None

Directory ID of the ADLS storage connector.

generation `property` #

generation: str | None

Generation of the ADLS storage connector.

path `property` #

path: str | None

If the connector refers to a path (e.g. ADLS) - return the path of the connector.

service_credential `property` #

service_credential: str | None

Service credential of the ADLS storage connector.

prepare_spark #

prepare_spark(path: str | None = None) -> str | None

Prepare Spark to use this Storage Connector.

conn.prepare_spark()

spark.read.format("json").load("abfss://[container-name]@[account_name].dfs.core.windows.net/[path]")

# or
spark.read.format("json").load(conn.prepare_spark("abfss://[container-name]@[account_name].dfs.core.windows.net/[path]"))

PARAMETER	DESCRIPTION
`path`	Path to prepare for reading from cloud storage. TYPE: `str \| None` DEFAULT: `None`

read #

read(
    query: str | None = None,
    data_format: str | None = None,
    options: dict[str, Any] | None = None,
    path: str | None = None,
    dataframe_type: Literal[
        "default",
        "spark",
        "pandas",
        "polars",
        "numpy",
        "python",
    ] = "default",
) -> (
    TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | pd.DataFrame
    | np.ndarray
    | pl.DataFrame
)

Reads a path into a dataframe using the storage connector.

PARAMETER	DESCRIPTION
`query`	Not relevant for ADLS connectors. TYPE: `str \| None` DEFAULT: `None`
`data_format`	The file format of the files to be read, e.g. `csv`, `parquet`. TYPE: `str \| None` DEFAULT: `None`
`options`	Any additional key/value options to be passed to the ADLS connector. TYPE: `dict[str, Any] \| None` DEFAULT: `None`
`path`	Path within the bucket to be read. For example, path=`path` will read directly from the container specified on connector by constructing the URI as 'abfss://[container-name]@[account_name].dfs.core.windows.net/[path]'. If no path is specified default container path will be used from connector. TYPE: `str \| None` DEFAULT: `None`
`dataframe_type`	str, optional. The type of the returned dataframe. Possible values are `"default"`, `"spark"`,`"pandas"`, `"polars"`, `"numpy"` or `"python"`. Defaults to "default", which maps to Spark dataframe for the Spark Engine and Pandas dataframe for the Python engine. TYPE: `Literal['default', 'spark', 'pandas', 'polars', 'numpy', 'python']` DEFAULT: `'default'`

RETURNS	DESCRIPTION
`TypeVar('pyspark.sql.DataFrame') \| TypeVar('pyspark.RDD') \| pd.DataFrame \| np.ndarray \| pl.DataFrame`	`DataFrame`.

BigQueryConnector #

Bases: StorageConnector

The BigQuery storage connector provides integration to Google Cloud BigQuery.

You can use it to run bigquery on your GCP cluster and load results into spark dataframe by calling the read API.

Authentication to GCP is handled by uploading the JSON keyfile for service account to the Hopsworks Project. For more information on service accounts and creating keyfile in GCP, read Google Cloud documentation.

The storage connector uses the Google spark-bigquery-connector behind the scenes. To read more about the spark connector, like the spark options or usage, check Apache Spark SQL connector for Google BigQuery.

arguments `property` #

arguments: dict[str, Any]

Additional spark options.

dataset `property` #

dataset: str | None

BigQuery dataset (The dataset containing the table).

key_path `property` #

key_path: str | None

JSON keyfile for service account.

materialization_dataset `property` #

materialization_dataset: str | None

BigQuery materialization dataset (The dataset where the materialized view is going to be created, used in case of query).

parent_project `property` #

parent_project: str | None

BigQuery parent project (Google Cloud Project ID of the table to bill for the export).

query_project `property` #

query_project: str | None

BigQuery project (The Google Cloud Project ID of the table).

query_table `property` #

query_table: str | None

BigQuery table name.

connector_options #

connector_options() -> dict[str, Any]

Return options to be passed to an external BigQuery connector library.

read #

read(
    query: str | None = None,
    data_format: str | None = None,
    options: dict[str, Any] | None = None,
    path: str | None = None,
    dataframe_type: Literal[
        "default",
        "spark",
        "pandas",
        "polars",
        "numpy",
        "python",
    ] = "default",
) -> (
    TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | pd.DataFrame
    | np.ndarray
    | pl.DataFrame
)

Reads results from BigQuery into a spark dataframe using the storage connector.

Reading from bigquery is done via either specifying the BigQuery table or BigQuery query. For example, to read from a BigQuery table, set the BigQuery project, dataset and table on storage connector and read directly from the corresponding path.

conn.read()

OR, to read results from a BigQuery query, set Materialization Dataset on storage connector, and pass your SQL to query argument.

conn.read(query='SQL')

Optionally, passing query argument will take priority at runtime if the table options were also set on the storage connector. This allows user to run from both a query or table with same connector, assuming all fields were set. Also, user can set the path argument to a bigquery table path to read at runtime, if table options were not set initially while creating the connector.

conn.read(path='project.dataset.table')

PARAMETER	DESCRIPTION
`query`	BigQuery query. TYPE: `str \| None` DEFAULT: `None`
`data_format`	Spark data format. TYPE: `str \| None` DEFAULT: `None`
`options`	Spark options. TYPE: `dict[str, Any] \| None` DEFAULT: `None`
`path`	BigQuery table path. TYPE: `str \| None` DEFAULT: `None`
`dataframe_type`	str, optional. The type of the returned dataframe. Possible values are `"default"`, `"spark"`,`"pandas"`, `"polars"`, `"numpy"` or `"python"`. Defaults to "default", which maps to Spark dataframe for the Spark Engine and Pandas dataframe for the Python engine. TYPE: `Literal['default', 'spark', 'pandas', 'polars', 'numpy', 'python']` DEFAULT: `'default'`

RAISES	DESCRIPTION
`ValueError`	Malformed arguments.

RETURNS	DESCRIPTION
`TypeVar('pyspark.sql.DataFrame') \| TypeVar('pyspark.RDD') \| pd.DataFrame \| np.ndarray \| pl.DataFrame`	A Spark dataframe.

GcsConnector #

Bases: StorageConnector

This storage connector provides integration to Google Cloud Storage (GCS).

Once you create a connector in FeatureStore, you can transact data from a GCS bucket into a spark dataframe by calling the read API.

Authentication to GCP is handled by uploading the JSON keyfile for service account to the Hopsworks Project. For more information on service accounts and creating keyfile in GCP, read Google Cloud documentation.

The connector also supports the optional encryption method Customer Supplied Encryption Key by Google. The encryption details are stored as Secrets in the FeatureStore for keeping it secure. Read more about encryption on Google Documentation.

The storage connector uses the Google gcs-connector-hadoop behind the scenes. For more information, check out Google Cloud Storage Connector for Spark and Hadoop.

algorithm `property` #

algorithm: str | None

Encryption Algorithm.

bucket `property` #

bucket: str | None

GCS Bucket.

encryption_key `property` #

encryption_key: str | None

Encryption Key.

encryption_key_hash `property` #

encryption_key_hash: str | None

Encryption Key Hash.

key_path `property` #

key_path: str | None

JSON keyfile for service account.

path `property` #

path: str | None

The path of the connector along with gs file system prefixed.

prepare_spark #

prepare_spark(path: str | None = None) -> str | None

Prepare Spark to use this Storage Connector.

conn.prepare_spark()
spark.read.format("json").load("gs://bucket/path")
# or
spark.read.format("json").load(conn.prepare_spark("gs://bucket/path"))

PARAMETER	DESCRIPTION
`path`	Path to prepare for reading from Google cloud storage. TYPE: `str \| None` DEFAULT: `None`

read #

read(
    query: str | None = None,
    data_format: str | None = None,
    options: dict[str, Any] | None = None,
    path: str | None = None,
    dataframe_type: Literal[
        "default",
        "spark",
        "pandas",
        "polars",
        "numpy",
        "python",
    ] = "default",
) -> (
    TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | pd.DataFrame
    | np.ndarray
    | pl.DataFrame
)

Reads GCS path into a dataframe using the storage connector.

To read directly from the default bucket, you can omit the path argument:

conn.read(data_format='spark_formats')

Or to read objects from default bucket provide the object path without gsUtil URI schema. For example, following will read from a path gs://bucket_on_connector/Path/object :

conn.read(data_format='spark_formats', paths='Path/object')

Or to read with full gsUtil URI path,

conn.read(data_format='spark_formats',path='gs://BUCKET/DATA')

PARAMETER	DESCRIPTION
`query`	Not relevant for GCS connectors. TYPE: `str \| None` DEFAULT: `None`
`data_format`	Spark data format. TYPE: `str \| None` DEFAULT: `None`
`options`	Spark options. TYPE: `dict[str, Any] \| None` DEFAULT: `None`
`path`	GCS path. TYPE: `str \| None` DEFAULT: `None`
`dataframe_type`	str, optional. The type of the returned dataframe. Possible values are `"default"`, `"spark"`,`"pandas"`, `"polars"`, `"numpy"` or `"python"`. Defaults to "default", which maps to Spark dataframe for the Spark Engine and Pandas dataframe for the Python engine. TYPE: `Literal['default', 'spark', 'pandas', 'polars', 'numpy', 'python']` DEFAULT: `'default'`

RAISES	DESCRIPTION
`ValueError`	Malformed arguments.

RETURNS	DESCRIPTION
`TypeVar('pyspark.sql.DataFrame') \| TypeVar('pyspark.RDD') \| pd.DataFrame \| np.ndarray \| pl.DataFrame`	A Spark dataframe.

GlueConnector #

Bases: StorageConnector

The Glue storage connector integrates with the AWS Glue Data Catalog.

The connector points at a Glue database backed by Amazon S3. Data always lives on S3, so the connector provides the same S3 credentials (access_key, secret_key, session_token, region) that the S3Connector does. This works for any data format — Apache Iceberg, Delta Lake, Apache Hudi, as well as plain file formats such as csv and parquet.

How the Glue Data Catalog itself is used depends on the format:

Iceberg: the catalog owns the table's current-metadata pointer, so reads and writes are mediated by the catalog (the table is addressed by <database>.<table>).
Delta and Hudi: the on-path transaction log or timeline stays authoritative; the catalog is a discoverability mirror that is registered on create and synced on write so external engines (Athena, EMR, ...) can find the table by name.
Plain file formats (csv, parquet, ...): the connector is used only for S3 access; nothing is registered in the catalog.

For direct Spark or PyIceberg access outside the feature group APIs, the connector supplies the matching catalog properties; see GlueConnector.catalog_options (Spark) and GlueConnector.pyiceberg_catalog_options (PyIceberg).

Feature group path is optional when the Glue database has a location.

When creating a feature group from this connector and the Glue database has a location, the feature group path is generated automatically by appending the new table to that database location, so no path needs to be set. Otherwise, the path must be set explicitly on the data source, for example:

ds = fs.get_data_source("glue")
ds.path = "s3://mybucket/iceberg-warehouse/myglue.db/fg_1/"

An explicitly set path always takes precedence over the generated one.

access_key `property` #

access_key: str | None

Access key.

arguments `property` #

arguments: dict[str, Any]

Additional Spark options for the connector, passed as a dictionary.

These are forwarded to the S3 setup the same way as for the S3Connector, so any fs.s3a.* option (e.g. {"fs.s3a.endpoint": "..."}) applies here too.

bucket `property` #

bucket: str | None

No fixed bucket; the bucket is part of the table's S3 location.

database `property` #

database: str | None

Default Glue database for the connector.

This is only a fallback: when a feature group's data source specifies a database, that one takes precedence over this connector default.

iam_role `property` #

iam_role: str | None

IAM role.

region `property` #

region: str | None

AWS region of the Glue Data Catalog and the backing S3 bucket.

secret_key `property` #

secret_key: str | None

Secret key.

server_encryption_algorithm `property` #

server_encryption_algorithm: str | None

Server-side encryption algorithm, exposed for reuse of the S3 setup.

server_encryption_key `property` #

server_encryption_key: str | None

Server-side encryption key, exposed for reuse of the S3 setup.

session_token `property` #

session_token: str | None

Session token.

table `property` #

table: str | None

Name of the table within the Glue database, if any.

catalog_options #

catalog_options(
    warehouse: str | None = None,
) -> dict[str, str]

Return Iceberg catalog properties for committing through the Glue Data Catalog.

The returned properties configure the Iceberg GlueCatalog and its S3FileIO, including the connector's S3 credentials. Pass these together with the iceberg.catalog write option (prefixed with iceberg.catalog.) to register the table in the Glue Data Catalog on write while the data stays on S3.

Hopsworks routes Glue feature groups through the Glue catalog automatically, so passing these options manually is only needed for direct Spark or PyIceberg access outside the feature group APIs.

Example

connector = fs.get_data_source("glue").storage_connector
options = {
    "iceberg.catalog": "glue_catalog",
    **{
        f"iceberg.catalog.{k}": v
        for k, v in connector.catalog_options().items()
    },
}
fg.insert(df, write_options=options)

PARAMETER	DESCRIPTION
`warehouse`	S3 warehouse location for the catalog; defaults to the catalog's configured location. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`dict[str, str]`	A dictionary of Iceberg Glue catalog properties.

connector_options #

connector_options() -> dict[str, Any]

Return options to be passed to an external S3 connector library.

prepare_spark #

prepare_spark(path: str | None = None) -> str | None

Prepare Spark to use this Storage Connector.

Sets the S3 credentials on the Spark session and rewrites the path to the s3a:// scheme, so reads and writes to the table's S3 location work, mirroring the S3Connector.

PARAMETER	DESCRIPTION
`path`	Path to prepare for reading from cloud storage. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`str \| None`	The path rewritten to the `s3a://` scheme.

pyiceberg_catalog_options #

pyiceberg_catalog_options(
    warehouse: str | None = None,
) -> dict[str, str]

Return PyIceberg catalog properties for the Glue Data Catalog.

PyIceberg identifies the catalog by type rather than by the implementation class used by the Iceberg Spark connector, and uses its own credential and region property names, so the catalog_options Spark properties cannot be reused. Use these when reading or writing a Glue table without Spark.

PARAMETER	DESCRIPTION
`warehouse`	S3 warehouse location for the catalog; defaults to the catalog's configured location. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`dict[str, str]`	A dictionary of PyIceberg Glue catalog properties.

GoogleSheetsConnector #

Bases: StorageConnector

A Google Sheets storage connector authenticated by a GCP service-account keyfile.

The connector stores the path to a service-account JSON key uploaded to HopsFS. An optional spreadsheet ID can be set at connector level; if omitted it must be provided per feature group via DataSource.spreadsheet_id.

key_path `property` `writable` #

key_path: str | None

Get or set the HopsFS path to the service-account JSON keyfile.

spreadsheet_id `property` `writable` #

spreadsheet_id: str | None

Get or set the Google Spreadsheet ID.

Optional at connector level — can be provided per feature group via DataSource.spreadsheet_id instead.

HopsFSConnector #

Bases: StorageConnector

read #

read(
    query: str | None = None,
    data_format: str | None = None,
    options: dict[str, Any] | None = None,
    path: str | None = None,
    dataframe_type: Literal[
        "default",
        "spark",
        "pandas",
        "polars",
        "numpy",
        "python",
    ] = "default",
) -> (
    TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | pd.DataFrame
    | np.ndarray
    | pl.DataFrame
)

Reads a path into a dataframe using the HopsFS storage connector.

PARAMETER	DESCRIPTION
`query`	Not used for HopsFS. Kept for interface consistency. TYPE: `str \| None` DEFAULT: `None`
`data_format`	The file format to be read, e.g., `csv`, `parquet`. TYPE: `str \| None` DEFAULT: `None`
`options`	Any additional key/value options to be passed to the connector. TYPE: `dict[str, Any] \| None` DEFAULT: `None`
`path`	Path to be read within HopsFS. If the connector has a base path configured, relative paths will be resolved against it. Absolute `hopsfs://` paths are used as-is. TYPE: `str \| None` DEFAULT: `None`
`dataframe_type`	The type of the returned dataframe. Defaults to "default", which maps to Spark dataframe for the Spark Engine and Pandas dataframe for the Python engine. TYPE: `Literal['default', 'spark', 'pandas', 'polars', 'numpy', 'python']` DEFAULT: `'default'`

RETURNS	DESCRIPTION
`TypeVar('pyspark.sql.DataFrame') \| TypeVar('pyspark.RDD') \| pd.DataFrame \| np.ndarray \| pl.DataFrame`	The read dataframe.

JdbcConnector #

Bases: StorageConnector

arguments `property` #

arguments: dict[str, Any] | None

Additional JDBC arguments.

When running hsfs with PySpark/Spark in Hopsworks, the driver is automatically provided in the classpath but you need to set the driver argument to com.mysql.cj.jdbc.Driver when creating the Storage Connector.

connection_string `property` #

connection_string: str | None

JDBC connection string.

read #

read(
    query: str | None = None,
    data_format: str | None = None,
    options: dict[str, Any] | None = None,
    path: str | None = None,
    dataframe_type: Literal[
        "default",
        "spark",
        "pandas",
        "polars",
        "numpy",
        "python",
    ] = "default",
) -> (
    TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | pd.DataFrame
    | np.ndarray
    | pl.DataFrame
)

Reads a query into a dataframe using the storage connector.

PARAMETER	DESCRIPTION
`query`	A SQL query to be read. TYPE: `str \| None` DEFAULT: `None`
`data_format`	Not relevant for JDBC based connectors. TYPE: `str \| None` DEFAULT: `None`
`options`	Any additional key/value options to be passed to the JDBC connector. TYPE: `dict[str, Any] \| None` DEFAULT: `None`
`path`	Not relevant for JDBC based connectors. TYPE: `str \| None` DEFAULT: `None`
`dataframe_type`	str, optional. The type of the returned dataframe. Possible values are `"default"`, `"spark"`,`"pandas"`, `"polars"`, `"numpy"` or `"python"`. Defaults to "default", which maps to Spark dataframe for the Spark Engine and Pandas dataframe for the Python engine. TYPE: `Literal['default', 'spark', 'pandas', 'polars', 'numpy', 'python']` DEFAULT: `'default'`

RETURNS	DESCRIPTION
`TypeVar('pyspark.sql.DataFrame') \| TypeVar('pyspark.RDD') \| pd.DataFrame \| np.ndarray \| pl.DataFrame`	`DataFrame`.

KafkaConnector #

Bases: StorageConnector

bootstrap_servers `property` #

bootstrap_servers: list[str] | None

Bootstrap servers string.

options `property` #

options: dict[str, Any]

Bootstrap servers string.

security_protocol `property` #

security_protocol: str | None

Bootstrap servers string.

ssl_endpoint_identification_algorithm `property` #

ssl_endpoint_identification_algorithm: str | None

Bootstrap servers string.

ssl_keystore_location `property` #

ssl_keystore_location: str | None

Bootstrap servers string.

ssl_truststore_location `property` #

ssl_truststore_location: str | None

Bootstrap servers string.

confluent_options #

confluent_options() -> dict[str, Any]

Return prepared options to be passed to confluent_kafka, based on the provided apache spark configuration.

Right now only producer values with Importance >= medium are implemented.

See https://docs.confluent.io/platform/current/clients/librdkafka/html/md_CONFIGURATION.html.

RETURNS	DESCRIPTION
`dict[str, Any]`	A dictionary containing the configuration options for confluent_kafka.

create_pem_files #

create_pem_files(kafka_options: dict[str, Any]) -> None

Create PEM (Privacy Enhanced Mail) files for Kafka SSL authentication.

This method writes the necessary PEM files for SSL authentication with Kafka, using the provided keystore and truststore locations and passwords. The generated file paths are stored as the following instance variables:

- self.ca_chain_path: Path to the generated CA chain PEM file.
- self.client_cert_path: Path to the generated client certificate PEM file.
- self.client_key_path: Path to the generated client key PEM file.

These files are used for configuring secure Kafka connections (e.g., with Spark or confluent_kafka). The method is idempotent and will only create the files once per connector instance.

PARAMETER	DESCRIPTION
`kafka_options`	A dictionary containing the Kafka configuration options, including keystore and truststore locations and passwords. TYPE: `dict[str, Any]`

kafka_options #

kafka_options(distribute: bool = True) -> dict[str, Any]

Return prepared options to be passed to kafka, based on the additional arguments.

See https://kafka.apache.org/documentation/.

PARAMETER	DESCRIPTION
`distribute`	Whether to distribute the SSL certificates to the cluster nodes. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`dict[str, Any]`	A dictionary containing the configuration options for kafka.

read #

read(
    query: str | None = None,
    data_format: str | None = None,
    options: dict[str, Any] | None = None,
    path: str | None = None,
    dataframe_type: Literal[
        "default",
        "spark",
        "pandas",
        "polars",
        "numpy",
        "python",
    ] = "default",
) -> None

Failure

This operation is not supported. Use read_stream instead to read a Kafka stream into a streaming Spark Dataframe.

RAISES	DESCRIPTION
`NotImplementedError`	Always, since this operation is not supported.

read_stream #

read_stream(
    topic: str,
    topic_pattern: bool = False,
    message_format: str = "avro",
    schema: str | None = None,
    options: dict[str, Any] | None = None,
    include_metadata: bool = False,
) -> TypeVar("pyspark.sql.DataFrame") | TypeVar(
    "pyspark.sql.streaming.StreamingQuery"
)

Reads a Kafka stream from a topic or multiple topics into a Dataframe.

Engine Support

Spark only

Reading from data streams using Pandas/Python as engine is currently not supported. Python/Pandas has no notion of streaming.

PARAMETER	DESCRIPTION
`topic`	Name or pattern of the topic(s) to subscribe to. TYPE: `str`
`topic_pattern`	Flag to indicate if `topic` string is a pattern. TYPE: `bool` DEFAULT: `False`
`message_format`	The format of the messages to use for decoding. Can be `"avro"` or `"json"`. TYPE: `str` DEFAULT: `'avro'`
`schema`	Optional schema, to use for decoding, can be an Avro schema string for `"avro"` message format, or for JSON encoding a Spark StructType schema, or a DDL formatted string. TYPE: `str \| None` DEFAULT: `None`
`options`	Additional options as key/value string pairs to be passed to Spark. Defaults to `{}`. TYPE: `dict[str, Any] \| None` DEFAULT: `None`
`include_metadata`	Indicate whether to return additional metadata fields from messages in the stream. Otherwise, only the decoded value fields are returned. TYPE: `bool` DEFAULT: `False`

RAISES	DESCRIPTION
`ValueError`	Malformed arguments.

RETURNS	DESCRIPTION
`TypeVar('pyspark.sql.DataFrame') \| TypeVar('pyspark.sql.streaming.StreamingQuery')`	A Spark streaming dataframe.

spark_options #

spark_options() -> dict[str, Any]

Return prepared options to be passed to Spark, based on the additional arguments.

This is done by just adding 'kafka.' prefix to kafka_options.

See https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#kafka-specific-configurations.

MongoDBConnector #

Bases: StorageConnector

MongoDB storage connector backed by the official mongo-spark-connector and pymongo.

Use this connector to register an external feature group whose data lives in a MongoDB collection. The connection_string is a MongoDB URI without embedded credentials; the username and password are kept in the Hopsworks secret store and spliced in at read time.

auth_mechanism `property` #

auth_mechanism: str | None

MongoDB authMechanism URI parameter (e.g. SCRAM-SHA-256).

auth_source `property` #

auth_source: str | None

MongoDB authSource URI parameter (typically admin).

collection `property` #

collection: str | None

Default collection name used when none is provided at read time.

connection_string `property` #

connection_string: str | None

MongoDB connection URI (mongodb:// or mongodb+srv://) without embedded credentials.

database `property` #

database: str | None

Default database name.

options `property` #

options: dict[str, Any]

Extra options forwarded to the Spark / pymongo client.

password `property` #

password: str | None

Database password resolved from the Hopsworks secret store.

user `property` #

user: str | None

Database user.

connector_options #

connector_options() -> dict[str, Any]

Return arguments suitable for an external pymongo client.

from pymongo import MongoClient

sc = fs.get_storage_connector("mongo_conn")
client = MongoClient(**sc.connector_options())

Forwards any persisted self.options whose key looks like a MongoClient constructor kwarg (lowercase letters, digits, and underscores) so operator-set tuning knobs — maxPoolSize, serverSelectionTimeoutMS, tlsAllowInvalidCertificates, etc. — reach the driver. Keys that look like URI parameters (camelCase, already embedded in connection_uri) and anything non-string are dropped to avoid duplicate-config errors from pymongo.

prepare_spark #

prepare_spark(path: str | None = None) -> str | None

Ensure the Spark session is wired with the mongo-spark-connector classpath.

read #

read(
    query: str | None = None,
    data_format: str | None = None,
    options: dict[str, Any] | None = None,
    path: str | None = None,
    dataframe_type: Literal[
        "default",
        "spark",
        "pandas",
        "polars",
        "numpy",
        "python",
    ] = "default",
) -> (
    TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | pd.DataFrame
    | np.ndarray
    | pl.DataFrame
)

Read a collection from MongoDB into a dataframe.

PARAMETER	DESCRIPTION
`query`	Not used for MongoDB. TYPE: `str \| None` DEFAULT: `None`
`data_format`	Not used for MongoDB. TYPE: `str \| None` DEFAULT: `None`
`options`	Extra key/value options merged into the Spark reader configuration. TYPE: `dict[str, Any] \| None` DEFAULT: `None`
`path`	Not used for MongoDB. TYPE: `str \| None` DEFAULT: `None`
`dataframe_type`	Type of the returned dataframe. TYPE: `Literal['default', 'spark', 'pandas', 'polars', 'numpy', 'python']` DEFAULT: `'default'`

RETURNS	DESCRIPTION
`TypeVar('pyspark.sql.DataFrame') \| TypeVar('pyspark.RDD') \| pd.DataFrame \| np.ndarray \| pl.DataFrame`	`DataFrame`.

RedshiftConnector #

Bases: StorageConnector

arguments `property` #

arguments: str | None

Additional JDBC, REDSHIFT, or Snowflake arguments.

auto_create `property` #

auto_create: bool | None

Database username for redshift cluster.

cluster_identifier `property` #

cluster_identifier: str | None

Cluster identifier for redshift cluster.

database_driver `property` #

database_driver: str | None

Database endpoint for redshift cluster.

database_endpoint `property` #

database_endpoint: str | None

Database endpoint for redshift cluster.

database_group `property` #

database_group: str | None

Database username for redshift cluster.

database_name `property` #

database_name: str | None

Database name for redshift cluster.

database_password `property` #

database_password: str | None

Database password for redshift cluster.

database_port `property` #

database_port: int | str | None

Database port for redshift cluster.

database_user_name `property` #

database_user_name: str | None

Database username for redshift cluster.

expiration `property` #

expiration: int | str | None

Cluster temporary credential expiration time.

iam_role `property` #

iam_role: Any | None

IAM role.

table_name `property` #

table_name: str | None

Table name for redshift cluster.

connector_options #

connector_options() -> dict[str, Any]

Return options to be passed to an external Redshift connector library.

read #

read(
    query: str | None = None,
    data_format: str | None = None,
    options: dict[str, Any] | None = None,
    path: str | None = None,
    dataframe_type: Literal[
        "default",
        "spark",
        "pandas",
        "polars",
        "numpy",
        "python",
    ] = "default",
) -> (
    TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | pd.DataFrame
    | np.ndarray
    | pl.DataFrame
)

Reads a table or query into a dataframe using the storage connector.

PARAMETER	DESCRIPTION
`query`	By default, the storage connector will read the table configured together with the connector, if any. It's possible to overwrite this by passing a SQL query here. TYPE: `str \| None` DEFAULT: `None`
`data_format`	Not relevant for JDBC based connectors such as Redshift. TYPE: `str \| None` DEFAULT: `None`
`options`	Any additional key/value options to be passed to the JDBC connector. TYPE: `dict[str, Any] \| None` DEFAULT: `None`
`path`	Not relevant for JDBC based connectors such as Redshift. TYPE: `str \| None` DEFAULT: `None`
`dataframe_type`	str, optional. The type of the returned dataframe. Possible values are `"default"`, `"spark"`,`"pandas"`, `"polars"`, `"numpy"` or `"python"`. Defaults to "default", which maps to Spark dataframe for the Spark Engine and Pandas dataframe for the Python engine. TYPE: `Literal['default', 'spark', 'pandas', 'polars', 'numpy', 'python']` DEFAULT: `'default'`

RETURNS	DESCRIPTION
`TypeVar('pyspark.sql.DataFrame') \| TypeVar('pyspark.RDD') \| pd.DataFrame \| np.ndarray \| pl.DataFrame`	`DataFrame`.

refetch #

refetch() -> None

Refetch storage connector in order to retrieve updated temporary credentials.

S3Connector #

Bases: StorageConnector

access_key `property` #

access_key: str | None

Access key.

arguments `property` #

arguments: dict[str, Any] | None

Additional spark options for the S3 connector, passed as a dictionary.

These are set using the Spark Options field in the UI when creating the connector. Example: {"fs.s3a.endpoint": "s3.eu-west-1.amazonaws.com", "fs.s3a.path.style.access": "true"}.

bucket `property` #

bucket: str | None

Return the bucket for S3 connectors.

iam_role `property` #

iam_role: str | None

IAM role.

path `property` #

path: str | None

If the connector refers to a path (e.g. S3) - return the path of the connector.

region `property` #

region: str | None

Return the region for S3 connectors.

secret_key `property` #

secret_key: str | None

Secret key.

server_encryption_algorithm `property` #

server_encryption_algorithm: str | None

Encryption algorithm if server-side S3 bucket encryption is enabled.

server_encryption_key `property` #

server_encryption_key: str | None

Encryption key if server-side S3 bucket encryption is enabled.

session_token `property` #

session_token: str | None

Session token.

connector_options #

connector_options() -> dict[str, Any]

Return options to be passed to an external S3 connector library.

prepare_spark #

prepare_spark(path: str | None = None) -> str | None

Prepare Spark to use this Storage Connector.

conn.prepare_spark()

spark.read.format("json").load("s3a://[bucket]/path")

# or
spark.read.format("json").load(conn.prepare_spark("s3a://[bucket]/path"))

PARAMETER	DESCRIPTION
`path`	Path to prepare for reading from cloud storage. TYPE: `str \| None` DEFAULT: `None`

read #

read(
    query: str | None = None,
    data_format: str | None = None,
    options: dict[str, Any] | None = None,
    path: str | None = None,
    dataframe_type: Literal[
        "default",
        "spark",
        "pandas",
        "polars",
        "numpy",
        "python",
    ] = "default",
) -> (
    TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | pd.DataFrame
    | np.ndarray
    | pl.DataFrame
)

Reads a query or a path into a dataframe using the storage connector.

Note, paths are only supported for object stores like S3, HopsFS and ADLS, while queries are meant for JDBC or databases like Redshift and Snowflake.

PARAMETER	DESCRIPTION
`query`	Not relevant for S3 connectors. TYPE: `str \| None` DEFAULT: `None`
`data_format`	The file format of the files to be read, e.g. `csv`, `parquet`. TYPE: `str \| None` DEFAULT: `None`
`options`	Any additional key/value options to be passed to the S3 connector. TYPE: `dict[str, Any] \| None` DEFAULT: `None`
`path`	Path within the bucket to be read. TYPE: `str \| None` DEFAULT: `None`
`dataframe_type`	The type of the returned dataframe. Defaults to "default", which maps to Spark dataframe for the Spark Engine and Pandas dataframe for the Python engine. TYPE: `Literal['default', 'spark', 'pandas', 'polars', 'numpy', 'python']` DEFAULT: `'default'`

RETURNS	DESCRIPTION
`TypeVar('pyspark.sql.DataFrame') \| TypeVar('pyspark.RDD') \| pd.DataFrame \| np.ndarray \| pl.DataFrame`	`DataFrame`.

SapHanaConnector #

Bases: StorageConnector

SAP HANA storage connector backed by the SAP JDBC driver.

Use this connector to register an external feature group whose data lives in SAP HANA, and to ingest data from HANA via the dlt-based ingestion job.

application `property` #

application: str | None

Optional SAP HANA application name surfaced for session tracing.

database `property` #

database: str | None

Tenant database name on the SAP HANA endpoint.

host `property` #

host: str | None

Hostname of the SAP HANA endpoint.

options `property` #

options: dict[str, Any]

Additional Spark and JDBC options merged into reads.

password `property` #

password: str | None

Database password.

port `property` #

port: int | None

Port of the SAP HANA endpoint.

schema `property` #

schema: str | None

Default schema applied to unqualified queries.

table `property` #

table: str | None

Table the connector points at when no query is provided.

user `property` #

user: str | None

Database user.

connector_options #

connector_options() -> dict[str, Any]

Return arguments suitable for an external Python HANA driver such as hdbcli.

from hdbcli import dbapi

sc = fs.get_storage_connector("hana_conn")
conn = dbapi.connect(**sc.connector_options())

prepare_spark #

prepare_spark(path: str | None = None) -> str | None

Prepare the Spark session with the SAP HANA driver classpath when needed.

read #

read(
    query: str | None = None,
    data_format: str | None = None,
    options: dict[str, Any] | None = None,
    path: str | None = None,
    dataframe_type: Literal[
        "default",
        "spark",
        "pandas",
        "polars",
        "numpy",
        "python",
    ] = "default",
) -> (
    TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | pd.DataFrame
    | np.ndarray
    | pl.DataFrame
)

Read a table or query from SAP HANA into a dataframe.

PARAMETER	DESCRIPTION
`query`	SQL query to read; overrides any configured table. TYPE: `str \| None` DEFAULT: `None`
`data_format`	Not used for SAP HANA. TYPE: `str \| None` DEFAULT: `None`
`options`	Extra key/value options passed to the Spark JDBC reader. TYPE: `dict[str, Any] \| None` DEFAULT: `None`
`path`	Not used for SAP HANA. TYPE: `str \| None` DEFAULT: `None`
`dataframe_type`	Type of the returned dataframe. TYPE: `Literal['default', 'spark', 'pandas', 'polars', 'numpy', 'python']` DEFAULT: `'default'`

RETURNS	DESCRIPTION
`TypeVar('pyspark.sql.DataFrame') \| TypeVar('pyspark.RDD') \| pd.DataFrame \| np.ndarray \| pl.DataFrame`	`DataFrame`.

SnowflakeConnector #

Bases: StorageConnector

account `property` #

account: str | None

Account of the Snowflake storage connector.

application `property` #

application: Any

Application of the Snowflake storage connector.

database `property` #

database: str | None

Database of the Snowflake storage connector.

options `property` #

options: dict[str, Any] | None

Additional options for the Snowflake storage connector.

passphrase `property` #

passphrase: str | None

Passphrase for the private key file.

password `property` #

password: str | None

Password of the Snowflake storage connector.

private_key `property` #

private_key: str | None

Path to the private key file for key pair authentication.

role `property` #

role: Any | None

Role of the Snowflake storage connector.

schema `property` #

schema: str | None

Schema of the Snowflake storage connector.

table `property` #

table: str | None

Table of the Snowflake storage connector.

token `property` #

token: str | None

OAuth token of the Snowflake storage connector.

url `property` #

url: str | None

URL of the Snowflake storage connector.

user `property` #

user: Any | None

User of the Snowflake storage connector.

warehouse `property` #

warehouse: str | None

Warehouse of the Snowflake storage connector.

connector_options #

connector_options() -> dict[str, Any] | None

Prepare a Python dictionary with the needed arguments for you to connect to a Snowflake database.

It is useful for the snowflake.connector Python library.

import snowflake.connector

sc = fs.get_storage_connector("snowflake_conn")
ctx = snowflake.connector.connect(**sc.connector_options())

RETURNS	DESCRIPTION
`dict[str, Any] \| None`	A dictionary with the needed arguments for you to connect to a Snowflake database.

read #

read(
    query: str | None = None,
    data_format: str | None = None,
    options: dict[str, Any] | None = None,
    path: str | None = None,
    dataframe_type: Literal[
        "default",
        "spark",
        "pandas",
        "polars",
        "numpy",
        "python",
    ] = "default",
) -> (
    TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | pd.DataFrame
    | np.ndarray
    | pl.DataFrame
)

Reads a table or query into a dataframe using the storage connector.

PARAMETER	DESCRIPTION
`query`	By default, the storage connector will read the table configured together with the connector, if any. It's possible to overwrite this by passing a SQL query here. TYPE: `str \| None` DEFAULT: `None`
`data_format`	Not relevant for Snowflake connectors. TYPE: `str \| None` DEFAULT: `None`
`options`	Any additional key/value options to be passed to the engine. TYPE: `dict[str, Any] \| None` DEFAULT: `None`
`path`	Not relevant for Snowflake connectors. TYPE: `str \| None` DEFAULT: `None`
`dataframe_type`	str, optional. The type of the returned dataframe. Possible values are `"default"`, `"spark"`,`"pandas"`, `"polars"`, `"numpy"` or `"python"`. Defaults to "default", which maps to Spark dataframe for the Spark Engine and Pandas dataframe for the Python engine. TYPE: `Literal['default', 'spark', 'pandas', 'polars', 'numpy', 'python']` DEFAULT: `'default'`

RETURNS	DESCRIPTION
`TypeVar('pyspark.sql.DataFrame') \| TypeVar('pyspark.RDD') \| pd.DataFrame \| np.ndarray \| pl.DataFrame`	`DataFrame`.

snowflake_connector_options #

snowflake_connector_options() -> dict[str, Any] | None

Alias for connector_options.

RETURNS	DESCRIPTION
`dict[str, Any] \| None`	A dictionary with the needed arguments for you to connect to a Snowflake database.

UnityCatalogConnector #

Bases: StorageConnector

Databricks Unity Catalog storage connector.

Reads Delta-formatted tables governed by Unity Catalog via the Arrow Flight query service. Direct Spark reads are not supported in this release; use the Arrow Flight path instead.

access_token `property` #

access_token: str | None

Databricks personal access token, decrypted from the Hopsworks secret store on retrieval.

account_host `property` #

account_host: str | None

Databricks account-console host, only set when oauth_endpoint is "ACCOUNT".

account_id `property` #

account_id: str | None

Databricks account ID, only set when oauth_endpoint is "ACCOUNT".

arguments `property` #

arguments: dict[str, Any]

Additional Unity Catalog connection arguments passed through to the Arrow Flight server.

auth_method `property` #

auth_method: str

Authentication method for the Databricks workspace, either "PAT" or "OAUTH_M2M".

Defaults to "PAT" for connectors created before OAuth support landed.

aws_region `property` #

aws_region: str | None

Optional explicit AWS region for the managed storage backing this Unity Catalog.

When unset, the Arrow Flight read path guesses the region from the STS session-token Databricks returns with temporary table credentials.

client_id `property` #

client_id: str | None

Databricks service principal client ID, only set when auth_method is "OAUTH_M2M".

client_secret `property` #

client_secret: str | None

Databricks service principal client secret.

Write-only on the backend: this property is only populated when the caller has just constructed the connector locally with a secret in hand. Server responses never carry it; use has_client_secret to test whether a secret is on file.

default_catalog `property` #

default_catalog: str | None

Optional default Unity Catalog catalog to use when no catalog is explicitly specified.

has_access_token `property` #

has_access_token: bool

True iff a personal access token is on file for this connector.

The server never returns the access token itself on read; this boolean lets callers tell whether one exists without exposing the secret.

has_client_secret `property` #

has_client_secret: bool

True iff a client secret is on file for this connector.

The server never returns the client secret itself on read; this boolean lets callers tell whether one exists without exposing the secret.

oauth_endpoint `property` #

oauth_endpoint: str | None

OAuth token endpoint flavour, either "WORKSPACE" or "ACCOUNT".

Only set when auth_method is "OAUTH_M2M".

workspace_url `property` #

workspace_url: str | None

Databricks workspace URL used for API calls.

connector_options #

connector_options() -> dict[str, Any]

Return UC connector options shaped for external library use.

read #

read(
    spark: Any,
    catalog: str,
    schema: str,
    table: str,
    *,
    force_vended: bool = False,
) -> Any

Read a Unity Catalog table as a Spark DataFrame.

Default behavior: if the SparkSession is connected to a Databricks cluster, dispatch to native UC access (spark.read.table()), which is faster, auth'd by the cluster identity, and skips the Hopsworks round-trip. Otherwise resolves vended S3 credentials via Hopsworks and reads the Delta path directly.

Set force_vended=True to skip detection and always use the vended path, useful if the cluster identity lacks the grants the connector's SP has.

spark_options_for #

spark_options_for(
    catalog: str, schema: str, table: str
) -> UnityCatalogSparkOptions

Resolve Spark read options for a Unity Catalog table.

Mirrors the Python / Arrow Flight architecture: Hopsworks vends the Databricks bearer; the SDK calls Databricks directly for vended S3 temp-credentials, then builds the per-bucket S3A keys + Delta path.

v1 supports AWS-backed Delta tables only. Non-AWS storage (Azure / GCP) raises here. Credentials live ~1 h; call this close to the action that triggers the read.

PARAMETER	DESCRIPTION
`catalog`	UC catalog name (e.g. `"main"`). TYPE: `str`
`schema`	UC schema name within the catalog (e.g. `"default"`). TYPE: `str`
`table`	UC table name within the schema (e.g. `"transactions"`). TYPE: `str`

RETURNS	DESCRIPTION
`UnityCatalogSparkOptions`	A typed `UnityCatalogSparkOptions`
`UnityCatalogSparkOptions`	object carrying the Delta path, per-bucket S3A keys, and credential expiry metadata.

UnityCatalogSparkOptions #

Typed Spark read options vended for a Unity Catalog table.

Returned by UnityCatalogConnector.spark_options_for. Carries short-lived AWS credentials (in storage_options) plus the Delta path the table is stored at.

Use apply_to to wire the S3A credentials onto a SparkSession's Hadoop config, then spark.read.format(opts.format).load(opts.path). Or use read which does both in one call.

Credentials live ~1 h. Spark is lazy, so a long delay between spark_options_for() and the first action that actually reads from S3 can outlive the credentials. Mitigation: call close to the action, or use read().

cloud `property` #

cloud: str

Cloud the storage is on. Currently always "aws".

expires_at `property` #

expires_at: str | None

ISO-8601 instant when the vended credentials expire.

expires_in_seconds `property` #

expires_in_seconds: int

Seconds remaining at the moment the server built this response.

format `property` #

format: str

Spark read format. Currently always "delta".

path `property` #

path: str

S3A path to the Delta table root.

storage_options `property` #

storage_options: dict[str, str]

Per-bucket S3A Hadoop config keys + values.

apply_to #

apply_to(spark: Any) -> None

Apply the per-bucket S3A keys to spark's Hadoop config.

Per-bucket scope (fs.s3a.bucket.<bucket>.*) means adjacent reads to other buckets in the same SparkSession are unaffected. Subsequent reads of the same bucket will pick up the most-recently-applied credentials — matches AWS STS rotation semantics.

PARAMETER	DESCRIPTION
`spark`	A `SparkSession`; the per-bucket keys are written to its underlying Hadoop configuration. TYPE: `Any`

read #

read(spark: Any) -> Any

Apply credentials, then issue the Delta read.

PARAMETER	DESCRIPTION
`spark`	A `SparkSession`; the per-bucket S3A keys are applied to its Hadoop configuration before the read. TYPE: `Any`

RETURNS	DESCRIPTION
`Any`	The Spark `DataFrame` produced by reading the Delta path.

hsfs.storage_connector #

StorageConnector #

description property #

id property #

name property #

type property #

connector_options #

get_data #

get_data_batch #

get_databases #

get_feature_groups #

get_feature_groups_provenance #

get_metadata #

get_tables #

get_training_datasets #

get_training_datasets_provenance #

infer_metadata #

prepare_spark #

read #

save #

spark_options abstractmethod #

update #

AdlsConnector #

account_name property #

application_id property #

container_name property #

directory_id property #

generation property #

path property #

service_credential property #

prepare_spark #

read #

BigQueryConnector #

arguments property #

dataset property #

key_path property #

materialization_dataset property #

parent_project property #

query_project property #

query_table property #

connector_options #

read #

GcsConnector #

algorithm property #

bucket property #

encryption_key property #

encryption_key_hash property #

key_path property #

path property #

prepare_spark #

read #

GlueConnector #

access_key property #

arguments property #

bucket property #

database property #

iam_role property #

region property #

secret_key property #

server_encryption_algorithm property #

server_encryption_key property #

session_token property #

table property #

catalog_options #

connector_options #

prepare_spark #

pyiceberg_catalog_options #

GoogleSheetsConnector #

key_path property writable #

spreadsheet_id property writable #

HopsFSConnector #

read #

JdbcConnector #

arguments property #

connection_string property #

read #

KafkaConnector #

bootstrap_servers property #

options property #

security_protocol property #

description `property` #

id `property` #

name `property` #

type `property` #

spark_options `abstractmethod` #

account_name `property` #

application_id `property` #

container_name `property` #

directory_id `property` #

generation `property` #

path `property` #

service_credential `property` #

arguments `property` #

dataset `property` #

key_path `property` #

materialization_dataset `property` #

parent_project `property` #

query_project `property` #

query_table `property` #

algorithm `property` #

bucket `property` #

encryption_key `property` #

encryption_key_hash `property` #

key_path `property` #

path `property` #

access_key `property` #

arguments `property` #

bucket `property` #

database `property` #

iam_role `property` #

region `property` #

secret_key `property` #

server_encryption_algorithm `property` #

server_encryption_key `property` #

session_token `property` #

table `property` #

key_path `property` `writable` #

spreadsheet_id `property` `writable` #

arguments `property` #

connection_string `property` #

bootstrap_servers `property` #

options `property` #

security_protocol `property` #

ssl_endpoint_identification_algorithm `property` #

ssl_keystore_location `property` #

ssl_truststore_location `property` #

auth_mechanism `property` #

auth_source `property` #

collection `property` #

connection_string `property` #

database `property` #

options `property` #

password `property` #

user `property` #

arguments `property` #

auto_create `property` #

cluster_identifier `property` #

database_driver `property` #

database_endpoint `property` #

database_group `property` #

database_name `property` #

database_password `property` #

database_port `property` #

database_user_name `property` #

expiration `property` #

iam_role `property` #

table_name `property` #

access_key `property` #

arguments `property` #

bucket `property` #

iam_role `property` #

path `property` #

region `property` #

secret_key `property` #

server_encryption_algorithm `property` #

server_encryption_key `property` #

session_token `property` #

application `property` #

database `property` #

host `property` #