hsfs.storage_connector #

StorageConnector #

Bases: ABC

type `property` #

type: str | None

Type of the connector as string, e.g. "HOPFS, S3, ADLS, REDSHIFT, JDBC or SNOWFLAKE.

id `property` #

id: int | None

Id of the storage connector uniquely identifying it in the Feature store.

name `property` #

name: str

Name of the storage connector.

description `property` #

description: str | None

User provided description of the storage connector.

spark_options `abstractmethod` #

spark_options() -> dict[str, Any]

Return prepared options to be passed to Spark, based on the additional arguments.

RETURNS	DESCRIPTION
`dict[str, Any]`	A dictionary containing the configuration options for Spark.

read #

read(
    query: str | None = None,
    data_format: str | None = None,
    options: dict[str, Any] | None = None,
    path: str | None = None,
    dataframe_type: Literal[
        "default",
        "spark",
        "pandas",
        "polars",
        "numpy",
        "python",
    ] = "default",
) -> (
    TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | pd.DataFrame
    | np.ndarray
    | pl.DataFrame
)

Reads a query or a path into a dataframe using the storage connector.

Note, paths are only supported for object stores like S3, HopsFS and ADLS, while queries are meant for JDBC or databases like Redshift and Snowflake.

PARAMETER	DESCRIPTION
`query`	By default, the storage connector will read the table configured together with the connector, if any. It's possible to overwrite this by passing a SQL query here. TYPE: `str \| None` DEFAULT: `None`
`data_format`	When reading from object stores such as S3, HopsFS and ADLS, specify the file format to be read, e.g., `csv`, `parquet`. TYPE: `str \| None` DEFAULT: `None`
`options`	Any additional key/value options to be passed to the connector. TYPE: `dict[str, Any] \| None` DEFAULT: `None`
`path`	Path to be read from within the bucket of the storage connector. Not relevant for JDBC or database based connectors such as Snowflake, JDBC or Redshift. TYPE: `str \| None` DEFAULT: `None`
`dataframe_type`	The type of the returned dataframe. Defaults to "default", which maps to Spark dataframe for the Spark Engine and Pandas dataframe for the Python engine. TYPE: `Literal['default', 'spark', 'pandas', 'polars', 'numpy', 'python']` DEFAULT: `'default'`

RETURNS	DESCRIPTION
`TypeVar('pyspark.sql.DataFrame') \| TypeVar('pyspark.RDD') \| pd.DataFrame \| np.ndarray \| pl.DataFrame`	The read dataframe.

get_feature_groups_provenance #

get_feature_groups_provenance() -> Links | None

Get the generated feature groups using this storage connector, based on explicit provenance.

These feature groups can be accessible or inaccessible.

Explicit provenance does not track deleted generated feature group links, so deleted will always be empty. For inaccessible feature groups, only a minimal information is returned.

RETURNS	DESCRIPTION
`Links \| None`	The feature groups generated using this storage connector or `None` if none were created.

RAISES	DESCRIPTION
`hopsworks.client.exceptions.RestAPIError`	In case the backend encounters an issue.

get_feature_groups #

get_feature_groups() -> list[FeatureGroup]

Get the feature groups using this storage connector, based on explicit rovenance.

Only the accessible feature groups are returned. For more items use the base method, see get_feature_groups_provenance.

RETURNS	DESCRIPTION
`list[FeatureGroup]`	List of feature groups.

get_training_datasets_provenance #

get_training_datasets_provenance() -> Links | None

Get the generated training datasets using this storage connector, based on explicit provenance.

These training datasets can be accessible or inaccessible. Explicit provenance does not track deleted generated training dataset links, so deleted will always be empty. For inaccessible training datasets, only a minimal information is returned.

RETURNS	DESCRIPTION
`Links \| None`	The training datasets generated using this storage connector or `None` if none were created.

RAISES	DESCRIPTION
`hopsworks.client.exceptions.RestAPIError`	In case the backend encounters an issue.

get_training_datasets #

get_training_datasets() -> list[TrainingDataset]

Get the training datasets using this storage connector, based on explicit provenance.

Only the accessible training datasets are returned. For more items use the base method, get_training_datasets_provenance.

RETURNS	DESCRIPTION
`list[TrainingDataset]`	List of training datasets.

get_databases #

get_databases() -> list[str]

Retrieve the list of available databases.

Example

# connect to the Feature Store
fs = ...

sc = fs.get_storage_connector("conn_name")

databases = sc.get_databases()

RETURNS	DESCRIPTION
`list[str]`	A list of database names available in the storage connector.

get_tables #

get_tables(
    database: str | None = None,
) -> list[ds.DataSource]

Retrieve the list of tables from the specified database.

Example

# connect to the Feature Store
fs = ...

sc = fs.get_storage_connector("conn_name")

tables = sc.get_tables("database_name")

PARAMETER	DESCRIPTION
`database`	The name of the database to list tables from. If not provided, the default database is used. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`list[ds.DataSource]`	A list of DataSource objects representing the tables.

get_data #

get_data(
    data_source: ds.DataSource, use_cached=True
) -> DataSourceData

Retrieve the data from the data source.

Example

# connect to the Feature Store
fs = ...

sc = fs.get_storage_connector("conn_name")

tables = sc.get_tables("database_name")

data = sc.get_data(tables[0])

Parameters: data_source (DataSource): The data source to retrieve data from. use_cached (bool): Whether to use cached data if available. Only supported for CRM and REST connectors. Defaults to True.

RETURNS	DESCRIPTION
`DataSourceData`	An object containing the data retrieved from the data source.

get_metadata #

get_metadata(data_source: ds.DataSource) -> dict

Retrieve metadata information about the data source.

Example

# connect to the Feature Store
fs = ...

sc = fs.get_storage_connector("conn_name")

tables = sc.get_tables("database_name")

metadata = sc.get_metadata(tables[0])

PARAMETER	DESCRIPTION
`data_source`	The data source to retrieve metadata from. TYPE: `ds.DataSource`

RETURNS	DESCRIPTION
`dict`	A dictionary containing metadata about the data source.

AdlsConnector #

Bases: StorageConnector

generation `property` #

generation: str | None

Generation of the ADLS storage connector.

directory_id `property` #

directory_id: str | None

Directory ID of the ADLS storage connector.

application_id `property` #

application_id: str | None

Application ID of the ADLS storage connector.

account_name `property` #

account_name: str | None

Account name of the ADLS storage connector.

container_name `property` #

container_name: str | None

Container name of the ADLS storage connector.

service_credential `property` #

service_credential: str | None

Service credential of the ADLS storage connector.

path `property` #

path: str | None

If the connector refers to a path (e.g. ADLS) - return the path of the connector.

prepare_spark #

prepare_spark(path: str | None = None) -> str | None

Prepare Spark to use this Storage Connector.

conn.prepare_spark()

spark.read.format("json").load("abfss://[container-name]@[account_name].dfs.core.windows.net/[path]")

# or
spark.read.format("json").load(conn.prepare_spark("abfss://[container-name]@[account_name].dfs.core.windows.net/[path]"))

PARAMETER	DESCRIPTION
`path`	Path to prepare for reading from cloud storage. TYPE: `str \| None` DEFAULT: `None`

read #

read(
    query: str | None = None,
    data_format: str | None = None,
    options: dict[str, Any] | None = None,
    path: str | None = None,
    dataframe_type: Literal[
        "default",
        "spark",
        "pandas",
        "polars",
        "numpy",
        "python",
    ] = "default",
) -> (
    TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | pd.DataFrame
    | np.ndarray
    | pl.DataFrame
)

Reads a path into a dataframe using the storage connector.

PARAMETER	DESCRIPTION
`query`	Not relevant for ADLS connectors. TYPE: `str \| None` DEFAULT: `None`
`data_format`	The file format of the files to be read, e.g. `csv`, `parquet`. TYPE: `str \| None` DEFAULT: `None`
`options`	Any additional key/value options to be passed to the ADLS connector. TYPE: `dict[str, Any] \| None` DEFAULT: `None`
`path`	Path within the bucket to be read. For example, path=`path` will read directly from the container specified on connector by constructing the URI as 'abfss://[container-name]@[account_name].dfs.core.windows.net/[path]'. If no path is specified default container path will be used from connector. TYPE: `str \| None` DEFAULT: `None`
`dataframe_type`	str, optional. The type of the returned dataframe. Possible values are `"default"`, `"spark"`,`"pandas"`, `"polars"`, `"numpy"` or `"python"`. Defaults to "default", which maps to Spark dataframe for the Spark Engine and Pandas dataframe for the Python engine. TYPE: `Literal['default', 'spark', 'pandas', 'polars', 'numpy', 'python']` DEFAULT: `'default'`

RETURNS	DESCRIPTION
`TypeVar('pyspark.sql.DataFrame') \| TypeVar('pyspark.RDD') \| pd.DataFrame \| np.ndarray \| pl.DataFrame`	`DataFrame`.

BigQueryConnector #

Bases: StorageConnector

The BigQuery storage connector provides integration to Google Cloud BigQuery.

You can use it to run bigquery on your GCP cluster and load results into spark dataframe by calling the read API.

Authentication to GCP is handled by uploading the JSON keyfile for service account to the Hopsworks Project. For more information on service accounts and creating keyfile in GCP, read Google Cloud documentation.

The storage connector uses the Google spark-bigquery-connector behind the scenes. To read more about the spark connector, like the spark options or usage, check Apache Spark SQL connector for Google BigQuery.

key_path `property` #

key_path: str | None

JSON keyfile for service account.

parent_project `property` #

parent_project: str | None

BigQuery parent project (Google Cloud Project ID of the table to bill for the export).

dataset `property` #

dataset: str | None

BigQuery dataset (The dataset containing the table).

query_table `property` #

query_table: str | None

BigQuery table name.

query_project `property` #

query_project: str | None

BigQuery project (The Google Cloud Project ID of the table).

materialization_dataset `property` #

materialization_dataset: str | None

BigQuery materialization dataset (The dataset where the materialized view is going to be created, used in case of query).

arguments `property` #

arguments: dict[str, Any]

Additional spark options.

connector_options #

connector_options() -> dict[str, Any]

Return options to be passed to an external BigQuery connector library.

read #

read(
    query: str | None = None,
    data_format: str | None = None,
    options: dict[str, Any] | None = None,
    path: str | None = None,
    dataframe_type: Literal[
        "default",
        "spark",
        "pandas",
        "polars",
        "numpy",
        "python",
    ] = "default",
) -> (
    TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | pd.DataFrame
    | np.ndarray
    | pl.DataFrame
)

Reads results from BigQuery into a spark dataframe using the storage connector.

Reading from bigquery is done via either specifying the BigQuery table or BigQuery query. For example, to read from a BigQuery table, set the BigQuery project, dataset and table on storage connector and read directly from the corresponding path.

conn.read()

OR, to read results from a BigQuery query, set Materialization Dataset on storage connector, and pass your SQL to query argument.

conn.read(query='SQL')

Optionally, passing query argument will take priority at runtime if the table options were also set on the storage connector. This allows user to run from both a query or table with same connector, assuming all fields were set. Also, user can set the path argument to a bigquery table path to read at runtime, if table options were not set initially while creating the connector.

conn.read(path='project.dataset.table')

PARAMETER	DESCRIPTION
`query`	BigQuery query. TYPE: `str \| None` DEFAULT: `None`
`data_format`	Spark data format. TYPE: `str \| None` DEFAULT: `None`
`options`	Spark options. TYPE: `dict[str, Any] \| None` DEFAULT: `None`
`path`	BigQuery table path. TYPE: `str \| None` DEFAULT: `None`
`dataframe_type`	str, optional. The type of the returned dataframe. Possible values are `"default"`, `"spark"`,`"pandas"`, `"polars"`, `"numpy"` or `"python"`. Defaults to "default", which maps to Spark dataframe for the Spark Engine and Pandas dataframe for the Python engine. TYPE: `Literal['default', 'spark', 'pandas', 'polars', 'numpy', 'python']` DEFAULT: `'default'`

RAISES	DESCRIPTION
`ValueError`	Malformed arguments.

RETURNS	DESCRIPTION
`TypeVar('pyspark.sql.DataFrame') \| TypeVar('pyspark.RDD') \| pd.DataFrame \| np.ndarray \| pl.DataFrame`	A Spark dataframe.

GcsConnector #

Bases: StorageConnector

This storage connector provides integration to Google Cloud Storage (GCS).

Once you create a connector in FeatureStore, you can transact data from a GCS bucket into a spark dataframe by calling the read API.

Authentication to GCP is handled by uploading the JSON keyfile for service account to the Hopsworks Project. For more information on service accounts and creating keyfile in GCP, read Google Cloud documentation.

The connector also supports the optional encryption method Customer Supplied Encryption Key by Google. The encryption details are stored as Secrets in the FeatureStore for keeping it secure. Read more about encryption on Google Documentation.

The storage connector uses the Google gcs-connector-hadoop behind the scenes. For more information, check out Google Cloud Storage Connector for Spark and Hadoop.

key_path `property` #

key_path: str | None

JSON keyfile for service account.

algorithm `property` #

algorithm: str | None

Encryption Algorithm.

encryption_key `property` #

encryption_key: str | None

Encryption Key.

encryption_key_hash `property` #

encryption_key_hash: str | None

Encryption Key Hash.

path `property` #

path: str | None

The path of the connector along with gs file system prefixed.

bucket `property` #

bucket: str | None

GCS Bucket.

read #

read(
    query: str | None = None,
    data_format: str | None = None,
    options: dict[str, Any] | None = None,
    path: str | None = None,
    dataframe_type: Literal[
        "default",
        "spark",
        "pandas",
        "polars",
        "numpy",
        "python",
    ] = "default",
) -> (
    TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | pd.DataFrame
    | np.ndarray
    | pl.DataFrame
)

Reads GCS path into a dataframe using the storage connector.

To read directly from the default bucket, you can omit the path argument:

conn.read(data_format='spark_formats')

Or to read objects from default bucket provide the object path without gsUtil URI schema. For example, following will read from a path gs://bucket_on_connector/Path/object :

conn.read(data_format='spark_formats', paths='Path/object')

Or to read with full gsUtil URI path,

conn.read(data_format='spark_formats',path='gs://BUCKET/DATA')

PARAMETER	DESCRIPTION
`query`	Not relevant for GCS connectors. TYPE: `str \| None` DEFAULT: `None`
`data_format`	Spark data format. TYPE: `str \| None` DEFAULT: `None`
`options`	Spark options. TYPE: `dict[str, Any] \| None` DEFAULT: `None`
`path`	GCS path. TYPE: `str \| None` DEFAULT: `None`
`dataframe_type`	str, optional. The type of the returned dataframe. Possible values are `"default"`, `"spark"`,`"pandas"`, `"polars"`, `"numpy"` or `"python"`. Defaults to "default", which maps to Spark dataframe for the Spark Engine and Pandas dataframe for the Python engine. TYPE: `Literal['default', 'spark', 'pandas', 'polars', 'numpy', 'python']` DEFAULT: `'default'`

RAISES	DESCRIPTION
`ValueError`	Malformed arguments.

RETURNS	DESCRIPTION
`TypeVar('pyspark.sql.DataFrame') \| TypeVar('pyspark.RDD') \| pd.DataFrame \| np.ndarray \| pl.DataFrame`	A Spark dataframe.

prepare_spark #

prepare_spark(path: str | None = None) -> str | None

Prepare Spark to use this Storage Connector.

conn.prepare_spark()
spark.read.format("json").load("gs://bucket/path")
# or
spark.read.format("json").load(conn.prepare_spark("gs://bucket/path"))

PARAMETER	DESCRIPTION
`path`	Path to prepare for reading from Google cloud storage. TYPE: `str \| None` DEFAULT: `None`

HopsFSConnector #

Bases: StorageConnector

read #

read(
    query: str | None = None,
    data_format: str | None = None,
    options: dict[str, Any] | None = None,
    path: str | None = None,
    dataframe_type: Literal[
        "default",
        "spark",
        "pandas",
        "polars",
        "numpy",
        "python",
    ] = "default",
) -> (
    TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | pd.DataFrame
    | np.ndarray
    | pl.DataFrame
)

Reads a path into a dataframe using the HopsFS storage connector.

PARAMETER	DESCRIPTION
`query`	Not used for HopsFS. Kept for interface consistency. TYPE: `str \| None` DEFAULT: `None`
`data_format`	The file format to be read, e.g., `csv`, `parquet`. TYPE: `str \| None` DEFAULT: `None`
`options`	Any additional key/value options to be passed to the connector. TYPE: `dict[str, Any] \| None` DEFAULT: `None`
`path`	Path to be read within HopsFS. If the connector has a base path configured, relative paths will be resolved against it. Absolute `hopsfs://` paths are used as-is. TYPE: `str \| None` DEFAULT: `None`
`dataframe_type`	The type of the returned dataframe. Defaults to "default", which maps to Spark dataframe for the Spark Engine and Pandas dataframe for the Python engine. TYPE: `Literal['default', 'spark', 'pandas', 'polars', 'numpy', 'python']` DEFAULT: `'default'`

RETURNS	DESCRIPTION
`TypeVar('pyspark.sql.DataFrame') \| TypeVar('pyspark.RDD') \| pd.DataFrame \| np.ndarray \| pl.DataFrame`	The read dataframe.

JdbcConnector #

Bases: StorageConnector

connection_string `property` #

connection_string: str | None

JDBC connection string.

arguments `property` #

arguments: dict[str, Any] | None

Additional JDBC arguments.

When running hsfs with PySpark/Spark in Hopsworks, the driver is automatically provided in the classpath but you need to set the driver argument to com.mysql.cj.jdbc.Driver when creating the Storage Connector.

read #

read(
    query: str | None = None,
    data_format: str | None = None,
    options: dict[str, Any] | None = None,
    path: str | None = None,
    dataframe_type: Literal[
        "default",
        "spark",
        "pandas",
        "polars",
        "numpy",
        "python",
    ] = "default",
) -> (
    TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | pd.DataFrame
    | np.ndarray
    | pl.DataFrame
)

Reads a query into a dataframe using the storage connector.

PARAMETER	DESCRIPTION
`query`	A SQL query to be read. TYPE: `str \| None` DEFAULT: `None`
`data_format`	Not relevant for JDBC based connectors. TYPE: `str \| None` DEFAULT: `None`
`options`	Any additional key/value options to be passed to the JDBC connector. TYPE: `dict[str, Any] \| None` DEFAULT: `None`
`path`	Not relevant for JDBC based connectors. TYPE: `str \| None` DEFAULT: `None`
`dataframe_type`	str, optional. The type of the returned dataframe. Possible values are `"default"`, `"spark"`,`"pandas"`, `"polars"`, `"numpy"` or `"python"`. Defaults to "default", which maps to Spark dataframe for the Spark Engine and Pandas dataframe for the Python engine. TYPE: `Literal['default', 'spark', 'pandas', 'polars', 'numpy', 'python']` DEFAULT: `'default'`

RETURNS	DESCRIPTION
`TypeVar('pyspark.sql.DataFrame') \| TypeVar('pyspark.RDD') \| pd.DataFrame \| np.ndarray \| pl.DataFrame`	`DataFrame`.

KafkaConnector #

Bases: StorageConnector

bootstrap_servers `property` #

bootstrap_servers: list[str] | None

Bootstrap servers string.

security_protocol `property` #

security_protocol: str | None

Bootstrap servers string.

ssl_truststore_location `property` #

ssl_truststore_location: str | None

Bootstrap servers string.

ssl_keystore_location `property` #

ssl_keystore_location: str | None

Bootstrap servers string.

ssl_endpoint_identification_algorithm `property` #

ssl_endpoint_identification_algorithm: str | None

Bootstrap servers string.

options `property` #

options: dict[str, Any]

Bootstrap servers string.

create_pem_files #

create_pem_files(kafka_options: dict[str, Any]) -> None

Create PEM (Privacy Enhanced Mail) files for Kafka SSL authentication.

This method writes the necessary PEM files for SSL authentication with Kafka, using the provided keystore and truststore locations and passwords. The generated file paths are stored as the following instance variables:

- self.ca_chain_path: Path to the generated CA chain PEM file.
- self.client_cert_path: Path to the generated client certificate PEM file.
- self.client_key_path: Path to the generated client key PEM file.

These files are used for configuring secure Kafka connections (e.g., with Spark or confluent_kafka). The method is idempotent and will only create the files once per connector instance.

PARAMETER	DESCRIPTION
`kafka_options`	A dictionary containing the Kafka configuration options, including keystore and truststore locations and passwords. TYPE: `dict[str, Any]`

kafka_options #

kafka_options(distribute: bool = True) -> dict[str, Any]

Return prepared options to be passed to kafka, based on the additional arguments.

See https://kafka.apache.org/documentation/.

PARAMETER	DESCRIPTION
`distribute`	Whether to distribute the SSL certificates to the cluster nodes. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`dict[str, Any]`	A dictionary containing the configuration options for kafka.

confluent_options #

confluent_options() -> dict[str, Any]

Return prepared options to be passed to confluent_kafka, based on the provided apache spark configuration.

Right now only producer values with Importance >= medium are implemented.

See https://docs.confluent.io/platform/current/clients/librdkafka/html/md_CONFIGURATION.html.

RETURNS	DESCRIPTION
`dict[str, Any]`	A dictionary containing the configuration options for confluent_kafka.

spark_options #

spark_options() -> dict[str, Any]

Return prepared options to be passed to Spark, based on the additional arguments.

This is done by just adding 'kafka.' prefix to kafka_options.

See https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#kafka-specific-configurations.

read #

read(
    query: str | None = None,
    data_format: str | None = None,
    options: dict[str, Any] | None = None,
    path: str | None = None,
    dataframe_type: Literal[
        "default",
        "spark",
        "pandas",
        "polars",
        "numpy",
        "python",
    ] = "default",
) -> None

Failure

This operation is not supported. Use read_stream instead to read a Kafka stream into a streaming Spark Dataframe.

RAISES	DESCRIPTION
`NotImplementedError`	Always, since this operation is not supported.

read_stream #

read_stream(
    topic: str,
    topic_pattern: bool = False,
    message_format: str = "avro",
    schema: str | None = None,
    options: dict[str, Any] | None = None,
    include_metadata: bool = False,
) -> TypeVar("pyspark.sql.DataFrame") | TypeVar(
    "pyspark.sql.streaming.StreamingQuery"
)

Reads a Kafka stream from a topic or multiple topics into a Dataframe.

Engine Support

Spark only

Reading from data streams using Pandas/Python as engine is currently not supported. Python/Pandas has no notion of streaming.

PARAMETER	DESCRIPTION
`topic`	Name or pattern of the topic(s) to subscribe to. TYPE: `str`
`topic_pattern`	Flag to indicate if `topic` string is a pattern. TYPE: `bool` DEFAULT: `False`
`message_format`	The format of the messages to use for decoding. Can be `"avro"` or `"json"`. TYPE: `str` DEFAULT: `'avro'`
`schema`	Optional schema, to use for decoding, can be an Avro schema string for `"avro"` message format, or for JSON encoding a Spark StructType schema, or a DDL formatted string. TYPE: `str \| None` DEFAULT: `None`
`options`	Additional options as key/value string pairs to be passed to Spark. Defaults to `{}`. TYPE: `dict[str, Any] \| None` DEFAULT: `None`
`include_metadata`	Indicate whether to return additional metadata fields from messages in the stream. Otherwise, only the decoded value fields are returned. TYPE: `bool` DEFAULT: `False`

RAISES	DESCRIPTION
`ValueError`	Malformed arguments.

RETURNS	DESCRIPTION
`TypeVar('pyspark.sql.DataFrame') \| TypeVar('pyspark.sql.streaming.StreamingQuery')`	A Spark streaming dataframe.

RedshiftConnector #

Bases: StorageConnector

cluster_identifier `property` #

cluster_identifier: str | None

Cluster identifier for redshift cluster.

database_driver `property` #

database_driver: str | None

Database endpoint for redshift cluster.

database_endpoint `property` #

database_endpoint: str | None

Database endpoint for redshift cluster.

database_name `property` #

database_name: str | None

Database name for redshift cluster.

database_port `property` #

database_port: int | str | None

Database port for redshift cluster.

table_name `property` #

table_name: str | None

Table name for redshift cluster.

database_user_name `property` #

database_user_name: str | None

Database username for redshift cluster.

auto_create `property` #

auto_create: bool | None

Database username for redshift cluster.

database_group `property` #

database_group: str | None

Database username for redshift cluster.

database_password `property` #

database_password: str | None

Database password for redshift cluster.

iam_role `property` #

iam_role: Any | None

IAM role.

expiration `property` #

expiration: int | str | None

Cluster temporary credential expiration time.

arguments `property` #

arguments: str | None

Additional JDBC, REDSHIFT, or Snowflake arguments.

connector_options #

connector_options() -> dict[str, Any]

Return options to be passed to an external Redshift connector library.

read #

read(
    query: str | None = None,
    data_format: str | None = None,
    options: dict[str, Any] | None = None,
    path: str | None = None,
    dataframe_type: Literal[
        "default",
        "spark",
        "pandas",
        "polars",
        "numpy",
        "python",
    ] = "default",
) -> (
    TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | pd.DataFrame
    | np.ndarray
    | pl.DataFrame
)

Reads a table or query into a dataframe using the storage connector.

PARAMETER	DESCRIPTION
`query`	By default, the storage connector will read the table configured together with the connector, if any. It's possible to overwrite this by passing a SQL query here. TYPE: `str \| None` DEFAULT: `None`
`data_format`	Not relevant for JDBC based connectors such as Redshift. TYPE: `str \| None` DEFAULT: `None`
`options`	Any additional key/value options to be passed to the JDBC connector. TYPE: `dict[str, Any] \| None` DEFAULT: `None`
`path`	Not relevant for JDBC based connectors such as Redshift. TYPE: `str \| None` DEFAULT: `None`
`dataframe_type`	str, optional. The type of the returned dataframe. Possible values are `"default"`, `"spark"`,`"pandas"`, `"polars"`, `"numpy"` or `"python"`. Defaults to "default", which maps to Spark dataframe for the Spark Engine and Pandas dataframe for the Python engine. TYPE: `Literal['default', 'spark', 'pandas', 'polars', 'numpy', 'python']` DEFAULT: `'default'`

RETURNS	DESCRIPTION
`TypeVar('pyspark.sql.DataFrame') \| TypeVar('pyspark.RDD') \| pd.DataFrame \| np.ndarray \| pl.DataFrame`	`DataFrame`.

refetch #

refetch() -> None

Refetch storage connector in order to retrieve updated temporary credentials.

S3Connector #

Bases: StorageConnector

access_key `property` #

access_key: str | None

Access key.

secret_key `property` #

secret_key: str | None

Secret key.

server_encryption_algorithm `property` #

server_encryption_algorithm: str | None

Encryption algorithm if server-side S3 bucket encryption is enabled.

server_encryption_key `property` #

server_encryption_key: str | None

Encryption key if server-side S3 bucket encryption is enabled.

bucket `property` #

bucket: str | None

Return the bucket for S3 connectors.

region `property` #

region: str | None

Return the region for S3 connectors.

session_token `property` #

session_token: str | None

Session token.

iam_role `property` #

iam_role: str | None

IAM role.

path `property` #

path: str | None

If the connector refers to a path (e.g. S3) - return the path of the connector.

arguments `property` #

arguments: dict[str, Any] | None

Additional spark options for the S3 connector, passed as a dictionary.

These are set using the Spark Options field in the UI when creating the connector. Example: {"fs.s3a.endpoint": "s3.eu-west-1.amazonaws.com", "fs.s3a.path.style.access": "true"}.

prepare_spark #

prepare_spark(path: str | None = None) -> str | None

Prepare Spark to use this Storage Connector.

conn.prepare_spark()

spark.read.format("json").load("s3a://[bucket]/path")

# or
spark.read.format("json").load(conn.prepare_spark("s3a://[bucket]/path"))

PARAMETER	DESCRIPTION
`path`	Path to prepare for reading from cloud storage. TYPE: `str \| None` DEFAULT: `None`

connector_options #

connector_options() -> dict[str, Any]

Return options to be passed to an external S3 connector library.

read #

read(
    query: str | None = None,
    data_format: str | None = None,
    options: dict[str, Any] | None = None,
    path: str | None = None,
    dataframe_type: Literal[
        "default",
        "spark",
        "pandas",
        "polars",
        "numpy",
        "python",
    ] = "default",
) -> (
    TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | pd.DataFrame
    | np.ndarray
    | pl.DataFrame
)

Reads a query or a path into a dataframe using the storage connector.

Note, paths are only supported for object stores like S3, HopsFS and ADLS, while queries are meant for JDBC or databases like Redshift and Snowflake.

PARAMETER	DESCRIPTION
`query`	Not relevant for S3 connectors. TYPE: `str \| None` DEFAULT: `None`
`data_format`	The file format of the files to be read, e.g. `csv`, `parquet`. TYPE: `str \| None` DEFAULT: `None`
`options`	Any additional key/value options to be passed to the S3 connector. TYPE: `dict[str, Any] \| None` DEFAULT: `None`
`path`	Path within the bucket to be read. TYPE: `str \| None` DEFAULT: `None`
`dataframe_type`	The type of the returned dataframe. Defaults to "default", which maps to Spark dataframe for the Spark Engine and Pandas dataframe for the Python engine. TYPE: `Literal['default', 'spark', 'pandas', 'polars', 'numpy', 'python']` DEFAULT: `'default'`

RETURNS	DESCRIPTION
`TypeVar('pyspark.sql.DataFrame') \| TypeVar('pyspark.RDD') \| pd.DataFrame \| np.ndarray \| pl.DataFrame`	`DataFrame`.

SnowflakeConnector #

Bases: StorageConnector

url `property` #

url: str | None

URL of the Snowflake storage connector.

warehouse `property` #

warehouse: str | None

Warehouse of the Snowflake storage connector.

database `property` #

database: str | None

Database of the Snowflake storage connector.

user `property` #

user: Any | None

User of the Snowflake storage connector.

password `property` #

password: str | None

Password of the Snowflake storage connector.

token `property` #

token: str | None

OAuth token of the Snowflake storage connector.

schema `property` #

schema: str | None

Schema of the Snowflake storage connector.

table `property` #

table: str | None

Table of the Snowflake storage connector.

role `property` #

role: Any | None

Role of the Snowflake storage connector.

account `property` #

account: str | None

Account of the Snowflake storage connector.

application `property` #

application: Any

Application of the Snowflake storage connector.

options `property` #

options: dict[str, Any] | None

Additional options for the Snowflake storage connector.

private_key `property` #

private_key: str | None

Path to the private key file for key pair authentication.

passphrase `property` #

passphrase: str | None

Passphrase for the private key file.

snowflake_connector_options #

snowflake_connector_options() -> dict[str, Any] | None

Alias for connector_options.

RETURNS	DESCRIPTION
`dict[str, Any] \| None`	A dictionary with the needed arguments for you to connect to a Snowflake database.

connector_options #

connector_options() -> dict[str, Any] | None

Prepare a Python dictionary with the needed arguments for you to connect to a Snowflake database.

It is useful for the snowflake.connector Python library.

import snowflake.connector

sc = fs.get_storage_connector("snowflake_conn")
ctx = snowflake.connector.connect(**sc.connector_options())

RETURNS	DESCRIPTION
`dict[str, Any] \| None`	A dictionary with the needed arguments for you to connect to a Snowflake database.

read #

read(
    query: str | None = None,
    data_format: str | None = None,
    options: dict[str, Any] | None = None,
    path: str | None = None,
    dataframe_type: Literal[
        "default",
        "spark",
        "pandas",
        "polars",
        "numpy",
        "python",
    ] = "default",
) -> (
    TypeVar("pyspark.sql.DataFrame")
    | TypeVar("pyspark.RDD")
    | pd.DataFrame
    | np.ndarray
    | pl.DataFrame
)

Reads a table or query into a dataframe using the storage connector.

PARAMETER	DESCRIPTION
`query`	By default, the storage connector will read the table configured together with the connector, if any. It's possible to overwrite this by passing a SQL query here. TYPE: `str \| None` DEFAULT: `None`
`data_format`	Not relevant for Snowflake connectors. TYPE: `str \| None` DEFAULT: `None`
`options`	Any additional key/value options to be passed to the engine. TYPE: `dict[str, Any] \| None` DEFAULT: `None`
`path`	Not relevant for Snowflake connectors. TYPE: `str \| None` DEFAULT: `None`
`dataframe_type`	str, optional. The type of the returned dataframe. Possible values are `"default"`, `"spark"`,`"pandas"`, `"polars"`, `"numpy"` or `"python"`. Defaults to "default", which maps to Spark dataframe for the Spark Engine and Pandas dataframe for the Python engine. TYPE: `Literal['default', 'spark', 'pandas', 'polars', 'numpy', 'python']` DEFAULT: `'default'`

RETURNS	DESCRIPTION
`TypeVar('pyspark.sql.DataFrame') \| TypeVar('pyspark.RDD') \| pd.DataFrame \| np.ndarray \| pl.DataFrame`	`DataFrame`.

hsfs.storage_connector #

StorageConnector #

type property #

id property #

name property #

description property #

spark_options abstractmethod #

read #

get_feature_groups_provenance #

get_feature_groups #

get_training_datasets_provenance #

get_training_datasets #

get_databases #

get_tables #

get_data #

get_metadata #

AdlsConnector #

generation property #

directory_id property #

application_id property #

account_name property #

container_name property #

service_credential property #

path property #

prepare_spark #

read #

BigQueryConnector #

key_path property #

parent_project property #

dataset property #

query_table property #

query_project property #

materialization_dataset property #

arguments property #

connector_options #

read #

GcsConnector #

key_path property #

algorithm property #

encryption_key property #

encryption_key_hash property #

path property #

bucket property #

read #

prepare_spark #

HopsFSConnector #

read #

JdbcConnector #

connection_string property #

arguments property #

read #

KafkaConnector #

bootstrap_servers property #

security_protocol property #

ssl_truststore_location property #

ssl_keystore_location property #

ssl_endpoint_identification_algorithm property #

options property #

create_pem_files #

kafka_options #

confluent_options #

spark_options #

read #

read_stream #

RedshiftConnector #

cluster_identifier property #

database_driver property #

database_endpoint property #

database_name property #

database_port property #

table_name property #

database_user_name property #

auto_create property #

database_group property #

database_password property #

iam_role property #

expiration property #

arguments property #

connector_options #

read #

type `property` #

id `property` #

name `property` #

description `property` #

spark_options `abstractmethod` #

generation `property` #

directory_id `property` #

application_id `property` #

account_name `property` #

container_name `property` #

service_credential `property` #

path `property` #

key_path `property` #

parent_project `property` #

dataset `property` #

query_table `property` #

query_project `property` #

materialization_dataset `property` #

arguments `property` #

key_path `property` #

algorithm `property` #

encryption_key `property` #

encryption_key_hash `property` #

path `property` #

bucket `property` #

connection_string `property` #

arguments `property` #

bootstrap_servers `property` #

security_protocol `property` #

ssl_truststore_location `property` #

ssl_keystore_location `property` #

ssl_endpoint_identification_algorithm `property` #

options `property` #

cluster_identifier `property` #

database_driver `property` #

database_endpoint `property` #

database_name `property` #

database_port `property` #

table_name `property` #

database_user_name `property` #

auto_create `property` #

database_group `property` #

database_password `property` #

iam_role `property` #

expiration `property` #

arguments `property` #

access_key `property` #

secret_key `property` #

server_encryption_algorithm `property` #

server_encryption_key `property` #

bucket `property` #

region `property` #

session_token `property` #

iam_role `property` #

path `property` #

arguments `property` #

url `property` #

warehouse `property` #

database `property` #

user `property` #

password `property` #

token `property` #

schema `property` #

table `property` #

role `property` #

account `property` #

application `property` #

options `property` #

private_key `property` #

passphrase `property` #