Skip to content

Datasets API#

Handle#

[source]

get_dataset_api#

Project.get_dataset_api()

Get the dataset api for the project.

Returns

DatasetApi: The Datasets Api handle


Methods#

[source]

chmod#

DatasetApi.chmod(remote_path, permissions)

Change permissions of a file or a directory in the Hopsworks Filesystem.

Arguments

  • remote_path str: path to change the permissions of.
  • permissions str: permissions string, for example "u+x".

Returns

dict: the updated dataset metadata

Raises

  • hopsworks.client.exceptions.RestAPIError: If the backend encounters an error when handling the request

[source]

copy#

DatasetApi.copy(source_path, destination_path, overwrite=False)

Copy a file or directory in the Hopsworks Filesystem.

import hopsworks

project = hopsworks.login()

dataset_api = project.get_dataset_api()

directory_path = dataset_api.copy("Resources/myfile.txt", "Logs/myfile.txt")
Arguments

  • source_path str: the source path to copy
  • destination_path str: the destination path
  • overwrite bool: overwrite destination if exists

Raises

  • hopsworks.client.exceptions.DatasetException: If the destination path already exists and overwrite is not set to True
  • hopsworks.client.exceptions.RestAPIError: If the backend encounters an error when handling the request

[source]

download#

DatasetApi.download(path, local_path=None, overwrite=False, chunk_size=1048576)

Download file from Hopsworks Filesystem to the current working directory.

import hopsworks

project = hopsworks.login()

dataset_api = project.get_dataset_api()

downloaded_file_path = dataset_api.download("Resources/my_local_file.txt")
Arguments

  • path str: path in Hopsworks filesystem to the file
  • local_path str | None: path where to download the file in the local filesystem
  • overwrite bool | None: overwrite local file if exists
  • chunk_size int: upload chunk size in bytes. Default 1 MB

Returns

str: Path to downloaded file

Raises

  • hopsworks.client.exceptions.RestAPIError: If the backend encounters an error when handling the request

[source]

exists#

DatasetApi.exists(path)

Check if a file exists in the Hopsworks Filesystem.

Arguments

  • path str: path to check

Returns

bool: True if exists, otherwise False


[source]

list#

DatasetApi.list(path, offset=0, limit=1000)

List the files and directories from a path in the Hopsworks Filesystem.

import hopsworks

project = hopsworks.login()

dataset_api = project.get_dataset_api()

# list all files in the Resources dataset
files = dataset_api.list("/Resources")

# list all datasets in the project
files = dataset_api.list("/")
Arguments

  • path str: path in Hopsworks filesystem to the directory
  • offset int: the number of entities to skip
  • limit int: max number of the returned entities

Returns

list[str]: List of path to files and directories in the provided path

Raises

  • hopsworks.client.exceptions.RestAPIError: If the backend encounters an error when handling the request

[source]

mkdir#

DatasetApi.mkdir(path)

Create a directory in the Hopsworks Filesystem.

import hopsworks

project = hopsworks.login()

dataset_api = project.get_dataset_api()

directory_path = dataset_api.mkdir("Resources/my_dir")
Arguments

  • path str: path to directory

Returns

str: Path to created directory

Raises

  • hopsworks.client.exceptions.RestAPIError: If the backend encounters an error when handling the request

[source]

move#

DatasetApi.move(source_path, destination_path, overwrite=False)

Move a file or directory in the Hopsworks Filesystem.

import hopsworks

project = hopsworks.login()

dataset_api = project.get_dataset_api()

directory_path = dataset_api.move("Resources/myfile.txt", "Logs/myfile.txt")
Arguments

  • source_path str: the source path to move
  • destination_path str: the destination path
  • overwrite bool: overwrite destination if exists

Raises

  • hopsworks.client.exceptions.DatasetException: If the destination path already exists and overwrite is not set to True
  • hopsworks.client.exceptions.RestAPIError: If the backend encounters an error when handling the request

[source]

read_content#

DatasetApi.read_content(path, dataset_type="DATASET")

Read the content of a file.

Arguments

  • path str: The path to the file to read.
  • dataset_type str: The type of dataset, can be DATASET or HIVEDB; defaults to DATASET. HIVEDB type is used to read files from Apache Hive.

Returns

An object with content attribute containing the file content as bytes, or None if the file was not found.


[source]

remove#

DatasetApi.remove(path)

Remove a path in the Hopsworks Filesystem.

Arguments

  • path str: path to remove

Raises

  • hopsworks.client.exceptions.RestAPIError: If the backend encounters an error when handling the request

[source]

unzip#

DatasetApi.unzip(remote_path, block=False, timeout=120)

Unzip an archive in the dataset.

Arguments

  • remote_path str: path to file or directory to unzip.
  • block bool: if the operation should be blocking until complete, defaults to False.
  • timeout int | None: timeout in seconds for the blocking, defaults to 120; if None, the blocking is unbounded.

Returns

bool: whether the operation completed in the specified timeout; if non-blocking, always returns True.

Raises

  • hopsworks.client.exceptions.RestAPIError: If the backend encounters an error when handling the request

[source]

upload#

DatasetApi.upload(
    local_path,
    upload_path,
    overwrite=False,
    chunk_size=10485760,
    simultaneous_uploads=3,
    simultaneous_chunks=3,
    max_chunk_retries=1,
    chunk_retry_interval=1,
)

Upload a file or directory to the Hopsworks filesystem.

import hopsworks

project = hopsworks.login()

dataset_api = project.get_dataset_api()

# upload a file to Resources dataset
uploaded_file_path = dataset_api.upload("my_local_file.txt", "Resources")

# upload a directory to Resources dataset
uploaded_file_path = dataset_api.upload("my_dir", "Resources")
Arguments

  • local_path str: local path to file or directory to upload, can be relative or absolute
  • upload_path str: path to directory where to upload the file in Hopsworks Filesystem
  • overwrite bool: overwrite file or directory if exists
  • chunk_size int: upload chunk size in bytes. Default 10 MB
  • simultaneous_chunks int: number of simultaneous chunks to upload for each file upload. Default 3
  • simultaneous_uploads int: number of simultaneous files to be uploaded for directories. Default 3
  • max_chunk_retries int: maximum retry for a chunk. Default is 1
  • chunk_retry_interval int: chunk retry interval in seconds. Default is 1sec

Returns

str: Path to uploaded file or directory

Raises

  • hopsworks.client.exceptions.RestAPIError: If the backend encounters an error when handling the request

[source]

upload_feature_group#

DatasetApi.upload_feature_group(feature_group, path, dataframe)

Upload a dataframe to a path in Parquet format using a feature group metadata.

Note

This method is a legacy method kept for backwards-compatibility; do not use it in new code.


[source]

zip#

DatasetApi.zip(remote_path, destination_path=None, block=False, timeout=120)

Zip a file or directory in the dataset.

Arguments

  • remote_path str: path to file or directory to unzip.
  • destination_path str | None: path to upload the zip, defaults to None.
  • block bool: if the operation should be blocking until complete, defaults to False.
  • timeout int | None: timeout in seconds for the blocking, defaults to 120; if None, the blocking is unbounded.

Returns

bool: whether the operation completed in the specified timeout; if non-blocking, always returns True.

Raises

  • hopsworks.client.exceptions.RestAPIError: If the backend encounters an error when handling the request