Skip to content

hopsworks.core.dataset_api #

DatasetApi #

For backwards compatibility hopsworks.core.dataset_api.DatasetApi is still available as hsfs.core.dataset_api.DatasetApi, hsml.core.dataset_api.DatasetApi. The use of these aliases is discouraged as they are to be deprecated.

download #

download(
    path: str,
    local_path: str | None = None,
    overwrite: bool | None = False,
    chunk_size: int = DEFAULT_DOWNLOAD_FLOW_CHUNK_SIZE,
) -> str

Download file from Hopsworks Filesystem to the current working directory.

import hopsworks

project = hopsworks.login()

dataset_api = project.get_dataset_api()

downloaded_file_path = dataset_api.download("Resources/my_local_file.txt")
PARAMETER DESCRIPTION
path

Path in Hopsworks filesystem to the file.

TYPE: str

local_path

Path where to download the file in the local filesystem.

TYPE: str | None DEFAULT: None

overwrite

Overwrite local file if exists.

TYPE: bool | None DEFAULT: False

chunk_size

Upload chunk size in bytes, defaults to 1 MB.

TYPE: int DEFAULT: DEFAULT_DOWNLOAD_FLOW_CHUNK_SIZE

RETURNS DESCRIPTION
str

The path to the downloaded file.

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

If the backend encounters an error when handling the request.

upload #

upload(
    local_path: str,
    upload_path: str,
    overwrite: bool = False,
    chunk_size: int = DEFAULT_UPLOAD_FLOW_CHUNK_SIZE,
    simultaneous_uploads: int = DEFAULT_UPLOAD_SIMULTANEOUS_UPLOADS,
    simultaneous_chunks: int = DEFAULT_UPLOAD_SIMULTANEOUS_CHUNKS,
    max_chunk_retries: int = DEFAULT_UPLOAD_MAX_CHUNK_RETRIES,
    chunk_retry_interval: int = 1,
) -> str

Upload a file or directory to the Hopsworks filesystem.

import hopsworks

project = hopsworks.login()

dataset_api = project.get_dataset_api()

# upload a file to Resources dataset
uploaded_file_path = dataset_api.upload("my_local_file.txt", "Resources")

# upload a directory to Resources dataset
uploaded_file_path = dataset_api.upload("my_dir", "Resources")
PARAMETER DESCRIPTION
local_path

Local path to file or directory to upload, can be relative or absolute.

TYPE: str

upload_path

Path to directory where to upload the file in Hopsworks Filesystem.

TYPE: str

overwrite

Overwrite file or directory if exists.

TYPE: bool DEFAULT: False

chunk_size

Upload chunk size in bytes, defaults to 10 MB.

TYPE: int DEFAULT: DEFAULT_UPLOAD_FLOW_CHUNK_SIZE

simultaneous_uploads

Number of simultaneous files to be uploaded for directories.

TYPE: int DEFAULT: DEFAULT_UPLOAD_SIMULTANEOUS_UPLOADS

simultaneous_chunks

Number of simultaneous chunks to upload for each file upload.

TYPE: int DEFAULT: DEFAULT_UPLOAD_SIMULTANEOUS_CHUNKS

max_chunk_retries

Maximum retry for a chunk.

TYPE: int DEFAULT: DEFAULT_UPLOAD_MAX_CHUNK_RETRIES

chunk_retry_interval

Chunk retry interval in seconds.

TYPE: int DEFAULT: 1

RETURNS DESCRIPTION
str

The path to the uploaded file or directory.

RAISES DESCRIPTION
hopsworks.client.exceptions.DatasetException

If the destination path already exists and overwrite is not set to True, or if the upload fails because the HopsFS storage quota is exhausted.

hopsworks.client.exceptions.RestAPIError

If the backend encounters an error when handling the request.

exists #

exists(path: str) -> bool

Check if a file exists in the Hopsworks Filesystem.

PARAMETER DESCRIPTION
path

Path to check.

TYPE: str

RETURNS DESCRIPTION
bool

True if exists, otherwise False.

remove #

remove(path: str)

Remove a path in the Hopsworks Filesystem.

PARAMETER DESCRIPTION
path

Path to remove.

TYPE: str

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

If the backend encounters an error when handling the request.

mkdir #

mkdir(path: str) -> str

Create a directory in the Hopsworks Filesystem.

import hopsworks

project = hopsworks.login()

dataset_api = project.get_dataset_api()

directory_path = dataset_api.mkdir("Resources/my_dir")
PARAMETER DESCRIPTION
path

Path to directory.

TYPE: str

RETURNS DESCRIPTION
str

Path to the created directory.

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

If the backend encounters an error when handling the request.

copy #

copy(
    source_path: str,
    destination_path: str,
    overwrite: bool = False,
)

Copy a file or directory in the Hopsworks Filesystem.

import hopsworks

project = hopsworks.login()

dataset_api = project.get_dataset_api()

directory_path = dataset_api.copy("Resources/myfile.txt", "Logs/myfile.txt")
PARAMETER DESCRIPTION
source_path

The source path to copy.

TYPE: str

destination_path

The destination path.

TYPE: str

overwrite

Overwrite destination if exists.

TYPE: bool DEFAULT: False

RAISES DESCRIPTION
hopsworks.client.exceptions.DatasetException

If the destination path already exists and overwrite is not set to True.

hopsworks.client.exceptions.RestAPIError

If the backend encounters an error when handling the request.

move #

move(
    source_path: str,
    destination_path: str,
    overwrite: bool = False,
)

Move a file or directory in the Hopsworks Filesystem.

import hopsworks

project = hopsworks.login()

dataset_api = project.get_dataset_api()

directory_path = dataset_api.move("Resources/myfile.txt", "Logs/myfile.txt")
PARAMETER DESCRIPTION
source_path

The source path to move.

TYPE: str

destination_path

The destination path.

TYPE: str

overwrite

Overwrite destination if exists.

TYPE: bool DEFAULT: False

RAISES DESCRIPTION
hopsworks.client.exceptions.DatasetException

If the destination path already exists and overwrite is not set to True.

hopsworks.client.exceptions.RestAPIError

If the backend encounters an error when handling the request.

upload_feature_group #

upload_feature_group(
    feature_group: FeatureGroup,
    path: str,
    dataframe: pd.DataFrame,
)

Upload a dataframe to a path in Parquet format using a feature group metadata.

Warning

This method is a legacy method kept for backwards-compatibility; do not use it in new code.

PARAMETER DESCRIPTION
feature_group

The feature group metadata to use for the upload.

TYPE: FeatureGroup

path

The path to upload the dataframe to.

TYPE: str

dataframe

The dataframe to upload.

TYPE: pd.DataFrame

list #

list(
    path: str, offset: int = 0, limit: int = 1000
) -> list[str]

List the files and directories from a path in the Hopsworks Filesystem.

import hopsworks

project = hopsworks.login()

dataset_api = project.get_dataset_api()

# list all files in the Resources dataset
files = dataset_api.list("/Resources")

# list all datasets in the project
files = dataset_api.list("/")
PARAMETER DESCRIPTION
path

Path in Hopsworks filesystem to the directory.

TYPE: str

offset

The number of entities to skip.

TYPE: int DEFAULT: 0

limit

Max number of the returned entities.

TYPE: int DEFAULT: 1000

RETURNS DESCRIPTION
list[str]

List of path to files and directories in the provided path.

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

If the backend encounters an error when handling the request.

read_content #

read_content(
    path: str, dataset_type: str = "DATASET"
) -> dict | None

Read the content of a file.

PARAMETER DESCRIPTION
path

The path to the file to read.

TYPE: str

dataset_type

The type of dataset, can be DATASET or HIVEDB; defaults to DATASET. HIVEDB type is used to read files from Apache Hive.

TYPE: str DEFAULT: 'DATASET'

RETURNS DESCRIPTION
dict | None

An object with content attribute containing the file content as bytes, or None if the file was not found.

chmod #

chmod(remote_path: str, permissions: str) -> dict

Change permissions of a file or a directory in the Hopsworks Filesystem.

PARAMETER DESCRIPTION
remote_path

Path to change the permissions of.

TYPE: str

permissions

Permissions string, for example "u+x".

TYPE: str

RETURNS DESCRIPTION
dict

The updated dataset metadata.

TYPE: dict

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

If the backend encounters an error when handling the request.

unzip #

unzip(
    remote_path: str,
    block: bool = False,
    timeout: int | None = 120,
) -> bool

Unzip an archive in the dataset.

PARAMETER DESCRIPTION
remote_path

Path to file or directory to unzip.

TYPE: str

block

Whether the operation should be blocking until complete.

TYPE: bool DEFAULT: False

timeout

Timeout in seconds for the blocking, defaults to 120; if None, the blocking is unbounded.

TYPE: int | None DEFAULT: 120

RETURNS DESCRIPTION
bool

Whether the operation completed in the specified timeout; if non-blocking, always returns True.

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

If the backend encounters an error when handling the request.

zip #

zip(
    remote_path: str,
    destination_path: str | None = None,
    block: bool = False,
    timeout: int | None = 120,
) -> bool

Zip a file or directory in the dataset.

PARAMETER DESCRIPTION
remote_path

Path to file or directory to zip.

TYPE: str

destination_path

Path to upload the zip.

TYPE: str | None DEFAULT: None

block

Whether the operation should be blocking until complete.

TYPE: bool DEFAULT: False

timeout

Timeout in seconds for the blocking, defaults to 120; if None, the blocking is unbounded.

TYPE: int | None DEFAULT: 120

RETURNS DESCRIPTION
bool

Whether the operation completed in the specified timeout; if non-blocking, always returns True.

RAISES DESCRIPTION
hopsworks.client.exceptions.RestAPIError

If the backend encounters an error when handling the request.