Datasets API#
Handle#
get_dataset_api#
Project.get_dataset_api()
Get the dataset api for the project.
Returns
DatasetApi: The Datasets Api handle
Methods#
chmod#
DatasetApi.chmod(remote_path, permissions)
Change permissions of a file or a directory in the Hopsworks Filesystem.
Arguments
- remote_path
str: path to change the permissions of. - permissions
str: permissions string, for example"u+x".
Returns
dict: the updated dataset metadata
Raises
hopsworks.client.exceptions.RestAPIError: If the backend encounters an error when handling the request
copy#
DatasetApi.copy(source_path, destination_path, overwrite=False)
Copy a file or directory in the Hopsworks Filesystem.
import hopsworks
project = hopsworks.login()
dataset_api = project.get_dataset_api()
directory_path = dataset_api.copy("Resources/myfile.txt", "Logs/myfile.txt")
- source_path
str: the source path to copy - destination_path
str: the destination path - overwrite
bool: overwrite destination if exists
Raises
hopsworks.client.exceptions.DatasetException: If the destination path already exists and overwrite is not set to Truehopsworks.client.exceptions.RestAPIError: If the backend encounters an error when handling the request
download#
DatasetApi.download(path, local_path=None, overwrite=False, chunk_size=1048576)
Download file from Hopsworks Filesystem to the current working directory.
import hopsworks
project = hopsworks.login()
dataset_api = project.get_dataset_api()
downloaded_file_path = dataset_api.download("Resources/my_local_file.txt")
- path
str: path in Hopsworks filesystem to the file - local_path
str | None: path where to download the file in the local filesystem - overwrite
bool | None: overwrite local file if exists - chunk_size
int: upload chunk size in bytes. Default 1 MB
Returns
str: Path to downloaded file
Raises
hopsworks.client.exceptions.RestAPIError: If the backend encounters an error when handling the request
exists#
DatasetApi.exists(path)
Check if a file exists in the Hopsworks Filesystem.
Arguments
- path
str: path to check
Returns
bool: True if exists, otherwise False
list#
DatasetApi.list(path, offset=0, limit=1000)
List the files and directories from a path in the Hopsworks Filesystem.
import hopsworks
project = hopsworks.login()
dataset_api = project.get_dataset_api()
# list all files in the Resources dataset
files = dataset_api.list("/Resources")
# list all datasets in the project
files = dataset_api.list("/")
- path
str: path in Hopsworks filesystem to the directory - offset
int: the number of entities to skip - limit
int: max number of the returned entities
Returns
list[str]: List of path to files and directories in the provided path
Raises
hopsworks.client.exceptions.RestAPIError: If the backend encounters an error when handling the request
mkdir#
DatasetApi.mkdir(path)
Create a directory in the Hopsworks Filesystem.
import hopsworks
project = hopsworks.login()
dataset_api = project.get_dataset_api()
directory_path = dataset_api.mkdir("Resources/my_dir")
- path
str: path to directory
Returns
str: Path to created directory
Raises
hopsworks.client.exceptions.RestAPIError: If the backend encounters an error when handling the request
move#
DatasetApi.move(source_path, destination_path, overwrite=False)
Move a file or directory in the Hopsworks Filesystem.
import hopsworks
project = hopsworks.login()
dataset_api = project.get_dataset_api()
directory_path = dataset_api.move("Resources/myfile.txt", "Logs/myfile.txt")
- source_path
str: the source path to move - destination_path
str: the destination path - overwrite
bool: overwrite destination if exists
Raises
hopsworks.client.exceptions.DatasetException: If the destination path already exists and overwrite is not set to Truehopsworks.client.exceptions.RestAPIError: If the backend encounters an error when handling the request
read_content#
DatasetApi.read_content(path, dataset_type="DATASET")
Read the content of a file.
Arguments
- path
str: The path to the file to read. - dataset_type
str: The type of dataset, can beDATASETorHIVEDB; defaults toDATASET.HIVEDBtype is used to read files from Apache Hive.
Returns
An object with content attribute containing the file content as bytes, or None if the file was not found.
remove#
DatasetApi.remove(path)
Remove a path in the Hopsworks Filesystem.
Arguments
- path
str: path to remove
Raises
hopsworks.client.exceptions.RestAPIError: If the backend encounters an error when handling the request
unzip#
DatasetApi.unzip(remote_path, block=False, timeout=120)
Unzip an archive in the dataset.
Arguments
- remote_path
str: path to file or directory to unzip. - block
bool: if the operation should be blocking until complete, defaults to False. - timeout
int | None: timeout in seconds for the blocking, defaults to 120; if None, the blocking is unbounded.
Returns
bool: whether the operation completed in the specified timeout; if non-blocking, always returns True.
Raises
hopsworks.client.exceptions.RestAPIError: If the backend encounters an error when handling the request
upload#
DatasetApi.upload(
local_path,
upload_path,
overwrite=False,
chunk_size=10485760,
simultaneous_uploads=3,
simultaneous_chunks=3,
max_chunk_retries=1,
chunk_retry_interval=1,
)
Upload a file or directory to the Hopsworks filesystem.
import hopsworks
project = hopsworks.login()
dataset_api = project.get_dataset_api()
# upload a file to Resources dataset
uploaded_file_path = dataset_api.upload("my_local_file.txt", "Resources")
# upload a directory to Resources dataset
uploaded_file_path = dataset_api.upload("my_dir", "Resources")
- local_path
str: local path to file or directory to upload, can be relative or absolute - upload_path
str: path to directory where to upload the file in Hopsworks Filesystem - overwrite
bool: overwrite file or directory if exists - chunk_size
int: upload chunk size in bytes. Default 10 MB - simultaneous_chunks
int: number of simultaneous chunks to upload for each file upload. Default 3 - simultaneous_uploads
int: number of simultaneous files to be uploaded for directories. Default 3 - max_chunk_retries
int: maximum retry for a chunk. Default is 1 - chunk_retry_interval
int: chunk retry interval in seconds. Default is 1sec
Returns
str: Path to uploaded file or directory
Raises
hopsworks.client.exceptions.RestAPIError: If the backend encounters an error when handling the request
upload_feature_group#
DatasetApi.upload_feature_group(feature_group, path, dataframe)
Upload a dataframe to a path in Parquet format using a feature group metadata.
Note
This method is a legacy method kept for backwards-compatibility; do not use it in new code.
zip#
DatasetApi.zip(remote_path, destination_path=None, block=False, timeout=120)
Zip a file or directory in the dataset.
Arguments
- remote_path
str: path to file or directory to unzip. - destination_path
str | None: path to upload the zip, defaults to None. - block
bool: if the operation should be blocking until complete, defaults to False. - timeout
int | None: timeout in seconds for the blocking, defaults to 120; if None, the blocking is unbounded.
Returns
bool: whether the operation completed in the specified timeout; if non-blocking, always returns True.
Raises
hopsworks.client.exceptions.RestAPIError: If the backend encounters an error when handling the request