public class FeatureView extends FeatureViewBase<FeatureView,FeatureStore,Query,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
Modifier and Type | Class and Description |
---|---|
static class |
FeatureView.FeatureViewBuilder |
description, extraFilterVersion, features, featureStore, id, labels, LOGGER, name, query, type, vectorServer, version
Constructor and Description |
---|
FeatureView(@NonNull String name,
Integer version,
@NonNull Query query,
String description,
@NonNull FeatureStore featureStore,
List<String> labels) |
Modifier and Type | Method and Description |
---|---|
void |
addTag(String name,
Object value)
Add name/value tag to the feature view.
|
void |
addTrainingDatasetTag(Integer version,
String name,
Object value)
Add name/value tag to the training dataset.
|
void |
clean(FeatureStore featureStore,
String featureViewName,
Integer featureViewVersion)
Delete the feature view and all associated metadata and training data.
|
Integer |
createTrainingData(String startTime,
String endTime,
String description,
DataFormat dataFormat)
Create the metadata for a training dataset and save the corresponding training data into `location`.
|
Integer |
createTrainingData(String startTime,
String endTime,
String description,
DataFormat dataFormat,
Boolean coalesce,
StorageConnector storageConnector,
String location,
Long seed,
StatisticsConfig statisticsConfig,
Map<String,String> writeOptions,
FilterLogic extraFilterLogic,
Filter extraFilter)
Create the metadata for a training dataset and save the corresponding training data into `location`.
|
Integer |
createTrainTestSplit(Float testSize,
String trainStart,
String trainEnd,
String testStart,
String testEnd,
String description,
DataFormat dataFormat)
Create the metadata for a training dataset and save the corresponding training data into `location`.
|
Integer |
createTrainTestSplit(Float testSize,
String trainStart,
String trainEnd,
String testStart,
String testEnd,
String description,
DataFormat dataFormat,
Boolean coalesce,
StorageConnector storageConnector,
String location,
Long seed,
StatisticsConfig statisticsConfig,
Map<String,String> writeOptions,
FilterLogic extraFilterLogic,
Filter extraFilter)
Create the metadata for a training dataset and save the corresponding training data into `location`.
|
Integer |
createTrainValidationTestSplit(Float validationSize,
Float testSize,
String trainStart,
String trainEnd,
String validationStart,
String validationEnd,
String testStart,
String testEnd,
String description,
DataFormat dataFormat)
Create the metadata for a training dataset and save the corresponding training data into `location`.
|
Integer |
createTrainValidationTestSplit(Float validationSize,
Float testSize,
String trainStart,
String trainEnd,
String validationStart,
String validationEnd,
String testStart,
String testEnd,
String description,
DataFormat dataFormat,
Boolean coalesce,
StorageConnector storageConnector,
String location,
Long seed,
StatisticsConfig statisticsConfig,
Map<String,String> writeOptions,
FilterLogic extraFilterLogic,
Filter extraFilter)
Create the metadata for a training dataset and save the corresponding training data into `location`.
|
void |
delete()
Delete current feature view, all associated metadata and training data.
|
void |
deleteAllTrainingDatasets()
Delete all training datasets.
|
void |
deleteTag(String name)
Delete a tag of the feature view.
|
void |
deleteTrainingDataset(Integer version)
Delete a training dataset.
|
void |
deleteTrainingDatasetTag(Integer version,
String name)
Delete a tag of the training dataset.
|
org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
getBatchData()
Get all data from the feature view as a batch from the offline feature store.
|
org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
getBatchData(String startTime,
String endTime)
Get a batch of data from an event time interval from the offline feature store.
|
org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> |
getBatchData(String startTime,
String endTime,
Map<String,String> readOptions)
Get a batch of data from an event time interval from the offline feature store.
|
String |
getBatchQuery()
Get a query string of the batch query.
|
String |
getBatchQuery(String startTime,
String endTime)
Get a query string of the batch query.
|
HashSet<String> |
getPrimaryKeys()
Get set of primary key names that is used as keys in input dict object for `getServingVector` method.
|
Object |
getTag(String name)
Get a single tag value of the feature view.
|
Map<String,Object> |
getTags()
Get all tags of the feature view.
|
List<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>> |
getTrainingData(Integer version)
Get training data created by `featureView.createTrainingData` or `featureView.trainingData`.
|
List<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>> |
getTrainingData(Integer version,
Map<String,String> readOptions)
Get training data created by `featureView.createTrainingData` or `featureView.trainingData`.
|
Object |
getTrainingDatasetTag(Integer version,
String name)
Get a single tag value of the training dataset.
|
Map<String,Object> |
getTrainingDatasetTags(Integer version)
Get all tags of the training dataset.
|
List<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>> |
getTrainTestSplit(Integer version)
Get training data created by `featureView.createTrainTestSplit` or `featureView.trainTestSplit`.
|
List<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>> |
getTrainTestSplit(Integer version,
Map<String,String> readOptions)
Get training data created by `featureView.createTrainTestSplit` or `featureView.trainTestSplit`.
|
List<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>> |
getTrainValidationTestSplit(Integer version)
Get training data created by `featureView.createTrainValidationTestSplit` or featureView.trainValidationTestSplit`.
|
List<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>> |
getTrainValidationTestSplit(Integer version,
Map<String,String> readOptions)
Get training data created by `featureView.createTrainValidationTestSplit` or featureView.trainValidationTestSplit`.
|
void |
purgeAllTrainingData()
Delete all training datasets in this feature view (data only).
|
void |
purgeTrainingData(Integer version)
Delete a training dataset (data only).
|
void |
recreateTrainingDataset(Integer version,
Map<String,String> writeOptions)
Recreate a training dataset.
|
List<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>> |
trainingData(String startTime,
String endTime,
String description)
Create the metadata for a training dataset and get the corresponding training data from the offline feature store.
|
List<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>> |
trainingData(String startTime,
String endTime,
String description,
Long seed,
StatisticsConfig statisticsConfig,
Map<String,String> readOptions,
FilterLogic extraFilterLogic,
Filter extraFilter)
Create the metadata for a training dataset and get the corresponding training data from the offline feature store.
|
List<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>> |
trainTestSplit(Float testSize,
String trainStart,
String trainEnd,
String testStart,
String testEnd,
String description)
Create the metadata for a training dataset and get the corresponding training data from the offline feature store.
|
List<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>> |
trainTestSplit(Float testSize,
String trainStart,
String trainEnd,
String testStart,
String testEnd,
String description,
Long seed,
StatisticsConfig statisticsConfig,
Map<String,String> readOptions,
FilterLogic extraFilterLogic,
Filter extraFilter)
Create the metadata for a training dataset and get the corresponding training data from the offline feature store.
|
List<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>> |
trainValidationTestSplit(Float validationSize,
Float testSize,
String trainStart,
String trainEnd,
String validationStart,
String validationEnd,
String testStart,
String testEnd,
String description)
Create the metadata for a training dataset and get the corresponding training data from the offline feature store.
|
List<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>> |
trainValidationTestSplit(Float validationSize,
Float testSize,
String trainStart,
String trainEnd,
String validationStart,
String validationEnd,
String testStart,
String testEnd,
String description,
Long seed,
StatisticsConfig statisticsConfig,
Map<String,String> readOptions,
FilterLogic extraFilterLogic,
Filter extraFilter)
Create the metadata for a training dataset and get the corresponding training data from the offline feature store.
|
FeatureView |
update(FeatureView other)
Update the description of the feature view.
|
getFeatureVector, getFeatureVector, getFeatureVectors, getFeatureVectors, initBatchScoring, initServing, initServing, validateTrainTestSplit, validateTrainValidationTestSplit
public void delete() throws FeatureStoreException, IOException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// delete feature view
fv.delete();
delete
in class FeatureViewBase<FeatureView,FeatureStore,Query,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
FeatureStoreException
- In case client is not connected to Hopsworks.IOException
- Generic IO exception.public void clean(FeatureStore featureStore, String featureViewName, Integer featureViewVersion) throws FeatureStoreException, IOException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// delete feature view
fv.clean();
clean
in class FeatureViewBase<FeatureView,FeatureStore,Query,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
featureStore
- Feature store metadata object.featureViewName
- Name of feature view.featureViewVersion
- Version of feature view.FeatureStoreException
- In case client is not connected to Hopsworks.IOException
- Generic IO exception.public FeatureView update(FeatureView other) throws FeatureStoreException, IOException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// update with new description
fv.setDescription("Updated description");
// delete feature view
fv.update(fv);
update
in class FeatureViewBase<FeatureView,FeatureStore,Query,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
other
- Updated FeatureView metadata Object.FeatureStoreException
- In case client is not connected to Hopsworks.IOException
- Generic IO exception.public String getBatchQuery() throws FeatureStoreException, IOException, ParseException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// get batch query
fv.getBatchQuery();
getBatchQuery
in class FeatureViewBase<FeatureView,FeatureStore,Query,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
FeatureStoreException
- If Client is not connected to Hopsworks and/or unable to identify
date formats.IOException
- Generic IO exception.ParseException
- In case it's unable to parse strings dates to date types.public String getBatchQuery(String startTime, String endTime) throws FeatureStoreException, IOException, ParseException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// get batch query that will fetch data from jan 1, 2023 to Jan 31, 2023
fv.getBatchQuery("20230101", "20130131");
getBatchQuery
in class FeatureViewBase<FeatureView,FeatureStore,Query,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
startTime
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.endTime
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.FeatureStoreException
- If Client is not connected to Hopsworks and/or unable to identify format of the
provided `startTime`/`endTime` date formats;IOException
- Generic IO exception.ParseException
- In case it's unable to parse provided `startTime`/`endTime` strings to date types.public org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> getBatchData() throws FeatureStoreException, IOException, ParseException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// get batch data
fv.getBatchData();
getBatchData
in class FeatureViewBase<FeatureView,FeatureStore,Query,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
Dataset<Row>
Spark dataframe of batch data.FeatureStoreException
- If Client is not connected to Hopsworks and/or unable to identify format of the
provided `startTime`/`endTime` date formats.IOException
- Generic IO exception.ParseException
- In case it's unable to parse provided `startTime`/`endTime` strings to date types.public org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> getBatchData(String startTime, String endTime) throws FeatureStoreException, IOException, ParseException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// get batch query that will fetch data from jan 1, 2023 to Jan 31, 2023
fv.getBatchData("20230101", "20130131");
getBatchData
in class FeatureViewBase<FeatureView,FeatureStore,Query,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
startTime
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.endTime
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.Dataset<Row>
Spark dataframe of batch data.FeatureStoreException
- If Client is not connected to Hopsworks and/or unable to identify format of the
provided `startTime`/`endTime` date formats.IOException
- Generic IO exception.ParseException
- In case it's unable to parse provided `startTime`/`endTime` strings to date types.public org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> getBatchData(String startTime, String endTime, Map<String,String> readOptions) throws FeatureStoreException, IOException, ParseException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
Map<String, String> readOptions = new HashMap<String, String>() {{
put("header", "true");
put("delimiter", ",")}
};
// get batch query that will fetch data from jan 1, 2023 to Jan 31, 2023
fv.getBatchData("20230101", "20130131");
getBatchData
in class FeatureViewBase<FeatureView,FeatureStore,Query,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
startTime
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.endTime
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.readOptions
- Additional read options as key/value pairs.Dataset<Row>
Spark dataframe of batch data.FeatureStoreException
- If Client is not connected to Hopsworks and/or unable to identify format of the
provided `startTime`/`endTime` date formats.IOException
- Generic IO exception.ParseException
- In case it's unable to parse provided `startTime`/`endTime` strings to date types.public void addTag(String name, Object value) throws FeatureStoreException, IOException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// attach a tag to a feature view
JSONObject value = ...;
fv.addTag("tag_schema", value);
addTag
in class FeatureViewBase<FeatureView,FeatureStore,Query,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
name
- Name of the tagvalue
- Value of the tag. The value of a tag can be any valid json - primitives, arrays or json objectsFeatureStoreException
- If Client is not connected to Hopsworks.IOException
- Generic IO exception.public Map<String,Object> getTags() throws FeatureStoreException, IOException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// get tags
fv.getTags();
getTags
in class FeatureViewBase<FeatureView,FeatureStore,Query,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
Map<String, Object>
a map of tag name and values. The value of a tag can be any valid
json - primitives, arrays or json objectsFeatureStoreException
- If Client is not connected to Hopsworks.IOException
- Generic IO exception.public Object getTag(String name) throws FeatureStoreException, IOException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// get tag
fv.getTag("tag_name");
getTag
in class FeatureViewBase<FeatureView,FeatureStore,Query,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
name
- name of the tagFeatureStoreException
- If Client is not connected to Hopsworks.IOException
- Generic IO exception.public void deleteTag(String name) throws FeatureStoreException, IOException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// delete tag
fv.deleteTag("tag_name");
deleteTag
in class FeatureViewBase<FeatureView,FeatureStore,Query,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
name
- Name of the tag to be deleted.FeatureStoreException
- If Client is not connected to Hopsworks.IOException
- Generic IO exception.public Integer createTrainingData(String startTime, String endTime, String description, DataFormat dataFormat) throws IOException, FeatureStoreException, ParseException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// create training dataset
String startTime = "20220101000000";
String endTime = "20220606235959";
String description = "demo training dataset":
fv.createTrainingData(startTime, endTime, description, DataFormat.CSV);
startTime
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.endTime
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.description
- A string describing the contents of the training dataset to improve discoverability for
Data Scientists.dataFormat
- The data format used to save the training dataset.FeatureStoreException
- If Client is not connected to Hopsworks and/or unable to identify format of the
provided `startTime`/`endTime` date formats.IOException
- Generic IO exception.ParseException
- In case it's unable to parse provided `startTime`/`endTime` strings to date types.public Integer createTrainingData(String startTime, String endTime, String description, DataFormat dataFormat, Boolean coalesce, StorageConnector storageConnector, String location, Long seed, StatisticsConfig statisticsConfig, Map<String,String> writeOptions, FilterLogic extraFilterLogic, Filter extraFilter) throws IOException, FeatureStoreException, ParseException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// create training dataset
String startTime = "20220101000000";
String endTime = "20220606235959";
String description = "demo training dataset":
String location = "";
StatisticsConfig statisticsConfig = new StatisticsConfig(true, true, true, true)
fv.createTrainingData(startTime, endTime, description, DataFormat.CSV, true, location, statisticsConfig);
startTime
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.endTime
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.description
- A string describing the contents of the training dataset to improve discoverability for
Data Scientists.dataFormat
- The data format used to save the training dataset.coalesce
- If true the training dataset data will be coalesced into a single partition before writing.
The resulting training dataset will be a single file per split.storageConnector
- Storage connector defining the sink location for the training dataset. If `null` is
provided and materializes training dataset on HopsFS.location
- Path to complement the sink storage connector with, e.g if the storage connector points to an
S3 bucket, this path can be used to define a sub-directory inside the bucket to place the training
dataset. If empty string is provided `""`, saving the training dataset at the root defined by the
storage connector.seed
- Define a seed to create the random splits with, in order to guarantee reproducability,statisticsConfig
- A configuration object, to generally enable descriptive statistics computation for
this feature group, `"correlations`" to turn on feature correlation computation,
`"histograms"` to compute feature value frequencies and `"exact_uniqueness"` to compute
uniqueness, distinctness and entropy. The values should be booleans indicating the
setting. To fully turn off statistics computation pass `statisticsConfig=null`.writeOptions
- Additional write options as key-value pairs.extraFilterLogic
- Additional filters (set of Filter objects) to be attached to the training dataset.
The filters will be also applied in `getBatchData`.extraFilter
- Additional filter to be attached to the training dataset. The filter will be also applied
in `getBatchData`.FeatureStoreException
- If Client is not connected to Hopsworks and/or unable to identify format of the
provided `startTime`/`endTime` date formats.IOException
- Generic IO exception.ParseException
- In case it's unable to parse provided `startTime`/`endTime` strings to date types.public Integer createTrainTestSplit(Float testSize, String trainStart, String trainEnd, String testStart, String testEnd, String description, DataFormat dataFormat) throws IOException, FeatureStoreException, ParseException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// create training dataset based on time split
String trainStart = "20220101000000";
String trainEnd = "20220630235959";
String testStart = "20220701000000";
String testEnd = "20220830235959";
String description = "demo training dataset":
fv.createTrainTestSplit(null, trainStart, trainEnd, testStart, testEnd, description, DataFormat.CSV);
// or based on random split
fv.createTrainTestSplit(30, null, null, null, null, description, DataFormat.CSV);
testSize
- Size of test set.trainStart
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.trainEnd
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.testStart
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.testEnd
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.description
- A string describing the contents of the training dataset to improve discoverability for
Data Scientists.dataFormat
- The data format used to save the training dataset.FeatureStoreException
- If Client is not connected to Hopsworks and/or unable to identify format of the
provided date strings to date formats.IOException
- Generic IO exception.ParseException
- In case it's unable to parse provided date strings to date types.public Integer createTrainTestSplit(Float testSize, String trainStart, String trainEnd, String testStart, String testEnd, String description, DataFormat dataFormat, Boolean coalesce, StorageConnector storageConnector, String location, Long seed, StatisticsConfig statisticsConfig, Map<String,String> writeOptions, FilterLogic extraFilterLogic, Filter extraFilter) throws IOException, FeatureStoreException, ParseException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// create training dataset based on time split
String trainStart = "20220101000000";
String trainEnd = "20220630235959";
String testStart = "20220701000000";
String testEnd = "20220830235959";
String description = "demo training dataset":
StatisticsConfig statisticsConfig = new StatisticsConfig(true, true, true, true)
Map<String, String> writeOptions = new HashMap<String, String>() {{
put("header", "true");
put("delimiter", ",")}
};
// define extra filters
Filter leftFtFilter = new Filter();
leftFtFilter.setFeature(new Feature("left_ft_name"));
leftFtFilter.setValue("400");
leftFtFilter.setCondition(SqlFilterCondition.EQUALS);
Filter rightFtFilter = new Filter();
rightFtFilter.setFeature(new Feature("right_ft_name"));
rightFtFilter.setValue("50");
rightFtFilter.setCondition(SqlFilterCondition.EQUALS);
FilterLogic extraFilterLogic = new FilterLogic(SqlFilterLogic.AND, leftFtFilter, rightFtFilter);
Filter extraFilter = new Filter();
extraFilter.setFeature(new Feature("ft_name"));
extraFilter.setValue("100");
extraFilter.setCondition(SqlFilterCondition.GREATER_THAN);
// create training data
fv.createTrainTestSplit(null, null, trainStart, trainEnd, testStart,
testEnd, description, DataFormat.CSV, coalesce, storageConnector, location, seed, statisticsConfig,
writeOptions, extraFilterLogic, extraFilter);
// or based on random split
fv.createTrainTestSplit(20, 10, null, null, null, null, description, DataFormat.CSV, coalesce,
storageConnector, location, seed, statisticsConfig, writeOptions, extraFilterLogic, extraFilter);
testSize
- Size of test set.trainStart
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.trainEnd
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.testStart
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.testEnd
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.description
- A string describing the contents of the training dataset to improve discoverability for
Data Scientists.dataFormat
- The data format used to save the training dataset.coalesce
- If true the training dataset data will be coalesced into a single partition before writing.
The resulting training dataset will be a single file per split.storageConnector
- Storage connector defining the sink location for the training dataset. If `null` is
provided and materializes training dataset on HopsFS.location
- Path to complement the sink storage connector with, e.g if the storage connector points to an
S3 bucket, this path can be used to define a sub-directory inside the bucket to place the training
dataset. If empty string is provided `""`, saving the training dataset at the root defined by the
storage connector.seed
- Define a seed to create the random splits with, in order to guarantee reproducability,statisticsConfig
- A configuration object, to generally enable descriptive statistics computation for
this feature group, `"correlations`" to turn on feature correlation computation,
`"histograms"` to compute feature value frequencies and `"exact_uniqueness"` to compute
uniqueness, distinctness and entropy. The values should be booleans indicating the
setting. To fully turn off statistics computation pass `statisticsConfig=null`.writeOptions
- Additional write options as key-value pairs.extraFilterLogic
- Additional filters (set of Filter objects) to be attached to the training dataset.
The filters will be also applied in `getBatchData`.extraFilter
- Additional filter to be attached to the training dataset. The filter will be also applied
in `getBatchData`.FeatureStoreException
- If Client is not connected to Hopsworks and/or unable to identify format of the
provided date strings to date formats.IOException
- Generic IO exception.ParseException
- In case it's unable to parse provided date strings to date types.public Integer createTrainValidationTestSplit(Float validationSize, Float testSize, String trainStart, String trainEnd, String validationStart, String validationEnd, String testStart, String testEnd, String description, DataFormat dataFormat) throws IOException, FeatureStoreException, ParseException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// create training dataset based on time split
String trainStart = "20220101000000";
String trainEnd = "20220630235959";
String validationStart = "20220701000000";
String validationEnd = "20220830235959";
String testStart = "20220901000000";
String testEnd = "20220931235959";
String description = "demo training dataset":
fv.createTrainTestSplit(null, null, trainStart, trainEnd, validationStart, validationEnd, testStart,
testEnd, description, DataFormat.CSV);
// or based on random split
fv.createTrainTestSplit(20, 10, null, null, null, null, null, null, description, DataFormat.CSV);
validationSize
- Size of validation set.testSize
- Size of test set.trainStart
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.trainEnd
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.validationStart
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.validationEnd
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.testStart
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.testEnd
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.description
- A string describing the contents of the training dataset to improve discoverability for
Data Scientists.dataFormat
- The data format used to save the training dataset.FeatureStoreException
- If Client is not connected to Hopsworks and/or unable to identify format of the
provided date strings to date formats.IOException
- Generic IO exception.ParseException
- In case it's unable to parse provided date strings to date types.public Integer createTrainValidationTestSplit(Float validationSize, Float testSize, String trainStart, String trainEnd, String validationStart, String validationEnd, String testStart, String testEnd, String description, DataFormat dataFormat, Boolean coalesce, StorageConnector storageConnector, String location, Long seed, StatisticsConfig statisticsConfig, Map<String,String> writeOptions, FilterLogic extraFilterLogic, Filter extraFilter) throws IOException, FeatureStoreException, ParseException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// create training dataset based on time split
String trainStart = "20220101000000";
String trainEnd = "20220630235959";
String validationStart = "20220701000000";
String validationEnd = "20220830235959";
String testStart = "20220901000000";
String testEnd = "20220931235959";
String description = "demo training dataset":
StorageConnector.S3Connector storageConnector = fs.getS3Connector("s3Connector");
String location = "";
Long seed = 1234L;
Boolean coalesce = true;
StatisticsConfig statisticsConfig = new StatisticsConfig(true, true, true, true)
Map<String, String> writeOptions = new HashMap<String, String>() {{
put("header", "true");
put("delimiter", ",")}
};
// define extra filters
Filter leftFtFilter = new Filter();
leftFtFilter.setFeature(new Feature("left_ft_name"));
leftFtFilter.setValue("400");
leftFtFilter.setCondition(SqlFilterCondition.EQUALS);
Filter rightFtFilter = new Filter();
rightFtFilter.setFeature(new Feature("right_ft_name"));
rightFtFilter.setValue("50");
rightFtFilter.setCondition(SqlFilterCondition.EQUALS);
FilterLogic extraFilterLogic = new FilterLogic(SqlFilterLogic.AND, leftFtFilter, rightFtFilter);
Filter extraFilter = new Filter();
extraFilter.setFeature(new Feature("ft_name"));
extraFilter.setValue("100");
extraFilter.setCondition(SqlFilterCondition.GREATER_THAN);
// create training data
fv.createTrainTestSplit(null, null, trainStart, trainEnd, validationStart, validationEnd, testStart,
testEnd, description, DataFormat.CSV, coalesce, storageConnector, location, seed, statisticsConfig,
writeOptions, extraFilterLogic, extraFilter);
// or based on random split
fv.createTrainTestSplit(20, 10, null, null, null, null, null, null, description, DataFormat.CSV, coalesce,
storageConnector, location, seed, statisticsConfig, writeOptions, extraFilterLogic, extraFilter);
validationSize
- Size of validation set.testSize
- Size of test set.trainStart
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.trainEnd
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.validationStart
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.validationEnd
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.testStart
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.testEnd
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.description
- A string describing the contents of the training dataset to improve discoverability for
Data Scientists.dataFormat
- The data format used to save the training dataset.coalesce
- If true the training dataset data will be coalesced into a single partition before writing.
The resulting training dataset will be a single file per split.storageConnector
- Storage connector defining the sink location for the training dataset. If `null` is
provided and materializes training dataset on HopsFS.location
- Path to complement the sink storage connector with, e.g if the storage connector points to an
S3 bucket, this path can be used to define a sub-directory inside the bucket to place the training
dataset. If empty string is provided `""`, saving the training dataset at the root defined by the
storage connector.seed
- Define a seed to create the random splits with, in order to guarantee reproducability,statisticsConfig
- A configuration object, to generally enable descriptive statistics computation for
this feature group, `"correlations`" to turn on feature correlation computation,
`"histograms"` to compute feature value frequencies and `"exact_uniqueness"` to compute
uniqueness, distinctness and entropy. The values should be booleans indicating the
setting. To fully turn off statistics computation pass `statisticsConfig=null`.writeOptions
- Additional write options as key-value pairs.extraFilterLogic
- Additional filters (set of Filter objects) to be attached to the training dataset.
The filters will be also applied in `getBatchData`.extraFilter
- Additional filter to be attached to the training dataset. The filter will be also applied
in `getBatchData`.FeatureStoreException
- If Client is not connected to Hopsworks and/or unable to identify format of the
provided date strings to date formats.IOException
- Generic IO exception.ParseException
- In case it's unable to parse provided date strings to date types.public void recreateTrainingDataset(Integer version, Map<String,String> writeOptions) throws FeatureStoreException, IOException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// define write options
Map<String, String> writeOptions = new HashMap<String, String>() {{
put("header", "true");
put("delimiter", ",")}
};
//recreate training data
fv.recreateTrainingDataset(1, writeOptions);
version
- Training dataset version.writeOptions
- Additional read options as key-value pairs.FeatureStoreException
- If Client is not connected to Hopsworks.IOException
- Generic IO exception.public List<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>> getTrainingData(Integer version) throws IOException, FeatureStoreException, ParseException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// get training data
fv.getTrainingData(1);
version
- Training dataset version.List<Dataset<Row>>
List of dataframe of features and labels.FeatureStoreException
- If Client is not connected to Hopsworks and/or unable to identify
date formats.IOException
- Generic IO exception.ParseException
- In case it's unable to parse strings dates to date types.public List<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>> getTrainingData(Integer version, Map<String,String> readOptions) throws IOException, FeatureStoreException, ParseException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// define write options
Map<String, String> readOptions = new HashMap<String, String>() {{
put("header", "true");
put("delimiter", ",")}
};
// get training data
fv.getTrainingData(1, readOptions);
getTrainingData
in class FeatureViewBase<FeatureView,FeatureStore,Query,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
version
- Training dataset version.readOptions
- Additional read options as key/value pairs.List<Dataset<Row>>
List of dataframe of features and labels.FeatureStoreException
- If Client is not connected to Hopsworks and/or unable to identify
date formats.IOException
- Generic IO exception.ParseException
- In case it's unable to parse strings dates to date types.public List<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>> getTrainTestSplit(Integer version) throws IOException, FeatureStoreException, ParseException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// get train test split dataframe of features and labels
fv.getTrainTestSplit(1);
version
- Training dataset version.List<Dataset<Row>>
List of dataframe of features and labels.FeatureStoreException
- If Client is not connected to Hopsworks and/or unable to identify
date formats.IOException
- Generic IO exception.ParseException
- In case it's unable to parse strings dates to date types.public List<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>> getTrainTestSplit(Integer version, Map<String,String> readOptions) throws IOException, FeatureStoreException, ParseException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// define additional readOptions
Map<String, String> readOptions = new HashMap<String, String>() {{
put("header", "true");
put("delimiter", ",")}
};
// get train test split dataframe of features and labels
fv.getTrainTestSplit(1, readOptions);
getTrainTestSplit
in class FeatureViewBase<FeatureView,FeatureStore,Query,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
version
- Training dataset version.readOptions
- Additional read options as key/value pairs.List<Dataset<Row>>
List of dataframe of features and labels.FeatureStoreException
- If Client is not connected to Hopsworks and/or unable to identify
date formats.IOException
- Generic IO exception.ParseException
- In case it's unable to parse strings dates to date types.public List<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>> getTrainValidationTestSplit(Integer version) throws IOException, FeatureStoreException, ParseException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// get train, validation, test split dataframe of features and labels
fv.getTrainValidationTestSplit(1);
version
- Training dataset version.List<Dataset<Row>>
List of dataframe of features and labels.FeatureStoreException
- If Client is not connected to Hopsworks and/or unable to identify
date formats.IOException
- Generic IO exception.ParseException
- In case it's unable to parse strings dates to date types.public List<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>> getTrainValidationTestSplit(Integer version, Map<String,String> readOptions) throws IOException, FeatureStoreException, ParseException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// define additional readOptions
Map<String, String> readOptions = new HashMap<String, String>() {{
put("header", "true");
put("delimiter", ",")}
};
// get train, validation, test split dataframe of features and labels
fv.getTrainValidationTestSplit(1, readOptions);
getTrainValidationTestSplit
in class FeatureViewBase<FeatureView,FeatureStore,Query,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
version
- Training dataset version.readOptions
- Additional read options as key/value pairs.List<Dataset<Row>>
List of dataframe of features and labels.FeatureStoreException
- If Client is not connected to Hopsworks and/or unable to identify
date formats.IOException
- Generic IO exception.ParseException
- In case it's unable to parse strings dates to date types.public List<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>> trainingData(String startTime, String endTime, String description) throws IOException, FeatureStoreException, ParseException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// create training dataset based on time split
String startTime = "20220101000000";
String endTime = "20220630235959";
String description = "demo training dataset":
fv.createTrainTestSplit(startTime, endTime, description);
startTime
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.endTime
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.description
- A string describing the contents of the training dataset to improve discoverability for
Data Scientists.List<Dataset<Row>>
List of dataframe of features and labels.FeatureStoreException
- If Client is not connected to Hopsworks and/or unable to identify format of the
provided date strings to date formats.IOException
- Generic IO exception.ParseException
- In case it's unable to parse provided date strings to date types.public List<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>> trainingData(String startTime, String endTime, String description, Long seed, StatisticsConfig statisticsConfig, Map<String,String> readOptions, FilterLogic extraFilterLogic, Filter extraFilter) throws IOException, FeatureStoreException, ParseException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// create training dataset based on time split
String startTime = "20220101000000";
String endTime = "20220630235959";
String description = "demo training dataset":
Long seed = 1234L;
StatisticsConfig statisticsConfig = new StatisticsConfig(true, true, true, true)
Map<String, String> readOptions = new HashMap<String, String>() {{
put("header", "true");
put("delimiter", ",")}
};
// define extra filters
Filter leftFtFilter = new Filter();
leftFtFilter.setFeature(new Feature("left_ft_name"));
leftFtFilter.setValue("400");
leftFtFilter.setCondition(SqlFilterCondition.EQUALS);
Filter rightFtFilter = new Filter();
rightFtFilter.setFeature(new Feature("right_ft_name"));
rightFtFilter.setValue("50");
rightFtFilter.setCondition(SqlFilterCondition.EQUALS);
FilterLogic extraFilterLogic = new FilterLogic(SqlFilterLogic.AND, leftFtFilter, rightFtFilter);
Filter extraFilter = new Filter();
extraFilter.setFeature(new Feature("ft_name"));
extraFilter.setValue("100");
extraFilter.setCondition(SqlFilterCondition.GREATER_THAN);
// create training data
fv.trainValidationTestSplit(startTime, endTime, description, seed, statisticsConfig, readOptions,
extraFilterLogic, extraFilter);
startTime
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.endTime
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.description
- A string describing the contents of the training dataset to improve discoverability for
Data Scientists.seed
- Define a seed to create the random splits with, in order to guarantee reproducability.statisticsConfig
- A configuration object, to generally enable descriptive statistics computation for
this feature group, `"correlations`" to turn on feature correlation computation,
`"histograms"` to compute feature value frequencies and `"exact_uniqueness"` to compute
uniqueness, distinctness and entropy. The values should be booleans indicating the
setting. To fully turn off statistics computation pass `statisticsConfig=null`.readOptions
- Additional read options as key/value pairs.extraFilterLogic
- Additional filters (set of Filter objects) to be attached to the training dataset.
The filters will be also applied in `getBatchData`.extraFilter
- Additional filter to be attached to the training dataset. The filter will be also applied
in `getBatchData`.List<Dataset<Row>>
List of dataframe of features and labels.FeatureStoreException
- If Client is not connected to Hopsworks and/or unable to identify format of the
provided date strings to date formats.IOException
- Generic IO exception.ParseException
- In case it's unable to parse provided date strings to date types.public List<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>> trainTestSplit(Float testSize, String trainStart, String trainEnd, String testStart, String testEnd, String description) throws IOException, FeatureStoreException, ParseException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// create training dataset based on time split
String trainStart = "20220101000000";
String trainEnd = "20220630235959";
String testStart = "20220701000000";
String testEnd = "20220830235959";
String description = "demo training dataset":
// create training data
fv.trainValidationTestSplit(null, trainStart, trainEnd, testStart, trainEnd, description);
// or random split
fv.trainValidationTestSplit(30, null, null, null, null, description);
testSize
- Size of test set.trainStart
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.trainEnd
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.testStart
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.testEnd
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.description
- A string describing the contents of the training dataset to improve discoverability for
Data Scientists.List<Dataset<Row>>
List of Spark Dataframes containing training dataset splits.FeatureStoreException
- If Client is not connected to Hopsworks and/or unable to identify format of the
provided date strings to date formats.IOException
- Generic IO exception.ParseException
- In case it's unable to parse provided date strings to date types.public List<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>> trainTestSplit(Float testSize, String trainStart, String trainEnd, String testStart, String testEnd, String description, Long seed, StatisticsConfig statisticsConfig, Map<String,String> readOptions, FilterLogic extraFilterLogic, Filter extraFilter) throws IOException, FeatureStoreException, ParseException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// create training dataset based on time split
String trainStart = "20220101000000";
String trainEnd = "20220630235959";
String testStart = "20220701000000";
String testEnd = "20220830235959";
String description = "demo training dataset":
Long seed = 1234L;
StatisticsConfig statisticsConfig = new StatisticsConfig(true, true, true, true)
Map<String, String> readOptions = new HashMap<String, String>() {{
put("header", "true");
put("delimiter", ",")}
};
// define extra filters
Filter leftFtFilter = new Filter();
leftFtFilter.setFeature(new Feature("left_ft_name"));
leftFtFilter.setValue("400");
leftFtFilter.setCondition(SqlFilterCondition.EQUALS);
Filter rightFtFilter = new Filter();
rightFtFilter.setFeature(new Feature("right_ft_name"));
rightFtFilter.setValue("50");
rightFtFilter.setCondition(SqlFilterCondition.EQUALS);
FilterLogic extraFilterLogic = new FilterLogic(SqlFilterLogic.AND, leftFtFilter, rightFtFilter);
Filter extraFilter = new Filter();
extraFilter.setFeature(new Feature("ft_name"));
extraFilter.setValue("100");
extraFilter.setCondition(SqlFilterCondition.GREATER_THAN);
// create training data
fv.trainTestSplit(null, strainStart, trainEnd, testStart, trainEnd, description, seed, statisticsConfig,
readOptions, extraFilterLogic, extraFilter);
// or random split
fv.trainTestSplit(30, null, null, null, null, description, seed, statisticsConfig, readOptions,
extraFilterLogic, extraFilter);
testSize
- Size of test set.trainStart
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.trainEnd
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.testStart
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.testEnd
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.description
- A string describing the contents of the training dataset to improve discoverability for
Data Scientists.seed
- Define a seed to create the random splits with, in order to guarantee reproducability.statisticsConfig
- A configuration object, to generally enable descriptive statistics computation for
this feature group, `"correlations`" to turn on feature correlation computation,
`"histograms"` to compute feature value frequencies and `"exact_uniqueness"` to compute
uniqueness, distinctness and entropy. The values should be booleans indicating the
setting. To fully turn off statistics computation pass `statisticsConfig=null`.readOptions
- Additional read options as key/value pairs.extraFilterLogic
- Additional filters (set of Filter objects) to be attached to the training dataset.
The filters will be also applied in `getBatchData`.extraFilter
- Additional filter to be attached to the training dataset. The filter will be also applied
in `getBatchData`.List<Dataset<Row>>
List of Spark Dataframes containing training dataset splits.FeatureStoreException
- If Client is not connected to Hopsworks and/or unable to identify format of the
provided date strings to date formats.IOException
- Generic IO exception.ParseException
- In case it's unable to parse provided date strings to date types.public List<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>> trainValidationTestSplit(Float validationSize, Float testSize, String trainStart, String trainEnd, String validationStart, String validationEnd, String testStart, String testEnd, String description) throws IOException, FeatureStoreException, ParseException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// create training dataset based on time split
String trainStart = "20220101000000";
String trainEnd = "20220630235959";
String validationStart = "20220701000000";
String validationEnd = "20220830235959";
String testStart = "20220901000000";
String testEnd = "20220931235959";
String description = "demo training dataset":
fv.trainValidationTestSplit(null, null, trainStart, trainEnd, validationStart, validationEnd, testStart,
testEnd, description);
// or based on random split
fv.trainValidationTestSplit(20, 10, null, null, null, null, null, null, description);
validationSize
- Size of validation set.testSize
- Size of test set.trainStart
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.trainEnd
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.validationStart
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.validationEnd
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.testStart
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.testEnd
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.description
- A string describing the contents of the training dataset to improve discoverability for
Data Scientists.List<Dataset<Row>>
List of Spark Dataframes containing training dataset splits.FeatureStoreException
- If Client is not connected to Hopsworks and/or unable to identify format of the
provided date strings to date formats.IOException
- Generic IO exception.ParseException
- In case it's unable to parse provided date strings to date types.public List<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>> trainValidationTestSplit(Float validationSize, Float testSize, String trainStart, String trainEnd, String validationStart, String validationEnd, String testStart, String testEnd, String description, Long seed, StatisticsConfig statisticsConfig, Map<String,String> readOptions, FilterLogic extraFilterLogic, Filter extraFilter) throws IOException, FeatureStoreException, ParseException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// create training dataset based on time split
String trainStart = "20220101000000";
String trainEnd = "20220630235959";
String validationStart = "20220701000000";
String validationEnd = "20220830235959";
String testStart = "20220901000000";
String testEnd = "20220931235959";
String description = "demo training dataset":
Long seed = 1234L;
StatisticsConfig statisticsConfig = new StatisticsConfig(true, true, true, true)
Map<String, String> readOptions = new HashMap<String, String>() {{
put("header", "true");
put("delimiter", ",")}
};
// define extra filters
Filter leftFtFilter = new Filter();
leftFtFilter.setFeature(new Feature("left_ft_name"));
leftFtFilter.setValue("400");
leftFtFilter.setCondition(SqlFilterCondition.EQUALS);
Filter rightFtFilter = new Filter();
rightFtFilter.setFeature(new Feature("right_ft_name"));
rightFtFilter.setValue("50");
rightFtFilter.setCondition(SqlFilterCondition.EQUALS);
FilterLogic extraFilterLogic = new FilterLogic(SqlFilterLogic.AND, leftFtFilter, rightFtFilter);
Filter extraFilter = new Filter();
extraFilter.setFeature(new Feature("ft_name"));
extraFilter.setValue("100");
extraFilter.setCondition(SqlFilterCondition.GREATER_THAN);
// create training data
fv.trainValidationTestSplit(null, null, trainStart, trainEnd, validationStart, validationEnd, testStart,
testEnd, description, seed, statisticsConfig,
readOptions, extraFilterLogic, extraFilter);
// or based on random split
fv.trainValidationTestSplit(20, 10, null, null, null, null, null, null, description, statisticsConfig,
seed, readOptions, extraFilterLogic, extraFilter);
validationSize
- Size of validation set.testSize
- Size of test set.trainStart
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.trainEnd
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.validationStart
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.validationEnd
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.testStart
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.testEnd
- Datetime string. The String should be formatted in one of the following formats `yyyyMMdd`,
`yyyyMMddHH`, `yyyyMMddHHmm`, or `yyyyMMddHHmmss`.description
- A string describing the contents of the training dataset to improve discoverability for
Data Scientists.seed
- Define a seed to create the random splits with, in order to guarantee reproducability.statisticsConfig
- A configuration object, to generally enable descriptive statistics computation for
this feature group, `"correlations`" to turn on feature correlation computation,
`"histograms"` to compute feature value frequencies and `"exact_uniqueness"` to compute
uniqueness, distinctness and entropy. The values should be booleans indicating the
setting. To fully turn off statistics computation pass `statisticsConfig=null`.readOptions
- Additional read options as key/value pairs.extraFilterLogic
- Additional filters (set of Filter objects) to be attached to the training dataset.
The filters will be also applied in `getBatchData`.extraFilter
- Additional filter to be attached to the training dataset. The filter will be also applied
in `getBatchData`.List<Dataset<Row>>
List of Spark Dataframes containing training dataset splits.FeatureStoreException
- If Client is not connected to Hopsworks and/or unable to identify format of the
provided date strings to date formats.IOException
- Generic IO exception.ParseException
- In case it's unable to parse provided date strings to date types.public void purgeTrainingData(Integer version) throws FeatureStoreException, IOException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// Delete a training dataset version 1
fv.purgeAllTrainingData(1);
purgeTrainingData
in class FeatureViewBase<FeatureView,FeatureStore,Query,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
version
- Version of the training dataset to be removed.FeatureStoreException
- If Client is not connected to Hopsworks.IOException
- Generic IO exception.public void purgeAllTrainingData() throws FeatureStoreException, IOException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// Delete a training dataset.
fv.purgeAllTrainingData(1);
purgeAllTrainingData
in class FeatureViewBase<FeatureView,FeatureStore,Query,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
FeatureStoreException
- If Client is not connected to Hopsworks.IOException
- Generic IO exception.public void deleteTrainingDataset(Integer version) throws FeatureStoreException, IOException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// Delete a training dataset version 1.
fv.deleteTrainingDataset(1);
deleteTrainingDataset
in class FeatureViewBase<FeatureView,FeatureStore,Query,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
version
- Version of the training dataset to be removed.FeatureStoreException
- If Client is not connected to Hopsworks.IOException
- Generic IO exception.public void deleteAllTrainingDatasets() throws FeatureStoreException, IOException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// Delete all training datasets in this feature view.
fv.deleteAllTrainingDatasets();
deleteAllTrainingDatasets
in class FeatureViewBase<FeatureView,FeatureStore,Query,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
FeatureStoreException
- If Client is not connected to Hopsworks.IOException
- Generic IO exception.public void addTrainingDatasetTag(Integer version, String name, Object value) throws FeatureStoreException, IOException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// add tag to datasets version 1 in this feature view.
JSONObject json = ...;
fv.addTrainingDatasetTag(1, "tag_name", json);
addTrainingDatasetTag
in class FeatureViewBase<FeatureView,FeatureStore,Query,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
version
- Training dataset version.name
- Name of the tag.value
- Value of the tag. The value of a tag can be any valid json - primitives, arrays or json objects.FeatureStoreException
- If Client is not connected to Hopsworks.IOException
- Generic IO exception.public Map<String,Object> getTrainingDatasetTags(Integer version) throws FeatureStoreException, IOException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// get tags of training dataset version 1 in this feature view.
fv.getTrainingDatasetTags(1);
getTrainingDatasetTags
in class FeatureViewBase<FeatureView,FeatureStore,Query,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
version
- Training dataset version.Map<String, Object>
A map of tag name and values. The value of a tag can be any valid json -
primitives, arrays or json objectsFeatureStoreException
- If Client is not connected to Hopsworks.IOException
- Generic IO exception.public Object getTrainingDatasetTag(Integer version, String name) throws FeatureStoreException, IOException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// get tag with name `"demo_name"` of training dataset version 1 in this feature view.
fv.getTrainingDatasetTags(1, "demo_name");
getTrainingDatasetTag
in class FeatureViewBase<FeatureView,FeatureStore,Query,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
version
- Training dataset version.name
- Name of the tag.FeatureStoreException
- If Client is not connected to Hopsworks.IOException
- Generic IO exception.public void deleteTrainingDatasetTag(Integer version, String name) throws FeatureStoreException, IOException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// delete tag with name `"demo_name"` of training dataset version 1 in this feature view.
fv.deleteTrainingDatasetTag(1, "demo_name");
deleteTrainingDatasetTag
in class FeatureViewBase<FeatureView,FeatureStore,Query,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
version
- Tag version.name
- Name of the tag to be deleted.FeatureStoreException
- If Client is not connected to Hopsworks.IOException
- Generic IO exception.public HashSet<String> getPrimaryKeys() throws SQLException, IOException, FeatureStoreException, ClassNotFoundException
// get feature store handle
FeatureStore fs = HopsworksConnection.builder().build().getFeatureStore();
// get feature view handle
FeatureView fv = fs.getFeatureView("fv_name", 1);
// get set of primary key names
fv.getPrimaryKeys();
getPrimaryKeys
in class FeatureViewBase<FeatureView,FeatureStore,Query,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
HashSet<String>
Set of serving keysFeatureStoreException
- In case client is not connected to Hopsworks.IOException
- Generic IO exception.SQLException
- In case there is online storage (RonDB) access error or other errors.ClassNotFoundException
- In case class `com.mysql.jdbc.Driver` can not be found.Copyright © 2023. All rights reserved.