How to manage schema and feature data types#
In this guide, you will learn how to manage the feature group schema and control the data type of the features in a feature group.
Before you begin this guide we suggest you read the Feature Group concept page to understand what a feature group is and how it fits in the ML pipeline. We also suggest you familiarize yourself with the APIs to create a feature group.
Feature group schema#
When a feature is stored in both the online and offline feature stores, it will be stored in a data type native to each store.
- Offline data type: The data type of the feature when stored on the offline feature store. The offline feature store is based on Apache Hudi and Hive Metastore, as such, Hive Data Types can be leveraged.
- Online data type: The data type of the feature when stored on the online feature store. The online storage is based on RonDB and hence, MySQL Data Types can be leveraged.
The offline data type is always required, even if the feature group is stored only online. On the other hand, if the feature group is not online_enabled, its features will not have an online data type.
The offline and online types for each feature are automatically inferred from the Spark or Pandas types of the input DataFrame as outlined in the following two sections. The default mapping, however, can be overwritten by using an explicit schema definition.
Offline data types#
|Spark Type (S)||Pandas Type (PO)||Offline Feature Type||Remarks|
|ByteType||int8||TINYINT or INT||INT when time_travel_type="HUDI"|
|ShortType||uint8, int16||SMALLINT or INT||INT when time_travel_type="HUDI"|
|LongType||int, uint32, int64||BIGINT|
|FloatType||float, float16, float32||FLOAT|
|DecimalType||decimal.decimal||DECIMAL(PREC, SCALE)||Not supported in PO env. when time_travel_type="HUDI"|
|StringType||object (str), object(np.unicode)||STRING|
|ArrayType||object (list), object (np.ndarray)||ARRAY<TYPE>|
|StructType||object (dict)||STRUCT<NAME: TYPE, ...>|
|MapType||-||MAP<String,TYPE>||Only when time_travel_type!="HUDI"; Only string keys permitted|
When registering a Pandas DataFrame in a PySpark environment (S) the Pandas DataFrame is first converted to a Spark DataFrame, using Spark's default conversion. It results in a less fine-grained mapping between Python and Spark types:
|Pandas Type (S)||Spark Type||Remarks|
|int8, uint8, int16, uint16, int32, int, uint32, int64||LongType|
|float, float16, float32, float64||DoubleType|
|object (str), object(np.unicode)||StringType|
|object (list), object (np.ndarray)||-||Not supported|
Online data types#
The online data type is determined based on the offline type according to the following mapping, regardless of which environment the data originated from. Only a subset of the data types can be used as primary key, as indicated in the table as well:
|Offline Feature Type||Online Feature Type||Primary Key||Remarks|
|INT||INT||x||Also supports: TINYINT, SMALLINT|
|DECIMAL(PREC, SCALE)||DECIMAL(PREC, SCALE)||e.g. DECIMAL(38, 18)|
|STRING||VARCHAR(100)||x||Also supports: TEXT|
|ARRAY<TYPE>||VARBINARY(100)||x||Also supports: BLOB|
|STRUCT<NAME: TYPE, ...>||VARBINARY(100)||x||Also supports: BLOB|
|BINARY||VARBINARY(100)||x||Also supports: BLOB|
|MAP<String,TYPE>||VARBINARY(100)||x||Also supports: BLOB|
String online data types#
String types are stored as VARCHAR(100) by default. This type is fixed-size, meaning it can only hold as many characters as specified in the argument (e.g. VARCHAR(100) can hold up to 100 unicode characters). The size should thus be within the maximum string length of the input data. Furthermore, the VARCHAR size has to be in line with the online restrictions for row size.
If the string size exceeds 100 characters, a larger type (e.g. VARCHAR(500)) can be specified via an explicit schema definition. If the string size is unknown or if it exceeds the maximum row size, then the TEXT type can be used instead.
String data that exceeds the specified VARCHAR size will lead to an error when data gets written to the online feature store. When in doubt, use the TEXT type instead, but note that it comes with a potential performance overhead.
Complex online data types#
Hopsworks allows users to store complex types (e.g. ARRAY
fs.sql("SELECT ...", online=True) statement, it will return a binary blob.
On the feature store UI, the online feature type for complex features will be reported as VARBINARY.
If the binary size exceeds 100 bytes, a larger type (e.g. VARBINARY(500)) can be specified via an explicit schema definition. If the binary size is unknown of if it exceeds the maximum row size, then the BLOB type can be used instead.
Binary data that exceeds the specified VARBINARY size will lead to an error when data gets written to the online feature store. When in doubt, use the BLOB type instead, but note that it comes with a potential performance overhead.
Online restrictions for primary key data types#
When a feature is being used as a primary key, certain types are not allowed. Examples of such types are FLOAT, DOUBLE, DATE, TEXT and BLOB. Additionally, the size of the sum of the primary key online data types storage requirements should not exceed 4KB.
Online restrictions for row size#
The online feature store supports up to 500 columns and all column types combined should not exceed 30000 Bytes. The byte size of each column is determined by its data type and calculated as follows:
|Online Data Type||Byte Size|
|VARCHAR(LENGTH)||LENGTH * 4|
|VARCHAR(LENGTH) charset latin1;||LENGTH * 1|
|VARBINARY(LENGTH)||LENGTH / 1.4|
Explicit schema definition#
When creating a feature group it is possible for the user to control both the offline and online data type of each column. If users explicitly define the schema for the feature group, Hopsworks is going to use that schema to create the feature group, without performing any type mapping. You can explicitly define the feature group schema as follows:
from hsfs.feature import Feature features = [ Feature(name="id",type="int",online_type="int"), Feature(name="name",type="string",online_type="varchar(20)") ] fg = fs.create_feature_group(name="fg_manual_schema", features=features, online_enabled=True) fg.save(df)
Append features to existing feature groups#
Hopsworks supports appending additional features to an existing feature group. Adding additional features to an existing feature group is not considered a breaking change.
from hsfs.feature import Feature features = [ Feature(name="id",type="int",online_type="int"), Feature(name="name",type="string",online_type="varchar(20)") ] fg = fs.get_feature_group(name="example", version=1) fg.append_features(features)
When adding additional features to a feature group, you can provide a default values for existing entries in the feature group. You can also backfill the new features for existing entries by running an
insert() operation and update all existing combinations of primary key - event time.