How to manage schema and feature data types#
In this guide, you will learn how to manage the feature group schema and control the data type of the features in a feature group.
Before you begin this guide we suggest you read the Feature Group concept page to understand what a feature group is and how it fits in the ML pipeline. We also suggest you familiarize yourself with the APIs to create a feature group.
Feature group schema#
When a feature is stored in both the online and offline feature stores, it will be stored in a data type native to each store.
- Offline data type: The data type of the feature when stored on the offline feature store. The offline feature store is based on Apache Hudi and Hive Metastore, as such, Hive Data Types can be leveraged.
- Online data type: The data type of the feature when stored on the online feature store. The online storage is based on RonDB and hence, MySQL Data Types can be leveraged.
The offline data type is always required, even if the feature group is stored only online. On the other hand, if the feature group is not online_enabled, its features will not have an online data type.
The offline and online types for each feature are automatically inferred from the Spark or Pandas types of the input DataFrame as outlined in the following two sections. The default mapping, however, can be overwritten by using an explicit schema definition.
Offline data types#
|Spark Type (S)||Pandas Type (P)||Offline Feature Type||Remarks|
|ByteType||int8, Int8||TINYINT or INT||INT when time_travel_type="HUDI"|
|ShortType||uint8, int16, Int16||SMALLINT or INT||INT when time_travel_type="HUDI"|
|IntegerType||uint16, int32, Int32||INT|
|LongType||int, uint32, int64, Int64||BIGINT|
|FloatType||float, float16, float32||FLOAT|
|DecimalType||decimal.decimal||DECIMAL(PREC, SCALE)||Not supported in PO env. when time_travel_type="HUDI"|
|TimestampType||datetime64[ns], datetime64[ns, tz]||TIMESTAMP||s. Timestamps and Timezones|
|StringType||object (str), object(np.unicode)||STRING|
|ArrayType||object (list), object (np.ndarray)||ARRAY<TYPE>|
|StructType||object (dict)||STRUCT<NAME: TYPE, ...>|
|MapType||-||MAP<String,TYPE>||Only when time_travel_type!="HUDI"; Only string keys permitted|
When registering a Pandas DataFrame in a PySpark environment (S) the Pandas DataFrame is first converted to a Spark DataFrame, using Spark's default conversion. It results in a less fine-grained mapping between Python and Spark types:
|Pandas Type (S)||Spark Type||Remarks|
|int8, uint8, int16, uint16, int32, int, uint32, int64||LongType|
|float, float16, float32, float64||DoubleType|
|datetime64[ns], datetime64[ns, tz]||TimestampType||s. Timestamps and Timezones|
|object (str), object(np.unicode)||StringType|
|object (list), object (np.ndarray)||-||Not supported|
Online data types#
The online data type is determined based on the offline type according to the following mapping, regardless of which environment the data originated from. Only a subset of the data types can be used as primary key, as indicated in the table as well:
|Offline Feature Type||Online Feature Type||Primary Key||Remarks|
|INT||INT||x||Also supports: TINYINT, SMALLINT|
|DECIMAL(PREC, SCALE)||DECIMAL(PREC, SCALE)||e.g. DECIMAL(38, 18)|
|TIMESTAMP||TIMESTAMP||s. Timestamps and Timezones|
|STRING||VARCHAR(100)||x||Also supports: TEXT|
|ARRAY<TYPE>||VARBINARY(100)||x||Also supports: BLOB|
|STRUCT<NAME: TYPE, ...>||VARBINARY(100)||x||Also supports: BLOB|
|BINARY||VARBINARY(100)||x||Also supports: BLOB|
|MAP<String,TYPE>||VARBINARY(100)||x||Also supports: BLOB|
String online data types#
String types are stored as VARCHAR(100) by default. This type is fixed-size, meaning it can only hold as many characters as specified in the argument (e.g. VARCHAR(100) can hold up to 100 unicode characters). The size should thus be within the maximum string length of the input data. Furthermore, the VARCHAR size has to be in line with the online restrictions for row size.
If the string size exceeds 100 characters, a larger type (e.g. VARCHAR(500)) can be specified via an explicit schema definition. If the string size is unknown or if it exceeds the maximum row size, then the TEXT type can be used instead.
String data that exceeds the specified VARCHAR size will lead to an error when data gets written to the online feature store. When in doubt, use the TEXT type instead, but note that it comes with a potential performance overhead.
Complex online data types#
Hopsworks allows users to store complex types (e.g. ARRAY
fs.sql("SELECT ...", online=True) statement, it will return a binary blob.
On the feature store UI, the online feature type for complex features will be reported as VARBINARY.
If the binary size exceeds 100 bytes, a larger type (e.g. VARBINARY(500)) can be specified via an explicit schema definition. If the binary size is unknown of if it exceeds the maximum row size, then the BLOB type can be used instead.
Binary data that exceeds the specified VARBINARY size will lead to an error when data gets written to the online feature store. When in doubt, use the BLOB type instead, but note that it comes with a potential performance overhead.
Online restrictions for primary key data types#
When a feature is being used as a primary key, certain types are not allowed. Examples of such types are FLOAT, DOUBLE, TEXT and BLOB. Additionally, the size of the sum of the primary key online data types storage requirements should not exceed 4KB.
Online restrictions for row size#
The online feature store supports up to 500 columns and all column types combined should not exceed 30000 Bytes. The byte size of each column is determined by its data type and calculated as follows:
|Online Data Type||Byte Size|
|VARCHAR(LENGTH)||LENGTH * 4|
|VARCHAR(LENGTH) charset latin1;||LENGTH * 1|
|VARBINARY(LENGTH)||LENGTH / 1.4|
Timestamps and Timezones#
All timestamp features are stored in Hopsworks in UTC time. Also, all timestamp-based functions (such as point-in-time joins) use UTC time. This ensures consistency of timestamp features across different client timezones and simplifies working with timestamp-based functions in general. When ingesting timestamp features, the Feature Store Write API will automatically handle the conversion to UTC, if necessary. The follwing table summarizes how different timestamp types are handled:
|Data Frame (Data Type)||Environment||Handling|
|Pandas DataFrame (datetime64[ns])||Python-only and PySpark||interpreted as UTC, independent of the client's timezone|
|Pandas DataFrame (datetime64[ns, tz])||Python-only and PySpark||timzone-sensitive conversion from 'tz' to UTC|
|Spark (TimestampType)||PySpark and Spark||interpreted as UTC, independent of the client's timezone|
Timestamp features retrieved from the Feature Store, e.g. using the Feature Store Read API, use a timezone-unaware format:
|Data Frame (Data Type)||Environment||Timezone|
|Pandas DataFrame (datetime64[ns])||Python-only||timezone-unaware (UTC)|
|Spark (TimestampType)||PySpark and Spark||timezone-unaware (UTC)|
Note that our PySpark/Spark client automatically sets the Spark SQL session's timezone to UTC. This ensures that Spark SQL will correctly interpret all timestamps as UTC. The setting will only apply to the client's session, and you don't have to worry about setting/unsetting the configuration yourself.
Explicit schema definition#
When creating a feature group it is possible for the user to control both the offline and online data type of each column. If users explicitly define the schema for the feature group, Hopsworks is going to use that schema to create the feature group, without performing any type mapping. You can explicitly define the feature group schema as follows:
from hsfs.feature import Feature features = [ Feature(name="id",type="int",online_type="int"), Feature(name="name",type="string",online_type="varchar(20)") ] fg = fs.create_feature_group(name="fg_manual_schema", features=features, online_enabled=True) fg.save(df)
Append features to existing feature groups#
Hopsworks supports appending additional features to an existing feature group. Adding additional features to an existing feature group is not considered a breaking change.
from hsfs.feature import Feature features = [ Feature(name="id",type="int",online_type="int"), Feature(name="name",type="string",online_type="varchar(20)") ] fg = fs.get_feature_group(name="example", version=1) fg.append_features(features)
When adding additional features to a feature group, you can provide a default values for existing entries in the feature group. You can also backfill the new features for existing entries by running an
insert() operation and update all existing combinations of primary key - event time.