Query#
Query objects are strictly generated by HSFS APIs called on Feature Group objects. Users will never construct a Query object using the constructor of the class. For this reason we do not provide the full documentation of the class here.
Methods#
as_of#
Query.as_of(wallclock_time)
Perform time travel on the given Query.
This method returns a new Query object at the specified point in time.
Warning
The wallclock_time needs to be a time included into the Hudi active timeline.
By default Hudi keeps the last 20 to 30 commits in the active timeline.
If you need to keep a longer active timeline, you can overwrite the options:
hoodie.keep.min.commits
and hoodie.keep.max.commits
when calling the insert()
method.
This can then either be read into a Dataframe or used further to perform joins or construct a training dataset.
Arguments
- wallclock_time: Datetime string. The String should be formatted in one of the
following formats
%Y%m%d
,%Y%m%d%H
,%Y%m%d%H%M
, or%Y%m%d%H%M%S
.
Returns
Query
. The query object with the applied time travel condition.
filter#
Query.filter(f)
Apply filter to the feature group.
Selects all features and returns the resulting Query
with the applied filter.
from hsfs.feature import Feature
query.filter(Feature("weekly_sales") > 1000)
If you are planning to join the filtered feature group later on with another feature group, make sure to select the filtered feature explicitly from the respective feature group:
query.filter(fg.feature1 == 1).show(10)
Composite filters require parenthesis:
query.filter((fg.feature1 == 1) | (fg.feature2 >= 2))
Arguments
- f
Union[hsfs.constructor.filter.Filter, hsfs.constructor.filter.Logic]
: Filter object.
Returns
Query
. The query object with the applied filter.
from_cache_feature_group_only#
Query.from_cache_feature_group_only()
from_response_json#
Query.from_response_json(json_dict)
join#
Query.join(sub_query, on=[], left_on=[], right_on=[], join_type="inner", prefix=None)
Join Query with another Query.
If no join keys are specified, Hopsworks will use the maximal matching subset of the primary keys of the feature groups you are joining. Joins of one level are supported, no neted joins.
Arguments
- sub_query
hsfs.constructor.query.Query
: Right-hand side query to join. - on
Optional[List[str]]
: List of feature names to join on if they are available in both feature groups. Defaults to[]
. - left_on
Optional[List[str]]
: List of feature names to join on from the left feature group of the join. Defaults to[]
. - right_on
Optional[List[str]]
: List of feature names to join on from the right feature group of the join. Defaults to[]
. - join_type
Optional[str]
: Type of join to perform, can be"inner"
,"outer"
,"left"
or"right"
. Defaults to "inner". - prefix
Optional[str]
: User provided prefix to avoid feature name clash. Prefix is applied to the right feature group of the query. Defaults toNone
.
Returns
Query
: A new Query object representing the join.
pull_changes#
Query.pull_changes(wallclock_start_time, wallclock_end_time)
read#
Query.read(online=False, dataframe_type="default", read_options={})
Read the specified query into a DataFrame.
It is possible to specify the storage (online/offline) to read from and the type of the output DataFrame (Spark, Pandas, Numpy, Python Lists).
Arguments
- online
Optional[bool]
: Read from online storage. Defaults toFalse
. - dataframe_type
Optional[str]
: DataFrame type to return. Defaults to"default"
. - read_options
Optional[dict]
: Optional dictionary with read options for Spark. Defaults to{}
.
Returns
DataFrame
: DataFrame depending on the chosen type.
show#
Query.show(n, online=False)
Show the first N rows of the Query.
Arguments
- n
int
: Number of rows to show. - online
Optional[bool]
: Show from online storage. Defaults toFalse
.
to_string#
Query.to_string(online=False)