Skip to content

Query#

Query objects are strictly generated by HSFS APIs called on Feature Group objects. Users will never construct a Query object using the constructor of the class. For this reason we do not provide the full documentation of the class here.

Methods#

[source]

as_of#

Query.as_of(wallclock_time)

Perform time travel on the given Query.

This method returns a new Query object at the specified point in time.

Warning

The wallclock_time needs to be a time included into the Hudi active timeline. By default Hudi keeps the last 20 to 30 commits in the active timeline. If you need to keep a longer active timeline, you can overwrite the options: hoodie.keep.min.commits and hoodie.keep.max.commits when calling the insert() method.

This can then either be read into a Dataframe or used further to perform joins or construct a training dataset.

Arguments

  • wallclock_time: Datetime string. The String should be formatted in one of the following formats %Y%m%d, %Y%m%d%H, %Y%m%d%H%M, or %Y%m%d%H%M%S.

Returns

Query. The query object with the applied time travel condition.


[source]

filter#

Query.filter(f)

Apply filter to the feature group.

Selects all features and returns the resulting Query with the applied filter.

from hsfs.feature import Feature

query.filter(Feature("weekly_sales") > 1000)

If you are planning to join the filtered feature group later on with another feature group, make sure to select the filtered feature explicitly from the respective feature group:

query.filter(fg.feature1 == 1).show(10)

Composite filters require parenthesis:

query.filter((fg.feature1 == 1) | (fg.feature2 >= 2))

Arguments

  • f Union[hsfs.constructor.filter.Filter, hsfs.constructor.filter.Logic]: Filter object.

Returns

Query. The query object with the applied filter.


[source]

join#

Query.join(sub_query, on=[], left_on=[], right_on=[], join_type="inner", prefix=None)

Join Query with another Query.

If no join keys are specified, Hopsworks will use the maximal matching subset of the primary keys of the feature groups you are joining. Joins of one level are supported, no neted joins.

Arguments

  • sub_query hsfs.constructor.query.Query: Right-hand side query to join.
  • on Optional[List[str]]: List of feature names to join on if they are available in both feature groups. Defaults to [].
  • left_on Optional[List[str]]: List of feature names to join on from the left feature group of the join. Defaults to [].
  • right_on Optional[List[str]]: List of feature names to join on from the right feature group of the join. Defaults to [].
  • join_type Optional[str]: Type of join to perform, can be "inner", "outer", "left" or "right". Defaults to "inner".
  • prefix Optional[str]: User provided prefix to avoid feature name clash. Prefix is applied to the right feature group of the query. Defaults to None.

Returns

Query: A new Query object representing the join.


[source]

pull_changes#

Query.pull_changes(wallclock_start_time, wallclock_end_time)

[source]

read#

Query.read(online=False, dataframe_type="default", read_options={})

Read the specified query into a DataFrame.

It is possible to specify the storage (online/offline) to read from and the type of the output DataFrame (Spark, Pandas, Numpy, Python Lists).

Arguments

  • online Optional[bool]: Read from online storage. Defaults to False.
  • dataframe_type Optional[str]: DataFrame type to return. Defaults to "default".
  • read_options Optional[dict]: Optional dictionary with read options for Spark. Defaults to {}.

Returns

DataFrame: DataFrame depending on the chosen type.


[source]

show#

Query.show(n, online=False)

Show the first N rows of the Query.

Arguments

  • n int: Number of rows to show.
  • online Optional[bool]: Show from online storage. Defaults to False.

[source]

to_string#

Query.to_string(online=False)

Properties#

[source]

left_feature_group_end_time#


[source]

left_feature_group_start_time#