Feature Join

Intuitions of Frame Join

Observation dataset has 2 records as below, and we want to use it as the ‘spine’ dataset, joining two features onto it:

1) Feature page_view_count from dataset page_view_data

2) Feature like_count from dataset like_count_data

The Feathr feature join in this case, will use the field id as join key of the observation data, and also consider the timestamp of each row during the join, making sure the joined feature values are collected before the observation_time of each row.

id	observe_time	Label
1	2022-01-01	Yes
1	2022-01-02	Yes
2	2022-01-02	No

Dataset page_view_data contains page_view_count of each user at a given time:

UserId	log_time	page_view_count
1	2022-01-01	101
1	2022-01-02	102
1	2022-01-03	103
2	2022-01-02	200
3	2022-01-02	300

Dataset like_count_data contains like_count of each user at a given time:

UserId	updated_time	‘like_count’
1	2022-01-01	11
1	2022-01-02	12
1	2022-01-03	13
2	2022-01-02	20
3	2022-01-02	30

The expected joined output, a.k.a. training dataset would be assuming feature:

id	observe_time	Label	f_page_view_count	f_like_count
1	2022-01-01	Yes	101	11
1	2022-01-02	Yes	102	12
2	2022-01-02	No	200	20

Note: In the above example, feature f_page_view_count and f_like_count are defined as simply a reference of field page_view_count and like_count respectively. Timestamp in these 3 datasets are considered automatically.

Feature join config

An example is like below:

feature_query = FeatureQuery(
    feature_list=["f_location_avg_fare"], key=location_id)
settings = ObservationSettings(
    observation_path="abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/demo_data/green_tripdata_2020-04.csv",
    event_timestamp_column="lpep_dropoff_datetime",
    timestamp_format="yyyy-MM-dd HH:mm:ss")
client.get_offline_features(observation_settings=settings,
                            feature_query=feature_query,
                            output_path="abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/demo_data/output.avro")

(ObservationSettings API doc, client.get_offline_feature API doc)

After you have defined the features (as described in the Feature Definition) part, you can define how you want to join them.

Observation data

The path of a dataset as the ‘spine’ for the to-be-created training dataset. We call this input ‘spine’ dataset the ‘observation’ dataset. Typically, each row of the observation data contains:

Column(s) representing entity id(s), which will be used as the join key to look up(join) feature value.
A column representing the event time of the row. By default, Feathr will make sure the feature values joined have a timestamp earlier than it, ensuring no data leakage in the resulting training dataset.
Other columns will be simply pass through onto the output training dataset. The key fields from the observation data, which are used to joined with the feature data. List of feature names to be joined with the observation data. They must be pre-defined in the Python APIs.

The time information of the observation data used to compare with the feature’s timestamp during the join.

FeatureQuery

After you have defined all the features, you probably don’t want to use all of them in this particular program. In this case, instead of putting every features in this FeatureQuery part, you can just put a selected list of features. Note that they have to be of the same key.