Earlier than diving into function engineering, let’s first take a more in-depth take a look at the H&M Trend Suggestion dataset.
The dataset consists of three predominant tables: articles, prospects, and transactions.
Beneath is how one can extract and examine the info:
from recsys.raw_data_sources import h_and_m as h_and_m_raw_data# Extract articles knowledge
articles_df = h_and_m_raw_data.extract_articles_df()
print(articles_df.form)
articles_df.head()
# Extract prospects knowledge
customers_df = h_and_m_raw_data.extract_customers_df()
print(customers_df.form)
customers_df.head()
# Extract transactions knowledge
transactions_df = h_and_m_raw_data.extract_transactions_df()
print(transactions_df.form)
transactions_df.head()
🔗 Full code right here → Github
That is what the info appears to be like like:
1 — Prospects Desk
- Buyer ID: A novel identifier for every buyer.
- Age: Offers demographic data, which may also help predict age-related buying habits.
- Membership standing: Signifies whether or not a buyer is a member, which can influence shopping for patterns and preferences.
- Trend information frequency: Mirror how typically prospects obtain vogue information, hinting at their engagement degree.
- Membership member standing: Present if the client is an lively membership member, which might have an effect on loyalty and buy frequency.
- FN (vogue information rating): A numeric rating reflecting buyer’s engagement with fashion-related content material.
2 — Articles Desk
- Article ID: A novel identifier for every product.
- Product group: Categorizes merchandise into teams like attire, tops, or footwear.
- Shade: Describes every product’s colour, which is essential for visible similarity suggestions.
- Division: Signifies the division to which the article belongs, offering context for the kind of merchandise.
- Product kind: A extra detailed classification inside product teams.
- Product code: A novel identifier for every product variant.
- Index code: Represents product indexes, helpful for segmenting comparable objects inside the similar class.
3 — Transactions Desk
- Transaction ID: A novel identifier for every transaction.
- Buyer ID: Hyperlinks the transaction to a particular buyer.
- Article ID: Hyperlinks the transaction to a particular product.
- Worth: Mirror the transaction quantity, which helps analyze spending habits.
- Gross sales channel: Exhibits whether or not the acquisition was made on-line or in-store.
- Timestamp: Data the precise time of the transaction, helpful for time-based evaluation.
🔗 Full code right here → Github
The tables are linked by distinctive identifiers like buyer and article IDs. These connections are essential for profiting from the H&M dataset:
- Buyer to Transactions: By associating buyer IDs with transaction knowledge, we are able to create behavioral options like buy frequency, recency, and whole spending, which offer insights into buyer exercise and preferences.
- Articles to Transactions: Linking article IDs to transaction data helps us analyze product reputation, determine developments, and perceive buyer preferences for various kinds of merchandise.
- Cross-Desk Evaluation: Combining knowledge from a number of tables permits us to carry out superior function engineering. For instance, we are able to monitor seasonal product developments or section prospects based mostly on buying habits, enabling extra personalised suggestions.
Desk relationships present a clearer image of how prospects work together with merchandise, which helps enhance the accuracy of the advice mannequin in suggesting related objects.
- The Prospects desk comprises buyer knowledge, together with distinctive buyer IDs (Main Key), membership standing, and vogue information preferences.
- The Articles desk shops product particulars like article IDs (Main Key), product codes, and product names.
- The Transactions desk hyperlinks prospects and articles by purchases, with fields for the transaction date, buyer ID (Overseas Key), and article ID (Overseas Key).
The double-line notations between tables point out one-to-many relationships: every buyer could make a number of transactions, and every transaction can contain a number of articles.
The function pipeline takes as enter uncooked knowledge and outputs options and labels used for coaching and inference.
📚 Read more about function pipelines and their integration into ML methods [6].
Creating efficient options for each retrieval and rating fashions is the inspiration of a profitable advice system.
Function engineering for the two-tower mannequin
The 2-tower retrieval mannequin’s major goal is to be taught person and merchandise embeddings that seize interplay patterns between prospects and articles.
We use the transactions desk as our supply of floor fact — every buy represents a constructive interplay between a buyer and an article.
That is the inspiration for coaching the mannequin to maximise similarity between embeddings for precise interactions (constructive pairs).
The pocket book imports mandatory libraries and modules for function computation.
This snippet lists the default settings used all through the pocket book, corresponding to mannequin IDs, studying charges, and batch sizes.
It’s useful for understanding the configuration of the function pipeline and fashions.
pprint(dict(settings))
🔗 Full code right here → Github
Coaching goal
The aim of the two-tower retrieval mannequin is to make use of a minimal, robust function set that’s extremely predictive however doesn’t introduce pointless complexity.
The mannequin goals to maximise the similarity between buyer and article embeddings for bought objects whereas minimizing similarity for non-purchased objects.
This goal is achieved utilizing a loss perform corresponding to cross-entropy loss for sampled softmax, or contrastive loss. The embeddings are then optimized for nearest-neighbor search, which permits environment friendly filtering in downstream advice duties.
Function choice
The 2-tower retrieval mannequin deliberately makes use of a minimal set of robust options to be taught strong embeddings:
(1) Question options — utilized by the QueryTower
(the client encoder from the two-tower mannequin):
- customer_id: A categorical function that uniquely identifies every person. That is the spine of person embeddings.
- age: A numerical function that may seize demographical patterns.
- month_sin and month_cos: Numerical options that encode cyclic patterns (e.g., seasonality) in person habits.
(2) Candidate options — utilized by the ItemTower
(the H&M vogue articles encoder from the two-tower mannequin):
- article_id: A categorical function that uniquely identifies every merchandise. That is the spine of merchandise embeddings.
- garment_group_name: A categorical function that captures high-level classes (e.g., “T-Shirts”, “Attire”) to supply extra context in regards to the merchandise.
- index_group_name: A categorical function that captures broader merchandise groupings (e.g., “Menswear”, “Womenswear”) to supply additional context.
These options are handed by their respective towers to generate the question (person) and merchandise embeddings, that are then used to compute similarities throughout retrieval.
The restricted function set is optimized for the retrieval stage, specializing in rapidly figuring out candidate objects by an approximate nearest neighbor (ANN) search.
This aligns with the four-stage recommender system structure, guaranteeing environment friendly and scalable merchandise retrieval.
This snippet computes options for articles, corresponding to product descriptions and metadata, and shows their construction.
articles_df = compute_features_articles(articles_df)
articles_df.form
articles_df.head(3)
compute_features_articles()
takes the articles dataframe and transforms it right into a dataset with 27 options throughout 105,542 articles.
import polars as pldef compute_features_articles(df: pl.DataFrame) -> pl.DataFrame:
df = df.with_columns(
[
get_article_id(df).alias("article_id"),
create_prod_name_length(df).alias("prod_name_length"),
pl.struct(df.columns)
.map_elements(create_article_description)
.alias("article_description"),
]
)
# Add full picture URLs.
df = df.with_columns(image_url=pl.col("article_id").map_elements(get_image_url))
# Drop columns with null values
df = df.choose([col for col in df.columns if not df[col].is_null().any()])
# Take away 'detail_desc' column
columns_to_drop = ["detail_desc", "detail_desc_length"]
existing_columns = df.columns
columns_to_keep = [col for col in existing_columns if col not in columns_to_drop]
return df.choose(columns_to_keep)
One customary method when manipulating textual content earlier than feeding it right into a mannequin is to embed it. This solves the curse of dimensionality or data loss from options corresponding to one-hot encoding or hashing.
The next snippet generates embeddings for article descriptions utilizing a pre-trained SentenceTransformer mannequin.
gadget = (
"cuda" if torch.cuda.is_available()
else "mps" if torch.backends.mps.is_available()
else "cpu"
)
logger.data(f"Loading '{settings.FEATURES_EMBEDDING_MODEL_ID}' embedding mannequin to {gadget=}")# Load the embedding mannequin
mannequin = SentenceTransformer(settings.FEATURES_EMBEDDING_MODEL_ID, gadget=gadget)
# Generate embeddings for articles
articles_df = generate_embeddings_for_dataframe(
articles_df, "article_description", mannequin, batch_size=128
)
🔗 Full code right here → Github
Options engineering for the Rating mannequin
The rating mannequin has a extra advanced goal: precisely predicting the probability of buy for every retrieved merchandise.
This mannequin makes use of a mix of question and merchandise options, together with labels, to foretell the probability of interplay between customers and objects.
This function set is designed to supply wealthy contextual and descriptive data, enabling the mannequin to rank objects successfully.
Generate options for patrons:
Coaching goal
The mannequin is skilled to foretell buy likelihood, with precise purchases (from the Transactions desk) serving as constructive labels (1) and non-purchases as detrimental labels (0).
This binary classification goal helps order retrieved objects by their probability of buy.
Function choice
(1) Question Options — similar to these used within the Retrieval Mannequin to encode the client
(2) Merchandise Options — used to signify the articles within the dataset. These options describe the merchandise’ attributes and assist the mannequin perceive merchandise properties and relationships:
- article_id: A categorical function that uniquely identifies every merchandise, forming the inspiration of merchandise illustration.
- product_type_name: A categorical function that describes the precise kind of product (e.g., “T-Shirts”, “Attire”), offering detailed item-level granularity.
- product_group_name: A categorical function for higher-level grouping of things, helpful for capturing broader class developments.
- graphical_appearance_name: A categorical function representing the visible fashion of the merchandise (e.g., “Strong”, “Striped”).
- colour_group_name: A categorical function that captures the colour group of the merchandise (e.g., “Black”, “Blue”).
- perceived_colour_value_name: A categorical function describing the brightness or worth of the merchandise’s colour (e.g., “Mild”, “Darkish”).
- perceived_colour_master_name: A categorical function representing the grasp colour of the merchandise (e.g., “Crimson”, “Inexperienced”), offering extra color-related data.
- department_name: A categorical function denoting the division to which the merchandise belongs (e.g., “Menswear”, “Womenswear”).
- index_name: A categorical function representing broader classes, offering a high-level grouping of things.
- index_group_name: A categorical function that teams objects into overarching divisions (e.g., “Divided”, “H&M Women”).
- section_name: A categorical function describing the precise part inside the retailer or catalog.
- garment_group_name: A categorical function that captures high-level garment classes (e.g., “Jackets”, “Trousers”), serving to the mannequin generalize throughout comparable objects.
(3) Label — A binary function used for supervised studying
- `1` signifies a constructive pair (buyer bought the merchandise).
- `0` signifies a detrimental pair (buyer didn’t buy the merchandise, randomly sampled).
This method is designed for the rating stage of the recommender system, the place the main target shifts from producing candidates to fine-tuning suggestions with larger precision.
By incorporating each question and merchandise options, the mannequin ensures that suggestions are related and personalised.
Setting up the ultimate rating dataset
The rating dataset is the ultimate dataset used to coach the scoring/rating mannequin within the advice pipeline.
It’s computed by combining question (buyer) options, merchandise (article) options, and the interactions (transactions) between them.
The compute_ranking_dataset()
combines the totally different options from the Function Teams:
- `trans_fg`: The transactions Function Group, which supplies the labels (`1` for constructive pairs and `0` for detrimental pairs) and extra interaction-based options (e.g., recency, frequency).
- `articles_fg`: The articles Function Group, which comprises the engineered merchandise options (e.g., product kind, colour, division, and so forth.).
- `customers_fg`: The shoppers Function Group, which comprises buyer options (e.g., age, membership standing, buy habits).
The ensuing rating dataset contains:
- Buyer Options: From `customers_fg`, representing the question.
- Merchandise Options: From `articles_fg`, representing the candidate objects.
- Interplay Options: From `trans_fg`, corresponding to buy frequency or recency, which seize behavioral indicators.
- Label: A binary label (`1` for bought objects, `0` for detrimental samples).
The result’s a dataset the place every row represents a customer-item pair, with the options and label indicating whether or not the client interacted with the merchandise.
In apply, this appears to be like like this:
ranking_df = compute_ranking_dataset(
trans_fg,
articles_fg,
customers_fg,
)
ranking_df.form
Unfavourable sampling for the rating dataset
The rating dataset contains each constructive and detrimental samples.
This ensures the mannequin learns to distinguish between related and irrelevant objects:
- Constructive samples (Label = 1): derived from the transaction Function Group (`trans_fg`), the place a buyer bought a particular merchandise.
- Unfavourable samples (Labels = 0): generated by randomly sampling objects the client didn’t buy. These signify objects the client is much less more likely to work together with and assist the mannequin higher perceive what’s irrelevant to the person.
# Examine the label distribution within the rating dataset
ranking_df.get_column("label").value_counts()
Outputs:
label rely
i32 u32
1 20377
0 203770
Unfavourable Samples are constrained to make them real looking, corresponding to sampling objects from the identical class or division because the buyer’s purchases or together with well-liked objects the client hasn’t interacted with, simulating believable alternate options.
For instance, if the client bought a “T-shirt,” detrimental samples may embody different “T-shirts” they didn’t purchase.
Unfavourable samples are sometimes balanced in proportion to constructive ones. For each constructive pattern, we’d add 1 to five detrimental ones. This prevents the mannequin from favoring detrimental pairs, that are way more widespread in real-world knowledge.
import polars as pldef compute_ranking_dataset(trans_fg, articles_fg, customers_fg) -> pl.DataFrame:
... # Extra code
# Create constructive pairs
positive_pairs = df.clone()
# Calculate variety of detrimental pairs
n_neg = len(positive_pairs) * 10
# Create detrimental pairs DataFrame
article_ids = (df.choose("article_id")
.distinctive()
.pattern(n=n_neg, with_replacement=True, seed=2)
.get_column("article_id"))
customer_ids = (df.choose("customer_id")
.pattern(n=n_neg, with_replacement=True, seed=3)
.get_column("customer_id"))
other_features = (df.choose(["age"])
.pattern(n=n_neg, with_replacement=True, seed=4))
# Assemble detrimental pairs
negative_pairs = pl.DataFrame({
"article_id": article_ids,
"customer_id": customer_ids,
"age": other_features.get_column("age"),
})
# Add labels
positive_pairs = positive_pairs.with_columns(pl.lit(1).alias("label"))
negative_pairs = negative_pairs.with_columns(pl.lit(0).alias("label"))
# Concatenate constructive and detrimental pairs
ranking_df = pl.concat([
positive_pairs,
negative_pairs.select(positive_pairs.columns)
])
... Extra code
return ranking_df
As soon as the rating dataset is computed, it’s uploaded to Hopsworks as a brand new Function Group, with lineage data reflecting its dependencies on the mum or dad Function Teams (`articles_fg`, `customers_fg`, and `trans_fg`).
logger.data("Importing 'rating' Function Group to Hopsworks.")
rank_fg = feature_store.create_ranking_feature_group(
fs,
df=ranking_df,
dad and mom=[articles_fg, customers_fg, trans_fg],
online_enabled=False
)
logger.data("✅ Uploaded 'rating' Function Group to Hopsworks!!")
This lineage ensures that any updates to the mum or dad Function Teams (e.g., new transactions or articles) might be propagated to the rating dataset, protecting it up-to-date and constant.
The Hopsworks Function Retailer is a centralized repository for managing options.
The next reveals how one can authenticate and connect with the function retailer:
from recsys import hopsworks_integration# Connect with Hopsworks Function Retailer
challenge, fs = hopsworks_integration.get_feature_store()
🔗 Full code right here → Github
Step 1: Outline Function Teams
Function Teams are logical groupings of associated options that can be utilized collectively in mannequin coaching and inference.
For instance:
1 — Buyer Function Group
Consists of all customer-related options, corresponding to demographic, behavioral, and engagement metrics.
- Demographics: Age, gender, membership standing.
- Behavioral options: Buy historical past, common spending, go to frequency.
- Engagement metrics: Trend information frequency, membership membership standing.
2 — Article Function Group
It contains options associated to articles (merchandise), corresponding to descriptive attributes, reputation metrics, and picture options.
- Descriptive attributes: Product group, colour, division, product kind, product code.
- Reputation metrics: Variety of purchases, scores.
- Picture options: Visible embeddings derived from product photos.
3 — Transaction Function Group
Consists of all transaction-related options, corresponding to transactional particulars, interplay metrics, and contextual options.
- Transactional attributes: Transaction ID, buyer ID, article ID, value.
- Interplay metrics: Recency and frequency of purchases.
- Contextual options: Gross sales channel, timestamp of transaction.
Including a function group to Hopsworks:
from recsys.hopsworks_integration.feature_store import create_feature_group# Create a function group for article options
create_feature_group(
feature_store=fs,
feature_data=article_features_df,
feature_group_name="articles_features",
description="Options for articles within the H&M dataset"
)
🔗 Full code right here → Github
Step 2: Information ingestion
To make sure the info is appropriately structured and prepared for mannequin coaching and inference, the subsequent step includes loading knowledge from the H&M dataset into the respective Function Teams in Hopsworks.
Right here’s the way it works:
1 — Information loading
Begin by extracting knowledge from the H&M supply recordsdata, processing them into options and loading them into the proper Function Teams.
2 — Information validation
After loading, test that the info is correct and matches the anticipated construction.
- Consistency checks: Confirm the relationships between datasets are right.
- Information cleansing: Handle any points within the knowledge, corresponding to lacking values, duplicates, or inconsistencies.
Fortunately, Hopsworks helps integration with Great Expectations, including a strong knowledge validation layer throughout knowledge loading.
Step 3: Versioning and metadata administration
Versioning and metadata administration are important for protecting your Function Teams organized and guaranteeing fashions might be reproduced.
The important thing steps are:
- Model management: Monitor totally different variations of Function Teams to make sure you can recreate and validate fashions utilizing particular knowledge variations. For instance, if there are important adjustments to the Buyer Function Group, create a brand new model to mirror these adjustments.
- Metadata administration: Doc the small print of every function, together with its definition, the way it’s remodeled, and any dependencies it has on different options.
rank_fg = fs.get_or_create_feature_group(
identify="rating",
model=1,
description="Derived function group for rating",
primary_key=["customer_id", "article_id"],
dad and mom=[articles_fg, customers_fg, trans_fg],
online_enabled=online_enabled,
)
rank_fg.insert(df, write_options={"wait_for_job": True})for desc in constants.ranking_feature_descriptions:
rank_fg.update_feature_description(desc["name"], desc["description"])
Defining Function Teams, managing knowledge ingestion, and monitoring variations and metadata guarantee your options are organized, reusable, and dependable, making it simpler to keep up and scale your ML workflows.
View ends in Hopsworks Serverless: Function Retailer → Function Teams
Hopsworks Function Teams are key in making machine studying workflows extra environment friendly and arranged.
Right here’s how they assist:
1 — Centralized repository
- Single supply of fact: Function Teams in Hopsworks present a centralized place for all of your function knowledge, guaranteeing everybody in your group makes use of the identical, up-to-date knowledge. This reduces the danger of inconsistencies and errors when totally different folks use outdated or different datasets.
- Simpler administration: Managing all options in a single place turns into simpler. Updating, querying, and sustaining the options is streamlined, resulting in elevated productiveness and smoother workflows.
2- Function reusability
- Cross-model consistency: Options saved in Hopsworks can be utilized throughout totally different fashions and initiatives, guaranteeing consistency of their definition and utility. This eliminates the necessity to re-engineer options every time, saving effort and time.
- Quicker growth: Since you’ll be able to reuse options, you don’t have to begin from scratch. You possibly can rapidly leverage current, well-defined options, rushing up the event and deployment of latest fashions.
3 — Scalability
- Optimized Efficiency: The platform ensures that queries and have updates are carried out rapidly, even when coping with giant quantities of information. That is essential for sustaining mannequin efficiency in manufacturing.
4 — Versioning and lineage
- Model management: Hopsworks supplies model management for Function Teams, so you’ll be able to hold monitor of adjustments made to options over time. This helps reproducibility, as you’ll be able to return to earlier variations if wanted.
- Information lineage: Monitoring knowledge lineage permits you to doc how options are created and remodeled. This provides transparency and helps you perceive the relationships between options.
Learn more on feature groups [4] and how one can combine them into ML methods.
Think about you’re working H&M’s on-line advice system, which delivers personalised product options to tens of millions of customers.
At the moment, the system makes use of a static pipeline: embeddings for customers and merchandise are precomputed utilizing a two-tower mannequin and saved in an Approximate Nearest Neighbor (ANN) index.
When customers work together with the location, comparable merchandise are retrieved, filtered (e.g., excluding seen or out-of-stock objects), and ranked by a machine studying mannequin.
Whereas this method works nicely offline, it struggles to adapt to real-time adjustments, corresponding to shifts in person preferences or the launch of latest merchandise.
It’s essential to shift to a streaming knowledge pipeline to make the advice system dynamic and responsive.
Step 1 — Integrating real-time knowledge
Step one is to introduce real-time knowledge streams into your pipeline. To start, take into consideration the varieties of occasions your system must deal with:
- Person habits: Actual-time interactions corresponding to clicks, purchases, and searches to maintain up with evolving preferences.
- Product updates: Stream knowledge on new arrivals, value adjustments, and inventory updates to make sure suggestions mirror essentially the most up-to-date catalog.
- Embedding updates: Constantly recalculate person and product embeddings to keep up the accuracy and relevance of the advice mannequin.
Step 2: Updating the retrieval stage
In a static pipeline, retrieval depends upon a precomputed ANN index that matches person and merchandise embeddings based mostly on similarity.
Nonetheless, as embeddings evolve, protecting the retrieval course of synchronized with these adjustments is essential to keep up accuracy and relevance.
Hopsworks helps upgrading the ANN index. This simplifies embedding updates and retains the retrieval course of aligned with the most recent embeddings.
Right here’s how one can improve the retrieval stage:
- Improve the ANN index: Swap to a system able to incremental updates, like FAISS, ScaNN, or Milvus. These libraries help real-time similarity searches and may immediately incorporate new and up to date embeddings.
- Stream embedding updates: Combine a message dealer like Kafka to feed up to date embeddings into the system. As a person’s preferences change or new objects are added, their corresponding embeddings needs to be up to date in real-time.
- Guarantee freshness: Construct a mechanism to prioritize the most recent embeddings throughout similarity searches. This ensures that suggestions are all the time based mostly on essentially the most present person preferences and accessible content material.
Step 3: Updating the filtering stage
After retrieving a listing of candidate objects, the subsequent step is filtering out irrelevant or unsuitable choices. In a static pipeline, filtering depends on precomputed knowledge like whether or not a person has already watched a video or if it’s regionally accessible.
Nonetheless, filtering must adapt immediately to new knowledge for a real-time system.
Right here’s how one can replace the filtering stage:
- Monitor current buyer exercise: Use a stream processing framework like Apache Flink or Kafka Streams to keep up a real-time document of buyer interactions
- Dynamic inventory availability: Constantly replace merchandise availability based mostly on real-time stock knowledge. If an merchandise goes out of inventory, it needs to be filtered instantly.
- Personalised filters: Apply personalised guidelines in real-time, corresponding to excluding objects that don’t match a buyer’s measurement, colour preferences, or shopping historical past.
First, you will need to create an account on Hopsworks’s Serverless platform. Each making an account and working our code are free.
Then you may have 3 predominant choices to run the function pipeline:
- In a neighborhood Pocket book or Google Colab: access instructions
- As a Python script from the CLI, access instructions
- GitHub Actions: access instructions
View the ends in Hopsworks Serverless: Function Retailer → Function Teams
We advocate utilizing GitHub Actions when you have a poor web connection and hold getting timeout errors when loading knowledge to Hopsworks. This occurs as a result of we push tens of millions of things to Hopsworks.
On this lesson, we coated the important parts of the function pipeline, from understanding the H&M dataset to engineering options for each retrieval and rating fashions.
We additionally launched Hopsworks Function Teams, emphasizing their significance in successfully organizing, managing, and reusing options.
Lastly, we coated the transition to a real-time streaming pipeline, which is essential for making advice methods adaptive to evolving person behaviors.
With this basis, you’ll be able to handle and optimize options for high-performing machine studying methods that ship personalised, high-impact person experiences.
In Lesson 3, we’ll dive into the coaching pipeline, specializing in coaching, evaluating, and managing retrieval and rating fashions utilizing the Hopsworks mannequin registry.
💻 Discover all the teachings and the code in our freely accessible GitHub repository.
If in case you have questions or want clarification, be happy to ask. See you within the subsequent session!
The H&M Real-Time Personalized Recommender course is a part of Decoding ML’s open-source sequence of end-to-end AI programs.
For extra comparable free programs on manufacturing AI, GenAI, data retrieval, and MLOps methods, take into account testing our available courses.
Additionally, we offer a free weekly e-newsletter on AI that works in manufacturing ↓