Recommendation Engine

TensorFlow Deep Learning Recommenders on Retail Dataset

Take advantage of TensorFlow 2.0's new flexible library to deploy a recommendation engine on retail dataset.

Taufik Azri
CodeX
Published in
15 min readJul 3, 2021

--

Personalization is the key to win attention in consumer retail. Photo by The Creative Exchange on Unsplash.

Retail data has grown exponentially over these past few years. Even more so, Covid-19 has shifted massive number of transactions from offline to online. With an increase of data integration among mobile applications, notably of social media, companies have gained more insight into consumer’s activity, behavior, sentiment, and preference. How can we take advantage of these inputs to produce an effective, curated, and personalize recommendation engine that can cater real-time changes in the continuously ever-changing retail dynamics? We need not just a powerful engine that can cater massive text, time, and image data, but also a flexible library that can adapt to the fluctuations of these inputs.

To address this issue, I want to shed light on TensorFlow’s new recommendation library (TFRS), which has the potential to be scaled up to meet these challenges. My assumption is that it is still a library in progress, but as of now, they have released a few set of tools that allows us to build a hybrid engine, taking advantage of neural network’s embedding layers while simplifying the process of input and output. I would demonstrate a simple application of this library on an open retail dataset, with the goal to increase available recommendation tools among existing ones.

Why do we need a flexible library?

Many recommendation algorithm that uses simple user-based similarity or content-based similarity often face constrain to be scaled bigger in order to consume many streams of information, resulting in loss of opportunity to gain predicting accuracy. In my experience, increasing input channels with additional filters to mitigate data error and data sparsity is a complex process, harder to build, and harder to maintain.

We need a library that can parallelized input streams such as text embedding and tokenization, data normalization, and data sparsity altogether such that they can be processed seamlessly without over-exhausting the computing capacity. Using purely deep learning may lead to even expensive computational power, and hence, we need a simpler API that can process these calculations efficiently. I believe TFRS is able to solve these issues — it will be expanded greatly in the near future, so let us learn to build one from scratch.

Objective

The purpose of this article is to demonstrate TFRS recommender library on a customer retail data. I am grateful that TF has released the source code on its website. I adapted their codes on an open Brazilian E-Commerce dataset from Olist made available on Kaggle. Using TF library enables us to build a flexible modular model, equipped with ability to add multiple features while adjusting the model complexity painlessly.

This article is not an introduction to recommendation system. Those who would like to read about basic fundamentals of user-based recommendation may read an earlier article I have written here. I also have written another an in-depth article of running an engine using TensorRec module shown above (which uses an earlier version of TensorFlow).

I will demonstrate the model in this order, in increasing layers of complexity and practicality:
- Retrieval model
- Ranking model
- Adding text and timestamp embedding
- Multitask recommendation, combining retrieval and ranking
- Add more features using Cross Network.

Each models add elements from the previous ones, so at the end, we will have a complete model that can cater different input types.

Dataset

Sample rows of cleaned dataset.

The data was provided in Kaggle by Olist, an e-commerce platform provider from Brazil. The dataset consists of customer transactions, customer information, and product information. I merged several disparate tables into one that looks like above. I also refine the product id by combining product category with numerical categorization of the original product ID (which was a long string). I also transform the time of the transaction into Unix timestamp.

It contains essential information we notably need for the engine, namely:

Customer ID: Identified by user_id.
Product ID: Identified by product_id.
Rating: Identified by quantity.

The customer and user IDs are important — we identified customer’s preference by products they have bought, and similar products other users of the same profile that have bought the same item too. The quantity of the items bought is akin to ratings in movie dataset — we can assume that if a customer have bought a lot of an item, he/she has higher positive sentiment towards it, which is conceptually equivalent to user’s rating in movie dataset.

There are other available features too, which are:
Time: Identified by Timestamp — the date and time of transactions, here converted to Unix epoch.
Geographical location: Identified by customer_city.
Product features: Identified by product_category.
Explicit Rating: Identified by review_score.

These additional features may be helpful in improving the accuracy of the model. There are many other features we can derive too, such as customer segmentation or product ranking, but for now let us focus on these available features.

I will demonstrate a very simple engine that takes user ID, product ID, and quantity, and then gradually adds other features as we head towards a full-scale model. The goal is to learn essential component of the model, so we can have a good knowledge to modularly arrange the module as necessary.

Retrieval and Ranking

TFRS provides two types of task — Retrieval and Ranking. Retrieval task selects an initial set of candidate among all possible choices. The objective is to eliminate candidate that a user may not be interested. Retrieval task can deal with millions of candidate and only returns a handful of items, therefore it can save computational power and memory.

Ranking, on the other hand, takes the output of retrieval task and select a few best possible items, ranking them from top to bottom. It normally returns a probability score for each item, and sort them from highest score to lowest.

The model building block are stacked by this order:
- Build user, item, and rating (quantity) data.
- Build lookup table and shuffle dataset.
- Define sequential layers and model tower.
- Fitting and evaluation.

Retrieval Task — Data Preparation

Let us start choosing the data for retrieval task. We only need to prepare two set of reference tables — an item table, which will be the reference table for items to be recommended, and an interaction (query) table (or a ranking table, in the case of IMDB movie example) where it shows user’s previous purchase history. Below is the code on building this data. The full code is available in GitHub.

The code essentially selects the required features into a table that has the data type of TF dictionary. Notice that quantity, which is a numerical feature, is copied into the dictionary as a float. Very often, when you passed a feature that is not compatible with the TF requirement, you will encounter an error message such as below:

Tensor conversion requested dtype int64 for Tensor with dtype float32: <tf.Tensor 'IteratorGetNext:4' shape=(None,) dtype=float32>

The key to avoid an error like this is to ensure numeric datatype are properly defined, either as integer or float.

Retrieval Task — Lookup Table and Shuffle Dataset

We need a reference map (lookup table) of the raw values of categorical features to the embedding vectors in the retrieval models. Do not worry if you do not understand what this means — they are vectorized elements that will be inserted and used as reference in the model layers.

To do that, we need a unique vocabulary list that maps a raw feature value to an integer in a contiguous range, which then will be mapped to the corresponding embeddings in the embedding tables inside the model. We will also shuffle the dataset, then split the data into train-test segment by 80–20 ratio.

Retrieval Task — Define Sequential Layers and Model Tower

After we have selected the data and prepared the lookup table, we build a model tower, where the model takes the input, stack them as embedding layers, and then passing them into task and loss calculation. There are five important component of the tower: candidate (item) model, query (user) model, task (retrieval), metrics (factorized top -k), and loss computation.

There are four inputs we need to insert into the tower:
- Candidate model: items, embedded into sequential layers.
- Query model: users, embedded into sequential layers.
- Task, which is retrieval task.
- Metrics to measure loss: which use factorized top -k retrieval accuracy.
- Loss function: which measures loss against actual product they bought before.

The code above may appear lengthy and intimidating, but essentially it passes the user embedding and item embedding into sequential layers. After that we define the task and metrics to calculate the loss function. You will notice that we are passing the same column name as the features we selected during data preparation.

Retrieval Model — Fitting and Evaluation

Now we are ready to fit the model and evaluate its accuracy using the model towers we have defined above.

We take the user and item embedding and pass it into the model tower. We use Adagrad optimizers to minimize the loss within each iteration, but there are other options too such as AdamOptimizer and Stochastic Gradient Descent (SGD). After that, we fit the model for ten epochs and check the accuracy for each of the top -k chosen item. Normally, a decent deep-learning model should be run for at least 100 epochs, but here we only do so for ten just for the sake of demonstration.

The model seems to be decent, with the accuracy for top 10 items nearing 90% at ten epochs. The higher the epochs, the higher the accuracy, but that may not be necessarily the case for many models. You will often find that the rate of accuracy often flattened after a certain epoch, so a graph of that can often visualize the rate of change, as shown below.

Retrieval Model — Retrieve Top Items for Users

The final step is to retrieve top items for each user. We can use brute force to search through all retrieved items and produced top ten, here doing so for user 40, with the output shown below the code.

We have just built a simple yet effective recommendation engine using retrieval task. However, TF also provide another module called ranking, which can be computationally more effective for large dataset with sizes of millions or more. It ranks all the items from best to worst, then run retrieval task to retrieve selected items from the short list.

Ranking Model

Ranking model can assist retrieval by ranking all the items from highest to lowest, predicting a probability that a user may or may not like it. It is useful to filter out items that are not relevant for the user before retrieval task, making retrieval task computationally efficient. In this example, we will look at a very simple ranking model, and after that, we will add more features and combine ranking and retrieval model into a multitask model. The full code is produced here.

Here we insert the same query and candidate model built earlier during retrieval task into the model towel. The only difference from that is here we use ranking task, and we calculate the accuracy metrics using RMSE instead of factorized top -k.

The procedure is as usual — we call fit on train and test data, and the evaluate the metrics. However, as we can see, after four epochs, the RMSE is not very good. Therefore, in the next section, we shall see how to improve the model by adding more features and then combine ranking and retrieval model altogether.

Adding Text and Timestamp Feature

TFRS can process product names of similar words. For example, the model can build a similarity relationship of boxed bearing the name “oat” or “wheat”. Photo by Franki Chamaki on Unsplash.

One powerful feature of TFRS library is its ability to tokenize text and timestamp into features. The library can process text into bag-of-words that can affect similarity measures among items. For instance, given two boxes of cereal bearing the word “oat” on it, TF can build a network of similarities that pairs these two altogether. Similarly, for temporal data, TF can process temporal records in analyzing users that bought an item around the same time. Let us explore how to add text and time feature into the model, with the option to see if adding timestamp make any significant improvement to the accuracy. The full code is produced here.

We follow the same procedure with retrieval task. Here, in the data preparation, we add timestamp into the interaction table. Notice the emphasis on float data type, as you may encounter data type error if TF cannot process it.

Numerical features have to be standardized, because doing so can increase computational efficiency and improve accuracy. Here we bucketize the timestamp into 1000 linearly spaced buckets. This is a massive reduction from two years of data on almost every seconds. There are other methods of standardizing numerical data too, such as normalization.

We split the query and candidate model separately to allow more stacked embedding layers before we pass it into the model. In the user model (query model), in addition to user embedding, we also add timestamp embedding. Then later in the sequential tower, we add an option whether to use timestamp or not, so we can see the differences in between using time vs. not using time.

For the candidate model, we want the model to learn from the text features too by learning words that are similar to each other. It can also identify OOV (out of Vocabulary) word, so if we are predicting a new item, the model can calculate them appropriately.

Below, the item name will be transformed by tokenization (splitting into constituent words), followed by vocabulary learning, then followed by embedding.

With both user model and item model defined, we can insert them into the full model tower and then implement our loss and metric logic inside it. Note that we also need to make sure that the query model and candidate model produce output embeddings of compatible size. Because we will be varying their sizes by adding more features, the easiest way to accomplish this is to use a dense projection layer after each query and candidate model.

We are ready to try out our first model. Let us start by not using timestamp features to establish a baseline.

model = RetailModel(use_timestamps=False)
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))
model.fit(cached_train, epochs=3)
model.evaluate(cached_test, return_dict=True)

After running for three epochs, the accuracy for top 100 is not really good, even less so for other higher orders. What if we include timestamp by running the line RetailModel(use_timestamps=True)?

As we can see, the result has improved slightly. Even though we only ran it at three epochs, we can see the accuracy increased. This improvement shows that time provides a positive effect towards the engine’s recommendation decision.

Multi-task Recommenders with ReLU-based DNN

Since we have learned ranking and retrieval task separately, now we can add them together to produce what we hope to be an even better model. Adapting the guide produced by TF, I will provide an option to weigh retrieval and ranking inside the model, allowing one model to have greater influence into the calculation over another. If we assign a large loss weight to the ranking task, our model is going to focus on predicting ratings (but still uses some information from the retrieval task); if we assign a large loss weight to the retrieval task, it will focus higher on retrieval instead. In addition, we will also add multiple Rectified Linear Unit (ReLU) dense layers into the model tower. Below is the model tower that stacks user model, item model, ReLU-based DNN layers, retrieval and raking task, and loss calculations. The full code is produced here.

Rating-specialized model

Depending on the weights we assign, the model will encode a different balance weightage on the task. Let us start with a model that only considers rating task. Here we assign full weightage to the rating task but zero weightage to the retrieval.

model = Model(rating_weight=1.0, retrieval_weight=0.0)
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))
cached_train = train.shuffle(100_000).batch(8192).cache()
cached_test = test.batch(4096).cache()
model.fit(cached_train, epochs=3)
metrics = model.evaluate(cached_test, return_dict=True)
print(f"Retrieval top-100 accuracy: {metrics['factorized_top_k/top_100_categorical_accuracy']:.3f}.")
print(f"Ranking RMSE: {metrics['root_mean_squared_error']:.3f}.")

The top-100 accuracy seems low, and the RMSE is not great either. What if we use retrieval only?

We can see the accuracy increases but RMSE drops. Let us run a model with positive weights on both tasks.

We can see that both accuracy and RMSE have improved. With more epochs, the model will show significant improvement. But so far we only have timestamp as additional features. How can we incorporate all of them into one single model?

Incorporating various input using Cross Network

A powerful recommendation engine allows it to accept input from various features. With the rich availability of user and item features, TFRS provides a cross network module that essentially combines the layers of all features into a multi-polynomial layers before passing them into DNN feed-forward propagation. We can also run cross network layers after the DNN layers, or run them in parallel. Let us see how to combine, prepare and insert all the available feature into the model tower. The full code is available here.

While this data preparation code may seem long, the only difference from the previous one is that we add all the features we want to incorporate into the model and retrieve unique reference table for each of the item. Here, we include product category and customer city into the interaction table and create a unique lookup table for each of them respectively.

Here I show the full user model, separated into two classes. First, UserModel process tokenization for text features and bucketing for numerical feature. They are stacked as layers which then, in QueryModel, passed into Cross Network and ReLU DNN. Similar step follows for item model, but not shown here. We then insert both user and item models into the full model tower shown below, with similar tasks copied from the multi-task model.

Now we can fit and evaluate the model using 32 embedding layers, equal weightage on rating and retrieval, and 3 epochs.

cached_train = train.shuffle(100_000).batch(8192).cache()
cached_test = test.batch(4096).cache()
model = CrossDNNModel([32], rating_weight=0.5, retrieval_weight=0.5,
projection_dim=None)
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))model.fit(cached_train, validation_data=cached_test,
validation_freq=5, epochs=3)
metrics = model.evaluate(cached_test, return_dict=True)print(f"Retrieval top-100 accuracy: {metrics['factorized_top_k/top_100_categorical_accuracy']:.3f}.")
print(f"Retrieval top-50 accuracy: {metrics['factorized_top_k/top_50_categorical_accuracy']:.3f}.")
print(f"Retrieval top-10 accuracy: {metrics['factorized_top_k/top_10_categorical_accuracy']:.3f}.")
print(f"Retrieval top-5 accuracy: {metrics['factorized_top_k/top_5_categorical_accuracy']:.3f}.")
print(f"Retrieval top-1 accuracy: {metrics['factorized_top_k/top_1_categorical_accuracy']:.3f}.")
# print(f"Ranking RMSE: {metrics['root_mean_squared_error']:.3f}.")

The result looks as follow:

With accuracy at 58%, the model is not doing a bad job, perhaps higher accuracy may be achieved with higher iterations. Surprisingly, the simpler retrieval model we first built achieved much higher accuracy, but that may be coincidental — other complex data shall benefit from Cross Network and DNN capability.

One benefit of Cross Network is that we can actually learned importance of interaction between features. Here, we plot the weight matrix that was calculated during the neural network iterations. Darker color shows stronger learned interactions, which in this case, customer location shows strong correlation with product id.

Conclusion

Having a real-time weather data to predict user’s sentiment would increase the relevance of recommendation engine. Photo by Craig Whitehead on Unsplash.

My goal is to demonstrate the modularity and flexibility of TFRS, in hope that it can assists data scientist in building and deploying recommendation engine in various retail and scientific setting. We can utilize this library to expand the recommendation engine to higher complexity data, focused less on programming, and focus more on scaling to different data streams, such as from Spark. Imagine, if we can take live stream of weather data and predict what a user would like to purchase during a rainy day, we may be able to increase conversion rate of a website visit or a product purchase. Such real-time sentiment prediction would require a flexible yet powerful computing engine to receive the continuous data flow, and I believe TFRS can be a powerful tool for that.

Reference

Heng-Tze Cheng et al. “Wide & deep learning for recommender systems.” In Proceedings of the 1st workshop on deep learning for recommender systems, pp. 7–10. 2016.

Martín Abadi et al. “Tensorflow: A system for large-scale machine learning.” In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16), pp. 265–283. 2016.

Ruoxi Wang et al. “DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems.” In Proceedings of the Web Conference 2021, pp. 1785–1797. 2021.

Ruoxi Wang et al. “Deep & cross network for ad click predictions.” In Proceedings of the ADKDD’17, pp. 1–7. 2017

Tensorflow Recommenders. https://www.tensorflow.org/recommenders

--

--

Taufik Azri
CodeX
Writer for

Data Scientist with interests in applicable solutions in retail and consumer industry.