RECOMMENDATION ENGINE

Recommendation System for Retail Customer

A Hands-On Example Using Python TensorRec Module

16 min readApr 2, 2020

Consumer retail industry is huge, high-turnover, small margin, yet full of opportunity. Photo by Markus Spiske on Unsplash.

Overview

Recommendation system is increasing in popularity — every data analyst, data scientist, and data engineer in retails and consumer-related business, whether in e-commerce or traditional offline business will come across this inevitable and important applications of Machine Learning. Perhaps it is not an exaggeration to acclaim it as one of the most high-impact products of ML. E-commerce has benefited a lot from machine learning prediction in improving business operation and increasing sales return. Recommendation System, which uses ML algorithm, has seemed to be an integral part of any retailers, e-commerce sellers, and merchandisers not only due to its simplicity but also due to its ability to unlock business values that is usually hidden within massive chunks of transaction data. It also beats traditional marketing intuition, providing solutions that relies on facts rather than mere assumption. Its ability to engender products that usually ends at the long tail or even introduce surprise elements to the retail portfolio has garnered attention from C-Suite business managers and retail owners.

In this article, I would provide a simple application in retail business using TensorRec module in Python, a deep-learning ML algorithm that uses Tensorflow and Keras backend with a rather simplified interface. In my opinion, it is user-friendly for beginners yet powerful and robust. The prerequisite is knowledge in Python and Pandas, and the rest of the codes should be easy to be understood, all available in my Github repository.

Spotify Discover Weekly is built on extensive recommendation engine. Photo by sgcdesignco on Unsplash.

Many recommendation algorithms exist, from a simple association rule to a slightly more complex K-Nearest Neighbor clustering, but are they robust enough to handle actual business case with millions of historical retail data, intertwined with hundreds of product category, customer hierarchy, and business goals? A prototype of collaborative filtering may work for a small father-and-mother grocery store, but if you are managing a chain of a few supermarket in a city, the amount of complexity can get multiplied enormously to millions of transactions. Hence, you need a better algorithm to process the complexity yet produce meaningful and relevant results to your business case. In my opinion, a hybrid deep-learning algorithm works best for complex and large dataset. Combining Pandas, Numpy, and SKLearn, together with TensorRec package module, we can harness the full potential of Python to produce a robust, solid, and relevant item recommendation.

There are already many good articles out there that provide introduction to Recommendation System. One article that provides a brief overview of Tensorrec is by James Kirk, here. There is also another good article that provides an example of using Tensorrec for movie recommendation using MovieLens dataset, here. A lot of the codes in this page are based on the works on these two websites. Those who also interested to see the actual page of TensorRec can also view the original Github page.

Types of Recommendation

In retail, there are two kinds of recommendation commonly used, which are:

a. Content-based recommendation

This system uses item’s explicit features to represent interaction in between them. For example, if a user has purchased an item (e.g. a pair of socks), then the algorithm will recommend a similar or relevant item (e.g. shoes)

b. Collaborative Filtering

This system is based on the idea “User who likes X also likes Y”, where user’s historical transaction in between these two items is chained together in building relationships among other users. If user X had purchased milk, user Y, who is similar to user X but have not yet purchased this item, will receive a recommendation to buy it.

A combination of both approaches, which I will demonstrate in this article, is simply a hybrid of them, processes both user similarity and item similarity altogether.

Example on Retail Dataset

This article will use an open dataset available on Kaggle. This dataset is a retail transaction, spanning from Jan 2011 to 2014, with full dataset and description available here. I also have put the data into my Github repository for easy access here. This dataset is not a perfect example, but it is good enough for the purpose of this demonstration. The aim here is to understand how to apply recommendation system to retailers that contains many customers or to systems that consists of multiple users.

We will utilize Pandas to fully unlock the value of the data before we transport them into TensorRec engine. A good opportunity to learn data cleaning and data wrangling, I encourage you work alongside the code in Jupyter Notebook or other preferred Python interface, all available in my Github. I will only explain major concepts while leaving most of the details in the code.

As discussed above, recommendation engine may utilize either collaborative filtering, content-based filtering, or a hybrid of both. Since we have a sample data that perfectly (or near perfect) encapsulates not only the interaction in between users and items, but features of the users as well, we can seamlessly apply hybrid method. The transaction data (Transactions.csv) serves as a collaborative feature, because it describes the user’s (or customer’s) transaction — what and when they purchased, while the customer (Customer.csv) and product information (prod_cat_info.csv) serve as content features and item features respectively.

Features can be either explicit or implicit, visible or hidden, either at face value or derived through some feature engineering. As a data scientist, feature engineering is an essential part of data wrangling, almost necessary in every ML algorithm, this recommendation system nonetheless. This sample data does not always tell us everything we require, but as a skillful data scientist (I like to call ourselves ninja), we should know how to unlock hidden value. To achieve this, we will perform some unsupervised ML and some data wrangling on the transaction data to derive new features.

A TensorRec diagram that shows the flow from input data to output (product ranking). Source: TensorRec Github.

The key to Tensorrec is to realize that there are three important components — User Features, Item Features, and Interactions. As shown in the diagram above, the input data consists of user and item features, and interaction. The engine takes these three data to build a model that ranks the relevant interaction for each user. The example data that we obtain from Kaggle contains roughly adequate content befitting these requirements. What we need to do now is transform raw retail data into these three data requirements.

Usual pipeline of recommendation system in a corporate data science environment.

We can take advantage of the powerful Deep Learning backend of TensorRec that processes training and testing the prediction. We only need to ensure we supply clean and relevant information into feature matrices, and any business case (e.g. eliminate long-tailed product, recommend only to active customers, customized category basket etc.) may be added after the prediction is done via data filtering.

The first five rows of transaction data. Original data at above, and modified column names at below.

First, we load the transaction data, shown as snippet of Pandas DataFrame above. We made a few changes to the column name so that we can read it nicely, as well as making it easy towards other analyst and business users. The change also allows us to standardize the column name across all other tables, ensuring seamless merger achieved throughout consistent column names.

The transaction data contains essential transaction information — who, what, and how much. The table lists down the transaction of each customer (Customer ID), the date (Transaction Date), the category (Prod Cat Code) and subcategory (Prod Subcat Code) of the item he/she purchases, the quantity (Qty), and the sales amount (Total Amt). While unfortunately we do not have the exact materials that he/she purchases, we can deduce them by assuming that unique Prod Cat Code, Prod Subcat Code, and Store Type yield unique material. That is perhaps the best choice for now, and we will stick to that just for the purpose of this exercise. You can also generate a pseudo-random material for the purpose of this exercise too.

For the purpose of this exercise, we tweak that each unique combination of category, subcategory, and store type entails different item (column Material).

First two rows customer information with useful demographic information.

The Kaggle repository also provides a customer data, which is helpful for us to identify customer features. Is this customer information sufficient, or is there additional information that we can derive from the data? Remember that feature engineering is crucial to every successful ML algorithm, and this case is no different. Since TensorRec does not define the requirement of item features and customer features, we can take advantage of this liberty by using feature engineering to derive enhanced item and customer features. We can be as simple or as complex as we want, and so I opt for a twist — I use a common marketing tool — RFMV — as an additional feature of customers.

RFMV — Marketing Analytics into Machine Learning

Thousands of products. Thousands of customers. Which cereal box should I get? Photo by Bernard Hermant on Unsplash

A goal of feature engineering is to derive features not explicitly present in the data but to derive hidden patterns or correlation that may exists in between them. To achieve this purpose, data scientist often utilizes unsupervised learning method such as K-Mean clustering and T-SNe, in hope to yield patterns, groups, or insights that are not explicitly visible at hand. Marketing analytics benefited from this application by performing customer segmentation using unsupervised clustering method. The application extends beyond marketing too; we can take advantage of this method as a supplementary information to better understand our customers.

Among numerous metrics exists in retail and consumer goods industry, recency (R), frequency (F), and monetary (M) are often utilized in valuating customer. Marketers at times use variety (V) as well. Hence, we utilize all four metrics, abbreviated RFMV in our customer feature analysis. Precisely, recency is defined as how recent a customer purchases an item from the current date, in this case, measured in days. A customer that has not engaged with us for a long time, say, a year, may need a business continuity and development evaluation. Frequency, on the other hand, denotes how often a customer purchases from us. A high-value customer may purchase with us frequently. Monetary, as the word implies, shows how much the customer had paid for the purchases, while variety denotes the different types of item he/she has purchased.

Recency, frequency, monetary, variety and clusters derived from K-Mean clustering

The table above shows the first five rows of customers with their respective R, F, M, and V values calculated using Pandas (see my Github repo). Notice that there is a column called “clusters” at the last column — that is the K-Mean clustering of these customers based on these four metrics. Clustering gives us a simplified window to see customer’s characteristics that is otherwise rather difficult to be visualized or comprehended by these four different segments. Not only it provides us with simplified customer metrics, it also adds another spectrum of information to the recommendation algorithm without overloading it with too much dimension (in this case, only one feature as opposed to four).

Recency and Frequency among the four clusters.

We can visualize the clusters to see how it spreads out within Recency and Frequency. Notice that certain clusters occupy certain range, for instance, blue customers entail medium-to-high recency but low frequency. See my Github repo on how I perform the clustering method and plotting the chart.

The Engine

Different businesses may assert different business case. Some businesess may want to promote long-tailed SKUs, while others may want to regain lost or inactive customers. Depending on the business objective, you may want to use this algorithm for a variety of scenario. You could be recommending to active, high-potential customers (they are the “low-hanging fruits”) or you could revive long-lost customers (rather difficult but possible). In this example, we will apply the first scenario, which is, we want to recommend to active customers, and the easiest way to determine that is to filter out customers with recency beyond 360 days (one year). We have the data for that, and all we need to do is to filter them out. This is optional, as you can also simply recommend to everyone, buy you may risk losing accuracy as the data may get convoluted with noise rather than relevant info.

Right now we should transform the important data to be passed to the interaction matrix, user feature, and item feature.

# Scale value of the interaction matrix using sklearn.preprocessing.MinMaxScaler
minmaxscaler = preprocessing.MinMaxScaler()
interactions_scaled = minmaxscaler.fit_transform(interactions)
interactions_scaled = pd.DataFrame(interactions_scaled)# The scaled matrix loses the index (customer) and column (item) information
# we re-append the customer ID and the material into the DataFrame's index and column respectively
interactions_scaled.index = interactions.index
interactions_scaled.columns = interactions.columns

Interaction matrix (see above snippet or codes in Github repo) basically shows the quantity of the material each customer has purchased. The code above shows how we can utilize Pandas’ groupby to get the quantity while using SKLearn’s preprocessing module to scale the values. Scaled values work better in most ML algorithm, including Neural Network.

For User Feature Matrix (see Github repo), we want to get a matrix that best describes customer’s personality and characteristics. One feature we have calculated above is RFMV clusters, where we can append the clusters to the matrix. Another feature we can get is the type of product they like to buy, that is, what category of product they have purchased, and its volume. Similar to Interaction Matrix, we use Pandas’ groupby function to get the number of quantities purchased for each category. As an additional practice, you can also include other user information from customer data, such as city, gender and date of birth. However, for now, I will omit that information, but feel free to try it as an additional exercise.

For Item Feature matrix (again see Github repo), we use the item category as its feature, alongside their magnitude. The Item Feature matrix will display in which category an item belongs, alongside their magnitude of their presence (i.e., the quantity).

Using Scipy’s coo_matrix function, we transform the user, item, and interaction features into sparse matrices. Using sparse matrices, especially for large data set, is always a good step as it reduces computing complexity, especially when the matrices are to be fed into Neural Network, which itself a heavily complex algorithm and often consumes a huge amount of time in calculating the forward and backward propagation.

# Fit the model for 5 epochs
model.fit(interactions, user_features, item_features, epochs=5, verbose=True)

The code above is taken directly from the TensorRec guide. It teaches us that to fit the data into the model, we can conveniently pass the user feature, item feature, and the interaction as a one-line code into the model (whichever model you deem as best fit into the data set).

We can take a step further and split the data into train and test set. As with any good Machine Learning model, we should perform the split conventionally at 80–20 ratio. How can we do it efficiently and effectively here? We can use a method called masking, where we mask at random 20% of the interaction data. The mask hides the interactions, so it yields as if the customer never purchase the item before, while separating the actual transaction of those masked item in the test set. Jesse Steinweg-Woods provides a great elaboration on masking here.

model = TensorRec(n_components=n_components,
                  loss_graph=WMRBLossGraph(),
                   user_repr_graph=DeepRepresentationGraph(),
              item_repr_graph=NormalizedLinearRepresentationGraph(),
              biased=biased)

TensorRec offer several Neural Network representation graphs, based on the order of complexity and depth of calculation required by the data set. I always believe that there is no single blanket algorithm that works for all data set; one graph may fit on a certain type, while others may work on a different kind. The above code snippet shows that I use DeepRepresentationGraph for user representation, NormalizedLinearRepresentationGraph for item representation, and WMRBLossGraph for loss. Try several different graphs to find the best fit for the model. In addition, you must tune the model by trying different parameters and representation graph in order to find the best fit. The best practice here is to combine and mix different representation graph for user features, item features, and loss graph; the choices above fits this case, but you can try others, which some of them are listed in TensorRec git page.

Testing: Recall at K

After we have ran the model, we should proceed by testing if this model is the best fit for the data. But what do I mean by best fit? What is the best fit for a recommendation engine? As we have discussed above, recommendation engine prescribes several outputs that are relevant to the user. The key here is relevant. In retail, relevant items means that users are likely to buy it if recommended, regardless whether the items are closest to his past purchases or to his related user’s past purchases. We do not know what is in user’s mind, whether he likes it or not, but we can assume that the user has some affinity to it, or in other words, can relate to that item. Remember, businesses only care about selling recommended product, but how can we ensure the recommended user will buy (or most likely to buy) the item?

One way we can ensure the generated recommendation is relevant is to obtain some degree of accuracy. In most (if not all) ML, the higher the accuracy, the better the algorithm. Similar metric applies in recommendation system, but there is a catch.

In recommendation system, two types of metrics used frequently are precision@k and recall@k, which conventionally defined as:

Precision: (the percentage of recommended items @k that are relevant) / (total # of recommended items)Recall: (the percentage of recommended items @k that are relevant) / (total # of relevant items)

These definitions ask for relevance — but how do we define it? What does relevant items mean exactly to a customer? This is a problem greatly brought forward by Joseph Konstant, where he stated that there is no ground truth for relevancy — no recommendation system has the base truth about relevance. Kirill Alexandrovich also discusses this problem in greater details here.

So, ideally, if the customer had purchased 50 items in the past, and the recommendation has successfully recommend 40 out of those 50 items, where 40 items were purchased by the customers before, then the recall metrics (assuming total items sold by the business is 50) is 40/50, which is 80%. That is not too bad. But think, would not it be better if the accuracy is 100%, that means, if the recommendation is able to recommend all the products that the user has purchased? If we want to recommend new products that the customer has not purchased before, then we would not be able to fit these new items into the ranking, because the ranking is filled with only purchased items, leaving non-purchased items out of the equation. The gap, 20% loss from the accuracy, provides some rooms for items that can be recommended but yet to exists in a customer’s existing portfolio. That is why achieving 100% recall accuracy is probably not a practical solution.

Personalized ranking of items (row index) for each customer (column)

So here we go, we accept the model at around 90% accuracy. We have yielded a set of ranking of personalized relevant products for each user, shown on the table above. Each material is ranked from 1 to n (n = total # of items) differently for each customer. If the system desires to produce ranking for items that they never purchased before, we can simply filter the output to only display that.

Final Word

Are these simply confectionery or different types of chocolate? Depends on who you talk to. Photo by Vishnu Mk on Unsplash

Very often, online platform recommendations are putting products that very often makes no sense to the user. The problem may lie in the engine, but more often it lies in the input data itself — what types of features and labels we have inserted into the engine, and the output data — how the output will look like at receiver’s end? These are the types of data wrangling and manipulation that we need to consider, add, and modify. Think, if we want a clear distinction in between white chocolate and dark chocolate, then the item features should have categories that clearly shows the difference in between these two. However, if we do not carry much variety of chocolate, or the difference does not matter, then perhaps they may be lumped together with other products into one category, say, confectionery. These types of consideration are all depend on the business context and request, requiring us to run the engine several times with different features, rather than simply altering and tuning the parameters.

Often, recommendation engines are expected to recommend “long-tailed” items, meaning that items that are previously not selling well, perhaps due to overshadowing from another competitor’s product, lack of advertisement and marketing effort, or other factors related to market force. Some of the engines are also expected to yield “surprise” items — items that surprisingly never got popular but has almost-similar feature with items other customer had bought. Can we use recommendation algorithm to up-sell and push the long-tailed or hidden items to users? That is what many researchers and academicians have argued — that the system should be able to recommend items that users might not have otherwise know before (see Herlocker et.al. 2002)[2]. But how can we achieve that in TensorRec? The Neural Network is a black box process that is hard to be interpreted. My suggestion is to apply workarounds on data manipulation and data wrangling instead of focusing too much on fine-tuning the algorithm. In addition, we should get close to business analyst and managers, understand the business context, and only then data scientists like us can deliver relevant results that truly add values to business.

REFERENCES:

[1] de Lima, Andre Paulino, and Sarajane Marques Peres. “Limits to Surprise in Recommender Systems.” arXiv preprint arXiv:1807.03905 (2018).

[2] Jon Herlocker, Joseph A Konstan, and John Riedl. An Empirical Analysis of Design Choices in Neighborhood-Based Collaborative Filtering Algorithms. Information Retrieval 5, 4 (2002), 287–310.

[3] Karaman, Baris. Customer Segmentation. Medium, (May 2019), link.

[4] Kirk, James. Getting Started with Recommendation System and TensorRec. Medium, (Jan 2019), link.

[5] Li, Susan. Building and Testing Recommender Systems With Surprise, Step-By-Step, Medium, (Dec 2018), link.

[6]Rocca, Baptisme. Introduction to Recommendation System. Medium,(June 2019), link.