RECOMMENDATION SYSTEM

Basic Fundamentals of Recommendation System — Collaborative Recommendation

Understanding the basic technique to calculate user-based and item-based recommendation

Tiktok captivates the world with its extensive video recommendation algorithm. Photo by Hello I'm Nik 🎞 on Unsplash.

Recommendations are pervasive in consumer online system, more pervasive than ever during digital age of today. As we read news in Facebook, buy products in Amazon, search for accommodation in Airbnb, or look for interesting profiles in Instagram, every suggestions we received are not a mere sorted piles of popular (or unpopular) items, rather they are calculated prediction made by extensive algorithms that compute massive information behind the screen. Online webpages and retailers rely on recommendation algorithms to provide relevant suggestions to users, hopefully, getting them to buy their products or subscribe to their content. But how exactly do they do it? What sort of calculation or formula that they use? And how do they utilize a vast amount of data to arrive at final decision that they confident users will like?

The objective of this article is to provide a concise and practical introduction to the principles of recommendation system. My intention is to provide a clear introduction to the concepts and technique that have been extensively developed by many researchers, tested, and deployed in industries. Along that, I will briefly discuss their advantage or disadvantage as well as their feasibility in different use cases.

This article will be of interest for data scientist and business analyst that need to understand the theoretical concept of recommendation system, which I believe it is not a topic commonly covered in academic curriculum. Readers who are interested to see a hybrid method of recommendation engine may want to read my previous article that implements it in Python.

Precedents of Recommendation System

The first ACM Recommender System conference took place in Minneapolis on 2007. Photo by Weston MacKinnon on Unsplash.

Recommendation system received a substantial attention among data community after a series of Netflix recommendation algorithm competitions that started around 2009, where the company searched for the best collaborative filtering algorithm. They have amassed a huge dataset that contains more than 400,000 of user ratings given to about 17,000 movies. Based on these historical information, they searched for the best algorithm that can predict new movies that users may like.

Apart from Netflix, social media application such as Instagram and Tiktok relies on user’s rating too. They processed these inputs into a system that yields suggested account, videos, or business advertisement. Another example is Spotify, where it relies on user’s listening preference to produce Discover Weekly playlist, personalized to each specific listener based not only on his/her listening habit, but also of other similar listener as well.

The academic development of recommendation system began independently during mid- 1990s, but expanded when the first ACM Recommender System conference held at Minnesota during 2007[1][2]. Since then, numerous open dataset such as MovieLens had contributed to the development and advancement of recommendation studies.

The fundamental concept remains elementary — a recommendation dataset contains users (e.g. customers, account holders, viewers), items (e.g. movies, pictures, videos) and interaction in between users and items (e.g. ratings, likes, or duration watched). These info are calculated and processed in various types of algorithm, so let us look at the simplest one, which is user-based nearest neighbor recommendation.

User-based nearest neighbor recommendation

Understanding user’s movie habit is a key to Netflix recommendation algorithm. Photo by Teslariu Mihai on Unsplash.

One of the basis of recommendation system is to understand the relationship among users. Precisely, we need to measure the similarity in between them, measuring which users are “closer” among each other, hence the name user-based nearest neighbor recommendation. Their similarities can be utilized to predict the new user’s preference based on existing user’s.

Consider a hypothetical table below that shows five users and their rating for five movies, scaled from 1 to 5, 1 being really dislike and 5 being really like. We are interested to predict if Amy will like Movie 5, or, in other words, if we predict that Amy prefers Movie 5.

A typical example of movie ratings given by five users on five different movies. We are interested in predicting Amy’s rating for Movie 5.

In nearest neighbor algorithm, we identify other users (often referred to as peer users) that had similar preference to the user of interest. Here, we take the other four users as peer users of Amy, so we can predict Amy’s taste based on their existing ratings. The assumption is that Wong, James, Kevin, and Ali are Amy’s peer users, and their past similar preference will be consistent in the future, so we can calculate the similarity anytime. However, in practice we may need to filter out non-similar users. In addition, we may need to consider item’s trends and popularity into the calculation too. For now, we shall use this simple example as it is.

Note that some users may rate the movie 1, 2, 3 and 4 differently than Amy, so they may not be the best choice to be Amy’s nearest neighbor. Hence, our task is to determine users that have closer preference to Amy. After that, we will predict Amy’s preference for Movie 5 based on those users’ preference for that movie too. In research literature, a method widely used to measure similarity among variables is called Pearson’s correlation coefficient [3]. It has been shown that this correlation coefficient performs better than other user similarity measurement [4].

Let U = {u, . . . , uₙ} denotes the set of users, P = {p1, . . . , pₘ} denotes the set of products (or items), and R as an n × m matrix of ratings rᵢⱼ , with i ∈ 1 . . . n, j ∈ 1 . . . m, where the possible rating values are defined on a numerical scale from 1 (strongly dislike) to 5 (strongly like). If a certain user i has not rated an item j, the corresponding matrix entry rᵢⱼ is empty, as we can see with an empty entry (marked ?) in Amy for Movie 5.

The similarity sim(a, b) of users a and b, given the rating matrix R, is:

We can think of this process as taking each user’s rating and adjusted it by his/her average rating.
We can use the above formula to find similarity of Wong and Amy. First, calculate the average of all their individual ratings as below:

Similarly, we can calculate sim(Amy, James), sim(Amy, Ali), and sim(Amy, Kevin), which will yield this result:

Readers can also see the calculation in Github.
Then we can calculate probability of Amy’s rating for item 5:

We shall expand the summation on the numerator and denominator to include all other users in the table, which yields 2.

While this method works fairly for a small number of datasets, the calculation gets extensive if the data is huge. If we have 10,000 users, calculating each user’s similarity with Amy would require us to calculate each of sim(Amy, user1), …, sim(Amy, user9999) individually, which consequently takes a lot of space in the memory. A potential solution is to compute a model-based algorithm such as SVD decomposition, which will be discussed later in this article.

Another potential problem is sparse matrix — where most of the entries are zero. Realistically, when the dataset relies on ratings given by other users, most of them only rate a certain number of movies. It is almost impossible to find users that have rated all the movies in the database. As a consequence of low rating activity, many parts of the ratings would be zero, hence, not enough similarity measurements can be generated. New movies would suffer from this consequent too, also known as cold-start. This is one of many complex cases that requires different treatment, but recent advancement have addressed these problems. Some of the methods discussed below may be able to address these issues.

Item-based nearest neighbor recommendation

Amazon relies heavily on item interaction on its marketplace. Photo by Christian Wiediger on Unsplash.

When the data is often too large, it is exhaustive to calculate the entire universe of users’ similarity. An alternative to user-based nearest neighbor would be to calculate item-based nearest neighbor, that is, to measure similarity among items of interest. If we have, say, 10,000 users but only 100 products, it would be simpler to calculate similarities among items, then using the result to calculate items that are closest to the user of interest. Consider the same dataset from above. We would like to calculate similarity of Movie 5 with other movies in the dataset.

A common similarity measure is cosine similarity measure, which measures the similarity between two n-dimensional vectors of an inner product space based on the angle in between the two vectors[5]. For two items x and y, the cosine similarity measure is defined as:

Hence the similarity is 30 / (5.2 * 6.63) = 0.88

Similarly we can calculate the similarity of Movie 5 with other movies. The result is produced in the table below:

A slightly exhaustive but better way is adjusted cosine similarity, which deducts user’s average from each of their rating. We shall calculate the similarity among the movie weighted by the user’s average rating on all their mutual items. We can tweak the earlier Pearson’s coefficient similarity by calculating not the similarity in between users, but in between items, precisely as below:

Where u are users in set U that have rated both items x and y.
Given all user’s average rating below:

The adjusted cosine similarity of Movie 5 and 1 is:

Similarly we can produce the similarities for other movie pairs, all have been calculated in Python (see code in Github) and produced as below:

We then can use this correlation to predict Amy’s preference for movie 5, which is:

Where i denotes the set of items I that have been predicted by user u. Therefore:

Which yields 2.88. If compared to the result of user-based nearest neighbor, item-based gives a slightly higher ratings for Amy. Which one is better? It is hard to evaluate just by observing these two results, but the closeness of these two methods show that either one of them are robust enough to generate sufficient result, hence, the deciding factor of choosing either user-based or item-based should come down to the size of the data and the computational ability of the system. If a system has more users than items, it may be more feasible to perform item similarity measurement, as claimed by Amazon in 2003 where item-based calculation can be scaled to big data [6].

Probabilistic Model — Naïve Bayes

Similarity measurement relies on calculating a similarity index in between items or users, which would only yield the measurement of selected item for one specific user, which in this case, predicting Movie 5’s rating for Amy. However, we also want to see the probabilities of all possible ratings, say, what are the probabilities of Amy rating Movie 5 as 1, 2, 3, 4, or 5?

This probabilistic approach is atypical in classification task, where the system needs to assign one of several possible values to the user. In machine learning, the task of classification often requires algorithm that can predict, say, given the size of the petals, what is the species of the flower, or, given the content of the text, is the email spam or not spam. We can apply the same classification context to this problem, where the possibilities are 1, 2, 3, 4, or 5. One way to do it is by using Naïve Bayes.

Naïve Bayes is a simplification of Bayes’ theorem that assume independence among attributes (or features). Formally, we can compute the posterior probability of P(Y|X) given the conditional probability of P(X|Y) and the P(X) and P(Y) itself. Formally, Bayes’ theorem states that:

The posterior probability of Y takes a certain value of k, given a prior distribution of X, is the likelihood of X, which is P(X|Y=k) multiplied by the prior, which is P(Y=k), and then divided over all observable measurement of X. In this case, Y refers to Amy’s rating of Movie 5 takes a certain value of k (either 1, 2, 3, 4, or 5).

We calculate the conditional probability of each possible rating value, given Amy’s other rating, and select the one with highest probability as selected prediction. In this case, we want to calculate the conditional probability of P(Movie 5 = 1 | X), that is, what is the probability that Amy going to rate Movie 5 as 1, given X, which is Amy’s other rating of Movie 1, 2, 3 and 4. After that, we repeat the similar step for P(Movie 5 = 2 | X), and so forth, as follow:

The Naïve Bayes assumption assumes that each attribute (movie ratings) are independent of each other, which in this case we can just take the product of all attributes of X of Amy, that is:

For each k = 1, 2, 3, 4, 5, we can calculate P(Movie 5 = k) directly from the table. For instance, P(Movie 1 =1) is 2/4, or 0.5 since there are two outcomes of 1 out of four outcomes (this is akin to classic dice probability).

Hence, we have:

P(Movie 5 =1) = 0.5
P(Movie 5 =2) = 0.25
P(Movie 5 =3) = 0
P(Movie 5 =4) = 0.25
P(Movie 5 =5) = 0

Clearly, since probability of Movie 5 having rating 3 or 5 is 0, we just need to calculate P(Y=1|X), P(Y=2|X), and P(Y=4|X).
To calculate P(Movie 5 = 1) for Amy, we will have to take the product of all other probability of Movies 1 to 4 (remember we only take attribute X for Amy):

Where P(Movie 1=3|Movie 5=1) is 1/2, since for Movie 5 =1, there are two outcomes from Wong and James. Only one outcome shows 1, which is from Wong, so the probability of observing 1 is ½.
Similar calculation for others follows ensue. Therefore,

Similarly,

And

Using Naïve Bayes, we can see the highest probability (the only probability actually) is for Amy to rate Movie 5 as 1. In other words, she is more likely to rate Movie 5 as 1 instead of 2, 3, 4, or 5. We can conclude that she has low preference for this movie.

The advantage of this method is that it is simple — it does not exhaust computing ability when the data is large, especially when it is full of sparse matrices. As you can see, zero ratings simplify the calculation, but that being said, it may also discard a lot of information due to their omissions. An advance technique would be required, for instance, a feature extraction that only extracts important attributes. Other methods include clustering customers into k-nearest neighbor or dimensionality reduction using PCA or SVD[7].

Slope One Prediction

Lemire and Maclachlan 2005 proposed a simplified approach to calculate rating of new user called slope one prediction [8]. It is simple yet effective to quickly predict a rating without having to resort to memory-intensive calculation. The idea behind slope one prediction is to fit the prediction into a a linear combination of other variables. Think of a linear equation of Y = aX + b, where Y is Amy’s rating for Movie 5, X is the addition of ratings of Movie 1 and 2, averaged or weighted by constant a, and with a slope adjustment of b.

First, we need to calculate the co-rating (“distance”) of Movie 5 with Movie 1 and 2 for each user.
Let M1 means Movie 1, M2 means Movie 2, …

Wong’s co-rating of M5 and M1 => 1–1 = 0
James’ co-rating of M5 and M1 => 1–3 = -2
The average distance of them is (-2+0)/2 = -1
Amy’s distance for M5 based on her rating for M1 is -1+3 = 2

Wong’s co-rating of M5 and M2 => 1–5 = -4
James’ co-rating of M5 and M2 => 1–3 = -2
The average distance of them is ((-4)+(-2))/2 = -3
Amy’s distance for M5 based on her rating for M2 is -3+3 = 0

We can combine the rating of Movie 5 for Amy as such:

We weighted the distance with the number of usable co-rating so that we weighted in between movies that have lots of ratings against those that have few ones. For instance, if James did not rate Movie 1, then there is no co-rating for James of M5 and M1, which then, the number of usable co-rating of M5 and M1 is only one.

But in this case, we have:

Number of usable co-rating of M5 and M1 = 2 (because both Wong and James rated M5 and M1)
Number of usable co-rating of M5 and M2 = 2 (because both Wong and James rated M5 and M2)
Plugging all these numbers into the equation yields 2.

The benefit of using this method is that it is simple — it works for non-complex use cases. For a simple application, the cost-to-benefit ratio of complex algorithm may not be as beneficial, therefore, a less complex one such as this may be more appropriate than user-based or item-based nearest neighbor. Think about this method as a quick way to provide recommendation to a landing page of e-commerce website, where customers just need to see a quick display of several recommended items, with some tolerance for accuracy and reliability, before they continue browsing other items.

Market Basket Analysis — Associate Rule

The diversity of items in supermarket may necessitates market-basket analysis. Photo by Viki Mohamad on Unsplash.

Market basket analysis is useful when we need to measure item’ association with each other. Consider a different example where we want to measure which supermarket item often goes together in customer’s basket. The table above shows a hypothetical data where we have five users and their quantity of items purchased. 0 indicates that they did not buy the product.

Assume x = milk, and y = egg. For someone that buys milk, we want to measure if egg is a good complementary association, or, if there may be other item that have a stronger association with milk. We can then calculate the association of x and y using these metrics:

Support is the simplest measure that calculates the ratio of transaction containing both x and y. Think about this as how many customers purchase both milk and egg over the entire transactions in a store.

Confidence calculates the ratio of all transactions that contains x and y as a fraction of all transaction containing x. Think about this as, among those customers that buy milk, how many also buy both egg and milk.

Lift calculates the ratio of confidence over the expected confidence. Lift is useful to tell us how much the confidence is supported by the actual transactions. Essentially, how much confidence has increased that x will be purchased given that y was actually purchased. Think about this as, how likely someone will buy egg and milk given that we know how many customers actually buy milk.

We can use those formula to calculate the associative metrics for all the possible pairs of items. The full calculation can be seen in Github, but here is the snippet of the result in between Milk-Egg and Milk-Flour:

Compare and contrast the associations in between milk-egg and milk-flour — notice that the metrics remain the same for milk-egg, but not the confidence and lift of milk-flour. This shows that the rule A => B does not always equally complement, which means, a customer that buys milk may likely to buy flour, but a customer that buys flour may not necessarily buy milk. You can also see that the lift value for Milk => Flour is much stronger than Flour => Milk. Hence, we can use this rule to suggest flour to someone that buys milk at much higher confidence than instead of recommending egg. For more info on market-basket association, see Larose 2014 [9].

Model-based Approach: Data Reduction via Matrix Decomposition

Often when the data is massive, calculating all the similarity indices may consume a huge computing power and memory capacity. Instead of calculating them on demand, we can pre-compute the parameters, store them aside, and only utilize them when the data arrives or when we need to produce the output, say, when the customer actually log-in into the e-commerce website. This method allows the application to reduce a large matrix of rating or transactional vectors into smaller approximation that still highly correlated to the actual data. This method will save the machine from exhausting the memory, and hence, as opposed to memory-based model discussed previously, this method is called a model-based. In this section, we will discuss one such methods called Singular Value Decomposition (SVD). For more information, see Grigorik 2007 [10].

SVD allows us to decompose matrix M into a product of three matrices — U, S, and V, where U is called left singular vectors, V as right singular vectors, and S as a diagonal singular matrix, as below [11]:

Again, consider the same movie rating. We are still interested to predict Amy’s preference for Movie 5. First, we can establish her similar users, which are still Wong, James, Kevin, and Ali. In Python, we can use Numpy to calculate matrix decomposition as below:

To find Amy’s position in the two-dimensional space, we take Amy’s rating vector and multiply with some appropriate number of columns/dimensions, usually the first 2 or 3 columns, commonly known as k-rank approximation. This method is legitimate because we have reduced M matrices into a linear combination of low-rank matrices. The first two columns will still capture a large percentage of variances in the original dataset.

Say we choose rank 2. Then we can take the first two columns of matrix U (denoted U₂) and the first two entries of diagonal matrix S, inverted (denoted as S₂⁻¹ ). This calculation yields an approximation of Amy in two-dimensional space, as below:

Which gives us [-0.55, -0.16]
Similarly, we can calculate the two-coordinate reduction of other users and items, and visualize them in a two-dimensional plot as below:

A benefit of this data reduction is that not only we can use it to predict movie that is closest to Amy, but also to other users as well. In other words, it combines both user-based and item-based nearest neighbor calculation that yields both parameters that are of interest to the engine. There are other types of dimensionality reduction besides SVD too, such as PCA and t-SNE. For more information, see Sarwar 2000 [12].

Issues and Consideration

Neighborhood Selection:

Selecting a limited number of neighborhoods may be a more pragmatic approach as it reduces the calculation time. Not only that, it also may eliminate noises — ratings that make no sense to the one we are trying to predict — hence narrowing down to only ratings we are interested. One of the possible ways is to include only neighbors that have positive similarity measurement with the user of interest. Another method would be to use weightage method to provide additional weightage to neighbors that have rated the same item as user of interest.

Alternative to Rating:

In real world, not all data are represented in rating. In many cases, they are often presented in transactions, where users are recorded by their purchasing quantity and amount. Some website records their behavior, where they are profiled based on their viewing history, such as YouTube. In these cases, one way to translate them as ratings is to assume these behaviors as implicit, meaning that we propose a certain threshold which, surpassing them means equating the activity as high rating. For example, if a user views a video for more than half of the total length, then we can assume the user enjoys it. Another instance is that, if a user remains at a webpage for a longer period, say to read an article or news, that may mean that the user enjoys the content.

Data Sparsity and Cold Start:

Cold start refers to new users without any or very little history, or new items into the system. How can we assign user’s preference if we are yet to collect any information? How can we assume which user will like this new item? In this case, we may need to exploit more information from them, such as location, demographic, age, or income level, then use this information to find another set of users that share similar attributes. Similarly, we can yield additional attributes of the new item, for instance, in the case of movies, we can elaborate on its genres and years of production. Exploiting additional attributes would require additional method beyond user-based collaborative recommendation. That would be the subject of the next article, content-based recommendation.

Conclusion

Each likes and story views in Instagram is vital in digital marketing. Photo by Erik Lucatero on Unsplash.

We have seen the primary building blocks of a simple recommendation system, which are users and items. The task is to utilize the historical interaction among existing users and items, be it ratings, purchasing quantity, or other means of data. Using the established relationship, we can predict new items for users that have not yet consume them. The relationship can either be user-centric or item-centric, or both, as in the case of SVD.

The complexity will arise when we have a lot of users of varying demographics, or if we have a lot of items with varying content. Imagine recommending hotel room to users — we need to tailor the recommendation that considers number of bedrooms, location, star rating, and length of stay. In another instance, recommending books to users requires consideration on not only the content of the book — genre, length, language — but also the customer — mother tongue, age, purchasing power, and many more.

How can we process these additional complexities to arrive at a more meaningful and accurate recommendation? In the next article, I would explore content-based recommendation that explores methods incorporating item attributes, particularly those of interest in textual analysis and document retrieval system.

References

[1] Ricci, F. et. al. “Introduction to Recommender Systems Handbook.” in Recommender Systems Handbook, 3. Springer, 2011.

[2] The ACM Conference Series on Recommender System. https://recsys.acm.org/.

[3] Resnick, P. et. al. “Grouplens: An Open Architecture for Collaborative Filtering of Netnews.” In Proceedings of the ACM 1994 Conference on Computer Supported Cooperative Work, 175–186. ACM, 1994.

[4] Herlocker, J. L. et al. “An Algorithmic Framework for Performing Collaborative Filtering.” In Proceedings of the 22nd Annual International ACM SIGIR Conference, 230–237. ACM, 1999.

[5] Jannach, D. et. al. Recommender Systems: An Introduction, 19. New York: Cambridge University Press, 2011.

[6] Linden, G., et. al. “Amazon.com recommendations: item-to-item collaborative Filtering.” IEEE Internet Computing 7, no. 1 (2003): 76–80.

[7] Chee, S. H. S. et. al. “Rectree: An efficient collaborative filtering method.” In Proceedings of the 3rd International Conference on Data Warehousing and Knowledge Discovery, 141–151. Munich, 2001.

[8] Lemire, D. and Maclachlan, A. “Slope one predictors for online rating-based collaborative Filtering.” In Proceedings of the 2005 SIAM International Conference on Data Mining, 471–475. Society for Industrial and Applied Mathematics, 2005.

[9] Larose, D. and Larose, C. Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, 247–265. Wiley, 2014.

[10] Grigorik, I. SVD recommendation system in ruby, 2007. http://www.igvita.com/2007/01/15/svd-recommendation-system-in-ruby/.

[11] Golub, G. and Kahan, W. “Calculating the singular values and pseudo-inverse of a matrix,” Journal of the Society for Industrial and Applied Mathematics, Series B: Numerical Analysis 2, no. 2 (1965): 205–224.

[12] Sarwar, B. et. al. “Application of Dimensionality Reduction in Recommender Systems — A Case Study.” University of Minnesota Department of Computer Science, 2000.

Data Scientist with interests in applicable solutions in retail and consumer industry.