What is the Alternating Least Squares method in recommendation systems?
-
I understand the SGD method but I am unable to understand the alternating least squares (ALS) method. I have a rating matrix R and I decompose it to = U*M' Everywhere I read, it says to initialize M with average values and then keep M fix and optimize U and vice versa. But can someone explain this algorithm (plus points for working out a simple example)? And why does this algorithm work (intuition behind this)? Thanks
-
Answer:
In SGD you are repeatedly picking some subset of the loss function to minimize -- one or more cells in the rating matrix -- and setting the parameters to better make just those 0. In ALS you're minimizing the entire loss function at once, but, only twiddling half the parameters. That's because the optimization has an easy algebraic solution -- if half your parameters are fixed. So you fix half, recompute the other half, and repeat. There is no gradient in the optimization step since each optimization problem is convex and doesn't need an approximate approach. But, each problem you're solving is not the "real" optimization problem -- you fixed half the parameters. You initialize M with random unit vectors usually. It's in feature space so wouldn't quite make sense to have averages over ratings. http://labs.yahoo.com/files/HuKorenVolinsky-ICDM08.pdf (Collaborative Filtering for Implicit Feedback Datasets) explains ALS pretty well, along with one particular formulation for applying it called ALS-WR. (They're not factoring the rating matrix, quite, but using ratings as weights in the loss function -- but the algorithm and algebraic form of the solution are similar, just with that cu weight term.) Myrrix has a fairly clean and easy to follow implementation of ALS-WR in Java. (There's also a Hadoop-based version): http://code.google.com/p/myrrix-recommender/source/browse/trunk/online/src/net/myrrix/online/factorizer/als/AlternatingLeastSquares.java Apache Mahout has an implementation too.
Sean Owen at Quora Visit the source
Other answers
When using a approach to implement a recommendation algorithm you decompose your large user/item matrix into lower dimensional user factors and item factors. In the most simple approach you can then estimate the user rating (or in general preference) by multiplying those factors according to the following equation: râ²ui=pTuqiruiâ²=puTqir_{ui}^{'} =p_u^T q_i (1) In order to learn those factors you need to minimize the following quadratic loss function: minq,pâu,i(ruiâpTuqi)2minq,pâu,i(ruiâpuTqi)2min_{q,p}\sum_{u,i}{(r_{ui}-p_u^T q_i)^2} (2) Note that for simplicity I am omitting the possible biases in the first equation and the regularization in this second one. In an SGD (Stochastic Gradient descent) approach, for each example in the dataset you compute the error (ruiâpTuqi)(ruiâpuTqi)(r_{ui}-p_u^T q_i) and then you update the parameters by a factor in the opposite direction of the gradient. Alternating Least Squares (ALS) represents a different approach to optimizing the loss function. The key insight is that you can turn the non-convex optimization problem in Equation (2) into an "easy" quadratic problem if you fix either pupup_u or qiqiq_i. ALS fixes each one of those alternatively. When one is fixed, the other one is computed, and vice versa. There are two main benefits of this approach. First, this is very easy to parallelize. Second, whenever dealing with implicit datasets, which are usually not sparse, SGD is not practical (users times items can easily be in the order of billions). ALS is a much more efficient optimization technique in these cases. References: http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=34AEEE06F0C2428083376C26C71D7CFF?doi=10.1.1.167.5120&rep=rep1&type=pdf http://www2.research.att.com/~volinsky/papers/ieeecomputer.pdf http://wanlab.poly.edu/recsys12/recsys/p83.pdf Implementations: https://mahout.apache.org/users/recommender/intro-als-hadoop.html http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html http://libfm.org/
Xavier Amatriain
Alternating Least Squares is a method that alternates between two matrices in a product such as Y=UVâ²Y=UVâ²Y = U V^{'} where Y is data. Essentially it guesses U to estimate V and then alternates back and forth until U and V stop changing. By fixing one or the other it becomes a simple least squares solution (by generalized inverses). It's been used in psychometrics and statistics for quite a long time (since the 1970s) because it allows you to deal with situations such as missing data in Y that would make the usual approach to the singular value decomposition fail. You can also impose constraints in an ALS algorithm to regularize the elements in U or V, which can be very useful. You may want to take a look at http://projecteuclid.org/euclid.ss/1028905828.
Jay Verkuilen
ALS is Matrix Factorization Algorithm. Matrix Factorization decomposes a large matrix into products of matrices.R = U * VFor example in recommendation systems, let us consider R as a matrix of User (Rows) and Ratings (Columns). Matrix factorization will allow us to discover the latent features that define the interactions between User and Ratings. In other words, ALS uncovers the latent features.R is m*n matrixU is m*r matrix - User and Feature matrixV is r*n matrix - Item and Feature MatrixSo the model will allow U and V giving a low dimension feature space. Also, space efficient for large data sets.Following is how it works:1> M is initialized with the average item ratings as its first row and random numbers for the rest row.2> Alternate the calculation of the Cost function until we met the criteria: Fix V and solve U by minimization of the cost function J(R, U, V) by resolving the least square error function Now, Fix U and solve V by minimization of the cost function J(R, U, V) by resolving the least square error function Why we alternate? we turn the problem from a non-convex problem to convex problem. Optimizing U and V simultaneously is non-convex and hard to solve. If we fix U or V, we turn into convex, which is easy to solve.
Arvind Rapaka
Before discussing ALS, letâs briefly discuss the least squares problem (in particular, regularised least squares). Letâs consider a feature matrix XâRmÃdXâRmÃdX \in \mathbb{R}^{m \times d} and target value yâRmÃ1yâRmÃ1y \in \mathbb{R}^{m \times 1} then regularised least squares optimisesargminâ¥yâXwâ¥2+λâ¥wâ¥22argminâ¥yâXwâ¥2+λâ¥wâ¥22\arg\!\min \left \| y - X w \right \|^2 + \lambda \left \| w \right \|_2^2One of the interesting property of linear regression is the analytical solution for optimal www is given byw=(XTX+λI)â1XTyw=(XTX+λI)â1XTyw = (X^TX + \lambda I)^{-1} X^T yIn multi-regression, we perform regression over multiple target Y={y1,y2,.......,yn}Y={y1,y2,.......,yn}Y = \{y_1,y_2,.......,y_n\} using the same feature matrix. The solution is given asW=(XTX+λI)â1XTYW=(XTX+λI)â1XTYW = (X^TX + \lambda I)^{-1} X^T YIn general, we rarely get to use this elegant analytical solution as it involves inverting dÃddÃdd \times d matrix where ddd ranges from thousands to millions in practice. Note that inversion of a large matrix is not only computationally expensive but also numerically unstable. However, despite being rarely used in real world problems, the analytical solution will be a boon for recommender systems as we will show shortly.Letâs discuss matrix factorisation (MF) framework. MF optimisesargminâ¥â¥RâUTVâ¥â¥2+λ(â¥Uâ¥22+â¥Vâ¥22),argminâ¥RâUTVâ¥2+λ(â¥Uâ¥22+â¥Vâ¥22),\arg\!\min \left \| R - U^T V \right \|^2 + \lambda ( \left \| U \right \|_2^2 + \left \| V \right \|_2^2),where RRR is partially observed the user-item preference matrix,UâRkÃm,,UâRkÃm,, U \in \mathbb{R}^{k \times m}, and VâRkÃnVâRkÃn V \in \mathbb{R}^{k \times n} Like other methods, you can optimise MF objective using gradient based method say SGD. In general, there are two different Collaborative Filtering (CF) scenarios, namely Rating Prediction and Implicit Feedback CF (without loss of generality we consider one-class collaborative filtering).For rating prediction, RâRmÃn,RâRmÃn,R \in \mathbb{R}^{m \times n}, the gradients wrt U and V are computed only using the observed entries . Typically, RRR is extremely sparse, and hence the algorithm scales linearly with the number of observed entries (which is acceptable for large scale problems).Now, in One-Class CF(OC-CF) the user-item interaction consists of positive only preferences, such as item purchase. In this scenario, Râ{?,1}mÃnRâ{?,1}mÃnR \in \{?,1\}^{m \times n} where 111 indicates purchase and ??? indicates unobserved entries. Now, if we do SGD using the observed entries then we learn a useless model that always predicts 111, which is out of the question. To address this, most OC-CF algorithms treat unobserved entries as 000. However, now we have too many entries in matrix R,R,R, i.e (mÃnmÃnm \times n ), for gradient computation which makes it extremely hard if not impossible in any real world dataset. This can be addressed by negative sub-sampling, but again sampling is expensive and doesnât work well in this setting.You know whatâs the time? Itâs ALS time :DAs name suggests, ALS solves alternate least squares as outlined belowALS Algorithm: Repeat Fix UUU and solve for VVV Fix VVV and solve for UUU This is actually (careful) block co-ordinate descent.Letâs dissect this algorithm by considering the step âaâ. If you fix UUU , the MF objective is equivalent to multi-regression problem with UUU as a feature matrix and RRR as a target, and the optimal value for VVV is given as:V=(UUT+λI)â1URV=(UUT+λI)â1URV = (UU^T + \lambda I)^{-1} U RUnlike, standard regression, here we can use the analytical solution as it involves inverting a small kÃkkÃkk \times k matrix ( kkk generally ranges from 10â1000), and sparse matrix multiplication.Similarly, the solution for step âbâ is given as:U=(VVT+λI)â1VRTU=(VVT+λI)â1VRTU = (VV^T + \lambda I)^{-1} V R^TThe elegance of ALS solution is in being able to solve MF for OC-CF efficiently which is hard if not impossible using gradient-based method. Pretty cool, eh? So, ALS is a big deal for Implicit/OC-CF setting. ALS can be applied for rating prediction. Since SGD scales very well in rating prediction setting, ALS trick is less significant there.Note that, in general, the weighted MF (WRMF) model is used for OC-CF. However, the idea of ALS remains the same. The choice of vanilla MF in this answer is just for simplicity.In summary, ALS is a nice trick that allows efficient solution to MF esp for Implicit Feedback/OC-CF problem.
Suvash Sedhain
Related Q & A:
- What is the difference between a static method and class method in Python?Best solution by pythoncentral.io
- What job pays at least $15 per hour?Best solution by Yahoo! Answers
- What is the difference between Business Information Systems and Computer Information Systems?Best solution by Yahoo! Answers
- What are the differences between French and American education systems?Best solution by Yahoo! Answers
- What is your LCD projector recommendation?Best solution by Yahoo! Answers
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.