How to create a similarity matrix from a large dataset without losing the precision?
-
The problem is to create a similarity matrix from a hundreds of thousands instances. It is not yet enough to run simple matlab code wince all those data and the n to n similarity matrix cannot fit into naive memory. There sohuld be some divide conquer model for that kind of problem that is inevitable for large scale retrieval or knn based classification problem. So What is the right approach for curating a similarity matrix of instances from large dataset? Also I should note that map reduce and hadoop are out of scope
-
Answer:
The similarity matrix is huge so it's going to be problematic. 100k^2 entries is 10^10 entries, so the matrix itself would take at least 40GB to store (assuming 4-byte floats). Assuming your input matrix is dense, you are basically just stuck with the matrix multiplication algorithms there are out there. If your input matrix is sparse, you might be able to run it faster, but note that if you multiply two sparse matrices, you end up with a matrix that's much more dense, so just because the original matrix is sparse, doesn't mean the correlation matrix is sparse. I'd consider approximations. You can use DISCO: https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco or some latent factor model to find a low dimensionality representation of the space instead.
Erik Bernhardsson at Quora Visit the source
Other answers
What is the bottleneck? Nothing about this problem requires you to keep the n x n similarity matrix in memory. The naive approach has two instances in memory at a time, computes the similarity, and writes it out.
Sean Owen
Please be more specific about: 1. What features do you have - because for example for text features the answer is quite different than for numbers. 2. What's your goal - do you really need full similarity of matrix 100000x100000? Or are knowing N most similar instances to each instance enough? Without knowing any specifics, my recommendation would be trying K-means (so don't compute similarity between each pair of instances but only similarity of instance to centroids).
Michal Illich
Just store all the instances and compute the similarity 'on the fly'
Charles H Martin
The preferred way to compute a similarity matrix might depend on the similarity function you use. If you want Amazon like "People that bought that, also bought..." similarity on user-item tables, then relational database might be the best for you. Relational database indices are very efficient in joining tables like that. You also get the memory vs. disk usage for free. And the best part, you can implement the whole thing in very few SQL lines. As said, the output might be very large so even just going over it or storing it might be an issue. Note that this issue is implementation independent. The good news are that in most scenarios your data will be long tailed so the matrix will be very spare. Just require that the number of matches will be above a certain threshold and your output will be way smaller while most the relations you omit are probably not similar or insignificant.
Dan Levin
Related Q & A:
- How To Create A Profile In Facebook?Best solution by Yahoo! Answers
- How To Create A Forum?Best solution by Super User
- How To Create A Wap Site For Free?Best solution by Yahoo! Answers
- Dose anyone know how to make a toga without cutting a sheet?Best solution by Yahoo! Answers
- How to buy a Samsung Vibrant without a contract?Best solution by Yahoo! Answers
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.