How would you technically implement a matching algorithm like OkCupid?

For those that are unfamiliar, OkCupid asks users to answer personality questions in one of several ways. It then asks how you would want your ideal match to answer, and how important the question is to you. From these questions, they are able to build a "match score" between you and every other user on the site (Along with an Enemy and Friend Score). Their algorithm appears to be highly efficient. It's possible to almost instantly search for the people with the highest and lowest match scores to you anywhere in the world. My assumption was that to return reliable results for this kind of search you would need to re-calculate the current user's answers against every other user's answers every time a search was done. There are some details of how their match score is calculated here: http://www.okcupid.com/faaaq, but they do not describe the specific technology they use to do it. Edit: I'm specifically looking for suggestions on how on a small start-up can implement this kind of feature with low to moderate implementation cost. It doesn't have to work as well as OKC or other sites that have invested in these algorithms. Edit2: Since our application only deals with boolean values one idea I had is to build a matrix where each column is an interest point, and each row is a user, like so: i1|i2|i3|i4|i5| u1| 0| 1| 0| 1| 1| u2| 1| 1| 0| 0| 0| u3| 0| 0| 1| 0| 1| u4| 1| 1| 0| 1| 0| Then, in order to identify the most common users I would remove the current user and the columns where the current user has a 0 value. Like so: i2|i4|i5| u2| 1| 0| 0| u3| 0| 0| 1| u4| 1| 1| 0| Then the values for each remaining user would be summer, and all the rows would be sorted by the sum. Since all the data except for user ids and interest ids is boolean, this data would theoretically not take up too much space and I might be able to apply an algorithm such as this one on it: http://gurmeetsingh.wordpress.com/2008/08/05/fast-bit-counting-routines/ It has been a long time since I took linear algebra, so I'm not sure of the best way to implement this algorithm or whether it would even work for larger sets of users and interests. There appear to be some decent open source applications that can do matrix algebra, but I'm hesitant to dive in until I'm confident this is a problem that can be solved.
Answer:

The problem you're trying to solve is Nearest Neighbor Search. You could either use some multidimensional tree structures (like quad trees or kd trees) to solve that or use local sensitive hashing (http://en.wikipedia.org/wiki/Locality_sensitive_hashing).

Cosmin Negruseri at Quora Visit the source

Was this solution helpful to you?

Other answers

Don't know about OKC but eHarmony apparently uses SVD: http://en.wikipedia.org/wiki/Singular_value_decomposition (and has a patent on their application of SVD to dating services, which is somewhat controversial)

Sean McCullough

First, don't forget that they can limit the search space to people that match your preferences like location and age. So this might yield just a few thousand people. It might be feasible to do very expensive algorithms on all of them. But let's say you're more indiscriminate and live in a big metro area. So now we have a few hundred thousand to get through. OKCupid isn't much like a nearest neighbor search. We are not matching two people together, we are matching requirements to people. Consider that one person might be super picky and their potential partner might be less discriminating. Their score must be some combination of how they both fit each other's requirements, not how much they were like one another. The requirement-to-person closeness might be nearest neighbor, but I doubt it... Thinking of each question like a dimension isn't helpful -- there is no such thing as "close" or "far". You either match or you don't. To me this suggests a bitwise operation. The community has helpfully ranked all the questions for us. So we just need to place the answers in aggregate popularity order, and do some bitwise operations, and boom, we already have a rough compatibility ranking. This probably doesn't even take that much data. Somewhere between 64 and 256 bits. Then, for people who match you highly, or for people you happen to be viewing, they can go for more accuracy and match the long-tail questions, weighted on the importances you entered. By the way, I have noticed that OKCupid is pretty relaxed about giving you perfect results, which is how it should be. Try this experiment; move the location of your profile to a totally different city. Your matches will be abysmal in quality -- almost random. A few hours later, they are back to being very good. Presumably there is some batching or clustering happening asynchronously, with cached results. To keep you interested, all OKCupid has to do is keep showing you profiles. They never claim that their results are absolutely complete, or make you wait for the almighty algorithm to complete.

Neil Kandalgaonkar

Bipartite graph matching is one possible way: http://en.wikipedia.org/wiki/Matching_%28graph_theory%29 There's also likely a dual between graph algorithms (e.g., matching) and matrix algorithms (SVD) given that a graph could be expressed a matrix.

Alex Feinberg

I don't really know, but I worked on a problem once where we pretty much used brute force for lack of a mathematics background. Possibly useful: the curse of dimensionality [1] means that if you use some kind of distance function, as you add more questions, it becomes less useful to put people in clusters. It's more efficient to start with questions that rule out most people all by themselves (e.g. geography and simple filters), before scoring the rest. [1] http://en.wikipedia.org/wiki/Curse_of_dimensionality

Brian Slesinsky

A bit off topic, but if you want some OkCupid data to test your algorithms on, check out: http://www.infochimps.com/datasets/personality-insights-okcupid-questions-and-answers-by-gender-age

Winnie Hsia

You could also hold a sorted key-value store of ORDERED-ANSWERS|SCORE as keys and "ORDERED-ANWERS" as values (e.g. leveldb). There would then be 0 calculation on a person changing their answer-profile (i.e. answering a new question, changing an answer, etc), 0 calculation on adding/removing a new person, and 0 calculation and minimal IO on result look-up. This calc table would never need to change until a new question was added (don't even need to do anything on question removal). The only costs would be calculation on adding a new question to the system (which could be pre-padded) and db disk space. The value side of the key-value store would require a quick single-join look-up to match actual people. If you were to asynchronously pre-cache a result set on any answer-profile update, you would be left with only the cache look-up IO on request. I've used this successfully for recommendation engines (e.g. sorting products in online interfaces to elevate items each individual is "most likely to buy").

Alec Hulce

Related Q & A:

How to implement a relative timer in a game?Best solution by Stack Overflow
How to implement a physics engine?Best solution by Game Development
How can I implement a multilayer social network in R?Best solution by Computational Science
How to implement a voting system?Best solution by Stack Overflow
How do I use a custom avatar like a picture of me?Best solution by Yahoo! Answers

Just Added Q & A:

How many active mobile subscribers are there in China?Best solution by Quora
How to find the right vacation?Best solution by bookit.com
How To Make Your Own Primer?Best solution by thekrazycouponlady.com
How do you get the domain & range?Best solution by ChaCha
How do you open pop up blockers?Best solution by Yahoo! Answers

For every problem there is a solution! Proved by Solucija.

Got an issue and looking for advice?
Ask Solucija to search every corner of the Web for help.
Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.