What is the best data model and database systems to store social graph?

What would be the ideal way to store a mid size graph data model so it could be walked in a couple of milliseconds?

  • I have a data set which represents entities and their relations in a graph. There are approximately 1.5 Million nodes and 23 million edges in the graph. There is a hard requirement for graph walks to be ran in no more than a couple of milliseconds. Which database architecture can accomplish this (preferably on a single very strong server). So far Neo4J has been tested and it won't hold the load, the latency is too high. Some more details as requested: The graph is static (will not be updated, it represents a snapshot of the data), directional, connected, not bipartite. The walks will span to a degree of ~5 steps.

  • Answer:

    For this type of graph, there is a GDB which suits you best, http://www.sparsity-technologies.com. It delivers high performance (sub-second answering time for complex analytical queries), large capacity (more than 10 billion objects stored with the answer times mentioned) and easy to implement on top of it. Other advantages of Sparksee like, the ability to integrate different data sources puts this solution in the high performance segment of the GDB solutions. You can contact me at for further information.

Josep Lluis Larriba Pey at Quora Visit the source

Was this solution helpful to you?

Other answers

In a single "very strong" server, I've seen this implemented using incidence lists (http://en.wikipedia.org/wiki/Graph_(abstract_data_type)). Given that your graph is extremely sparse, it does not make sense to store the graph as an uncompressed matrix. To achieve the latency you're talking about, you'll need to store the entire graph structure in memory (or on SSDs). In a distributed environment, the most efficient way to do this is to use a distributed key-value store, and store the graph structure in an adjacency list. For a first-level traversal, you simply fetch the node's adjacency list. For subsequent levels, you do a scatter-gather. I have seen this implemented in an extremely high-load environment and achieve 1-5ms the majority of the time. The long tail is often not as pretty. You might want to have a look at FlockDB (https://github.com/twitter/flockdb). It's important to recognize that you are still bound by the laws of physics. I have a feeling that you're graph is likely a social graph. In such situations, there is often (very) heavy skew. That is, there will be certain nodes with far greater edges than the average/median. In these situations, you need to be realistic about performance expectations. Typically, in these situations, we implement heuristic business rules to deal with the issue. One thing you haven't mentioned is whether there is meta-data associated with your nodes/edges. If so, that tends to complicate things slightly, though the solutions above should still be valid, with a few minor tweaks. You might also have a look at Twitter's in-memory graph database, Cassovary (http://engineering.twitter.com/2012/03/cassovary-big-graph-processing-library.html). It seems to be skewed more towards data mining, rather than low-latency queries, but it might be worth a look.

Chris Riccomini

DIY in-memory. Any language can provide structures which will brute-force this quite easily. You need to provide an indication of node-size and number of clients/parallel queries since latency is your concern. It would also help if you described the problem a little more. I may well be missing some intrinsic complexity.

Richard Henderson

Related Q & A:

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.