What is the best data model and database systems to store social graph?

Is fair to benchmark Titan database with Neo4j in a single database instance in isolation?

  • As I know Neo4j is not a distributed database and runs really fast in single servers. "Neo4j is a robust (fully ACID) transactional http://www.neo4j.org/learn/graphdatabase database. Due to its graph data model, Neo4j is highly agile and blazing fast. For connected data operations, http://www.neo4j.org/learn/neo4j runs a thousand times faster than relational databases." Titan is a graph database based on cassandra (highly scalable and distributed).  "Titan is a scalable http://en.wikipedia.org/wiki/Graph_database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. Titan is a transactional database that can support http://thinkaurelius.com/2012/08/06/titan-provides-real-time-big-graph-data/ executing http://thinkaurelius.com/2013/05/13/educating-the-planet-with-pearson/. (http://thinkaurelius.github.io/titan/)" So, before any benchmark, I would like to know if it's a fair comparation between a distributed database (Titan) that maybe only gain some advantage in "shard architectures"?

  • Answer:

    If you need a distributed graph, based on the amount of data you expect to store, then I would think it would make it more fair to build your test around that amount of data rather then a minimal subset. (Use the distributed capability) Neo4j is considerably faster. I think the Titan team even publish a simple graph showing this. The index free adjacency, speeding up traversals, is slowed down considerably if the graph is scattered over multiple servers. In short; If you need a distributed graph then compare solutions that support that. If you don't, then compare solutions not trying to solve that use-case.

Stefan Baxter at Quora Visit the source

Was this solution helpful to you?

Other answers

It's very hard when it comes to benchmarks. You can take a look at a few comparisons here: 1) http://db-engines.com/en/system/Neo4j%3BTitan and 2) https://docs.google.com/spreadsheet/ccc?key=0AlHPKx74VyC5dERyMHlLQ2lMY3dFQS1JRExYQUNhdVE#gid=0 --> Neo4j is part open-source/ part paid(When you want HA cluster) --> Neo4j has it's proprietary database(file structure). whereas, --> Titan is Apache licensed(completely open source). --> Titan relies on Cassandra, HBase, etc for it's storage.

Vishvesh Deshmukh

Neo4J is faster in a single machine, Titan will scale better for very large graphs (Neo4j, as far as I know, has writing in the master as its main bottleneck). As someone said, Cassandra/HBase makes it possible to have a fully distributed file system. One very important thing to keep in mind are the read/write requirements and the implications for I/O demands. If you have a lot of reading, you want to avoid I/O by writing sequentially (and probably slower). If you have a lot of writing, I/O becomes less important. Read http://www.violin-memory.com/blog/understanding-io-random-vs-sequential/ on the topic. Interesting explanations on Titan and OrientDB here:

Flavio Graf

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.