How to find connected components of a random graph?

How can I find connected components of a graph using MapReduce?

  • Hey guys, I need to find connected components for a huge dataset. (Graph being Undirected). One obvious choice is MapReduce. But i'm a newbie to MapReduce and am quiet short of time to pick it up and to code it myself. I was just wondering if there is any existing API in MapReduce for the same since it is a very common problem in Social Network Analysis? Or  atleast if anyone is aware of any reliable(tried and tested)  source  using which atleast i can get started with the implementation  myself? Please direct me to any popular API to play around with Graphs in MapReduce as well.

  • Answer:

    I know very little about graph algorithms, but I do know a little about map reduce. I think graph algorithms are in the anti-sweet spot for map reduce. Map reduce is great for low communication algorithms.  The more global knowledge you need in order to process one cell of data, the worse your algorithm will fit into the MR paradigm.  If you're following links, you destroy data locality.  Further, if you have to iterate, you'll typically have to chain MR jobs.  This involves at minimum reading everything off disk over and over. Try poking at giraph [1] (what is giraph [2]). [1] http://incubator.apache.org/giraph/ [2] http://engineering.linkedin.com/open-source/apache-giraph-framework-large-scale-graph-processing-hadoop-reaches-01-milestone

Earl Hathaway at Quora Visit the source

Was this solution helpful to you?

Other answers

Connected components are easy to get if the neighbor list fits into memory, Both igraph and networkX use this approach. If your network doesn't fit into memory but you have access to a cluster whose memory sum may can manage the network, you can actually do...by yourself I'd go then with storing the neighbor list in memcache, using prefixes of node identifier as ID for the machine where the key is stored. Then grab to networkX code to find connected components and modify it so instead of looking in a local python dict for finding neighbors, just do through memcache. Anyway, it should be a memory problem cause is the network is huge, is very likely (Erdos result) to have a giant component, so the search for component shouldn't take long once it is in memory. One core to compute it should be more than enough.

Carlos Herrera

arxiv.org/pdf/1203.5387 follow the above link of related research paper if you are ok to shift on BSP model then there is precompiled code for the same in Apache Giraph.

Nishant M Gandhi

Related Q & A:

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.