How can I find connected components of a graph using MapReduce?
-
Hey guys, I need to find connected components for a huge dataset. (Graph being Undirected). One obvious choice is MapReduce. But i'm a newbie to MapReduce and am quiet short of time to pick it up and to code it myself. I was just wondering if there is any existing API in MapReduce for the same since it is a very common problem in Social Network Analysis? Or atleast if anyone is aware of any reliable(tried and tested) source using which atleast i can get started with the implementation myself? Please direct me to any popular API to play around with Graphs in MapReduce as well.
-
Answer:
I know very little about graph algorithms, but I do know a little about map reduce. I think graph algorithms are in the anti-sweet spot for map reduce. Map reduce is great for low communication algorithms. The more global knowledge you need in order to process one cell of data, the worse your algorithm will fit into the MR paradigm. If you're following links, you destroy data locality. Further, if you have to iterate, you'll typically have to chain MR jobs. This involves at minimum reading everything off disk over and over. Try poking at giraph [1] (what is giraph [2]). [1] http://incubator.apache.org/giraph/ [2] http://engineering.linkedin.com/open-source/apache-giraph-framework-large-scale-graph-processing-hadoop-reaches-01-milestone
Earl Hathaway at Quora Visit the source
Other answers
Connected components are easy to get if the neighbor list fits into memory, Both igraph and networkX use this approach. If your network doesn't fit into memory but you have access to a cluster whose memory sum may can manage the network, you can actually do...by yourself I'd go then with storing the neighbor list in memcache, using prefixes of node identifier as ID for the machine where the key is stored. Then grab to networkX code to find connected components and modify it so instead of looking in a local python dict for finding neighbors, just do through memcache. Anyway, it should be a memory problem cause is the network is huge, is very likely (Erdos result) to have a giant component, so the search for component shouldn't take long once it is in memory. One core to compute it should be more than enough.
Carlos Herrera
arxiv.org/pdf/1203.5387 follow the above link of related research paper if you are ok to shift on BSP model then there is precompiled code for the same in Apache Giraph.
Nishant M Gandhi
Related Q & A:
- How can I find the radius of a circle?Best solution by Yahoo! Answers
- How can I find out who owns a particular Yahoo email address?Best solution by Yahoo! Answers
- How can I find a Yahoo user ID if I only have the name?Best solution by Yahoo! Answers
- How can I find out who owns a particular email address?Best solution by Yahoo! Answers
- How can I find out more about a Yahoo member?Best solution by Yahoo! Answers
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.