How to do grouping of analytics data?

What are the cost-efficient approaches to run Big Data Analytics on a heavy-traffic ridden huge MongoDB deployment?

This problem is specifically related to Big Data Analytics. We have a large Mongo production deployment, having 7 shards and currently storing more than 4 TB of data. Since our production setup is already clogged with heavy application traffic, we cannot run Data Analytics directly on the production installation. Therefore, we are looking for alternative means to run Analytics on this huge amount of data, without disrupting the current setup. Can someone suggest a better cost optimized approach to solve this problem? Of course, we can replicate the data into a separate Mongo cluster and run it in sync with the existing production cluster. However, running and maintaining a separate deployment 24X7 is too expensive and wasteful against our not-so-large daily data reporting needs. One solution that has been proposed is- to export the production data and the oplogs into Amazon S3, and run MapReduce jobs to create an updated snapshot using the oplogs daily. Then, essentially we use Hive/Pig on top of that to carry out Analytics. What are the possible loopholes with this strategy? Needless to mention, this approach requires heavy code-rewriting to bake the logic of -'how MongoDB updates its documents using oplogs'- into MapReduce. So, can we do better? Has someone faced a similar problem and can suggest a smarter and better approach to solve this problem??
Answer:

Have you explored running your analytics against the secondaries? This is supported on a per-operation basis, including the Aggregation Framework, when using the appropriate read preference: http://docs.mongodb.org/manual/core/read-preference/. If the work is primarily analytical in nature, and you wish to use Hive, then you could consider using the MongoDB Connector for Hadoop: http://docs.mongodb.org/ecosystem/tools/hadoop/ This would allow you to run the analysis against data in place in MongoDB, including reading from the secondaries only.

Kelly Stirman at Quora Visit the source

Was this solution helpful to you?

Related Q & A:

How to Display Big Data On A Google Map?Best solution by gis.stackexchange.com
What is the cost of a piercing at Progress Body Piercing? (or any other pro piercing shop?Best solution by Yahoo! Answers
What is the cost of a LA Fitness membership?Best solution by Yahoo! Answers
What's the cost of a turbo for a Jetta?Best solution by Yahoo! Answers
What's the difference between a static data member and a regular data member?Best solution by eHow old

Just Added Q & A:

How many active mobile subscribers are there in China?Best solution by Quora
How to find the right vacation?Best solution by bookit.com
How To Make Your Own Primer?Best solution by thekrazycouponlady.com
How do you get the domain & range?Best solution by ChaCha
How do you open pop up blockers?Best solution by Yahoo! Answers

For every problem there is a solution! Proved by Solucija.

Got an issue and looking for advice?
Ask Solucija to search every corner of the Web for help.
Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.