How to do grouping of analytics data?

What are the cost-efficient approaches to run Big Data Analytics on a heavy-traffic ridden huge MongoDB deployment?

  • This problem is specifically related to Big Data Analytics. We have a large Mongo production deployment, having 7 shards and currently storing more than 4 TB of data. Since our production setup is already clogged with heavy application traffic, we cannot run Data Analytics directly on the production installation. Therefore, we are looking for alternative means to run Analytics on this huge amount of data, without disrupting the current setup. Can someone suggest a better cost optimized approach to solve this problem? Of course, we can replicate the data into a separate Mongo cluster and run it in sync with the existing production cluster. However, running and maintaining a separate deployment 24X7 is too expensive and wasteful against our not-so-large daily data reporting needs. One solution that has been proposed is- to export the production data and the oplogs into Amazon S3, and run MapReduce jobs to create an updated snapshot using the oplogs daily. Then, essentially we use Hive/Pig on top of that to carry out Analytics. What are the possible loopholes with this strategy? Needless to mention, this approach requires heavy code-rewriting to bake the logic of -'how MongoDB updates its documents using oplogs'- into MapReduce. So, can we do better? Has someone faced a similar problem and can suggest a smarter and better approach to solve this problem??

  • Answer:

    Have you explored running your analytics against the secondaries? This is supported on a per-operation basis, including the Aggregation Framework, when using the appropriate read preference: http://docs.mongodb.org/manual/core/read-preference/. If the work is primarily analytical in nature, and you wish to use Hive, then you could consider using the MongoDB Connector for Hadoop: http://docs.mongodb.org/ecosystem/tools/hadoop/ This would allow you to run the analysis against data in place in MongoDB, including reading from the secondaries only.

Kelly Stirman at Quora Visit the source

Was this solution helpful to you?

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.