What tools can I use to mine data from large Apache Tomcat log files?

I have a number of applications that generate about 1.5GB of log data a day. I would like to analyse this data for business/operational intelligence purposes. I know Splunk would help me with this, but its license costs are quite pricey for me. I am not sure if ELSA would be able to handle the volume of data that is generated. I have also been doing some reading about Apache Hive and its ability to tail a log file and load the data into a NoSQL DB such as MongoDB, but I also need to know whether MongoDB would scale well with that volume of data if this second approach is a good option. Thanks
Answer:

The first challenge you may have is how to collect huge amount of data reliably with ease of management. There're some open-source log collector implementation such as , , , and :) The big problem is how to store and process data. The backend infrastructure requires a lot of changes as the data volume increase. At first, you can use to store all of your data, but at some moment you end up using to architect a massively scalable architecture. Here're some links to put the Apache Logs into Amazon S3, MongoDB, or Hadoop HDFS by Fluentd. http://docs.fluentd.org/articles/apache-to-s3 http://docs.fluentd.org/articles/apache-to-mongodb http://docs.fluentd.org/articles/http-to-hdfs Disclaimer: I'm a committer of Fluentd project.

Was this solution helpful to you?

Other answers

Iâ€™ve experienced this situation before when running live games. Our back-end architecture consisted of multiple servers for different purposes, and when our game started to gain users, we were generating many terabytes of log data weekly. It just became too much to manage. NoSQL database flavors like MongoDB and Hadooop can scale to handle this kind of data, but only if you design a data-model that facilitates it. Log data is pretty diverse, so it's pretty easy to miscalculate this and still end up with a situation in which working with your data is unmanageably slow. Getting it right is basically a full-time job. For something a little cheaper than Splunk, I'd recommend Loggly (http://www.loggly.com/). It's $74/month for 2 gigs of data vs. Splunk's $1,000+ a month, and it doesn't require any proprietary agents. You can send all your logs to it via SYSLOG and RESTful protocols.

Mike Turner

Hadoop is pretty darn good at this use case, provide you want to write most of the jobs yourself. That can be bigger than a breadbox in practice once you get past the obvious parse the log tasks, but YMMV. I'd suggest you model the cost of building everything yourself and maintaining it vs a packaged solution. For example some firms find it is cheaper all in to use our machine data analytics accelerators, but do the math either way.

Tom Deutsch

Did you try Stackify? my team is using it to aggregate errors and logs and monitor the logs. I think they give you about 50GB in their free trial...so I would just go and try it. We like it, and with their pricing it really doesn't make sense to build anything from scratch

Gail Smith

Related Q & A:

What else can I use instead of PayPal?Best solution by searchenginejournal.com
What card can I use with Paypal?Best solution by banking.about.com
What wire can I use to connect my PC to my TV?Best solution by Yahoo! Answers
What paint can I use on guitar?Best solution by Yahoo! Answers
What products can I use if I have facial eczema?Best solution by everydayhealth.com

Just Added Q & A:

How many active mobile subscribers are there in China?Best solution by Quora
How to find the right vacation?Best solution by bookit.com
How To Make Your Own Primer?Best solution by thekrazycouponlady.com
How do you get the domain & range?Best solution by ChaCha
How do you open pop up blockers?Best solution by Yahoo! Answers

For every problem there is a solution! Proved by Solucija.

Got an issue and looking for advice?
Ask Solucija to search every corner of the Web for help.
Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.