How to sequence events while uploading large files to Amazon S3?

Distributed Systems: How can I get my application events serialized on S3 in large files?

  • I have an application that creates numerous events across many machines. We store these events in a database, but I need to get them to Amazon S3 for additional processing with Hadoop/Elastic MapReduce. Other than ETL'ing the events from the database, how can I most easily and reliably get these events to larger files on Amazon S3?  I have looked at many solutions including Kafka, Flume and commercial solutions like Loggly. None of them makes it easy to reliably aggregate and dump events in big files on S3. What is the shortest path here? Is there a simple way to do this?

  • Answer:

    As Eric Sammer mentioned, was built to collect in-app events and upload them to various data sources (Amazon S3 being one of them). One downside of Flume, however, is its relatively large footprint. Because Flume is written in Java, it requires running JVM, which is known for taking up a not so small amount of memory. Being written in Java is not necessarily bad. For example, it makes integrating with Hadoop (which is also written in Java) pretty straightforward. One alternative is [1]. Fluentd is a lightweight logger written in 3000 lines of Ruby (plus a couple of C extensions for better performance wherever it makes sense) with a pretty small footprint. For example, we use Fluentd ourselves at Treasure Data, and it consumes ~20MB of memory. The other advantage of Fluentd is its plugin system. Thanks to open source contributions, you can send data from pretty much anywhere to Fluentd, and Fluentd can write data to pretty much any source, including Amazon S3[2] (all the plugins are listed here[3]). Here's an example configuration of fluentd, which uploads the data into Amazon S3 every hour (high-availability configuration is also available[4]).     # HTTP Input     <source>       type http       port 8888     </source>         # S3 Output     <match tag>       type s3       aws_key_id YOUR_AWS_KEY_ID       aws_sec_key YOUR_AWS_SECRET/KEY       s3_bucket YOUR_S3_BUCKET_NAME       s3_endpoint http://s3-us-west-1.amazonaws.com       path logs/       buffer_path /tmp/s3       time_slice_format %Y%m%d-%H       time_slice_wait 10m       utc     </match> The application can post the event via HTTP. Fluentd supports input from various sources (e.g. Multi-Language Libraries, TCP, Unix Domain Socket, AMQP, etc) , and it's also pluggable.     $ curl -X POST -d 'json={"action":"login","user":2}' \          http://localhost:8888/tag Another key difference with Flume is Fluentd handles logs as JSON streams[5]. By keeping everything semi-structured as JSON, subsequent analyses and post-processing (ex: JSONSerde on EMR Hive[6]) become a lot easier and less error prone. Disclaimer: I'm a Fluentd committer and a co-founder of Treasure Data. [1] http://fluentd.org/ [2] https://github.com/fluent/fluent-plugin-s3 [3] http://fluentd.org/plugin/ [4] http://docs.treasure-data.com/articles/td-agent-high-availability [5] http://blog.treasure-data.com/post/21881575472/log-everything-as-json-make-your-life-easier [6] http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/CreateJobFlowHive.html#emr-gsg-create-hive-script

Kazuki Ohta at Quora Visit the source

Was this solution helpful to you?

Other answers

Flume is built to handle exactly this case (although admittedly S3 isn't the most popular sink, it should work just fine). Simply getting events from A to B is trivial; run a few flume agents with an RPC source like the AvroSource. Send app events using one of the clients like the included Avro RPC log4j appender or whatever makes sense for your environment. Best is to send the events as structured records like Avro when possible. The RPC interface supports batch or individual event calls (the former is far more efficient because of RPC overhead amortization). Each call is atomic because of the way transactions are controlled on the server side. Within the agent, pick the channel type that offers the appropriate level of durability for your app; it's a trade off of performance and safety. You can have flume agents write to S3 using the HDFS sink but configuring it to use Hadoop's s3 filesystem implementation. This is the part that doesn't see a ton of testing, although some of us in the flume community plan to treat this as a first class deployment option going forward with more direct s3 support. The sink consumes the queue of events from the various app servers and decided how to write those events as files. This is where you can control output bucketing and file size. The sink also can control how individual events are serialized which is how you would eliminate the ETL step (eg. write the events as Avro records with some schema you later use in your MR jobs). It would be better to have a proper S3 sink and avoid the HDFS abstraction layer, and this is on the roadmap, but that's the current state of affairs. The question suggests flume doesn't easily do this today. I'd love to make that not the case if many users feel like that's true but it is possible. Best of luck!

Eric Sammer

Have you looked at http://Logentries.com? It will take any new incoming log events that you send to the platform, package these up on a daily basis, compress them to minimize your storage requirements and then back them up to your Amazon S3 bucket. It's reliable and takes little effort to set up. Some resources that may be helpful: ʉۢ https://blog.logentries.com/2014/01/amazon-s3-archiving-you-asked-we-delivered/ ʉۢ https://blog.logentries.com/2014/03/tracking-events-through-complex-stacks-part-1/

Trevor Parsons

What about using http://aws.amazon.com/storagegateway/?  It is a paid solution.  You can mount it as an iSCSI device, and then share it as an NFS mount point that can be accessed from your multiple client machines.  Then the software will automatically write it to your S3 account.

Gary Ogasawara

Nicolae Marasoiu

You can check out Amazon's Simple Workflow.  People are using it for very similar situations.  The Flow language is a bit tricky to learn but i heard there might be a new simpler version coming out.  http://www.amazon.com/swfhttp://www.amazon.com/swf

Tom Dawson

I think you should look at MongoDB, it may solve many of your problems, here is a sample on how it can be used instead of S3 on Amazon and provide a much easier framework to work with than Hadoop etc': Replace Amazon S3 with MongoDB GridFS and Grails (James Williams) http://jameswilliams.be/blog/entry/171

Avi Kapuya

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.