How to sequence events while uploading large files to Amazon S3?

How does Amazon EMR Hive parallelize when operating on files stored in Amazon S3?

  • In a typical HDFS setup, the raw files can be partitioned by directory structure and an external table (with partitioning defined) can be created in Hive. This allows the cluster administrator to optimize the way the data is partitioned. In an Amazon EMR and S3 setup, The files are stored in a single S3 bucket. Therefore, how does Amazon EMR Hive know how to best parallelize the job?

  • Answer:

    It appears there is no way for Hive nodes to have locality of reference when using S3. The S3 bucket is accessed as a distributed filesystem and all the nodes access it through its API entry point. EMR accesses S3 through the REST API, just like any app. For example, see the discussion on S3:

Miguel Paraz at Quora Visit the source

Was this solution helpful to you?

Related Q & A:

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.