How to efficiently store data in hive?
-
How can I efficiently store data in Hive and also store and retrieve compressed data in hive? Currently I am storing it as a TextFile. I was going through [Bejoy article](http://kickstarthadoop.blogspot.com/2011/10/how-to-efficiently-store-data-in-hive.html) and I found that LZO compression will be good for storing the files and also it is splittable. I have one HiveQL Select query that is generating some output and I am storing that output somewhere so that one of my Hive table (quality) can use that data so that I can query that qualityqualityquality table. Below is the qualityqualityquality table in which I am loading the data from the below SELECT query by making the partition I am using to overwrite table qualityqualityquality. create table quality (id bigint, total bigint, error bigint ) partitioned by (ds string) row format delimited fields terminated by '\t' stored as textfile location '/user/uname/quality' ; insert overwrite table quality partition (ds='20120709') SELECT id , count2 , coalesce(error, cast(0 AS BIGINT)) AS count1 FROM Table1; So here currently I am storing it as a TextFiâ¤TextFiâ¤TextFile, should I make this as a Sequencefiâ¤Sequencefiâ¤Sequence file and start storing the data in LZOcompressionformatLZOcompressionformatLZO compression format? Or text file will be fine here also? As from the select query I will be getting some GB of data, that need to be uploaded on table quality on a daily basis. So which way is best? Should I store the output as a TextFile or SequenceFile format (LZO compression) so that when I am querying the Hive quality table, I am getting result fasters. Means querying is faster. What If I am storing as a SEQUENCE FILE with BLOCK COMPRESSION like below? set mapred.output.compress=true; set mapred.output.compression.type=BLOCK; set mapred.output.compression.codec=http://org.apache.hadoop.io.compress.LzoCodec;
-
Answer:
You don't necessarily need to use Lzo. Snappy is a splittable compression codec packaged with CDH and HDN that works really well. The one thing that you're missing from what I see in your described approach is that you need to configure the table to be stored as sequence files if you were to go with the Snappy approach.
Brian Tran at Quora Visit the source
Other answers
Generally using RCFile format is a good choice. It organises the table in row groups and saves one or several groups (depending on your HDFS block size) in a HDFS file. Within a row group the data is stored in columnar format. You can optionally compress the RCFiles. Snappy is a good choice for compression if you like to minimize the load on your CPUs and do not require an optimal compression ratio. Using RCFile has two benefits. Hadoop can read and process row groups in parallel due to the data locality and redundancy with HDFS. A tasknode processing a row group benefits from the ability to skip columns that are irrelevant to a query reducing disk IO (and decompression load on the CPU). Alternatively, you can use text and snappy with block compression. Snappy is splittable so multiple mappers can read and split one large file. However, the performance of text is not ideal especially if you store mostly numbers as your example implies. You are better off using sequence files with compression (if for some reason you do not want to use RCFile). However, Hive stores rows as single value in sequence files and requires to load and decompress every value to investigate it no matter the query and columns required. Your example indicates that you partition by date. It may (depending on your data) be another optimisation to split by year, month, and day and not merely by every day. Lastly, later this year (2013) ORC file format will become part of most distributions and should be considered as an alternative to RCFile. ORC goes beyond RCFile and introduces columnar optimised storage (e.g. variable length encoding for integers), large block sizes (better disk IO and fewer file with lower namenode load), basic statistics on columns in a file and simple file indices to skip whole row groups if they don't match a query.
Christian Prokopp
Answering the main question : compressed input or non-compressed input. If your priority is speed no matter what and memory is no issue then why compress it?? Compressed data will obviously slow down any MR job. However, if memory is an issue, which I am assuming is an "important" issue considering the fact that you won't be using hadoop unless you have really really large amount of data, then compression is the way to go. Answering the next question -> how to input a lzo compressed file to hive : SET hive.exec.compress.output=true; SET io.seqfile.compression.type=BLOCK; SET mapred.output.compression.codec = com.hadoop.compression.lzo.LzopCodec; CREATE EXTERNAL TABLE quality (id bigint, total bigint, error bigint) PARTITIONED BY (ds string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat" OUTPUTFORMAT "http://org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat" LOCATION '/user/uname/quality'; The other query will remain the same. Lzo is not bundled along with apache nor cloudera hadoop distribution due to some issues, therefore you will have to install and configure lzo native libraries for your hadoop cluster. You can try out the below link https://github.com/twitter/hadoop-lzo Also, I am assuming you are using lzo currently. However if you want to increase your speed, try out snappy. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
Nicole Hu
You can use ORC File format as the underlying file format for Hive. This format has several advantages like better compression compared to other formats, lightweight indexes within the file, skip rows which don't pass predicate filtering etc which helps in efficient queries. You can output ORC Files from a MR program also as shown in http://hadoopcraft.blogspot.in/2014/07/generating-orc-files-using-mapreduce.html
Gautam P Hegde
Related Q & A:
- How To Get Online Data Mining Work?Best solution by theatlantic.com
- How would you transfer data between your data structures and databases?Best solution by Programmers
- How to efficiently aggregate several collections into one collection?Best solution by Stack Overflow
- How to efficiently invalidate cache?Best solution by Drupal Answers
- How to store data in php and get data from php?Best solution by Stack Overflow
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.