How to improve the performance of the query?

What should be the number of reducers in hive configuration to improve the performance of hive?

  • I have a ec2 machine as hive server with one namenode and two data node. I also set <property><name>mapred.tasktracker.reduce.tasks.maximum</name><value>15</value></property> but while executing a simple hive query it still shows Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1   and the query takes around 70 seconds to fetch around 20,000 rows of data. I just want to know that is there any way so that I can improve the performance.

  • Answer:

    The performance depends on many variables not only reducers. Importantly, if your query does use ORDER BY Hive's implementation only supports a single reducer at the moment for this operation. Generally 70 seconds is not that long for a query on Hive considering that there is a lot of overhead involved like generating and starting the mapreduce job. There will be improvements with Stinger this year (http://hortonworks.com/blog/100x-faster-hive/). Especially Tez (as a service) will improve short query's performance. In your example the 20,000 rows probably are not a very big file and Hadoop will by default not start many mappers (and consequently reducers). You can check the mapreduce job's bytes read counter or the file location. You could change mapred.min.split.size and mapred.max.split.size (both Hadoop setting) to increase the number of mappers Hadoop will use to read your data. Note, that if your data is unsplittable for mappers, e.g. Gzipped, then the split size is ignored and each file is read by a  mapper. Alternatively, you could store your data in RCFIle format (if you haven't) to optimise the data storage and the number of data to be read. RCFile enables to skip columns irrelevant tot the query. You can also use compression (Snappy is good) and compress on block level for best results. You can optimise joins with (at least one) small table. Hive can read the table into memory of all mappers and do the join map-side only without reducers. Use the smallest tables on the left side of your joins and enable auto optimisation: set hive.auto.convert.join=true; Lastly, avoid storing your data on S3 and prefer HDFS to reduce network O and delays.

Christian Prokopp at Quora Visit the source

Was this solution helpful to you?

Related Q & A:

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.