How to improve the performance of the query?

How exactly does a multi-level execution tree improve Impala Query performance?

  • This question stemmed from Google's Dremel paper but I thought I'd post it for Impala since there's a lot of interest in the open source community. The paper mentions two core architecture decisions to improving query execution speed: 1) A column storage representation 2) Borrowing the concept of a serving tree used in distributed search engines. These allow fast SQL queries on nested in place data. From this I had 3 more questions stemming from the concepts of the execution tree and what it means to be able to perform analyses "in place": 1) What exactly is a multi-level execution tree and how do they help with performance enhancement as opposed to a regular Map Reduce job? (Is there a key concept here that allows the 50x boost from a Hive job? Column storage's performance boost makes sense to me but the benefits of the execution tree was black boxed/abstract for me) 2) Are the Impala worker nodes the same as the HDFS storage nodes so to avoid having to move data around within the cluster? 3) This is more of a functional question: does Impala need a separate data store format in HDFS that's different from Hive's metastore/format? That is, if I am working on a system already using Hive batch processes and want to test Impala, is that datastore compatible with Impala or will I need to retransform my data using some MPP tool? (I didn't quite get column storage at first but I found 's blog post really helpful for me to understand the concept here at http://the-paper-trail.org/blog/columnar-storage/) Also for general reasons why Impala's fast, I referred to the Cloudera FAQ's answer in the question below but any additional input would be great!

  • Answer:

    1) What exactly is a multi-level execution tree and how do they help with performance enhancement as opposed to a regular Map Reduce job? (Is there a key concept here that allows the 50x boost from a Hive job? Column storage's performance boost makes sense to me but the benefits of the execution tree was black boxed/abstract for me) Impala doesn't currently use a multi-level execution tree, though it may in the future. There are several benefits of a multi-level execution tree when you're running on a large cluster: 1) Reduced fan-in Imagine that you have a query running on a cluster with 1000 nodes. Each of those thousand nodes may have 10 tasks running concurrently. If you run a query such as "SELECT COUNT(*)", each of these 10,000 tasks will separately compute a local count and then send the results to some central node to come up with the final result. Therefore that central aggregation point ends up getting 10,000 inbound results, probably at approximately the same time. This burst of inbound network traffic causes a problem generally known as TCP Incast. The one-sentence explanation of this is that the burst of inbound traffic causes some of the packets to get dropped, meaning that the sender nodes are going to have to delay some number of milliseconds and retry. If the retries aren't themselves jittered, they may again collide with each other, and the total query latency goes up significantly. A good reference on this subject is http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-40.html. Given the above problem, there's a fairly obvious solution. Instead of the 10,000:1 fan-in of the partial results, put a machine at the top of each rack which does a partial aggregation from all of the machines from that rack. Now instead of 10,000:1, we have 400:1 within each rack, and 25:1 from the top-of-rack machines to the central aggregation point. With these lower fan-in ratios, the TCP incast problem is not so bad, and the total latency will likely be reduced despite the "extra hop". 2) Better parallelism Again consider the above example of an aggregation query, but now add a GROUP BY clause to the query, with many thousands of groups. Now, the single node doing the aggregation actually has a fair amount of work to do (lots of hashtable lookups, or even spilling to disk). If you partially aggregate at the top of each rack first, then much of this work is spread out. 3) Lower cross-rack network utilization The other benefit here is that by locally combining results within each rack, you end up having to send less data across the inter-rack switches. In many datacenter designs, bandwidth between racks is over-subscribed, and any reduction in cross-rack network usage is beneficial. If you're familiar with Hadoop, this is the same idea you see with a "Combiner" running at each node. Here, though, you have a combiner at the top of each rack. 2) Are the Impala worker nodes the same as the HDFS storage nodes so to avoid having to move data around within the cluster? Yes. Plan fragments ("tasks" in Hadoop parlance) are scheduled to the machines that have local access to the data. 3) This is more of a functional question: does Impala need a separate data store format in HDFS that's different from Hive's metastore/format? That is, if I am working on a system already using Hive batch processes and want to test Impala, is that datastore compatible with Impala or will I need to retransform my data using some MPP tool? Impala re-uses the metadata and storage formats exposed by the Hive metastore. So, if you have a table in Hive, you can almost always run Impala directly on the same data with no extra transformation step.

Todd Lipcon at Quora Visit the source

Was this solution helpful to you?

Other answers

Todd's answer is almost right on the button, but there's a couple of things I want to clear up. The Dremel paper isn't explicit about what they mean by a multi-level execution tree, but reading between the lines it seems clear they mean a planner that can decompose aggregation operators into execution trees with arbitrary depths. Impala doesn't have this facility, but it does have aggregation trees with more than one level, which is not very different. When a query is submitted to Impala, it's compiled into a query plan, which is a tree of logical operators (like COUNT, SCAN, AGGREGATE and so on)  where data flows from the leaf to the root. There are many benefits from using this representation: SQL queries naturally map onto a tree, and it's a convenient representation for the planner to use to make transformations from one plan to a more efficient one. There are also advantages when it comes to execution, as we shall see. The query-plan usually only has a few nodes - most often somewhere between 3 and 20, but the actual number is proportional to the complexity of the query. When it comes to actually executing a query plan, Impala expands the simple tree by partitioning the work that each node has to do amongst many machines. So for example with a simple SCAN operator that reads all the rows in a table, Impala partitions that work amongst many plan nodes by giving them all responsibility for a subset of the table. By performing this partitioning, Impala can infer a great deal of concurrency from the tree structure (note that trees in different subtrees can't depend on each other, so it's ok to run them in any order), and this allows it to take advantage of any available parallelism in the cluster. (The result of this partitioning is sometimes not a tree, but still has a topological structure that makes it easy to reason about concurrency - don't worry too hard about it now :)). This partitioning of a single node in the logical execution tree into a large number of independent nodes in the physical execution tree is the basic source of performance in MPP databases. It's possible to partition other operators as well as SCAN. One great example is the COUNT operator; by adding another level below a COUNT node and making each new child node count only the rows that it sees, you can partition the execution of COUNT as much as you like. When you do this, you reduce the number of rows that get sent to any node in the COUNT tree, which helps avoid both incast problems (as Todd points out) and overwhelming any single node with a huge number of rows. This transformation can also been repeated ad infinitum - you can make the aggregation tree as deep as you like. In general, most aggregation operators - GROUP BY, SORT, COUNT, SUM, MAX etc. - can be decomposed into multi-level execution plans like this, and since they 'aggregate' they most often produce fewer rows than they take in, which naturally gives them the load-distributing property that we're after. So finally we get to what the Dremel paper seems to mean about 'multi-level' execution trees. The trivial COUNT implementation has two levels - the leaves that provide rows, and the root node that counts all the rows it gets from the leaves. Adding an intermediate level (Dremel's 3-level topology) means that the count can be partitioned. This is what Dremel calls a 'multi-level' tree, but it also refers to the ability to make the tree as deep as is required. The main advantage of multi-level trees is that you can partition the levels below the root and then run those partitions in parallel (the Dremel paper attributes the substantial gains in performance to 'increased parallelism', not throughput improvements due to avoiding incast issues). Impala does have multi-level aggregations, but they are usually fixed at 2 or 3 levels deep, and Impala currently won't generate plans that are arbitrarily deep. You can see from the Dremel paper that adding the second level is where the biggest win is, although there are obviously queries that would benefit significantly from deeper aggregation trees. Impala will often co-locate the second-level aggregation nodes with scans where possible, since it's convenient to, for example, count all the rows that you read from disk rather than waiting for rows from lots of scan nodes before being able to proceed with sending the partial count up the tree. Similarly, for a query like SELECT 1 from tbl GROUP BY col1, Impala will aggregate all the results from each scan node into groups before sending those groups up the tree to an intermediate aggregation node. There's nothing in Impala's execution engine that precludes arbitrarily deep aggregation trees - as soon as the planner can generate them, the backend will run them with no modifications. It's exactly like adding support for a particular optimisation to a compiler.

Henry Robinson

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.