Skip to main content
Skip to main content

Solving Unevenly Data Distribution

In the section on using column statistics to optimize plans, we introduced the uniformity assumption employed by the optimizer. However, in real-world scenarios, data often does not satisfy this uniformity assumption. When the optimizer generates an unsatisfactory execution plan due to significant estimation errors, we can manually adjust and optimize the execution plan using hints.

Tuning Case 1: Bucket Issue

When data skew occurs on the bucket key of a table, the workload will be unevenly distributed across different BE instances, thereby prolonging the overall query execution time.

Taking the TPC-H schema as an example: Suppose the orders table uses o_orderkey as the bucket key and has two tablets. For certain reasons, one tablet contains 100 million rows of data, while the other tablet contains only 100 rows.

When executing the following query:

SELECT COUNT(*) FROM orders JOIN customer ON o_custkey = c_custkey;

The optimizer generates a Broadcast Join, with the orders table as the left table and the customer table as the right table.

The execution engine then launches a thread for each tablet of the orders table to perform the join. However, due to uneven data distribution, one thread processes 100 million rows of data, while the other thread processes only 100 rows.

Ideally, both threads should each process 50% of the data to double the query efficiency. To address this issue, we can specify the use of a Shuffle Join to redistribute the data from the orders table based on o_custkey before joining it with the customer table.

Tuning Case 2: Row Estimation Issue

The optimizer estimates the filter rate based on the uniformity assumption. Significant errors in the estimated number of filtered rows can impact the selection of subsequent SQL operators.

When estimating the filter rate, the optimizer typically relies on the assumption of uniform distribution. However, when the error in the estimated number of filtered rows is significant, it can affect the selection of subsequent SQL operators.

Considering the following SQL query:

select count() 
from orders, customer
where o_custkey = c_custkey and o_orderdate < '1920-01-02'

Under the assumption of uniform distribution, the optimizer may assume that the number of rows output after filtering with o_orderdate < '1920-01-02' will be less than the number of rows in the customer table, and therefore choose to build a hash table based on the orders table.

However, if the actual data is skewed, resulting in more orders satisfying the condition than the number of entries in the customer table, a more reasonable choice would be to build the hash table based on the customer table.

To optimize the query, we need to adjust the SQL statement based on the actual situation or prompt the optimizer to use a more suitable execution plan.

Revising the SQL as follows:

select /* leading(orders customer) */ count() 
from orders, customer
where o_custkey = c_custkey and o_orderdate < '1920-01-02'