Subquery
A Subquery is an SQL query nested within another query (usually a SELECT statement). It can be used in the SELECT, FROM, WHERE, or HAVING clauses to provide data or conditions for the outer query. The use of subqueries makes SQL queries more flexible and powerful, as they allow us to solve more complex problems within a single query.
Some important features of subqueries are as follows:
Position of Subqueries: Subqueries can be placed in multiple SQL clauses, such as the WHERE clause, HAVING clause, and FROM clause. They can be used with SELECT, UPDATE, INSERT, DELETE statements, and expression operators (such as comparison operators =, >, <, <=, as well as IN, EXISTS, etc.).
Relationship between Main Query and Subquery: A subquery is a query nested inside another query. The outer query is referred to as the main query, while the inner query is referred to as the subquery.
Execution Order: When there is no correlation between the subquery and the main query, the subquery is usually executed first. When there is a correlation, the parser decides which query to execute first in real-time as needed and uses the output of the subquery accordingly.
Use of Parentheses: Subqueries must be enclosed in parentheses to distinguish them as nested within another query.
Below, we will use tables t1 and t2 and related SQL to introduce the basic features and usage of subqueries. The table creation statements are as follows:
create table t1
(
c1 bigint,
c2 bigint
)
DISTRIBUTED BY HASH(c1) BUCKETS 3
PROPERTIES ("replication_num" = "1");
create table t2
(
c1 bigint,
c2 bigint
)
DISTRIBUTED BY HASH(c1) BUCKETS 3
PROPERTIES ("replication_num" = "1");
Classification of Subqueries
Classification Based on the Characteristics of Data Returned by Subqueries
Subqueries can be classified into scalar and non-scalar subqueries based on the characteristics of the data they return:
1. Scalar Subquery
A subquery that always returns a single value (essentially equivalent to a one-row, one-column Relation). If the subquery does not return any data, it returns a NULL value. Scalar subqueries can theoretically appear anywhere a single-value expression is allowed.
2. Non-scalar Subquery
A subquery that returns a Relation (different from the return value of a scalar subquery, this Relation can contain multiple rows and columns). If the subquery does not return any data, it returns an empty set (0 rows). Non-scalar subqueries can theoretically appear anywhere a relation (set) is allowed.
The following examples illustrate scalar and non-scalar subqueries (for the two subqueries in parentheses, when t2 is an empty table, the results returned by the two subqueries are different):
-- Scalar subquery, when t2 is an empty table, the subquery returns the scalar value null
select * from t1 where t1.c1 > (select sum(t2.c1) from t2);
-- Non-scalar subquery, when t2 is an empty table, the subquery returns an empty set (0 rows)
select * from t1 where t1.c1 in (select t2.c1 from t2);
Classification Based on Whether the Subquery References Columns from the Outer Query
Subqueries can be classified into correlated subqueries and non-correlated subqueries based on whether they reference columns from the outer query:
3. Non-correlated Subquery
A subquery that does not reference any columns from the outer query. Non-correlated subqueries can often be computed independently and return the corresponding results once for the outer query to use.
4. Correlated Subquery
A subquery that references one or more columns from the main query (also known as the outer query) (the referenced outer columns are often in the WHERE condition of the subquery). Correlated subqueries can often be seen as a filtering operation on the externally associated table, as for each row of data in the outer table, the subquery is computed and returns the corresponding result.
The following examples illustrate correlated and non-correlated subqueries:
-- Correlated subquery, the subquery internally uses the column t1.c2 from the outer table
select * from t1 where t1.c1 in (select t2.c1 from t2 where t2.c2 = t1.c2);
-- Non-correlated subquery, the subquery internally does not use any columns from the outer table t1
select * from t1 where t1.c1 in (select t2.c1 from tt2);
Subqueries Supported by Doris
Doris supports all non-correlated subqueries and provides partial support for correlated subqueries as follows:
Supports correlated scalar subqueries in the
WHERE
andHAVING
clauses.Supports correlated
IN
,NOT IN
,EXISTS
,NOT EXISTS
non-scalar subqueries in theWHERE
andHAVING
clauses.Supports correlated scalar subqueries in the
SELECT
list.For nested subqueries, Doris only supports subqueries correlated to their immediate parent query and does not support cross-level correlation to outer queries beyond the parent.
Limitations of Correlated Subqueries
Limitations of Correlated Scalar Subqueries
The correlation condition must be an equality condition.
The output of the subquery must be the result of a single aggregate function without a GROUP BY clause.
-- Single aggregate function without GROUP BY, supported
select * from t1 where t1.c1 < (select max(t2.c1) from t2 where t1.c2 = t2.c2);
-- Equivalent rewritten SQL as follows:
select t1.* from t1 inner join (select t2.c2 as c2, max(t2.c1) as c1 from t2 group by t2.c2) tx on t1.c1 < tx.c1 and t1.c2 = tx.c2;
-- Non-equality condition, not supported
select * from t1 where t1.c1 = (select max(t2.c1) from t2 where t1.c2 > t2.c2);
-- No aggregate function, not supported
select * from t1 where t1.c1 = (select t2.c1 from t2 where t1.c2 = t2.c2);
-- With aggregate function but includes GROUP BY, not supported
select * from t1 where t1.c1 = (select max(t2.c1) from t2 where t1.c2 = t2.c2 group by t2.c2);
Limitations of Correlated (NOT) EXISTS Subqueries
- The subquery cannot have both OFFSET and LIMIT.
-- With LIMIT but no OFFSET, supported
select * from t1 where exists (select t2.c1 from t2 where t1.c2 = t2.c2 limit 2);
-- Equivalent rewritten SQL as follows:
select * from t1 left semi join t2 on t1.c2 = t2.c2;
-- With OFFSET and LIMIT, not supported
select * from t1 where exists (select t2.c1 from t2 where t1.c2 = t2.c2 limit 2, 3);
Limitations of Correlated (NOT) IN Subqueries
The output of the subquery must be a single column.
The subquery cannot have LIMIT.
The subquery cannot have aggregate functions or GROUP BY clauses.
-- Supported subquery
select * from t1 where t1.c1 in (select t2.c1 from t2 where t1.c2 = t2.c2);
-- Equivalent rewritten SQL as follows:
select * from t1 left semi join t2 on t1.c1 = t2.c1 and t1.c2 = t2.c2;
-- Subquery output is multiple columns, not supported
select * from t1 where (t1.a, t1.c) in (select t2.c1, t2.c from t2 where t1.c2 = t2.c2);
-- Subquery with LIMIT, not supported
select * from t1 where t1.c1 in (select t2.c1 from t2 where t1.c2 = t2.c2 limit 3);
-- With GROUP BY clause, not supported
select * from t1 where t1.c1 in (select t2.c1 from t2 where t1.c2 = t2.c2 group by t2.c1);
-- With aggregate function, not supported
select * from t1 where t1.c1 in (select sum(t2.c1) from t2 where t1.c2 = t2.c2);
Limitations of Nested Subqueries
Currently, only subqueries that correlate directly with their immediate parent queries are supported. Correlation with outer layers of the parent query is not supported.
Assume there is another table t3
with the following creation statement:
create table t3
(
c1 bigint,
c2 bigint
)
DISTRIBUTED BY HASH(c1) BUCKETS 3
PROPERTIES ("replication_num" = "1");
Supported when the subquery only uses columns from its immediate parent query:
select
t1.c1
from
t1
where not exists (
select
t2.c1
from
t2
where not exists (
select
t3.c1
from
t3
where
t3.c2 = t2.c2
) and t2.c2 = t1.c2
);Not supported when the innermost subquery uses columns from its immediate parent query
t2.c2
and also columns from the outermost queryt1.c1
:select
t1.c1
from
t1
where not exists (
select
t2.c1
from
t2
where not exists (
select
t3.c1
from
t3
where
t3.c2 = t2.c2 and t3.c1 = t1.c1
)
);
Mark Join
In where
conditions, clauses with or
relationships composed of subqueries using (not) in
or (not) exists
and other filtering conditions require special handling to produce correct results. An example is given below:
select
t1.c1,
t1.c2
from t1
where exists (
select
t2.c1
from t2
where
t1.c2 = t2.c2
) or t1.c1 > 0;
If the exists
clause in this SQL is directly implemented using left semi join
, according to the semantics of left semi join
, only rows from t1
that satisfy t1.c2 = t2.c2
will be output. However, rows that actually satisfy the condition t1.c1 > 0
should also be output. To achieve this, the mechanism of Mark Join
is introduced.
right semi join
is similar but differs in the left and right tables. Here, we use left semi join
as an example.
Example SQL is as follows:
-- This SQL cannot be executed and is only for demonstration purposes
select
tx.c1,
tx.c2
from
(
select
t1.c1,
t1.c2,
mark_join_flag
from
t1 left (mark) semi join t2 on t1.c2 = t2.c2
) tx
where
tx.mark_join_flag or tx.c1 > 0;
The difference between Mark Join
and a regular left semi join
is that a regular left semi join
directly outputs rows from the left table that meet the condition, while Mark Join
outputs the original left table with an additional flag column (the mark_join_flag
in the example) that can be true
, false
, or null
. The value of the flag is determined by the join
condition expression t1.c2 = t2.c2
, with each row corresponding to a flag value. The calculation of flag values is shown in the table below:
t1.c1 | t2.c1 | mark_join_flag |
---|---|---|
1 | 1 | TRUE |
1 | 2 | FALSE |
1 | NULL | NULL |
NULL | 1 | NULL |
NULL | NULL | NULL |
With this flag, the where
filtering condition can be rewritten as where mark_join_flag or t1.c1 > 0
to obtain the correct results.
Common Issues
Since the output of a scalar subquery must be a single value, Doris adopts different processing methods for correlated and non-correlated scalar subqueries.
For Correlated Scalar Subqueries
Currently, Doris can only statically ensure that the subquery outputs a single value (i.e., a single aggregate function without group by
). Therefore, when using correlated scalar subqueries, a group by
-less aggregate function such as any_value
needs to be added according to requirements, so that the optimizer can recognize the single-value semantics smoothly. Users need to ensure that the subquery always returns only one value. If the subquery actually returns multiple values (other database systems would report an error at runtime), due to the added aggregate function, it will always return one value, although the result may not match expectations.
Please refer to the following SQL example:
-- Correlated scalar subquery lacking a single aggregate function without group by, currently not supported
select t1.*, (select t2.c1 from t2 where t1.c2 = t2.c2) from t1;
-- Add a single aggregate function for the optimizer to recognize smoothly
select t1.*, (select any_value(t2.c1) from t2 where t1.c2 = t2.c2) from t1;
For Non-Correlated Scalar Subqueries
Doris will add an assert num rows
operator at runtime. If the subquery returns more than one row of data, a runtime error will be reported.
Please refer to the following SQL example:
-- Non-correlated scalar subquery, will report an error if table t2 has more than 1 row of data
select t1.*, (select t2.c1 from t2) from t1;
-- Example error message
ERROR 1105 (HY000): errCode = 2, detailMessage = (127.0.0.1)[CANCELLED]Expected EQ 1 to be returned by expression