Data Model
Data Model
This topic introduces the data models in Doris from a logical perspective so you can make better use of Doris in different business scenarios.
Basic conceptsβ
In Doris, data is logically described in the form of tables. A table consists of rows and columns. Row is a row of user data. Column is used to describe different fields in a row of data.
Columns can be divided into two categories: Key and Value. From a business perspective, Key and Value correspond to dimension columns and indicator columns, respectively.
Data models in Doris fall into three types:
- Aggregate
- Unique
- Duplicate
The following is the detailed introduction to each of them.
Aggregate Modelβ
We illustrate what aggregation model is and how to use it correctly with practical examples.
Example 1: Importing data aggregationβ
Assume that the business has the following data table schema:
ColumnName | Type | AggregationType | Comment |
---|---|---|---|
userid | LARGEINT | user id | |
date | DATE | date of data filling | |
City | VARCHAR (20) | User City | |
age | SMALLINT | User age | |
sex | TINYINT | User gender | |
Last_visit_date | DATETIME | REPLACE | Last user access time |
Cost | BIGINT | SUM | Total User Consumption |
max dwell time | INT | MAX | Maximum user residence time |
min dwell time | INT | MIN | User minimum residence time |
The corresponding CREATE TABLE statement would be as follows (omitting the Partition and Distribution information):
CREATE TABLE IF NOT EXISTS example_db.example_tbl
(
`user_id` LARGEINT NOT NULL COMMENT "user id",
`date` DATE NOT NULL COMMENT "data import time",
`city` VARCHAR(20) COMMENT "city",
`age` SMALLINT COMMENT "age",
`sex` TINYINT COMMENT "gender",
`last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "last visit date time",
`cost` BIGINT SUM DEFAULT "0" COMMENT "user total cost",
`max_dwell_time` INT MAX DEFAULT "0" COMMENT "user max dwell time",
`min_dwell_time` INT MIN DEFAULT "99999" COMMENT "user min dwell time"
)
AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);
As you can see, this is a typical fact table of user information and visit behaviors. In star models, user information and visit behaviors are usually stored in dimension tables and fact tables, respectively. Here, for the convenience of explanation, we store the two types of information in one single table.
The columns in the table are divided into Key (dimension) columns and Value (indicator columns) based on whether they are set with an AggregationType
. Key columns are not set with an AggregationType
, such as user_id
, date
, and age
, while Value columns are.
When data are imported, rows with the same contents in the Key columns will be aggregated into one row, and their values in the Value columns will be aggregated as their AggregationType
specify. Currently, their are four aggregation types:
- SUM: Accumulate the values in multiple rows.
- REPLACE: The newly imported value will replace the previous value.
- MAX: Keep the maximum value.
- MIN: Keep the minimum value.
Suppose that you have the following import data (raw data):
user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
---|---|---|---|---|---|---|---|---|
10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 06:00 | 20 | 10 | 10 |
10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00 | 15 | 2 | 2 |
10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 |
Assume that this is a table recording the user behaviors when visiting a certain commodity page. The first row of data, for example, is explained as follows:
Data | Description |
---|---|
10000 | User id, each user uniquely identifies id |
2017-10-01 | Data storage time, accurate to date |
Beijing | User City |
20 | User Age |
0 | Gender male (1 for female) |
2017-10-01 06:00 | User's time to visit this page, accurate to seconds |
20 | Consumption generated by the user's current visit |
10 | User's visit, time to stay on the page |
10 | User's current visit, time spent on the page (redundancy) |
After this batch of data is imported into Doris correctly, it will be stored in Doris as follows:
user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
---|---|---|---|---|---|---|---|---|
10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00 | 35 | 10 | 2 |
10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 |
As you can see, the data of User 10000 have been aggregated to one row, while those of other users remain the same. The explanation for the aggregated data of User 10000 is as follows (the first 5 columns remain unchanged, so it starts with Column 6 last_visit_date
):
*2017-10-01 07:00
: The last_visit_date
column is aggregated by REPLACE, so 2017-10-01 07:00
has replaced 2017-10-01 06:00
.
Note: When using REPLACE to aggregate data from the same import batch, the order of replacement is uncertain. That means, in this case, the data eventually saved in Doris could be
2017-10-01 06:00
. However, for different import batches, it is certain that data from the new batch will replace those from the old batch.
*35
: The cost
column is aggregated by SUM, so the update value 35
is the result of 20
+ 15
.
*10
: The max_dwell_time
column is aggregated by MAX, so 10
is saved as it is the maximum between 10
and 2
.
*2
: The min_dwell_time
column is aggregated by MIN, so 2
is saved as it is the minimum between 10
and 2
.
After aggregation, Doris only stores the aggregated data. In other words, the detailed raw data will no longer be available.
Example 2: keep detailed dataβ
Here is a modified version of the table schema in Example 1:
ColumnName | Type | AggregationType | Comment |
---|---|---|---|
user_id | LARGEINT | User ID | |
date | DATE | Date when the data are imported | |
time stamp | DATETIME | Date and time when the data are imported (with second-level accuracy) | |
city | VARCHAR (20) | User location city | |
age | SMALLINT | User age | |
sex | TINYINT | User gender | |
last visit date | DATETIME | REPLACE | Last visit time of the user |
cost | BIGINT | SUM | Total consumption of the user |
max_dwell_time | INT | MAX | Maximum user dwell time |
min_dwell_time | INT | MIN | Minimum user dwell time |
A new column timestamp
has been added to record the date and time when the data are imported (with second-level accuracy).
Suppose that the import data are as follows:
user_id | date | timestamp | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
---|---|---|---|---|---|---|---|---|---|
10000 | 2017-10-01 | 2017-10-01 08:00:05 | Beijing | 20 | 0 | 2017-10-01 06:00 | 20 | 10 | 10 |
10000 | 2017-10-01 | 2017-10-01 09:00:05 | Beijing | 20 | 0 | 2017-10-01 07:00 | 15 | 2 | 2 |
10001 | 2017-10-01 | 2017-10-01 18:12:10 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
10002 | 2017-10-02 | 2017-10-02 13:10:00 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
10003 | 2017-10-02 | 2017-10-02 13:15:00 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
10004 | 2017-10-01 | 2017-10-01 12:12:48 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
10004 | 2017-10-03 | 2017-10-03 12:38:20 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 |
After importing, this batch of data will be stored in Doris as follows:
user_id | date | timestamp | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
---|---|---|---|---|---|---|---|---|---|
10000 | 2017-10-01 | 2017-10-01 08:00:05 | Beijing | 20 | 0 | 2017-10-01 06:00 | 20 | 10 | 10 |
10000 | 2017-10-01 | 2017-10-01 09:00:05 | Beijing | 20 | 0 | 2017-10-01 07:00 | 15 | 2 | 2 |
10001 | 2017-10-01 | 2017-10-01 18:12:10 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
10002 | 2017-10-02 | 2017-10-02 13:10:00 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
10003 | 2017-10-02 | 2017-10-02 13:15:00 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
10004 | 2017-10-01 | 2017-10-01 12:12:48 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
10004 | 2017-10-03 | 2017-10-03 12:38:20 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 |
As you can see, the stored data are exactly the same as the import data. No aggregation has ever happened. This is because, the newly added timestamp
column results in difference of Keys among the rows. That is to say, as long as the Keys of the rows are not identical in the import data, Doris can save the complete detailed data even in the Aggregate Model.
Example 3: aggregate import data and existing dataβ
Based on Example 1, suppose that you have the following data stored in Doris:
user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
---|---|---|---|---|---|---|---|---|
10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00 | 35 | 10 | 2 |
10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 |
Now you need to import a new batch of data:
user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
---|---|---|---|---|---|---|---|---|
10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 11:22:00 | 44 | 19 | 19 |
10005 | 2017-10-03 | Changsha | 29 | 1 | 2017-10-03 18:11:02 | 3 | 1 | 1 |
After importing, the data stored in Doris will be updated as follows:
user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
---|---|---|---|---|---|---|---|---|
10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00 | 35 | 10 | 2 |
10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 11:22:00 | 55 | 19 | 6 |
10005 | 2017-10-03 | Changsha | 29 | 1 | 2017-10-03 18:11:02 | 3 | 1 | 1 |
As you can see, the existing data and the newly imported data of User 10004 have been aggregated. Meanwhile, the new data of User 10005 have been added.
In Doris, data aggregation happens in the following 3 stages:
- The ETL stage of each batch of import data. At this stage, the batch of import data will be aggregated internally.
- The data compaction stage of the underlying BE. At this stage, BE will aggregate data from different batches that have been imported.
- The data query stage. The data involved in the query will be aggregated accordingly.
At different stages, data will be aggregated to varying degrees. For example, when a batch of data is just imported, it may not be aggregated with the existing data. But for users, they can only query aggregated data. That is, what users see are the aggregated data, and they should not assume that what they have seen are not or partly aggregated. (See the [Limitations of Aggregate Model](#Limitations of Aggregate Model) section for more details.)
Unique Modelβ
In some multi-dimensional analysis scenarios, users are highly concerned about how to ensure the uniqueness of the Key, that is, how to create uniqueness constraints for the Primary Key. Therefore, we introduce the Unique Model. Prior to Doris 1.2, the Unique Model was essentially a special case of the Aggregate Model and a simplified representation of table schema. The Aggregate Model is implemented by Merge on Read, so it might not deliver high performance in some aggregation queries (see the [Limitations of Aggregate Model] (#Limitations of Aggregate Model) section). In Doris 1.2, we have introduced a new implementation for the Unique Model--Merge on Write, which can help achieve optimal query performance. For now, Merge on Read and Merge on Write will coexist in the Unique Model for a while, but in the future, we plan to make Merge on Write the default implementation of the Unique Model. The following will illustrate the two implementations with examples.
Merge on Read ( Same Implementation as Aggregate Model)β
ColumnName | Type | IsKey | Comment |
---|---|---|---|
user_id | BIGINT | Yes | User ID |
username | VARCHAR (50) | Yes | Username |
city | VARCHAR (20) | No | User location city |
age | SMALLINT | No | User age |
sex | TINYINT | No | User gender |
phone | LARGEINT | No | User phone number |
address | VARCHAR (500) | No | User address |
register_time | DATETIME | No | User registration time |
This is a typical user basic information table. There is no aggregation requirement for such data. The only concern is to ensure the uniqueness of the primary key. (The primary key here is user_id + username). The CREATE TABLE statement for the above table is as follows:
CREATE TABLE IF NOT EXISTS example_db.example_tbl
(
`user_id` LARGEINT NOT NULL COMMENT "User ID",
`username` VARCHAR (50) NOT NULL COMMENT "Username",
`city` VARCHAR (20) COMMENT "User location city",
`age` SMALLINT COMMENT "User age",
`sex` TINYINT COMMENT "User sex",
`phone` LARGEINT COMMENT "User phone number",
`address` VARCHAR (500) COMMENT "User address",
`register_time` DATETIME COMMENT "User registration time"
)
UNIQUE KEY (`user_id`, `username`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);
This is the same table schema and the CREATE TABLE statement as those of the Aggregate Model:
ColumnName | Type | AggregationType | Comment |
---|---|---|---|
user_id | BIGINT | User ID | |
username | VARCHAR (50) | Username | |
city | VARCHAR (20) | REPLACE | User location city |
age | SMALLINT | REPLACE | User age |
sex | TINYINT | REPLACE | User gender |
phone | LARGEINT | REPLACE | User phone number |
address | VARCHAR (500) | REPLACE | User address |
register_time | DATETIME | REPLACE | User registration time |
CREATE TABLE IF NOT EXISTS example_db.example_tbl
(
`user_id` LARGEINT NOT NULL COMMENT "User ID",
`username` VARCHAR (50) NOT NULL COMMENT "Username",
`city` VARCHAR (20) REPLACE COMMENT "User location city",
`sex` TINYINT REPLACE COMMENT "User gender",
`phone` LARGEINT REPLACE COMMENT "User phone number",
`address` VARCHAR(500) REPLACE COMMENT "User address",
`register_time` DATETIME REPLACE COMMENT "User registration time"
)
AGGREGATE KEY(`user_id`, `username`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);
That is to say, the Merge on Read implementation of the Unique Model is equivalent to the REPLACE aggregation type in the Aggregate Model. The internal implementation and data storage are exactly the same.
Merge on Write (Since Doris 1.2)β
The Merge on Write implementation of the Unique Model is completely different from that of the Aggregate Model. It can deliver better performance in aggregation queries with primary key limitations.
In Doris 1.2.0, as a new feature, Merge on Write is disabled by default, and users can enable it by adding the following property:
"enable_unique_key_merge_on_write" = "true"
Take the previous table as an example, the corresponding CREATE TABLE statement should be:
CREATE TABLE IF NOT EXISTS example_db.example_tbl
(
`user_id` LARGEINT NOT NULL COMMENT "User ID",
`username` VARCHAR (50) NOT NULL COMMENT "Username",
`city` VARCHAR (20) COMMENT "User location city",
`age` SMALLINT COMMENT "Userage",
`sex` TINYINT COMMENT "User gender",
`phone` LARGEINT COMMENT "User phone number",
`address` VARCHAR (500) COMMENT "User address",
`register_time` DATETIME COMMENT "User registration time"
)
UNIQUE KEY (`user_id`, `username`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
"enable_unique_key_merge_on_write" = "true"
);
The table schema produced by the above statement will be different from that of the Aggregate Model.
ColumnName | Type | AggregationType | Comment |
---|---|---|---|
user_id | BIGINT | User ID | |
username | VARCHAR (50) | Username | |
city | VARCHAR (20) | NONE | User location city |
age | SMALLINT | NONE | User age |
sex | TINYINT | NONE | User gender |
phone | LARGEINT | NONE | User phone number |
address | VARCHAR (500) | NONE | User address |
register_time | DATETIME | NONE | User registration time |
On a Unique table with the Merge on Write option enabled, during the import stage, the data that are to be overwritten and updated will be marked for deletion, and new data will be written in. When querying, all data marked for deletion will be filtered out at the file level, and only the latest data would be readed. This eliminates the data aggregation cost while reading, and supports many types of predicate pushdown now. Therefore, it can largely improve performance in many scenarios, especially in aggregation queries.
[NOTE]
- The new Merge on Write implementation is disabled by default, and can only be enabled by specifying a property when creating a new table.
- The old Merge on Read cannot be seamlessly upgraded to the new implementation (since they have completely different data organization). If you want to switch to the Merge on Write implementation, you need to manually execute
insert into unique-mow- table select * from source table
to load data to new table. - The two unique features
delete sign
andsequence col
of the Unique Model can be used as normal in the new implementation, and their usage remains unchanged.
Duplicate Modelβ
In some multi-dimensional analysis scenarios, there is no need for primary keys or data aggregation. For these cases, we introduce the Duplicate Model to. Here is an example:
ColumnName | Type | SortKey | Comment |
---|---|---|---|
timstamp | DATETIME | Yes | Log time |
type | INT | Yes | Log type |
error_code | INT | Yes | Error code |
Error_msg | VARCHAR (1024) | No | Error details |
op_id | BIGINT | No | Operator ID |
op_time | DATETIME | No | Operation time |
The corresponding CREATE TABLE statement is as follows:
CREATE TABLE IF NOT EXISTS example_db.example_tbl
(
`timestamp` DATETIME NOT NULL COMMENT "Log time",
`type` INT NOT NULL COMMENT "Log type",
`error_code` INT COMMENT "Error code",
`error_msg` VARCHAR(1024) COMMENT "Error details",
`op_id` BIGINT COMMENT "Operator ID",
`op_time` DATETIME COMMENT "Operation time"
)
DUPLICATE KEY(`timestamp`, `type`, `error_code`)
DISTRIBUTED BY HASH(`type`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);
Different from the Aggregate and Unique Models, the Duplicate Model stores the data as they are and executes no aggregation. Even if there are two identical rows of data, they will both be retained. The DUPLICATE KEY in the CREATE TABLE statement is only used to specify based on which columns the data are sorted. (A more appropriate name than DUPLICATE KEY would be SORTING COLUMN, but it is named as such to specify the data model used. For more information, see Prefix Index.) For the choice of DUPLICATE KEY, we recommend the first 2-4 columns.
The Duplicate Mode l is suitable for storing raw data without aggregation requirements or primary key uniqueness constraints. For more usage scenarios, see the [Limitations of Aggregate Model](#Limitations of Aggregate Model) section.
Limitations of Aggregate Modelβ
This section is about the limitations of the Aggregate Model.
The Aggregate Model only presents the aggregated data. That means we have to ensure the presentation consistency of data that has not yet been aggregated (for example, two different import batches). The following provides further explanation with examples.
Suppose that you have the following table schema:
ColumnName | Type | AggregationType | Comment |
---|---|---|---|
user_id | LARGEINT | User ID | |
date | DATE | Date when the data are imported | |
cost | BIGINT | SUM | Total user consumption |
Assume that there are two batches of data that have been imported into the storage engine as follows:
batch 1
user_id | date | cost |
---|---|---|
10001 | 2017-11-20 | 50 |
10002 | 2017-11-21 | 39 |
batch 2
user_id | date | cost |
---|---|---|
10001 | 2017-11-20 | 1 |
10001 | 2017-11-21 | 5 |
10003 | 2017-11-22 | 22 |
As you can see, data about User 10001 in these two import batches have not yet been aggregated. However, in order to ensure that users can only query the aggregated data as follows:
user_id | date | cost |
---|---|---|
10001 | 2017-11-20 | 51 |
10001 | 2017-11-21 | 5 |
10002 | 2017-11-21 | 39 |
10003 | 2017-11-22 | 22 |
We have added an aggregation operator to the query engine to ensure the presentation consistency of data.
In addition, on the aggregate column (Value), when executing aggregate class queries that are inconsistent with the aggregate type, please pay attention to the semantics. For example, in the example above, if you execute the following query:
SELECT MIN(cost) FROM table;
The result will be 5, not 1.
Meanwhile, this consistency guarantee could considerably reduce efficiency in some queries.
Take the basic count (*) query as an example:
SELECT COUNT(*) FROM table;
In other databases, such queries return results quickly. Because in actual implementation, the models can get the query result by counting rows and saving the statistics upon import, or by scanning only one certain column of data to get count value upon query, with very little overhead. But in Doris's Aggregation Model, the overhead of such queries is large.
For the previous example:
batch 1
user_id | date | cost |
---|---|---|
10001 | 2017-11-20 | 50 |
10002 | 2017-11-21 | 39 |
batch 2
user_id | date | cost |
---|---|---|
10001 | 2017-11-20 | 1 |
10001 | 2017-11-21 | 5 |
10003 | 2017-11-22 | 22 |
Since the final aggregation result is:
user_id | date | cost |
---|---|---|
10001 | 2017-11-20 | 51 |
10001 | 2017-11-21 | 5 |
10002 | 2017-11-21 | 39 |
10003 | 2017-11-22 | 22 |
The correct result of select count (*) from table;
should be 4. But if the model only scans the user_id
column and operates aggregation upon query, the final result will be 3 (10001, 10002, 10003). And if it does not operate aggregation, the final result will be 5 (a total of five rows in two batches). Apparently, both results are wrong.
In order to get the correct result, we must read both the user_id
and date
column, and performs aggregation when querying. That is to say, in the count (*)
query, Doris must scan all AGGREGATE KEY columns (in this case, user_id
and date
) and aggregate them to get the semantically correct results. That means if there are many aggregated columns, count (*)
queries could involve scanning large amounts of data.
Therefore, if you need to perform frequent count (*)
queries, we recommend that you simulate count (*)
by adding a column of value 1 and aggregation type SUM. In this way, the table schema in the previous example will be modified as follows:
ColumnName | Type | AggregationType | Comment |
---|---|---|---|
user ID | BIGINT | User ID | |
date | DATE | Date when the data are imported | |
Cost | BIGINT | SUM | Total user consumption |
count | BIGINT | SUM | For count queries |
The above adds a count column, the value of which will always be 1, so the result of select count (*) from table;
is equivalent to that of select sum (count) from table;
The latter is much more efficient than the former. However, this method has its shortcomings, too. That is, it requires that users will not import rows with the same values in the AGGREGATE KEY columns. Otherwise, select sum (count) from table;
can only express the number of rows of the originally imported data, instead of the semantics of select count (*) from table;
Another method is to add a cound
column of value 1 but aggregation type of REPLACE. Then select sum (count) from table;
and select count (*) from table;
could produce the same results. Moreover, this method does not require the absence of same AGGREGATE KEY columns in the import data.
Merge on Write of Unique modelβ
The Merge on Write implementation in the Unique Model does not impose the same limitation as the Aggregate Model. In Merge on Write, the model adds a delete bitmap
for each imported rowset to mark the data being overwritten or deleted. With the previous example, after Batch 1 is imported, the data status will be as follows:
batch 1
user_id | date | cost | delete bit |
---|---|---|---|
10001 | 2017-11-20 | 50 | false |
10002 | 2017-11-21 | 39 | false |
After Batch 2 is imported, the duplicate rows in the first batch will be marked as deleted, and the status of the two batches of data is as follows
batch 1
user_id | date | cost | delete bit |
---|---|---|---|
10001 | 2017-11-20 | 50 | true |
10002 | 2017-11-21 | 39 | false |
batch 2
user_id | date | cost | delete bit |
---|---|---|---|
10001 | 2017-11-20 | 1 | false |
10001 | 2017-11-21 | 5 | false |
10003 | 2017-11-22 | 22 | false |
In queries, all data marked true
in the delete bitmap
will not be read, so there is no need for data aggregation. Since there are 4 valid rows in the above data, the query result should also be 4. This also enables minimum overhead since it only scans one column of data.
In the test environment, count(*)
queries in Merge on Write of the Unique Model deliver 10 times higher performance than that of the Aggregate Model.
Duplicate Modelβ
The Duplicate Model does not impose the same limitation as the Aggregate Model because it does not involve aggregation semantics. For any columns, it can return the semantically correct results in count (*)
queries.
Key Columnsβ
For the Duplicate, Aggregate, and Unique Models, the Key columns will be specified when the table is created, but there exist some differences: In the Duplicate Model, the Key columns of the table can be regarded as just "sorting columns", but not unique identifiers. In Aggregate and Unique Models, the Key columns are both "sorting columns" and "unique identifier columns".
Suggestions for Choosing Data Modelβ
Since the data model was established when the table was built, and irrevocable thereafter, it is very important to select the appropriate data model.
- The Aggregate Model can greatly reduce the amount of data scanned and query computation by pre-aggregation. Thus, it is very suitable for report query scenarios with fixed patterns. But this model is unfriendly to
count (*)
queries. Meanwhile, since the aggregation method on the Value column is fixed, semantic correctness should be considered in other types of aggregation queries. - The Unique Model guarantees the uniqueness of primary key for scenarios requiring unique primary key. The downside is that it cannot exploit the advantage brought by pre-aggregation such as ROLLUP in queries.
- Users who have high performance requirements for aggregate queries are recommended to use the newly added Merge on Write implementation since version 1.2.
- The Unique Model only supports entire-row updates. If you require primary key uniqueness as well as partial updates of certain columns (such as loading multiple source tables into one Doris table), you can consider using the Aggregate Model, while setting the aggregate type of the non-primary key columns to REPLACE_IF_NOT_NULL. See CREATE TABLE Manual for more details.
- The Duplicate Model is suitable for ad-hoc queries of any dimensions. Although it may not be able to take advantage of the pre-aggregation feature, it is not limited by what constrains the Aggregate Model and can give full play to the advantage of columnar storage (reading only the relevant columns, but not all Key columns).