TPCH
Usage Notes
TPCH Catalog uses the Trino Connector compatibility framework and the TPCH Connector to quickly build TPCH test sets.
This feature is supported starting from Doris version 3.0.0.
Compiling the TPCH Connector
JDK 17 is required.
git clone https://github.com/trinodb/trino.git
git checkout 435
cd trino/plugin/trino-tpch
mvn clean install -DskipTest
After compiling, you will find the trino-tpch-435/
directory under trino/plugin/trino-tpch/target/
.
You can also directly download the precompiled trino-tpch-435.tar.gz and extract it.
Deploying the TPCH Connector
Place the trino-tpch-435/
directory under the connectors/
directory in the deployment paths of all FE and BE nodes. (If it does not exist, you can create it manually).
├── bin
├── conf
├── connectors
│ ├── trino-tpch-435
...
After deployment, it is recommended to restart the FE and BE nodes to ensure the Connector is loaded correctly.
Creating the TPCH Catalog
CREATE CATALOG `tpch` PROPERTIES (
"type" = "trino-connector",
"trino.connector.name" = "tpch",
"trino.tpch.column-naming" = "STANDARD",
"trino.tpch.splits-per-node" = "32"
);
The tpch.splits-per-node
property sets the level of concurrency. It is recommended to set it to twice the number of cores per BE node to achieve optimal concurrency and improve data generation efficiency.
When "tpch.column-naming" = "STANDARD"
, the column names in the TPCH table will start with the abbreviation of the table name, such as l_orderkey
, otherwise, it is orderkey
.
Using the TPCH Catalog
The TPCH Catalog includes pre-configured TPCH datasets of different scale factors, which can be viewed using the SHOW DATABASES
and SHOW TABLES
commands.
mysql> SWITCH tpch;
Query OK, 0 rows affected (0.00 sec)
mysql> SHOW DATABASES;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| sf1 |
| sf100 |
| sf1000 |
| sf10000 |
| sf100000 |
| sf300 |
| sf3000 |
| sf30000 |
| tiny |
+--------------------+
11 rows in set (0.00 sec)
mysql> USE sf1;
mysql> SHOW TABLES;
+---------------+
| Tables_in_sf1 |
+---------------+
| customer |
| lineitem |
| nation |
| orders |
| part |
| partsupp |
| region |
| supplier |
+---------------+
8 rows in set (0.00 sec)
You can directly query these tables using the SELECT statement.
The data in these pre-configured datasets is not actually stored but generated in real-time during queries. Therefore, these datasets are not suitable for direct benchmarking. They are more appropriate for writing to other target tables (such as Doris internal tables, Hive, Iceberg, and other data sources supported by Doris) via INSERT INTO SELECT
, after which performance tests can be conducted on the target tables.
Best Practices
Quickly Build TPCH Test Dataset
You can quickly build a TPCH test dataset using the CTAS (Create Table As Select) statement:
CREATE TABLE hive.tpch100.customer PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpch.sf100.customer ;
CREATE TABLE hive.tpch100.lineitem PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpch.sf100.lineitem ;
CREATE TABLE hive.tpch100.nation PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpch.sf100.nation ;
CREATE TABLE hive.tpch100.orders PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpch.sf100.orders ;
CREATE TABLE hive.tpch100.part PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpch.sf100.part ;
CREATE TABLE hive.tpch100.partsupp PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpch.sf100.partsupp ;
CREATE TABLE hive.tpch100.region PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpch.sf100.region ;
CREATE TABLE hive.tpch100.supplier PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpch.sf100.supplier ;
On a Doris cluster with 3 BE nodes, each with 16 cores, creating a TPCH 1000 dataset in Hive takes approximately 25 minutes, and TPCH 10000 takes about 4 to 5 hours.