DML Tuning Plan
DML Plan Tuning Includes Both Import Section (INSERT INTO SELECT) and Query Section (CREATE TABLE AS SELECT - CTAS). Next, will introduce the principles and tuning practices of these two parts separately.
Loading
Principles
Apache Doris offers a variety of flexible data import solutions to meet data access needs in different scenarios. Doris supports importing data from the following data sources:
Object storage (S3) and HDFS
Local files
Kafka
Relational databases (such as MySQL, PostgreSQL, Oracle, SQLServer, etc.)
Data sources connected via JDBC
JSON format data
Doris provides the following main data import methods:
Broker Load: Imports data from external storage systems through the Broker process
Stream Load: Streams data from local files or in-memory data
Routine Load: Continuously imports data from Kafka
INSERT INTO: Imports data through SQL insert statements
S3 Load: Directly imports data from object storage that supports the S3 protocol
MySQL Load: Uses MySQL clients to import local data
Different import methods support slightly different data formats:
Broker Load: Supports Parquet, ORC, CSV, and GZip formats
Stream Load: Supports CSV, JSON, Parquet, and ORC formats
Routine Load: Supports CSV and JSON formats
MySQL Load: Supports CSV format
Data import has the following mechanisms:
Atomicity guarantee: Each import job acts as a complete transaction, ensuring atomic data writes.
Import label: Each import job is assigned a unique label to ensure At-Most-Once semantics.
Synchronous/Asynchronous mode: Synchronous mode returns results immediately, while asynchronous mode requires separate querying of job status.
Array type support: Array type data can be imported using CAST and array functions.
Execution engine: Users can choose whether to use the Pipeline engine to execute import tasks based on configuration.
In practicality, the following considerations are necessary:
Reasonable selection of import methods: Choose the most appropriate import method for different data sources.
Utilization of label mechanism: Achieve Exactly-Once semantics guarantee.
Proper configuration of parallelism: Adjust the number of parallel imports based on cluster resources.
Monitoring of import status: For asynchronous imports, check the job's progress timely.
By flexibly using the various import functions provided by Doris, data from various sources can be efficiently imported into Doris for analysis. For more details, please refer to the Data Loading Overview.
Loading Optimization
The Pipeline engine is a new query execution engine in Doris, designed to improve the efficiency of queries and data processing. During data import, the Pipeline engine can also be enabled to enhance overall performance. By default, the Pipeline engine is disabled during data import, but users can enable it through relevant configurations.
To enable the Pipeline engine during data import, configure the following variables:
1. FE Configuration Item: enable_pipeline_load
Location: In the configuration file of the FE (Frontend)
Function: When enabled, import tasks such as Stream Load will attempt to execute using the Pipeline engine
2. Session Variable: enable_nereids_dml_with_pipeline
Location: Set at session level
Function: When enabled, INSERT INTO statements will attempt to execute using the Pipeline engine
3. Session Variable: enable_pipeline_engine
Location: Set at session level
Function: Controls whether the Pipeline engine is actually enabled
Querying
For details, please refer to other sections on plan tuning.