Routine Load
Doris can continuously consume data from Kafka Topic through the Routine Load method. After submitting a Routine Load job, Doris will continuously run the load job, generating real-time loading tasks to constantly consume messages from the specified Topic in the Kafka cluster.
Routine Load is a streaming load job that supports Exactly-Once semantics, ensuring that data is neither lost nor duplicated.
Usage Scenariosβ
Supported Data File Formatsβ
Routine Load supports consuming data in CSV and JSON formats from Kafka.
When loading CSV format, it is necessary to clearly distinguish between null values and empty strings:
Null values need to be represented with
\n
. For example,a,\n,b
indicates that the middle column is a null value.Empty strings can be represented by leaving the data field empty. For example,
a,,b
indicates that the middle column is an empty string.
Usage Limitationsβ
When using Routine Load to consume data from Kafka, there are the following limitations:
It supports unauthenticated Kafka access as well as Kafka clusters authenticated through SSL.
The supported message formats are CSV and JSON text formats. Each message in CSV should be on a separate line, and the line should not end with a newline character.
By default, it supports Kafka versions 0.10.0.0 and above. If you need to use a Kafka version below 0.10.0.0 (such as 0.9.0, 0.8.2, 0.8.1, 0.8.0), you need to modify the BE configuration by setting the value of
kafka_broker_version_fallback
to the compatible older version, or directly set the value ofproperty.broker.version.fallback
when creating the Routine Load. However, using an older version may mean that some new features of Routine Load, such as setting the offset of Kafka partitions based on time, may not be available.
Basic Principlesβ
Routine Load continuously consumes data from Kafka Topic and writes it into Doris.
When a Routine Load job is created in Doris, it generates a persistent load job and several load tasks:
Load Job: Each routine load corresponds to a load job. The load job is a persistent task that continuously consumes data from the Kafka Topic.
Load Task: A load job is divided into several load tasks, which are loaded as independent basic units using the Stream Load method into BE.
The specific process of Routine Load is illustrated in the following diagram:
The Client submits a Routine Load job to the FE to establish a persistent Routine Load Job.
The FE splits the Routine Load Job into multiple Routine Load Tasks through the Job Scheduler.
On the BE, each Routine Load Task is treated as a Stream Load task for importation and reports back to the FE upon completion.
The Job Scheduler in the FE generates new Tasks based on the report results or retries failed Tasks.
The Routine Load Job continuously generates new Tasks to complete uninterrupted data importation.
Quick Startβ
Create Jobβ
In Doris, you can create persistent Routine Load tasks using the CREATE ROUTINE LOAD
command. For detailed syntax, please refer to CREATE ROUTINE LOAD. Routine Load supports consuming data in CSV and JSON formats.
Loading CSV Data
Loading data sample
In Kafka, there is the following sample data:
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test-routine-load-csv --from-beginnin
1,Emily,25
2,Benjamin,35
3,Olivia,28
4,Alexander,60
5,Ava,17
6,William,69
7,Sophia,32
8,James,64
9,Emma,37
10,Liam,64Creating table
In Doris, create the table for loading with the following syntax:
CREATE TABLE testdb.test_streamload(
user_id BIGINT NOT NULL COMMENT "User ID",
name VARCHAR(20) COMMENT "User Name",
age INT COMMENT "User Age"
)
DUPLICATE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 10;Creating the Routine Load job
In Doris, use the
CREATE ROUTINE LOAD
command to create the load job:CREATE ROUTINE LOAD testdb.example_routine_load_csv ON test_routineload_tbl
COLUMNS TERMINATED BY ",",
COLUMNS(user_id, name, age)
FROM KAFKA(
"kafka_broker_list" = "192.168.88.62:9092",
"kafka_topic" = "test-routine-load-csv",
"property.kafka_default_offsets" = "OFFSET_BEGINNING"
);
Loading JSON Data
Loading sample data
In Kafka, there is the following sample data:
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test-routine-load-json --from-beginning
Creating table
In Doris, create the table for loading with the following syntax:
CREATE TABLE testdb.test_streamload(
user_id BIGINT NOT NULL COMMENT "User ID",
name VARCHAR(20) COMMENT "User Name",
age INT COMMENT "User Age"
)
DUPLICATE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 10;Creating the Routine Load job
In Doris, use the
CREATE ROUTINE LOAD
command to create the job:CREATE ROUTINE LOAD testdb.example_routine_load_json ON test_routineload_tbl
COLUMNS(user_id, name, age)
PROPERTIES(
"format"="json",
"jsonpaths"="[\"$.user_id\",\"$.name\",\"$.age\"]"
)
FROM KAFKA(
"kafka_broker_list" = "192.168.88.62:9092"
);
Viewing Statusβ
In Doris, you can check the status of Routine Load jobs and tasks using the following methods:
Load Jobs: Used to view information about load tasks, such as the target table, number of subtasks, load delay status, load configuration, and load results.
Load Tasks: Used to view the status of individual load tasks, including task ID, transaction status, task status, execution start time, and BE (Backend) node assignment.
01 Viewing Running Jobs
You can use the SHOW ROUTINE LOAD
command to check the status of jobs. The SHOW ROUTINE LOAD
command provides information about the current job, including the target table, load delay status, load configuration, and error messages.
For example, to view the status of the testdb.example_routine_load_csv
job, you can run the following command:
mysql> SHOW ROUTINE LOAD FOR testdb.example_routine_load\G
*************************** 1. row ***************************
Id: 12025
Name: example_routine_load
CreateTime: 2024-01-15 08:12:42
PauseTime: NULL
EndTime: NULL
DbName: default_cluster:testdb
TableName: test_routineload_tbl
IsMultiTable: false
State: RUNNING
DataSourceType: KAFKA
CurrentTaskNum: 1
JobProperties: {"max_batch_rows":"200000","timezone":"America/New_York","send_batch_parallelism":"1","load_to_single_tablet":"false","column_separator":"','","line_delimiter":"\n","current_concurrent_number":"1","delete":"*","partial_columns":"false","merge_type":"APPEND","exec_mem_limit":"2147483648","strict_mode":"false","jsonpaths":"","max_batch_interval":"10","max_batch_size":"104857600","fuzzy_parse":"false","partitions":"*","columnToColumnExpr":"user_id,name,age","whereExpr":"*","desired_concurrent_number":"5","precedingFilter":"*","format":"csv","max_error_number":"0","max_filter_ratio":"1.0","json_root":"","strip_outer_array":"false","num_as_string":"false"}
DataSourceProperties: {"topic":"test-topic","currentKafkaPartitions":"0","brokerList":"192.168.88.62:9092"}
CustomProperties: {"kafka_default_offsets":"OFFSET_BEGINNING","group.id":"example_routine_load_73daf600-884e-46c0-a02b-4e49fdf3b4dc"}
Statistic: {"receivedBytes":28,"runningTxns":[],"errorRows":0,"committedTaskNum":3,"loadedRows":3,"loadRowsRate":0,"abortedTaskNum":0,"errorRowsAfterResumed":0,"totalRows":3,"unselectedRows":0,"receivedBytesRate":0,"taskExecuteTimeMs":30069}
Progress: {"0":"2"}
Lag: {"0":0}
ReasonOfStateChanged:
ErrorLogUrls:
OtherMsg:
User: root
Comment:
1 row in set (0.00 sec)
02 Viewing Running Tasks
You can use the SHOW ROUTINE LOAD TASK
command to check the status of load tasks. The SHOW ROUTINE LOAD TASK
command provides information about the individual tasks under a specific load job, including task ID, transaction status, task status, execution start time, and BE ID.
For example, to view the task status of the example_routine_load_csv
job, you can run the following command:
mysql> SHOW ROUTINE LOAD TASK WHERE jobname = 'example_routine_load_csv';
+-----------------------------------+-------+-----------+-------+---------------------+---------------------+---------+-------+----------------------+
| TaskId | TxnId | TxnStatus | JobId | CreateTime | ExecuteStartTime | Timeout | BeId | DataSourceProperties |
+-----------------------------------+-------+-----------+-------+---------------------+---------------------+---------+-------+----------------------+
| 8cf47e6a68ed4da3-8f45b431db50e466 | 195 | PREPARE | 12177 | 2024-01-15 12:20:41 | 2024-01-15 12:21:01 | 20 | 10429 | {"4":1231,"9":2603} |
| f2d4525c54074aa2-b6478cf8daaeb393 | 196 | PREPARE | 12177 | 2024-01-15 12:20:41 | 2024-01-15 12:21:01 | 20 | 12109 | {"1":1225,"6":1216} |
| cb870f1553864250-975279875a25fab6 | -1 | NULL | 12177 | 2024-01-15 12:20:52 | NULL | 20 | -1 | {"2":7234,"7":4865} |
| 68771fd8a1824637-90a9dac2a7a0075e | -1 | NULL | 12177 | 2024-01-15 12:20:52 | NULL | 20 | -1 | {"3":1769,"8":2982} |
| 77112dfea5e54b0a-a10eab3d5b19e565 | 197 | PREPARE | 12177 | 2024-01-15 12:21:02 | 2024-01-15 12:21:02 | 20 | 12098 | {"0":3000,"5":2622} |
+-----------------------------------+-------+-----------+-------+---------------------+---------------------+---------+-------+----------------------+
Pausing Jobsβ
You can pause an load job using the PAUSE ROUTINE LOAD command. When a job is paused, it enters the PAUSED state, but the load job is not terminated and can be resumed using the RESUME ROUTINE LOAD command.
To pause the testdb.example_routine_load_csv
load job, you can use the following command:
PAUSE ROUTINE LOAD FOR testdb.example_routine_load_csv;
Resuming Jobsβ
You can resume a paused load job using the RESUME ROUTINE LOAD command.
To resume the testdb.example_routine_load_csv
job, you can use the following command:
RESUME ROUTINE LOAD FOR testdb.example_routine_load_csv;
Modifying Jobsβ
You can modify a created loading job using the ALTER ROUTINE LOAD command. Before modifying the job, you need to pause it using the PAUSE ROUTINE LOAD
command, and after making the modifications, you can resume it using the RESUME ROUTINE LOAD
command.
To modify the desired_concurrent_number
parameter for the job and update the Kafka topic information, you can use the following command:
ALTER ROUTINE LOAD FOR testdb.example_routine_load_csv
PROPERTIES(
"desired_concurrent_number" = "3"
)
FROM KAFKA(
"kafka_broker_list" = "192.168.88.60:9092",
"kafka_topic" = "test-topic"
);
Canceling Jobsβ
You can stop and delete a Routine Load job using the STOP ROUTINE LOAD command. Once deleted, the load job cannot be recovered and cannot be viewed using the SHOW ROUTINE LOAD
command.
To stop and delete the testdb.example_routine_load_csv
load job, you can use the following command:
STOP ROUTINE LOAD FOR testdb.example_routine_load_csv;
Reference Manualβ
Load Commandsβ
The syntax for creating a Routine Load persistent load job is as follows:
CREATE ROUTINE LOAD [<db_name>.]<job_name> [ON <tbl_name>]
[merge_type]
[load_properties]
[job_properties]
FROM KAFKA [data_source_properties]
[COMMENT "<comment>"]
The modules for creating a loading job are explained as follows:
Module | Description |
---|---|
db_name | Specifies the name of the database for creating the loading task. |
job_name | Specifies the name of the created loading job. The job name must be unique within the same database. |
tbl_name | Specifies the name of the table to be loaded. This parameter is optional. If not specified, the dynamic table mode will be used, where Kafka data should contain the table name information. |
merge_type | Specifies the data merge type. The default value is APPEND. Possible merge_type options are:
|
load_properties | Describes the load properties, including:
|
job_properties | Specifies the general load parameters for Routine Load. |
data_source_properties | Describes the properties of Kafka data source. |
comment | Describes any additional comments for the loading job. |
Load Parameter Descriptionβ
01 FE Configuration Parameters
max_routine_load_task_concurrent_num
Default Value: 5
Dynamic Configuration: Yes
FE Master Exclusive: Yes
Parameter Description: Limits the maximum number of concurrent subtasks for Routine Load jobs. It is recommended to keep it at the default value. Setting it too high may result in excessive concurrent tasks and resource consumption.
max_routine_load_task_num_per_be
Default Value: 5
Dynamic Configuration: Yes
FE Master Exclusive: Yes
Parameter Description: Limits the maximum number of concurrent Routine Load tasks per backend (BE).
max_routine_load_task_num_per_be
should be smaller than theroutine_load_thread_pool_size
parameter.
max_routine_load_job_num
Default Value: 100
Dynamic Configuration: Yes
FE Master Exclusive: Yes
Parameter Description: Limits the maximum number of Routine Load jobs, including those in NEED_SCHEDULED, RUNNING, and PAUSE states.
max_tolerable_backend_down_num
Default Value: 0
Dynamic Configuration: Yes
FE Master Exclusive: Yes
Parameter Description: If any BE goes down, Routine Load cannot automatically recover. Under certain conditions, Doris can reschedule PAUSED tasks and transition them to the RUNNING state. Setting this parameter to 0 means that re-scheduling is only allowed when all BE nodes are in the alive state.
period_of_auto_resume_min
Default Value: 5 (minutes)
Dynamic Configuration: Yes
FE Master Exclusive: Yes
Parameter Description: The period for automatically resuming Routine Load.
02 BE Configuration Parameters
max_consumer_num_per_group
Default Value: 3
Dynamic Configuration: Yes
Description: Specifies the maximum number of consumers generated per subtask. For Kafka data sources, a consumer can consume one or multiple Kafka partitions. For example, if a task needs to consume 6 Kafka partitions, it will generate 3 consumers, with each consumer consuming 2 partitions. If there are only 2 partitions, it will generate 2 consumers, with each consumer consuming 1 partition.
Load Configuration Parametersβ
When creating a Routine Load job, you can specify the load configuration parameters for different modules using the CREATE ROUTINE LOAD
command.
tbl_name Clause
Specifies the name of the table to be loaded. This parameter is optional.
If not specified, the dynamic table mode is used, which requires the data in Kafka to contain the table name information. Currently, only extracting the table name from the Value field of Kafka is supported. The format should be as follows, using JSON as an example: table_name|{"col1": "val1", "col2": "val2"}
, where tbl_name
is the table name and |
is used as the separator between the table name and the table data. The same format applies to CSV data, such as table_name|val1,val2,val3
. Note that the table_name
here must be consistent with the table name in Doris, otherwise the load will fail. Note that dynamic tables do not support the column_mapping configuration described later.
merge_type Clause
The merge_type module specifies the type of data merging. There are three options for merge_type:
APPEND: Append load mode.
MERGE: Merge load mode. Only applicable to Unique Key models. It needs to be used together with the [DELETE ON] module to mark the Delete Flag column.
DELETE: All loaded data is data that needs to be deleted.
load_properties Clause
The load_properties module describes the properties of the loaded data using the following syntax:
[COLUMNS TERMINATED BY <column_separator>,]
[COLUMNS (<column1_name>[, <column2_name>, <column_mapping>, ...]),]
[WHERE <where_expr>,]
[PARTITION(<partition1_name>, [<partition2_name>, <partition3_name>, ...]),]
[DELETE ON <delete_expr>,]
[ORDER BY <order_by_column1>[, <order_by_column2>, <order_by_column3>, ...]]
The specific parameters for each module are as follows:
Submodule | Parameter | Description |
---|---|---|
COLUMNS TERMINATED BY | <column_separator> | Specifies the column delimiter, defaulting to \t . For example, to specify a comma as the delimiter, use COLUMNS TERMINATED BY "," . When handling empty values, note the following:
|
COLUMNS | <column_name> | Specifies the corresponding column names. For example, to specify the load columns as (k1, k2, k3) , use COLUMNS(k1, k2, k3) . The COLUMNS clause can be omitted in the following cases:
|
<column_mapping> | During the load process, column mapping can be used to filter and transform columns. For example, if the target column needs to perform a derived calculation based on a column in the data source (e.g., the target column k4 is calculated as k3 + 1 based on the k3 column), you can use COLUMNS(k1, k2, k3, k4 = k3 + 1) . For more details, refer to the Data Conversion documentation. | |
WHERE | <where_expr> | Specifies the condition to filter the loaded data source. For example, to load only data where age > 30, use WHERE age > 30 . |
PARTITION | <partition_name> | Specifies which partitions in the target table to load. If not specified, it will automatically load into the corresponding partitions. For example, to load partitions p1 and p2 of the target table, use PARTITION(p1, p2) . |
DELETE ON | <delete_expr> | In the MERGE load mode, using delete_expr to mark which columns need to be deleted. For example, to delete columns where age > 30 during the MERGE process, use DELETE ON age > 30 . |
ORDER BY | <order_by_column> | Only effective for Unique Key models. Specifies the Sequence Column in the loaded data to ensure the order of the data. For example, when loading into a Unique Key table and specifying create_time as the Sequence Column, use ORDER BY create_time . For more information on Sequence Columns in Unique Key models, refer to the Data Update/Sequence Columns |
job_properties Clause
The job_properties clause is used to specify the properties of a Routine Load job when creating it. The syntax is as follows:
PROPERTIES ("<key1>" = "<value1>"[, "<key2>" = "<value2>" ...])
Here are the available parameters for the job_properties clause:
Parameter | Description |
---|---|
desired_concurrent_number |
|
max_batch_interval | The maximum running time for each subtask, in seconds. The range is from 1s to 60s, with a default value of 10s. max_batch_interval/max_batch_rows/max_batch_size together form the execution threshold for subtasks. If any of these parameters reaches the threshold, the load subtask ends and a new one is generated. |
max_batch_rows | The maximum number of rows read by each subtask. Must be greater than or equal to 200,000. The default value is 200,000. max_batch_interval/max_batch_rows/max_batch_size together form the execution threshold for subtasks. If any of these parameters reaches the threshold, the load subtask ends and a new one is generated. |
max_batch_size | The maximum number of bytes read by each subtask. The unit is bytes, and the range is from 100MB to 1GB. The default value is 100MB. max_batch_interval/max_batch_rows/max_batch_size together form the execution threshold for subtasks. If any of these parameters reaches the threshold, the load subtask ends and a new one is generated. |
max_error_number | The maximum number of error rows allowed within a sampling window. Must be greater than or equal to 0. The default value is 0, which means no error rows are allowed. The sampling window is max_batch_rows * 10 . If the number of error rows within the sampling window exceeds max_error_number , the regular job will be paused and manual intervention is required to check for data quality issues using the SHOW ROUTINE LOAD command and ErrorLogUrls . Rows filtered out by the WHERE condition are not counted as error rows. |
strict_mode | Whether to enable strict mode. The default value is disabled. Strict mode applies strict filtering to type conversions during the load process. If enabled, non-null original data that results in a NULL after type conversion will be filtered out. The filtering rules in strict mode are as follows:
|
timezone | Specifies the time zone used by the load job. The default is to use the session's timezone parameter. This parameter affects the results of all timezone-related functions involved in the load. |
format | Specifies the data format for the load. The default is csv, and JSON format is supported. |
jsonpaths | When the data format is JSON, jsonpaths can be used to specify the JSON paths to extract data from nested structures. It is a JSON array of strings, where each string represents a JSON path. |
delimiter | Specifies the delimiter used in CSV files. The default delimiter is a comma (,). |
escape | Specifies the escape character used in CSV files. The default escape character is a backslash (). |
quote | Specifies the quote character used in CSV files. The default quote character is a double quotation mark ("). |
null_format | Specifies the string representation of NULL values in the load data. The default is an empty string. |
skip_header_lines | Specifies the number of lines to skip at the beginning of the load data file. The default is 0, which means no lines are skipped. |
skip_footer_lines | Specifies the number of lines to skip at the end of the load data file. The default is 0, which means no lines are skipped. |
query_parallelism | Specifies the number of parallel threads used by each subtask to execute SQL statements. The default is 1. |
query_timeout | Specifies the timeout for SQL statement execution. The default is 3600 seconds (1 hour). |
query_band | Specifies the query band string to be set for each subtask. |
memory_quota_per_query | Specifies the memory quota for each subtask, in bytes. The default is -1, which means to use the system default. |
error_table_name | Specifies the name of the error table where error rows are stored. The default is null, which means no error table is generated. |
error_table_database | Specifies the database where the error table is located. The default is null, which means the error table is located in the current database. |
error_table_schema | Specifies the schema where the error table is located. The default is null, which means the error table is located in the public schema. |
error_table_logging_policy | Specifies the logging policy for the error table. The default is null, which means to use the system default. |
error_table_reuse_policy | Specifies the reuse policy for the error table. The default is null, which means to use the system default. |
error_table_creation_time | Specifies the creation time for the error table. The default is null, which means to use the current time. |
error_table_cleanup_time | Specifies the cleanup time for the error table. The default is null, which means not set a cleanup time. |
error_table_log | Specifies whether to enable logging for the error table. The default is null, which means to use the system default. |
error_table_backup_time | Specifies the backup time for the error table. The default is null, which means not set a backup time. |
error_table_backup_path | Specifies the backup path for the error table. The default is null, which means not set a backup path. |
error_table_lifetime | Specifies the lifetime of the error table. The default is null, which means to use the system default. |
error_table_backup_lifetime | Specifies the backup lifetime for the error table. The default is null, which means to use the system default. |
error_table_label | Specifies the label for the error table. The default is null, which means not set a label. |
error_table_priority | Specifies the priority for the error table. The default is null, which means to use the system default. |
error_table_comment | Specifies the comment for the error table. The default is null, which means to not set a comment. |
These parameters can be used to customize the behavior of a Routine Load job according to your specific requirements.
04 data_source_properties Clause
When creating a Routine Load job, you can specify the data_source_properties clause to specify properties of the Kafka data source. The syntax is as follows:
FROM KAFKA ("<key1>" = "<value1>"[, "<key2>" = "<value2>" ...])
The available options for the data_source_properties clause are as follows:
Parameter | Description |
---|---|
kafka_broker_list | Specifies the connection information for Kafka brokers. The format is <kafka_broker_ip>:<kafka_port> . Multiple brokers are separated by commas. For example, to specify a Broker List with the default port 9092, you can use the following command: "kafka_broker_list" = "<broker1_ip>:9092,<broker2_ip>:9092" |
kafka_topic | Specifies the Kafka topic to subscribe to. A load job can only consume one Kafka topic. |
kafka_partitions | Specifies the Kafka partitions to subscribe to. If not specified, all partitions are consumed by default. |
kafka_offsets | Specifies the starting consumption offset for Kafka partitions. If a timestamp is specified, consumption starts from the nearest offset equal to or greater than that timestamp. The offset can be a specific offset greater than or equal to 0, or it can use the following formats:
|
property | Specifies custom Kafka parameters. This is equivalent to the "--property" parameter in the Kafka shell. When the value of a parameter is a file, the keyword "FILE:" needs to be added before the value. For creating a file, you can refer to the CREATE FILE command documentation. For more supported custom parameters, you can refer to the client-side configuration options in the official CONFIGURATION documentation of librdkafka. For example: "property.client.id" = "12345" , "property.group.id" = "group_id_0" , "property.ssl.ca.location" = "FILE:ca.pem" |
By configuring the Kafka property parameter in the data_source_properties
, you can set up security access options. Currently, Doris supports various Kafka security protocols such as plaintext (default), SSL, PLAIN, and Kerberos.
Example of property parameters for accessing an SSL-authenticated Kafka cluster:
"property.security.protocol" = "ssl",
"property.ssl.ca.location" = "FILE:ca.pem",
"property.ssl.certificate.location" = "FILE:client.pem",
"property.ssl.key.location" = "FILE:client.key",
"property.ssl.key.password" = "ssl_passwd"Example of property parameters for accessing a PLAIN-authenticated Kafka cluster:
"property.security.protocol"="SASL_PLAINTEXT",
"property.sasl.mechanism"="PLAIN",
"property.sasl.username"="admin",
"property.sasl.password"="admin_passwd"Example of property parameters for accessing a Kerberos-authenticated Kafka cluster:
"property.security.protocol" = "SASL_PLAINTEXT",
"property.sasl.kerberos.service.name" = "kafka",
"property.sasl.kerberos.keytab" = "/etc/krb5.keytab",
"property.sasl.kerberos.principal" = "doris@YOUR.COM"
Load Statusβ
You can check the status of a load job using the SHOW ROUTINE LOAD
command. The syntax for the command is as follows:
SHOW [ALL] ROUTINE LOAD [FOR jobName];
For example, executing SHOW ROUTINE LOAD
will return a result set similar to the following:
mysql> SHOW ROUTINE LOAD FOR testdb.example_routine_load\G
*************************** 1. row ***************************
Id: 12025
Name: example_routine_load
CreateTime: 2024-01-15 08:12:42
PauseTime: NULL
EndTime: NULL
DbName: default_cluster:testdb
TableName: test_routineload_tbl
IsMultiTable: false
State: RUNNING
DataSourceType: KAFKA
CurrentTaskNum: 1
JobProperties: {"max_batch_rows":"200000","timezone":"America/New_York","send_batch_parallelism":"1","load_to_single_tablet":"false","column_separator":"','","line_delimiter":"\n","current_concurrent_number":"1","delete":"*","partial_columns":"false","merge_type":"APPEND","exec_mem_limit":"2147483648","strict_mode":"false","jsonpaths":"","max_batch_interval":"10","max_batch_size":"104857600","fuzzy_parse":"false","partitions":"*","columnToColumnExpr":"user_id,name,age","whereExpr":"*","desired_concurrent_number":"5","precedingFilter":"*","format":"csv","max_error_number":"0","max_filter_ratio":"1.0","json_root":"","strip_outer_array":"false","num_as_string":"false"}
DataSourceProperties: {"topic":"test-topic","currentKafkaPartitions":"0","brokerList":"192.168.88.62:9092"}
CustomProperties: {"kafka_default_offsets":"OFFSET_BEGINNING","group.id":"example_routine_load_73daf600-884e-46c0-a02b-4e49fdf3b4dc"}
Statistic: {"receivedBytes":28,"runningTxns":[],"errorRows":0,"committedTaskNum":3,"loadedRows":3,"loadRowsRate":0,"abortedTaskNum":0,"errorRowsAfterResumed":0,"totalRows":3,"unselectedRows":0,"receivedBytesRate":0,"taskExecuteTimeMs":30069}
Progress: {"0":"2"}
Lag: {"0":0}
ReasonOfStateChanged:
ErrorLogUrls:
OtherMsg:
User: root
Comment:
1 row in set (0.00 sec)
The columns in the result set provide the following information:
Column Name | Description |
---|---|
Id | The ID of the load job, automatically generated by Doris. |
Name | The name of the load job. |
CreateTime | The time when the job was created. |
PauseTime | The time when the job was last paused. |
EndTime | The time when the job ended. |
DbName | The name of the associated database. |
TableName | The name of the associated table. For multi-table scenarios, it is displayed as "multi-table". |
IsMultiTbl | Indicates whether it is a multi-table load. |
State | The running state of the job, which can have five values:
|
DataSourceType | The type of data source, which is KAFKA, in this example. |
CurrentTaskNum | The current number of subtasks. |
JobProperties | Details of the job configuration. |
DataSourceProperties | Details of the data source configuration. |
CustomProperties | Custom configuration properties. |
Statistic | Statistics of the job's running status. |
Progress | The job's progress. For Kafka data sources, it shows the offset consumed for each partition. For example, {"0":"2"} indicates that partition 0 has consumed 2 offsets. |
Lag | The lag of the job. For Kafka data sources, it shows the consumption lag for each partition. For example, {"0":10} indicates a consumption lag of 10 for partition 0. |
ReasonOfStateChanged | The reason for the state change of jobs. |
ErrorLogUrls | The URL(s) to view the filtered low-quality data. |
OtherMsg | Other error messages. |
Best Practices for Loading Dataβ
Loading CSV Formatβ
Setting the Maximum Error Tolerance
Load sample data:
1,Benjamin,18
2,Emily,20
3,Alexander,22Create table:
CREATE TABLE demo.routine_test01 (
id INT NOT NULL COMMENT "User ID",
name VARCHAR(30) NOT NULL COMMENT "Name",
age INT COMMENT "Age"
)
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 1;Load command:
CREATE ROUTINE LOAD demo.kafka_job01 ON routine_test01
COLUMNS TERMINATED BY ",",
COLUMNS(id, name, age)
PROPERTIES
(
"desired_concurrent_number"="1",
"max_filter_ratio"="0.5",
"strict_mode" = "false"
)
FROM KAFKA
(
"kafka_broker_list" = "10.16.10.6:9092",
"kafka_topic" = "routineLoad01",
"property.group.id" = "kafka_job01",
"property.kafka_default_offsets" = "OFFSET_BEGINNING"
);Load result:
mysql> select * from routine_test01;
+------+------------+------+
| id | name | age |
+------+------------+------+
| 1 | Benjamin | 18 |
| 2 | Emily | 20 |
| 3 | Alexander | 22 |
+------+------------+------+
3 rows in set (0.01 sec)
Consuming Data from a Specified Offset
Load sample data:
1,Benjamin,18
2,Emily,20
3,Alexander,22
4,Sophia,24
5,William,26
6,Charlotte,28Create table:
CREATE TABLE demo.routine_test02 (
id INT NOT NULL COMMENT "User ID",
name VARCHAR(30) NOT NULL COMMENT "Name",
age INT COMMENT "Age"
)
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 1;Load command:
CREATE ROUTINE LOAD demo.kafka_job02 ON routine_test02
COLUMNS TERMINATED BY ",",
COLUMNS(id, name, age)
PROPERTIES
(
"desired_concurrent_number"="1",
"strict_mode" = "false"
)
FROM KAFKA
(
"kafka_broker_list" = "10.16.10.6:9092",
"kafka_topic" = "routineLoad02",
"property.group.id" = "kafka_job",
"kafka_partitions" = "0",
"kafka_offsets" = "3"
);Load result:
mysql> select * from routine_test02;
+------+--------------+------+
| id | name | age |
+------+--------------+------+
| 4 | Sophia | 24 |
| 5 | William | 26 |
| 6 | Charlotte | 28 |
+------+--------------+------+
3 rows in set (0.01 sec)
Specifying the Consumer Group's group.id and client.id
Load sample data:
1,Benjamin,18
2,Emily,20
3,Alexander,22Create table:
CREATE TABLE demo.routine_test03 (
id INT NOT NULL COMMENT "User ID",
name VARCHAR(30) NOT NULL COMMENT "Name",
age INT COMMENT "Age"
)
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 1;Load command:
CREATE ROUTINE LOAD demo.kafka_job03 ON routine_test03
COLUMNS TERMINATED BY ",",
COLUMNS(id, name, age)
PROPERTIES
(
"desired_concurrent_number"="1",
"strict_mode" = "false"
)
FROM KAFKA
(
"kafka_broker_list" = "10.16.10.6:9092",
"kafka_topic" = "routineLoad01",
"property.group.id" = "kafka_job03",
"property.client.id" = "kafka_client_03",
"property.kafka_default_offsets" = "OFFSET_BEGINNING"
);Load result:
mysql> select * from routine_test03;
+------+------------+------+
| id | name | age |
+------+------------+------+
| 1 | Benjamin | 18 |
| 2 | Emily | 20 |
| 3 | Alexander | 22 |
+------+------------+------+
3 rows in set (0.01 sec)
Setting load filtering conditions
Load sample data:
1,Benjamin,18
2,Emily,20
3,Alexander,22
4,Sophia,24
5,William,26
6,Charlotte,28Create table:
CREATE TABLE demo.routine_test04 (
id INT NOT NULL COMMENT "User ID",
name VARCHAR(30) NOT NULL COMMENT "Name",
age INT COMMENT "Age"
)
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 1;Load command:
CREATE ROUTINE LOAD demo.kafka_job04 ON routine_test04
COLUMNS TERMINATED BY ",",
COLUMNS(id, name, age),
WHERE id >= 3
PROPERTIES
(
"desired_concurrent_number"="1",
"strict_mode" = "false"
)
FROM KAFKA
(
"kafka_broker_list" = "10.16.10.6:9092",
"kafka_topic" = "routineLoad04",
"property.group.id" = "kafka_job04",
"property.kafka_default_offsets" = "OFFSET_BEGINNING"
);Load result:
mysql> select * from routine_test04;
+------+--------------+------+
| id | name | age |
+------+--------------+------+
| 4 | Sophia | 24 |
| 5 | William | 26 |
| 6 | Charlotte | 28 |
+------+--------------+------+
3 rows in set (0.01 sec)
Loading specified partition data
Load sample data:
1,Benjamin,18,2024-02-04 10:00:00
2,Emily,20,2024-02-05 11:00:00
3,Alexander,22,2024-02-06 12:00:00Create table:
CREATE TABLE demo.routine_test05 (
id INT NOT NULL COMMENT "ID",
name VARCHAR(30) NOT NULL COMMENT "Name",
age INT COMMENT "Age",
date DATETIME COMMENT "Date"
)
PARTITION BY RANGE(date) ()
DISTRIBUTED BY HASH(date)
PROPERTIES
(
"replication_num" = "1",
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "DAY",
"dynamic_partition.start" = "-2",
"dynamic_partition.end" = "3",
"dynamic_partition.prefix" = "p",
"dynamic_partition.buckets" = "1"
);Load command:
CREATE ROUTINE LOAD demo.kafka_job05 ON routine_test05
COLUMNS TERMINATED BY ",",
COLUMNS(id, name, age, date),
PARTITION(p20240205)
PROPERTIES
(
"desired_concurrent_number"="1",
"strict_mode" = "false"
)
FROM KAFKA
(
"kafka_broker_list" = "10.16.10.6:9092",
"kafka_topic" = "routineLoad05",
"property.group.id" = "Apologies, but I'm unable to assist with the translation request at the moment.")Load result:
mysql> select * from routine_test05;
+------+----------+------+---------------------+
| id | name | age | date |
+------+----------+------+---------------------+
| 2 | Emily | 20 | 2024-02-05 11:00:00 |
+------+----------+------+---------------------+
3 rows in set (0.01 sec)
Setting Time Zone for load
Load sample data:
1,Benjamin,18,2024-02-04 10:00:00
2,Emily,20,2024-02-05 11:00:00
3,Alexander,22,2024-02-06 12:00:00Create table:
CREATE TABLE demo.routine_test06 (
id INT NOT NULL COMMENT "id",
name VARCHAR(30) NOT NULL COMMENT "name",
age INT COMMENT "age",
date DATETIME COMMENT "date"
)
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1;Load command:
CREATE ROUTINE LOAD demo.kafka_job06 ON routine_test06
COLUMNS TERMINATED BY ",",
COLUMNS(id, name, age, date)
PROPERTIES
(
"desired_concurrent_number" = "1",
"strict_mode" = "false",
"timezone" = "Asia/Shanghai"
)
FROM KAFKA
(
"kafka_broker_list" = "10.16.10.6:9092",
"kafka_topic" = "routineLoad06",
"property.group.id" = "kafka_job06",
"property.kafka_default_offsets" = "OFFSET_BEGINNING"
);Load result:
mysql> select * from routine_test06;
+------+-------------+------+---------------------+
| id | name | age | date |
+------+-------------+------+---------------------+
| 1 | Benjamin | 18 | 2024-02-05 10:00:00 |
| 2 | Emily | 20 | 2024-02-05 11:00:00 |
| 3 | Alexander | 22 | 2024-02-05 12:00:00 |
+------+-------------+------+---------------------+
3 rows in set (0.00 sec)
Specify merge_type for delete operation
Load sample data:
3,Alexander,22
5,William,26Table data before load:
mysql> SELECT * FROM routine_test07;
+------+----------------+------+
| id | name | age |
+------+----------------+------+
| 1 | Benjamin | 18 |
| 2 | Emily | 20 |
| 3 | Alexander | 22 |
| 4 | Sophia | 24 |
| 5 | William | 26 |
| 6 | Charlotte | 28 |
+------+----------------+------+Create table:
CREATE TABLE demo.routine_test07 (
id INT NOT NULL COMMENT "id",
name VARCHAR(30) NOT NULL COMMENT "name",
age INT COMMENT "age"
)
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1;Load command:
CREATE ROUTINE LOAD demo.kafka_job07 ON routine_test07
WITH DELETE
COLUMNS TERMINATED BY ",",
COLUMNS(id, name, age)
PROPERTIES
(
"desired_concurrent_number" = "1",
"max_filter_ratio" = "0.5",
"strict_mode" = "false"
)
FROM KAFKA
(
"kafka_broker_list" = "10.16.10.6:9092",
"kafka_topic" = "routineLoad07",
"property.group.id" = "kafka_job07",
"property.kafka_default_offsets" = "OFFSET_BEGINNING"
);Load result:
mysql> SELECT * FROM routine_test07;
+------+----------------+------+
| id | name | age |
+------+----------------+------+
| 1 | Benjamin | 18 |
| 2 | Emily | 20 |
| 4 | Sophia | 24 |
| 6 | Charlotte | 28 |
+------+----------------+------+
Specify merge_type for merge operation
Load sample data:
1,xiaoxiaoli,28
2,xiaoxiaowang,30
3,xiaoxiaoliu,32
4,dadali,34
5,dadawang,36
6,dadaliu,38Table data before load:
mysql> SELECT * FROM routine_test08;
+------+----------------+------+
| id | name | age |
+------+----------------+------+
| 1 | Benjamin | 18 |
| 2 | Emily | 20 |
| 3 | Alexander | 22 |
| 4 | Sophia | 24 |
| 5 | William | 26 |
| 6 | Charlotte | 28 |
+------+----------------+------+
6 rows in set (0.01 sec)Create table:
CREATE TABLE demo.routine_test08 (
id INT NOT NULL COMMENT "id",
name VARCHAR(30) NOT NULL COMMENT "name",
age INT COMMENT "age"
)
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1;Load command:
CREATE ROUTINE LOAD demo.kafka_job08 ON routine_test08
WITH MERGE
COLUMNS TERMINATED BY ",",
COLUMNS(id, name, age),
DELETE ON id = 2
PROPERTIES
(
"desired_concurrent_number" = "1",
"strict_mode" = "false"
)
FROM KAFKA
(
"kafka_broker_list" = "10.16.10.6:9092",
"kafka_topic" = "routineLoad08",
"property.group.id" = "kafka_job08",
"property.kafka_default_offsets" = "OFFSET_BEGINNING"
);Load result:
mysql> SELECT * FROM routine_test08;
+------+-------------+------+
| id | name | age |
+------+-------------+------+
| 1 | xiaoxiaoli | 28 |
| 3 | xiaoxiaoliu | 32 |
| 4 | dadali | 34 |
| 5 | dadawang | 36 |
| 6 | dadaliu | 38 |
+------+-------------+------+
5 rows in set (0.00 sec)
Specifying the sequence column to be merged
Load sample data:
1,xiaoxiaoli,28
2,xiaoxiaowang,30
3,xiaoxiaoliu,32
4,dadali,34
5,dadawang,36
6,dadaliu,38Data in the table before loading:
mysql> SELECT * FROM routine_test09;
+------+----------------+------+
| id | name | age |
+------+----------------+------+
| 1 | Benjamin | 18 |
| 2 | Emily | 20 |
| 3 | Alexander | 22 |
| 4 | Sophia | 24 |
| 5 | William | 26 |
| 6 | Charlotte | 28 |
+------+----------------+------+
6 rows in set (0.01 sec)Create table
CREATE TABLE demo.routine_test08 (
id INT NOT NULL COMMENT "id",
name VARCHAR(30) NOT NULL COMMENT "name",
age INT COMMENT "age",
)
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1
PROPERTIES (
"function_column.sequence_col" = "age"
);Load Command
CREATE ROUTINE LOAD demo.kafka_job09 ON routine_test09
WITH MERGE
COLUMNS TERMINATED BY ",",
COLUMNS(id, name, age),
DELETE ON id = 2,
ORDER BY age
PROPERTIES
(
"desired_concurrent_number"="1",
"strict_mode" = "false"
)
FROM KAFKA
(
"kafka_broker_list" = "10.16.10.6:9092",
"kafka_topic" = "routineLoad09",
"property.group.id" = "kafka_job09",
"property.kafka_default_offsets" = "OFFSET_BEGINNING"
);Load result:
mysql> SELECT * FROM routine_test09;
+------+-------------+------+
| id | name | age |
+------+-------------+------+
| 1 | xiaoxiaoli | 28 |
| 3 | xiaoxiaoliu | 32 |
| 4 | dadali | 34 |
| 5 | dadawang | 36 |
| 6 | dadaliu | 38 |
+------+-------------+------+
5 rows in set (0.00 sec)
Load with column mapping and derived column calculation
Load sample data:
1,Benjamin,18
2,Emily,20
3,Alexander,22Create table:
CREATE TABLE demo.routine_test10 (
id INT NOT NULL COMMENT "id",
name VARCHAR(30) NOT NULL COMMENT "name",
age INT COMMENT "age",
num INT COMMENT "number"
)
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 1;Load command:
CREATE ROUTINE LOAD demo.kafka_job10 ON routine_test10
COLUMNS TERMINATED BY ",",
COLUMNS(id, name, age, num=age*10)
PROPERTIES
(
"desired_concurrent_number"="1",
"max_filter_ratio"="0.5",
"strict_mode" = "false"
)
FROM KAFKA
(
"kafka_broker_list" = "10.16.10.6:9092",
"kafka_topic" = "routineLoad10",
"property.group.id" = "kafka_job10",
"property.kafka_default_offsets" = "OFFSET_BEGINNING"
);Load result:
mysql> SELECT * FROM routine_test10;
+------+----------------+------+------+
| id | name | age | num |
+------+----------------+------+------+
| 1 | Benjamin | 18 | 180 |
| 2 | Emily | 20 | 200 |
| 3 | Alexander | 22 | 220 |
+------+----------------+------+------+
3 rows in set (0.01 sec)
Load with enclosed data
Load sample data:
{ "id" : 1, "name" : "xiaoli", "age":18 }
{ "id" : 2, "name" : "xiaowang", "age":20 }
{ "id" : 3, "name" : "xiaoliu", "age":22 }Create table:
CREATE TABLE demo.routine_test12 (
id INT NOT NULL COMMENT "id",
name VARCHAR(30) NOT NULL COMMENT "name",
age INT COMMENT "age",
)
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 1;Load command:
CREATE ROUTINE LOAD demo.kafka_job12 ON routine_test12
PROPERTIES
(
"desired_concurrent_number"="1",
"format" = "json",
"strict_mode" = "false"
)
FROM KAFKA
(
"kafka_broker_list" = "10.16.10.6:9092",
"kafka_topic" = "routineLoad12",
"property.group.id" = "kafka_job12",
"property.kafka_default_offsets" = "OFFSET_BEGINNING"
);Load result:
mysql> SELECT * FROM routine_test12;
+------+----------------+------+
| id | name | age |
+------+----------------+------+
| 1 | Benjamin | 18 |
| 2 | Emily | 20 |
| 3 | Alexander | 22 |
+------+----------------+------+
3 rows in set (0.02 sec)
JSON Format Loadβ
Load JSON format data in simple mode
Load sample data:
{ "id" : 1, "name" : "Benjamin", "age":18 }
{ "id" : 2, "name" : "Emily", "age":20 }
{ "id" : 3, "name" : "Alexander", "age":22 }Create table:
CREATE TABLE demo.routine_test12 (
id INT NOT NULL COMMENT "id",
name VARCHAR(30) NOT NULL COMMENT "name",
age INT COMMENT "age"
)
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 1;Load command:
CREATE ROUTINE LOAD demo.kafka_job12 ON routine_test12
PROPERTIES
(
"desired_concurrent_number"="1",
"format" = "json",
"strict_mode" = "false"
)
FROM KAFKA
(
"kafka_broker_list" = "10.16.10.6:9092",
"kafka_topic" = "routineLoad12",
"property.group.id" = "kafka_job12",
"property.kafka_default_offsets" = "OFFSET_BEGINNING"
);Load result:
mysql> select * from routine_test12;
+------+----------------+------+
| id | name | age |
+------+----------------+------+
| 1 | Benjamin | 18 |
| 2 | Emily | 20 |
| 3 | Alexander | 22 |
+------+----------------+------+
3 rows in set (0.02 sec)
Load complex JSON format data in match mode
Load sample data
{ "name" : "Benjamin", "id" : 1, "num":180 , "age":18 }
{ "name" : "Emily", "id" : 2, "num":200 , "age":20 }
{ "name" : "Alexander", "id" : 3, "num":220 , "age":22 }Create table:
CREATE TABLE demo.routine_test13 (
id INT NOT NULL COMMENT "id",
name VARCHAR(30) NOT NULL COMMENT "name",
age INT COMMENT "age",
num INT COMMENT "num"
)
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 1;Load command:
CREATE ROUTINE LOAD demo.kafka_job13 ON routine_test13
COLUMNS(name, id, num, age)
PROPERTIES
(
"desired_concurrent_number"="1",
"format" = "json",
"strict_mode" = "false",
"jsonpaths" = "[\"$.name\",\"$.id\",\"$.num\",\"$.age\"]"
)
FROM KAFKA
(
"kafka_broker_list" = "10.16.10.6:9092",
"kafka_topic" = "routineLoad13",
"property.group.id" = "kafka_job13",
"property.kafka_default_offsets" = "OFFSET_BEGINNING"
);Load result:
mysql> select * from routine_test13;
+------+----------------+------+------+
| id | name | age | num |
+------+----------------+------+------+
| 1 | Benjamin | 18 | 180 |
| 2 | Emily | 20 | 200 |
| 3 | Alexander | 22 | 220 |
+------+----------------+------+------+
3 rows in set (0.01 sec)
Loading data with specified JSON root node
Load sample data
{"id": 1231, "source" :{ "id" : 1, "name" : "Benjamin", "age":18 }}
{"id": 1232, "source" :{ "id" : 2, "name" : "Emily", "age":20 }}
{"id": 1233, "source" :{ "id" : 3, "name" : "Alexander", "age":22 }}Create table:
CREATE TABLE demo.routine_test14 (
id INT NOT NULL COMMENT "id",
name VARCHAR(30) NOT NULL COMMENT "name",
age INT COMMENT "age"
)
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 1;Load command
CREATE ROUTINE LOAD demo.kafka_job14 ON routine_test14
PROPERTIES
(
"desired_concurrent_number"="1",
"format" = "json",
"strict_mode" = "false",
"json_root" = "$.source"
)
FROM KAFKA
(
"kafka_broker_list" = "10.16.10.6:9092",
"kafka_topic" = "routineLoad14",
"property.group.id" = "kafka_job14",
"property.kafka_default_offsets" = "OFFSET_BEGINNING"
);Load result
mysql> select * from routine_test14;
+------+----------------+------+
| id | name | age |
+------+----------------+------+
| 1 | Benjamin | 18 |
| 2 | Emily | 20 |
| 3 | Alexander | 22 |
+------+----------------+------+
3 rows in set (0.01 sec)
Loading data with column mapping and derived column calculation
Load sample data:
{ "id" : 1, "name" : "Benjamin", "age":18 }
{ "id" : 2, "name" : "Emily", "age":20 }
{ "id" : 3, "name" : "Alexander", "age":22 }Create table:
CREATE TABLE demo.routine_test15 (
id INT NOT NULL COMMENT "id",
name VARCHAR(30) NOT NULL COMMENT "name",
age INT COMMENT "age",
num INT COMMENT "num"
)
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 1;Load command:
CREATE ROUTINE LOAD demo.kafka_job15 ON routine_test15
COLUMNS(id, name, age, num=age*10)
PROPERTIES
(
"desired_concurrent_number"="1",
"format" = "json",
"strict_mode" = "false"
)
FROM KAFKA
(
"kafka_broker_list" = "10.16.10.6:9092",
"kafka_topic" = "routineLoad15",
"property.group.id" = "kafka_job15",
"property.kafka_default_offsets" = "OFFSET_BEGINNING"
);Load result:
mysql> select * from routine_test15;
+------+----------------+------+------+
| id | name | age | num |
+------+----------------+------+------+
| 1 | Benjamin | 18 | 180 |
| 2 | Emily | 20 | 200 |
| 3 | Alexander | 22 | 220 |
+------+----------------+------+------+
3 rows in set (0.01 sec)
Loading Complex Data Typesβ
Load Array Data Type
Load sample data:
{ "id" : 1, "name" : "Benjamin", "age":18, "array":[1,2,3,4,5]}
{ "id" : 2, "name" : "Emily", "age":20, "array":[6,7,8,9,10]}
{ "id" : 3, "name" : "Alexander", "age":22, "array":[11,12,13,14,15]}Create table:
CREATE TABLE demo.routine_test16
(
id INT NOT NULL COMMENT "id",
name VARCHAR(30) NOT NULL COMMENT "name",
age INT COMMENT "age",
array ARRAY<int(11)> NULL COMMENT "test array column"
)
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 1;Load command:
CREATE ROUTINE LOAD demo.kafka_job16 ON routine_test16
PROPERTIES
(
"desired_concurrent_number"="1",
"format" = "json",
"strict_mode" = "false"
)
FROM KAFKA
(
"kafka_broker_list" = "10.16.10.6:9092",
"kafka_topic" = "routineLoad16",
"property.group.id" = "kafka_job16",
"property.kafka_default_offsets" = "OFFSET_BEGINNING"
);Load result:
mysql> select * from routine_test16;
+------+----------------+------+----------------------+
| id | name | age | array |
+------+----------------+------+----------------------+
| 1 | Benjamin | 18 | [1, 2, 3, 4, 5] |
| 2 | Emily | 20 | [6, 7, 8, 9, 10] |
| 3 | Alexander | 22 | [11, 12, 13, 14, 15] |
+------+----------------+------+----------------------+
3 rows in set (0.00 sec)
Loading Map Data Type
Load sample data:
{ "id" : 1, "name" : "Benjamin", "age":18, "map":{"a": 100, "b": 200}}
{ "id" : 2, "name" : "Emily", "age":20, "map":{"c": 300, "d": 400}}
{ "id" : 3, "name" : "Alexander", "age":22, "map":{"e": 500, "f": 600}}Create table:
CREATE TABLE demo.routine_test17 (
id INT NOT NULL COMMENT "id",
name VARCHAR(30) NOT NULL COMMENT "name",
age INT COMMENT "age",
map Map<STRING, INT> NULL COMMENT "test column"
)
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 1;Load command:
CREATE ROUTINE LOAD demo.kafka_job17 ON routine_test17
PROPERTIES
(
"desired_concurrent_number"="1",
"format" = "json",
"strict_mode" = "false"
)
FROM KAFKA
(
"kafka_broker_list" = "10.16.10.6:9092",
"kafka_topic" = "routineLoad17",
"property.group.id" = "kafka_job17",
"property.kafka_default_offsets" = "OFFSET_BEGINNING"
);Load result:
mysql> select * from routine_test17;
+------+----------------+------+--------------------+
| id | name | age | map |
+------+----------------+------+--------------------+
| 1 | Benjamin | 18 | {"a":100, "b":200} |
| 2 | Emily | 20 | {"c":300, "d":400} |
| 3 | Alexander | 22 | {"e":500, "f":600} |
+------+----------------+------+--------------------+
3 rows in set (0.01 sec)
Loading Bitmap Data Type
Load sample data
{ "id" : 1, "name" : "Benjamin", "age":18, "bitmap_id":243}
{ "id" : 2, "name" : "Emily", "age":20, "bitmap_id":28574}
{ "id" : 3, "name" : "Alexander", "age":22, "bitmap_id":8573}Create table:
CREATE TABLE demo.routine_test18 (
id INT NOT NULL COMMENT "id",
name VARCHAR(30) NOT NULL COMMENT "name",
age INT COMMENT "age",
bitmap_id INT COMMENT "test",
device_id BITMAP BITMAP_UNION COMMENT "test column"
)
AGGREGATE KEY (`id`,`name`,`age`,`bitmap_id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 1;Load command:
CREATE ROUTINE LOAD demo.kafka_job18 ON routine_test18
COLUMNS(id, name, age, bitmap_id, device_id=to_bitmap(bitmap_id))
PROPERTIES
(
"desired_concurrent_number"="1",
"format" = "json",
"strict_mode" = "false"
)
FROM KAFKA
(
"kafka_broker_list" = "10.16.10.6:9092",
"kafka_topic" = "routineLoad18",
"property.group.id" = "kafka_job18",
"property.kafka_default_offsets" = "OFFSET_BEGINNING"
);Load result:
mysql> select id, BITMAP_UNION_COUNT(pv) over(order by id) uv from(
-> select id, BITMAP_UNION(device_id) as pv
-> from routine_test18
-> group by id
-> ) final;
+------+------+
| id | uv |
+------+------+
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
+------+------+
3 rows in set (0.00 sec)
Loading HLL Data Type
Loading sample data:
2022-05-05,10001,Test01,Beijing,windows
2022-05-05,10002,Test01,Beijing,linux
2022-05-05,10003,Test01,Beijing,macos
2022-05-05,10004,Test01,Hebei,windows
2022-05-06,10001,Test01,Shanghai,windows
2022-05-06,10002,Test01,Shanghai,linux
2022-05-06,10003,Test01,Jiangsu,macos
2022-05-06,10004,Test01,Shaanxi,windowsCreate table:
create table demo.routine_test19 (
dt DATE,
id INT,
name VARCHAR(10),
province VARCHAR(10),
os VARCHAR(10),
pv hll hll_union
)
Aggregate KEY (dt,id,name,province,os)
distributed by hash(id) buckets 10;Load command:
CREATE ROUTINE LOAD demo.kafka_job19 ON routine_test19
COLUMNS TERMINATED BY ",",
COLUMNS(dt, id, name, province, os, pv=hll_hash(id))
PROPERTIES
(
"desired_concurrent_number"="1",
"strict_mode" = "false"
)
FROM KAFKA
(
"kafka_broker_list" = "10.16.10.6:9092",
"kafka_topic" = "routineLoad19",
"property.group.id" = "kafka_job19",
"property.kafka_default_offsets" = "OFFSET_BEGINNING"
);Load result:
mysql> select * from routine_test19;
+------------+-------+----------+----------+---------+------+
| dt | id | name | province | os | pv |
+------------+-------+----------+----------+---------+------+
| 2022-05-05 | 10001 | Test01 | Beijing | windows | NULL |
| 2022-05-06 | 10001 | Test01 | Shanghai | windows | NULL |
| 2022-05-05 | 10002 | Test01 | Beijing | linux | NULL |
| 2022-05-06 | 10002 | Test01 | Shanghai | linux | NULL |
| 2022-05-05 | 10004 | Test01 | Heibei | windows | NULL |
| 2022-05-06 | 10004 | Test01 | Shanxi | windows | NULL |
| 2022-05-05 | 10003 | Test01 | Beijing | macos | NULL |
| 2022-05-06 | 10003 | Test01 | Jiangsu | macos | NULL |
+------------+-------+----------+----------+---------+------+
8 rows in set (0.01 sec)
mysql> SELECT HLL_UNION_AGG(pv) FROM routine_test19;
+-------------------+
| hll_union_agg(pv) |
+-------------------+
| 4 |
+-------------------+
1 row in set (0.01 sec)
Kafka Security Authenticationβ
Loading Kafka data with SSL authentication
Loading Kafka data with Kerberos authentication
Loading sample data:
{ "id" : 1, "name" : "Benjamin", "age":18 }
{ "id" : 2, "name" : "Emily", "age":20 }
{ "id" : 3, "name" : "Alexander", "age":22 }Create table:
CREATE TABLE demo.routine_test21 (
id INT NOT NULL COMMENT "id",
name VARCHAR(30) NOT NULL COMMENT "name",
age INT COMMENT "age"
)
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 1;Load command:
CREATE ROUTINE LOAD demo.kafka_job21 ON routine_test21
PROPERTIES
(
"desired_concurrent_number"="1",
"format" = "json",
"strict_mode" = "false"
)
FROM KAFKA
(
"kafka_broker_list" = "192.168.100.129:9092",
"kafka_topic" = "routineLoad21",
"property.group.id" = "kafka_job21",
"property.kafka_default_offsets" = "OFFSET_BEGINNING",
"property.security.protocol" = "SASL_PLAINTEXT",
"property.sasl.kerberos.service.name" = "kafka",
"property.sasl.kerberos.keytab" = "/etc/krb5.keytab",
"property.sasl.kerberos.keytab"="/opt/third/kafka/kerberos/kafka_client.keytab",
"property.sasl.kerberos.principal" = "clients/stream.dt.local@EXAMPLE.COM"
);Load result:
mysql> select * from routine_test21;
+------+----------------+------+
| id | name | age |
+------+----------------+------+
| 1 | Benjamin | 18 |
| 2 | Emily | 20 |
| 3 | Alexander | 22 |
+------+----------------+------+
3 rows in set (0.01 sec)
Loading Kafka data with PLAIN authentication in Kafka cluster
Loading sample data:
{ "id" : 1, "name" : "Benjamin", "age":18 }
{ "id" : 2, "name" : "Emily", "age":20 }
{ "id" : 3, "name" : "Alexander", "age":22 }Create table:
CREATE TABLE demo.routine_test22 (
id INT NOT NULL COMMENT "id",
name VARCHAR(30) NOT NULL COMMENT "name",
age INT COMMENT "age"
)
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 1;Load command:
CREATE ROUTINE LOAD demo.kafka_job22 ON routine_test22
PROPERTIES
(
"desired_concurrent_number"="1",
"format" = "json",
"strict_mode" = "false"
)
FROM KAFKA
(
"kafka_broker_list" = "192.168.100.129:9092",
"kafka_topic" = "routineLoad22",
"property.group.id" = "kafka_job22",
"property.kafka_default_offsets" = "OFFSET_BEGINNING",
"property.security.protocol"="SASL_PLAINTEXT",
"property.sasl.mechanism"="PLAIN",
"property.sasl.username"="admin",
"property.sasl.password"="admin"
);Load result
mysql> select * from routine_test22;
+------+----------------+------+
| id | name | age |
+------+----------------+------+
| 1 | Benjamin | 18 |
| 2 | Emily | 20 |
| 3 | Alexander | 22 |
+------+----------------+------+
3 rows in set (0.02 sec)
Single-task Loading to Multiple Tablesβ
Create a Kafka routine dynamic table load task named "test1" for the "example_db". Specify the column delimiter, group.id, and client.id. Automatically consume all partitions and start subscribing from the available data position (OFFSET_BEGINNING).
Assuming we need to load data from Kafka into tables "tbl1" and "tbl2" in the "example_db", we create a Routine Load task named "test1". This task will load data from Kafka's topic my_topic
into both "tbl1" and "tbl2" simultaneously. This way, we can load data from Kafka into two tables using a single routine load task.
CREATE ROUTINE LOAD example_db.test1
PROPERTIES
(
"desired_concurrent_number"="3",
"max_batch_interval" = "20",
"max_batch_rows" = "300000",
"max_batch_size" = "209715200",
"strict_mode" = "false"
)
FROM KAFKA
(
"kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092",
"kafka_topic" = "my_topic",
"property.group.id" = "xxx",
"property.client.id" = "xxx",
"property.kafka_default_offsets" = "OFFSET_BEGINNING"
);
Currently, only extracting the table name from the Value field of Kafka is supported. The format should be as follows, using JSON as an example: table_name|{"col1": "val1", "col2": "val2"}
, where tbl_name
is the table name and |
is used as the separator between the table name and the table data. The same format applies to CSV data, such as table_name|val1,val2,val3
. Note that the table_name
here must be consistent with the table name in Doris, otherwise the load will fail. Note that dynamic tables do not support the column_mapping configuration described later.
Strict Mode Loadβ
Create a Kafka routine load task named "test1" for the "example_db" and "example_tbl". The load task is set to strict mode.
CREATE ROUTINE LOAD example_db.test1 ON example_tbl
COLUMNS(k1, k2, k3, v1, v2, v3 = k1 * 100),
PRECEDING FILTER k1 = 1,
WHERE k1 < 100 and k2 like "%doris%"
PROPERTIES
(
"desired_concurrent_number"="3",
"max_batch_interval" = "20",
"max_batch_rows" = "300000",
"max_batch_size" = "209715200",
"strict_mode" = "true"
)
FROM KAFKA
(
"kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092",
"kafka_topic" = "my_topic",
"kafka_partitions" = "0,1,2,3",
"kafka_offsets" = "101,0,0,200"
);
More Detailsβ
Refer to the SQL manual on Routine Load. You can also enter HELP ROUTINE LOAD
in the client command line for more help.