Skip to main content

MULTI-LOAD

MULTI-LOAD​

Name​

MULTI LOAD

Description​

Users submit multiple import jobs through the HTTP protocol. Multi Load can ensure the atomic effect of multiple import jobs

Syntax:
curl --location-trusted -u user:passwd -XPOST http://host:port/api/{db}/_multi_start?label=xxx
curl --location-trusted -u user:passwd -T data.file http://host:port/api/{db}/{table1}/_load?label=xxx\&sub_label=yyy
curl --location-trusted -u user:passwd -T data.file http://host:port/api/{db}/{table2}/_load?label=xxx\&sub_label=zzz
curl --location-trusted -u user:passwd -XPOST http://host:port/api/{db}/_multi_commit?label=xxx
curl --location-trusted -u user:passwd -XPOST http://host:port/api/{db}/_multi_desc?label=xxx

On the basis of 'MINI LOAD', 'MULTI LOAD' can support users to import to multiple tables at the same time. The specific commands are shown above.
'/api/{db}/_multi_start' starts a multi-table import task
'/api/{db}/{table}/_load' adds a table to be imported to an import task. The main difference from 'MINI LOAD' is that the 'sub_label' parameter needs to be passed in
'/api/{db}/_multi_commit' submits the entire multi-table import task, and starts processing in the background
'/api/{db}/_multi_abort' Abort a multi-table import task
'/api/{db}/_multi_desc' can display the number of jobs submitted by a multi-table import task

Description of the HTTP protocol
Authorization Authentication Currently, Doris uses HTTP Basic authorization authentication. So you need to specify the username and password when importing
This method is to pass the password in clear text, since we are currently in an intranet environment. . .

Expect Doris needs to send the http request, it needs to have 'Expect' header information, the content is '100-continue'
why? Because we need to redirect the request, before transmitting the data content,
This can avoid causing multiple transmissions of data, thereby improving efficiency.

Content-Length Doris needs to send the request with the 'Content-Length' header. If the content sent is greater than
If the 'Content-Length' is less, then Palo thinks that there is a problem with the transmission, and fails to submit the task.
NOTE: If more data is sent than 'Content-Length', then Doris only reads 'Content-Length'
length content and import

Parameter Description:
user: If the user is in the default_cluster, the user is the user_name. Otherwise user_name@cluster_name.

label: Used to specify the label number imported in this batch, which is used for later job status query, etc.
This parameter is required.

sub_label: Used to specify the subversion number inside a multi-table import task. For loads imported from multiple tables, this parameter must be passed in.

columns: used to describe the corresponding column names in the import file.
If it is not passed in, then the order of the columns in the file is considered to be the same as the order in which the table was created.
The specified method is comma-separated, for example: columns=k1,k2,k3,k4

column_separator: used to specify the separator between columns, the default is '\t'
NOTE: url encoding is required, for example, '\t' needs to be specified as the delimiter,
Then you should pass in 'column_separator=%09'

max_filter_ratio: used to specify the maximum ratio of non-standard data allowed to filter, the default is 0, no filtering is allowed
The custom specification should be as follows: 'max_filter_ratio=0.2', which means 20% error rate is allowed
Passing in has effect when '_multi_start'

NOTE:
1. This import method currently completes the import work on one machine, so it is not suitable for import work with a large amount of data.
It is recommended that the amount of imported data should not exceed 1GB

2. Currently it is not possible to submit multiple files using `curl -T "{file1, file2}"`, because curl splits them into multiple files
The request is sent. Multiple requests cannot share a label number, so it cannot be used.

3. Supports the use of curl to import data into Doris in a way similar to streaming, but only after the streaming ends Doris
The real import behavior will occur, and the amount of data in this way cannot be too large.

Example​

1. Import the data in the local file 'testData1' into the table 'testTbl1' in the database 'testDb', and
Import the data of 'testData2' into table 'testTbl2' in 'testDb' (user is in defalut_cluster)
curl --location-trusted -u root -XPOST http://host:port/api/testDb/_multi_start?label=123
curl --location-trusted -u root -T testData1 http://host:port/api/testDb/testTbl1/_load?label=123\&sub_label=1
curl --location-trusted -u root -T testData2 http://host:port/api/testDb/testTbl2/_load?label=123\&sub_label=2
curl --location-trusted -u root -XPOST http://host:port/api/testDb/_multi_commit?label=123

2. Abandoned in the middle of multi-table import (user is in defalut_cluster)
curl --location-trusted -u root -XPOST http://host:port/api/testDb/_multi_start?label=123
curl --location-trusted -u root -T testData1 http://host:port/api/testDb/testTbl1/_load?label=123\&sub_label=1
curl --location-trusted -u root -XPOST http://host:port/api/testDb/_multi_abort?label=123

3. Multi-table import to see how much content has been submitted (the user is in the defalut_cluster)
curl --location-trusted -u root -XPOST http://host:port/api/testDb/_multi_start?label=123
curl --location-trusted -u root -T testData1 http://host:port/api/testDb/testTbl1/_load?label=123\&sub_label=1
curl --location-trusted -u root -XPOST http://host:port/api/testDb/_multi_desc?label=123

Keywords​

MULTI, MINI, LOAD

Best Practice​