S3
S3
Name
s3
description
S3 表函数(table-valued-function,tvf),可以让用户像访问关系表格式数据一样,读取并访问 S3 兼容的对象存储上的文件内容。目前支持csv/csv_with_names/csv_with_names_and_types/json/parquet/orc
文件格式。
语法
s3(
"uri" = "..",
"s3.access_key" = "...",
"s3.secret_key" = "...",
"s3.region" = "...",
"format" = "csv",
"keyn" = "valuen",
...
);
参数说明
S3 tvf 中的每一个参数都是一个 "key"="value"
对。
访问 S3 相关参数:
uri
: (必填) 访问 S3 的 uri,S3 表函数会根据use_path_style
参数来决定是否使用 path style 访问方式,默认为 virtual-hosted style 方式s3.access_key
: (必填)s3.secret_key
: (必填)s3.region
: (选填)。如果 Minio 服务设置了其他的 region,那么必填,否则默认使用us-east-1
。s3.session_token
: (选填)use_path_style
:(选填) 默认为false
。S3 SDK 默认使用 virtual-hosted style 方式。但某些对象存储系统可能没开启或没支持 virtual-hosted style 方式的访问,此时我们可以添加 use_path_style 参数来强制使用 path style 方式。比如minio
默认情况下只允许path style
访问方式,所以在访问 minio 时要加上use_path_style=true
。
注意:uri 目前支持三种 schema:http://, https:// 和 s3://
- 如果使用 http://或 https://, 则会根据 'use_path_style' 参数来决定是否使用'path style'方式访问 s3
- 如果使用 s3://, 则都使用 'virtual-hosted style' 方式访问 s3, 'use_path_style'参数无效。
- 如果 uri 路径不存在或文件都是空文件,s3 tvf 将返回空集合
详细使用案例可以参考最下方 Best Practice。
文件格式参数:
format
:(必填) 目前支持csv/csv_with_names/csv_with_names_and_types/json/parquet/orc
column_separator
:(选填) 列分割符,默认为\t
。line_delimiter
:(选填) 行分割符,默认为\n
。compress_type
: (选填) 目前支持UNKNOWN/PLAIN/GZ/LZO/BZ2/LZ4FRAME/DEFLATE/SNAPPYBLOCK
。默认值为UNKNOWN
, 将会根据uri
的后缀自动推断类型。下面 6 个参数是用于 json 格式的导入,具体使用方法可以参照:Json Load
read_json_by_line
: (选填) 默认为"true"
strip_outer_array
: (选填) 默认为"false"
json_root
: (选填) 默认为空jsonpaths
: (选填) 默认为空num_as_string
: (选填) 默认为false
下面 2 个参数是用于 csv 格式的导入fuzzy_parse
: (选填) 默认为false
trim_double_quotes
:布尔类型,选填,默认值为false
,为true
时表示裁剪掉 csv 文件每个字段最外层的双引号skip_lines
:整数类型,选填,默认值为 0,含义为跳过 csv 文件的前几行。当设置 format 设置为csv_with_names
或csv_with_names_and_types
时,该参数会失效
其他参数:
path_partition_keys
:(选填)指定文件路径中携带的分区列名,例如/path/to/city=beijing/date="2023-07-09", 则填写path_partition_keys="city,date"
,将会自动从路径中读取相应列名和列值进行导入。
Example
读取并访问 S3 兼容的对象存储上的 csv 格式文件
select * from s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"format" = "csv",
"use_path_style" = "true") order by c1;
可以配合desc function
使用
MySQL [(none)]> Desc function s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"format" = "csv",
"use_path_style" = "true");
Keywords
s3, table-valued-function, tvf
Best Practice
不同 url schema 的写法 http:// 、https:// 使用示例:
// 注意URI bucket写法以及use_path_style参数设置,http同理。
// 由于设置了"use_path_style"="true", 所以将采用path style方式访问s3。
select * from s3(
"uri" = "https://endpoint/bucket/file/student.csv",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"format" = "csv",
"use_path_style"="true");
// 注意URI bucket写法以及use_path_style参数设置,http同理。
// 由于设置了"use_path_style"="false", 所以将采用virtual-hosted style方式访问s3。
select * from s3(
"uri" = "https://bucket.endpoint/bucket/file/student.csv",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"format" = "csv",
"use_path_style"="false");
// 阿里云oss和腾讯云cos采用virtual-hosted style方式访问s3。
// OSS
select * from s3(
"uri" = "http://example-bucket.oss-cn-beijing.aliyuncs.com/your-folder/file.parquet",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"s3.region" = "oss-cn-beijing",
"format" = "parquet",
"use_path_style" = "false");
// COS
select * from s3(
"uri" = "https://example-bucket.cos.ap-hongkong.myqcloud.com/your-folder/file.parquet",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"s3.region" = "ap-hongkong",
"format" = "parquet",
"use_path_style" = "false");
s3:// 使用示例:
// 注意URI bucket写法, 无需设置use_path_style参数。
// 将采用virtual-hosted style方式访问s3。
select * from s3(
"uri" = "s3://bucket.endpoint/file/student.csv",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"format" = "csv");
csv format 由于 S3 table-valued-function 事先并不知道 table schema,所以会先读一遍文件来解析出 table schema。
csv
格式:S3 table-valued-function 读取 S3 上的文件并当作 csv 文件来处理,读取文件中的第一行用于解析 table schema。文件第一行的列个数n
将作为 table schema 的列个数,table schema 的列名则自动取名为c1, c2, ..., cn
,列类型都设置为 String
, 举例:
student1.csv 文件内容为:
1,ftw,12
2,zs,18
3,ww,20
使用 S3 tvf
MySQL [(none)]> select * from s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
-> "s3.access_key"= "minioadmin",
-> "s3.secret_key" = "minioadmin",
-> "format" = "csv",
-> "use_path_style" = "true") order by c1;
+------+------+------+
| c1 | c2 | c3 |
+------+------+------+
| 1 | ftw | 12 |
| 2 | zs | 18 |
| 3 | ww | 20 |
+------+------+------+
可以配合 desc function S3()
来查看 table schema
MySQL [(none)]> Desc function s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
-> "s3.access_key"= "minioadmin",
-> "s3.secret_key" = "minioadmin",
-> "format" = "csv",
-> "use_path_style" = "true");
+-------+------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------+------+-------+---------+-------+
| c1 | TEXT | Yes | false | NULL | NONE |
| c2 | TEXT | Yes | false | NULL | NONE |
| c3 | TEXT | Yes | false | NULL | NONE |
+-------+------+------+-------+---------+-------+
csv_with_names format
csv_with_names
格式:解析文件的第一行作为 table schema 的列个数和列名,列类型则都设置为 String
, 举例:
student_with_names.csv 文件内容为
id,name,age
1,ftw,12
2,zs,18
3,ww,20
使用 S3 tvf
MySQL [(none)]> select * from s3("uri" = "http://127.0.0.1:9312/test2/student_with_names.csv",
-> "s3.access_key"= "minioadmin",
-> "s3.secret_key" = "minioadmin",
-> "format" = "csv_with_names",
-> "use_path_style" = "true") order by id;
+------+------+------+
| id | name | age |
+------+------+------+
| 1 | ftw | 12 |
| 2 | zs | 18 |
| 3 | ww | 20 |
+------+------+------+
同样配合desc function S3()
可查看 table schema
MySQL [(none)]> Desc function s3("uri" = "http://127.0.0.1:9312/test2/student_with_names.csv",
-> "s3.access_key"= "minioadmin",
-> "s3.secret_key" = "minioadmin",
-> "format" = "csv_with_names",
-> "use_path_style" = "true");
+-------+------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------+------+-------+---------+-------+
| id | TEXT | Yes | false | NULL | NONE |
| name | TEXT | Yes | false | NULL | NONE |
| age | TEXT | Yes | false | NULL | NONE |
+-------+------+------+-------+---------+-------+
csv_with_names_and_types foramt
csv_with_names_and_types
格式:目前暂不支持从 csv 文件中解析出 column type。使用该 format 时,S3 tvf 会解析文件的第一行作为 table schema 的列个数和列名,列类型则都设置为 String,同时将忽略该文件的第二行。
student_with_names_and_types.csv 文件内容为
id,name,age
INT,STRING,INT
1,ftw,12
2,zs,18
3,ww,20
使用 S3 tvf
MySQL [(none)]> select * from s3("uri" = "http://127.0.0.1:9312/test2/student_with_names_and_types.csv",
-> "s3.access_key"= "minioadmin",
-> "s3.secret_key" = "minioadmin",
-> "format" = "csv_with_names_and_types",
-> "use_path_style" = "true") order by id;
+------+------+------+
| id | name | age |
+------+------+------+
| 1 | ftw | 12 |
| 2 | zs | 18 |
| 3 | ww | 20 |
+------+------+------+
同样配合desc function S3()
可查看 table schema
MySQL [(none)]> Desc function s3("uri" = "http://127.0.0.1:9312/test2/student_with_names_and_types.csv",
-> "s3.access_key"= "minioadmin",
-> "s3.secret_key" = "minioadmin",
-> "format" = "csv_with_names_and_types",
-> "use_path_style" = "true");
+-------+------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------+------+-------+---------+-------+
| id | TEXT | Yes | false | NULL | NONE |
| name | TEXT | Yes | false | NULL | NONE |
| age | TEXT | Yes | false | NULL | NONE |
+-------+------+------+-------+---------+-------+
json format
json
格式:json 格式涉及到较多的可选参数,各个参数的意义可以参考:Json Load。S3 tvf 查询 json 格式文件时根据 json_root
和 jsonpaths
参数定位到一个 json 对象,将该对象的中的key
作为 table schema 的列名,列类型都设置为 String。举例:
data.json 文件
[{"id":1, "name":"ftw", "age":18}]
[{"id":2, "name":"xxx", "age":17}]
[{"id":3, "name":"yyy", "age":19}]
使用 S3 tvf 查询
MySQL [(none)]> select * from s3(
"uri" = "http://127.0.0.1:9312/test2/data.json",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"format" = "json",
"strip_outer_array" = "true",
"read_json_by_line" = "true",
"use_path_style"="true");
+------+------+------+
| id | name | age |
+------+------+------+
| 1 | ftw | 18 |
| 2 | xxx | 17 |
| 3 | yyy | 19 |
+------+------+------+
MySQL [(none)]> select * from s3(
"uri" = "http://127.0.0.1:9312/test2/data.json",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"format" = "json",
"strip_outer_array" = "true",
"jsonpaths" = "[\"$.id\", \"$.age\"]",
"use_path_style"="true");
+------+------+
| id | age |
+------+------+
| 1 | 18 |
| 2 | 17 |
| 3 | 19 |
+------+------+
parquet format
parquet
格式:S3 tvf 支持从 parquet 文件中解析出 table schema 的列名、列类型。举例:
MySQL [(none)]> select * from s3(
"uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"format" = "parquet",
"use_path_style"="true") limit 5;
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi |
| 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo |
| 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag |
| 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r |
| 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
MySQL [(none)]> desc function s3(
"uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"format" = "parquet",
"use_path_style"="true");
+---------------+--------------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+--------------+------+-------+---------+-------+
| p_partkey | INT | Yes | false | NULL | NONE |
| p_name | TEXT | Yes | false | NULL | NONE |
| p_mfgr | TEXT | Yes | false | NULL | NONE |
| p_brand | TEXT | Yes | false | NULL | NONE |
| p_type | TEXT | Yes | false | NULL | NONE |
| p_size | INT | Yes | false | NULL | NONE |
| p_container | TEXT | Yes | false | NULL | NONE |
| p_retailprice | DECIMAL(9,0) | Yes | false | NULL | NONE |
| p_comment | TEXT | Yes | false | NULL | NONE |
+---------------+--------------+------+-------+---------+-------+
orc format
orc
格式:和parquet
format 使用方法一致,将format
参数设置为 orc。
MySQL [(none)]> select * from s3(
"uri" = "http://127.0.0.1:9312/test2/test.snappy.orc",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"format" = "orc",
"use_path_style"="true") limit 5;
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi |
| 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo |
| 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag |
| 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r |
| 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
avro format
avro
格式:S3 tvf 支持从 avro 文件中解析出 table schema 的列名、列类型。举例:
select * from s3(
"uri" = "http://127.0.0.1:9312/test2/person.avro",
"ACCESS_KEY" = "ak",
"SECRET_KEY" = "sk",
"FORMAT" = "avro");
+--------+--------------+-------------+-----------------+
| name | boolean_type | double_type | long_type |
+--------+--------------+-------------+-----------------+
| Alyssa | 1 | 10.0012 | 100000000221133 |
| Ben | 0 | 5555.999 | 4009990000 |
| lisi | 0 | 5992225.999 | 9099933330 |
+--------+--------------+-------------+-----------------+
uri 包含通配符
uri 可以使用通配符来读取多个文件。注意:如果使用通配符要保证各个文件的格式是一致的 (尤其是 csv/csv_with_names/csv_with_names_and_types 算做不同的格式),S3 tvf 用第一个文件来解析出 table schema。 如下两个 csv 文件:
// file1.csv
1,aaa,18
2,qqq,20
3,qwe,19
// file2.csv
5,cyx,19
6,ftw,21
可以在 uri 上使用通配符来导入。
MySQL [(none)]> select * from s3(
"uri" = "http://127.0.0.1:9312/test2/file*.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"format" = "csv",
"use_path_style"="true");
+------+------+------+
| c1 | c2 | c3 |
+------+------+------+
| 1 | aaa | 18 |
| 2 | qqq | 20 |
| 3 | qwe | 19 |
| 5 | cyx | 19 |
| 6 | ftw | 21 |
+------+------+------+
配合 insert into
和 cast
使用 S3
tvf
// 创建doris内部表
CREATE TABLE IF NOT EXISTS ${testTable}
(
id int,
name varchar(50),
age int
)
COMMENT "my first table"
DISTRIBUTED BY HASH(id) BUCKETS 32
PROPERTIES("replication_num" = "1");
// 使用S3插入数据
insert into ${testTable} (id,name,age)
select cast (id as INT) as id, name, cast (age as INT) as age
from s3(
"uri" = "${uri}",
"s3.access_key"= "${ak}",
"s3.secret_key" = "${sk}",
"format" = "${format}",
"strip_outer_array" = "true",
"read_json_by_line" = "true",
"use_path_style" = "true");