Skip to main content
Skip to main content

Approximate Deduplication Using HLL

HLL approximate deduplication

In actual business scenarios, with the increasing amount of business data, the pressure on data deduplication is also increasing. When the data reaches a certain scale, the cost of using accurate deduplication is also increasing. If it is acceptable, it is a very good way to achieve fast deduplication and reduce computational pressure through approximate algorithms. This article mainly introduces HyperLogLog (HLL for short) provided by Doris as an approximate deduplication algorithm.

The characteristic of HLL is that it has excellent space complexity O(mloglogn), time complexity is O(n), and the error of the calculation result can be controlled at about 1%-2%. The error is related to the size of the data set and the ha related to the Hierarchy function.

What is HyperLogLog

It is an upgraded version of the LogLog algorithm, and its role is to provide imprecise deduplication counts. Its mathematical basis is the Bernoulli test.

Assuming that the coin has both heads and tails, the probability of the coin being flipped up and down is 50%. Keep flipping the coin until it comes up heads, which we record as a full trial.

Then for multiple Bernoulli trials, assume that the number of times is n. It means that there are n times of heads. Suppose the number of tosses experienced per Bernoulli trial is k. For the first Bernoulli trial, the number of times is set to k1, and so on, the nth time corresponds to kn.

Among them, for these n Bernoulli trials, there must be a maximum number of tosses k. For example, after 12 tosses, a head will appear, then this is called k_max, which represents the maximum number of tosses.

Bernoulli's experiment can easily draw the following conclusions:

  • No number of throws for n Bernoulli processes is greater than k_max.
  • n Bernoulli processes with at least one throw equal to k_max

Finally, combined with the method of maximum likelihood estimation, it is found that there is an estimated correlation between n and k_max: n = 2^k_max. When we only record k_max, we can estimate how many pieces of data there are in total, that is, the cardinality.

Suppose the test results are as follows:

  • 1st trial: 3 tosses before it turns heads, at this time k=3, n=1
  • 2nd trial: Heads appear after 2 tosses, at this time k=2, n=2
  • The 3rd trial: 6 tosses before the heads appear, at this time k=6, n=3
  • The nth trial: it took 12 tosses to get heads, at this point we estimate, n = 2^12

Take the first three groups of experiments in the above example, then k_max = 6, and finally n=3, we put it into the estimation formula, obviously: 3 ≠ 2^6 . That is to say, when the number of trials is small, the error of this estimation method is very large.

These three sets of trials, we call one round of estimation. If only one round is performed, when n is large enough, the estimated error rate will be relatively reduced, but still not small enough.

Doris HLL Function

HLL is an engineering implementation based on the HyperLogLog algorithm. It is used to save the intermediate results of the HyperLogLog calculation process. It can only be used as the value column type of the table to continuously reduce the amount of data through aggregation.

To achieve the purpose of speeding up the query, based on it is an estimated result, the error is about 1%, the hll column is generated by other columns or the data in the imported data, and the hll_hash function is used when importing

To specify which column in the data is used to generate the hll column, it is often used to replace count distinct, and is used to quickly calculate uv in business by combining rollup

HLL_UNION_AGG(hll)

This function is an aggregate function that computes a cardinality estimate for all data that satisfies the condition.

HLL_CARDINALITY(hll)

This function is used to calculate the cardinality estimate for a single hll column

HLL_HASH(column_name)

Generate HLL column type for insert or import, see related instructions for import usage

How to use Doris HLL

  1. When using HLL to deduplicate, you need to set the target column type to HLL and the aggregate function to HLL_UNION in the table creation statement
  2. Columns of HLL type cannot be used as Key columns
  3. The user does not need to specify the length and default value, the length is controlled within the system according to the degree of data aggregation

Create a table with hll column

create table test_hll(
dt date,
id int,
name char(10),
province char(10),
os char(10),
pv hll hll_union
)
Aggregate KEY (dt,id,name,province,os)
distributed by hash(id) buckets 10
PROPERTIES(
"replication_num" = "1",
"in_memory"="false"
);

Import Data

  1. Stream load Import

    curl --location-trusted -u root: -H "label:label_test_hll_load" \
    -H "column_separator:," \
    -H "columns:dt,id,name,province,os, pv=hll_hash(id)" -T test_hll.csv http://fe_IP:8030/api/demo/test_hll/_stream_load

    The sample data is as follows (test_hll.csv):

    2022-05-05,10001,Testing01,Beijing,Windows  
    2022-05-05,10002,Testing01,Beijing,Linux
    2022-05-05,10003,Testing01,Beijing,MacOS
    2022-05-05,10004,Testing01,Hebei,Windows
    2022-05-06,10001,Testing01,Shanghai,Windows
    2022-05-06,10002,Testing01,Shanghai,Linux
    2022-05-06,10003,Testing01,Jiangsu,MacOS
    2022-05-06,10004,Testing01,Shaanxi,Windows

    The import result is as follows:

    # curl --location-trusted -u root: -H "label:label_test_hll_load"     -H "column_separator:,"     -H "columns:dt,id,name,province,os, pv=hll_hash(id)" -T test_hll.csv http://127.0.0.1:8030/api/demo/test_hll/_stream_load

    {
    "TxnId": 693,
    "Label": "label_test_hll_load",
    "TwoPhaseCommit": "false",
    "Status": "Success",
    "Message": "OK",
    "NumberTotalRows": 8,
    "NumberLoadedRows": 8,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 320,
    "LoadTimeMs": 23,
    "BeginTxnTimeMs": 0,
    "StreamLoadPutTimeMs": 1,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 9,
    "CommitAndPublishTimeMs": 11
    }
  2. Broker Load

LOAD LABEL demo.test_hlllabel
(
DATA INFILE("hdfs://hdfs_host:hdfs_port/user/doris_test_hll/data/input/file")
INTO TABLE `test_hll`
COLUMNS TERMINATED BY ","
(dt,id,name,province,os)
SET (
pv = HLL_HASH(id)
)
);

Query data

HLL columns do not allow direct query of the original value, but can only be queried through the HLL aggregate function.

  1. Find the total PV
mysql> select HLL_UNION_AGG(pv) from test_hll;
+---------------------+
| hll_union_agg(`pv`) |
+---------------------+
| 4 |
+---------------------+
1 row in set (0.00 sec)

Equivalent to:

mysql> SELECT COUNT(DISTINCT pv) FROM test_hll;
+----------------------+
| count(DISTINCT `pv`) |
+----------------------+
| 4 |
+----------------------+
1 row in set (0.01 sec)
  1. Find the PV of each day
mysql> select HLL_UNION_AGG(pv) from test_hll group by dt;
+---------------------+
| hll_union_agg(`pv`) |
+---------------------+
| 4 |
| 4 |
+---------------------+
2 rows in set (0.01 sec)