Skip to main content

Bitmap/HLL data format

Bitmap format

Format description

The bitmap in Doris uses roaring bitmap storage, and the be side uses CRoaring. The serialization format of Roaring is compatible in languages ​​such as C++/java/go, while the serialization result of the format of C++ Roaring64Map is the same as that of Roaring64NavigableMap in Java. Not compatible. There are 5 types of Doris bimap, each of which is represented by one byte

The bitmap serialization format in Doris is explained as follows:

 | flag     | data .....|
<--1Byte--><--n bytes-->

The Flag value description is as follows:

ValueDescription
0EMPTY, empty bitmap, the following data part is empty, the whole serialization result is only one byte
1SINGLE32, there is only one 32-bit unsigned integer value in the bitmap, and the next 4 bytes represent the 32-bit unsigned integer value
2BITMAP32, 32-bit bitmap corresponds to the type org.roaringbitmap.RoaringBitmap in java. The type is roaring::Roaring in C++, and the following data is the structure after the sequence of roaring::Roaring. You can use org in java. .roaringbitmap.RoaringBitmap or roaring::Roaring in c++ directly deserialize
3SINGLE64, there is only one 64-bit unsigned integer value in the bitmap, and the next 8 bytes represent the 64-bit unsigned integer value
4BITMAP64, 64-bit bitmap corresponds to org.roaringbitmap.RoaringBitmap in java; Roaring64Map in doris in c++. The data structure is the same as the result in the roaring library, but the serialization and deserialization methods It is different, there will be 1-8 bytes of variable-length encoding uint64 in the bitmap representation of the size. The following data is a series of multiple high-order representations of 4 bytes and 32-bit roaring bitmap serialized data repeated
5SET, When the number of values in the bitmap is greater than one and less than 32, the BitmapValue actually uses a hash set to store the values. The data structure: A uint8_t of one byte represents the number of values, followed by the values themselves (8 bytes each, uint64_t). |

C++ serialization and deserialization examples are in the BitmapValue::write() method in be/src/util/bitmap_value.h and the Java examples are in the serialize() deserialize() method in fe/fe-common/src/main/java/org/apache/doris/common/io/BitmapValue.java.

HLL format description

Serialized data in HLL format is implemented in Doris itself. Similar to the Bitmap type, the HLL format is composed of a 1-byte flag followed by multiple bytes of data. The meaning of the flag is as follows

ValueDescription
0HLL_DATA_EMPTY, empty HLL, the following data part is empty, the entire serialization result is only one byte
1HLL_DATA_EXPLICIT, the next byte is explicit The number of data blocks, followed by multiple data blocks, each data block is composed of 8 bytes in length and data,
2HLL_DATA_SPARSE, only non-zero values are stored, the next 4 bytes indicate the number of registers, and there are multiple register structures in the back. Each register is composed of the index of the first 2 bytes and the value of 1 byte
3HLL_DATA_FULL, which means that all 16 * 1024 registers have values, followed by 16 * 1024 bytes of value data

C++ serialization and deserialization examples are in the serialize() deserialize() method of be/src/olap/hll.h, and the Java examples are in the serialize() deserialize() method in fe/fe-common/src/main/java/org/apache/doris/common/io/hll.java.