-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Open
Labels
type:community-supportCommunity-relatedCommunity-related
Description
Hi,
I am using Hudi 0.15.0, I have following sql:
set hoodie.spark.sql.insert.into.operation=insert;
set hoodie.datasource.write.insert.drop.duplicates=false;
set hoodie.datasource.write.insert.dup.policy=none;
set hoodie.combine.before.insert=false;
CREATE TABLE IF NOT EXISTS hudi_cow_20260102_06 (
a INT,
b INT,
c INT
)
USING hudi
tblproperties(
type='cow',
primaryKey='a',
hoodie.datasource.write.precombine.field='c',
hoodie.index.type='BLOOM',
hoodie.index.bloom.num_entries='20',
hoodie.bloom.index.filter.dynamic.max.entries='25'
);
insert into hudi_cow_20260102_06(a,b,c) values(1,2,3),(1,4,7),(1,3,6);
There are 3 records in hudi_cow_20260102_06 after the insertion (all the three record's record key is 1)
When I look at the parquet footer, I see the bloom filter is created in in the parquet footer, which means, bloom index is created( hoodie_bloom_filter_type_code: DYNAMIC_V0 in the footer).
I have thought that no index will be created after duplicates allowed insertion, because there maybe duplicates for insert, if two records have the same record key, which record would the index refer to?
I am using the following command
[hadoop@hadoop ~]$ hadoop jar software/parquet-cli-1.14.1-runtime.jar meta hdfs:///user/hive/warehouse/hudi_cow_20260102_06/e6ffba4b-0b30-4333-a851-7bef6dcc9cb0-0_0-612-617_20260102160515023.parquet
File path: hdfs:///user/hive/warehouse/hudi_cow_20260102_06/e6ffba4b-0b30-4333-a851-7bef6dcc9cb0-0_0-612-617_20260102160515023.parquet
Created by: parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94)
Properties:
hoodie_bloom_filter_type_code: DYNAMIC_V0
org.apache.hudi.bloomfilter: /////wAAAB4BAAADXwAAABQAAAADAAAAAf////8AAAAeAQAAA18AAAAAEAAAAEAAAAQAEIAAAAAAAQAAAAIAAAAAIAAAAAAIAAAIAAAAAAAAAAAAyAAAgAIAAAAAAAAAAIAAABBABAAAAAAAAAQAIAAQAAAAAABAABAAAAAAAAAAAIAAAAAAAAAAAAEAAggAEAA=
hoodie_min_record_key: 1
parquet.avro.schema: {"type":"record","name":"hudi_cow_20260102_06_record","namespace":"hoodie.hudi_cow_20260102_06","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"a","type":["null","int"],"default":null},{"name":"b","type":["null","int"],"default":null},{"name":"c","type":["null","int"],"default":null}]}
writer.model.name: avro
hoodie_max_record_key: 1
Schema:
message hoodie.hudi_cow_20260102_06.hudi_cow_20260102_06_record {
optional binary _hoodie_commit_time (STRING);
optional binary _hoodie_commit_seqno (STRING);
optional binary _hoodie_record_key (STRING);
optional binary _hoodie_partition_path (STRING);
optional binary _hoodie_file_name (STRING);
optional int32 a;
optional int32 b;
optional int32 c;
}
Row group 0: count: 3 248.33 B records start: 4 total(compressed): 745 B total(uncompressed):551 B
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
_hoodie_commit_time BINARY G _ R 3 37.00 B 0 "20260102160515023" / "20260102160515023"
_hoodie_commit_seqno BINARY G _ 3 26.67 B 0 "20260102160515023_0_0" / "20260102160515023_0_2"
_hoodie_record_key BINARY G _ R 3 31.67 B 0 "1" / "1"
_hoodie_partition_path BINARY G _ R 3 31.33 B 0 "" / ""
_hoodie_file_name BINARY G _ R 3 53.00 B 0 "e6ffba4b-0b30-4333-a851-7..." / "e6ffba4b-0b30-4333-a851-7..."
a INT32 G _ R 3 31.33 B 0 "1" / "1"
b INT32 G _ 3 18.67 B 0 "2" / "4"
c INT32 G _ 3 18.67 B 0 "3" / "7"
Metadata
Metadata
Assignees
Labels
type:community-supportCommunity-relatedCommunity-related