Skip to content

Does insert operation also create index after write? #17767

@bithw1

Description

@bithw1

Hi,

I am using Hudi 0.15.0, I have following sql:

set hoodie.spark.sql.insert.into.operation=insert;
set hoodie.datasource.write.insert.drop.duplicates=false;
set hoodie.datasource.write.insert.dup.policy=none;
set hoodie.combine.before.insert=false;

CREATE TABLE IF NOT EXISTS hudi_cow_20260102_06 (
  a INT,
  b INT,
  c INT
) 

USING hudi

tblproperties(
type='cow',
primaryKey='a',
hoodie.datasource.write.precombine.field='c',
hoodie.index.type='BLOOM',
hoodie.index.bloom.num_entries='20',
hoodie.bloom.index.filter.dynamic.max.entries='25'
);


insert into  hudi_cow_20260102_06(a,b,c) values(1,2,3),(1,4,7),(1,3,6);

There are 3 records in hudi_cow_20260102_06 after the insertion (all the three record's record key is 1)

When I look at the parquet footer, I see the bloom filter is created in in the parquet footer, which means, bloom index is created( hoodie_bloom_filter_type_code: DYNAMIC_V0 in the footer).

I have thought that no index will be created after duplicates allowed insertion, because there maybe duplicates for insert, if two records have the same record key, which record would the index refer to?

I am using the following command


[hadoop@hadoop ~]$ hadoop jar software/parquet-cli-1.14.1-runtime.jar meta hdfs:///user/hive/warehouse/hudi_cow_20260102_06/e6ffba4b-0b30-4333-a851-7bef6dcc9cb0-0_0-612-617_20260102160515023.parquet

File path:  hdfs:///user/hive/warehouse/hudi_cow_20260102_06/e6ffba4b-0b30-4333-a851-7bef6dcc9cb0-0_0-612-617_20260102160515023.parquet
Created by: parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94)
Properties:
  hoodie_bloom_filter_type_code: DYNAMIC_V0
    org.apache.hudi.bloomfilter: /////wAAAB4BAAADXwAAABQAAAADAAAAAf////8AAAAeAQAAA18AAAAAEAAAAEAAAAQAEIAAAAAAAQAAAAIAAAAAIAAAAAAIAAAIAAAAAAAAAAAAyAAAgAIAAAAAAAAAAIAAABBABAAAAAAAAAQAIAAQAAAAAABAABAAAAAAAAAAAIAAAAAAAAAAAAEAAggAEAA=
          hoodie_min_record_key: 1
            parquet.avro.schema: {"type":"record","name":"hudi_cow_20260102_06_record","namespace":"hoodie.hudi_cow_20260102_06","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"a","type":["null","int"],"default":null},{"name":"b","type":["null","int"],"default":null},{"name":"c","type":["null","int"],"default":null}]}
              writer.model.name: avro
          hoodie_max_record_key: 1
Schema:
message hoodie.hudi_cow_20260102_06.hudi_cow_20260102_06_record {
  optional binary _hoodie_commit_time (STRING);
  optional binary _hoodie_commit_seqno (STRING);
  optional binary _hoodie_record_key (STRING);
  optional binary _hoodie_partition_path (STRING);
  optional binary _hoodie_file_name (STRING);
  optional int32 a;
  optional int32 b;
  optional int32 c;
}


Row group 0:  count: 3  248.33 B records  start: 4  total(compressed): 745 B total(uncompressed):551 B
--------------------------------------------------------------------------------
                        type      encodings count     avg size   nulls   min / max
_hoodie_commit_time     BINARY    G _ R     3         37.00 B    0       "20260102160515023" / "20260102160515023"
_hoodie_commit_seqno    BINARY    G   _     3         26.67 B    0       "20260102160515023_0_0" / "20260102160515023_0_2"
_hoodie_record_key      BINARY    G _ R     3         31.67 B    0       "1" / "1"
_hoodie_partition_path  BINARY    G _ R     3         31.33 B    0       "" / ""
_hoodie_file_name       BINARY    G _ R     3         53.00 B    0       "e6ffba4b-0b30-4333-a851-7..." / "e6ffba4b-0b30-4333-a851-7..."
a                       INT32     G _ R     3         31.33 B    0       "1" / "1"
b                       INT32     G   _     3         18.67 B    0       "2" / "4"
c                       INT32     G   _     3         18.67 B    0       "3" / "7"

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions