Skip to content

[Bug] V2.1.6 Iceberg Dangling Deletes影响数量统计 #42240

@liuchunhua

Description

@liuchunhua

Search before asking

  • I had searched in the issues and found no similar issues.

Version

v2.1.6

What's Wrong?

Iceberg Dangling Deletes 影响数量统计

test case: doris/samples/datalake/iceberg_and_paimon

bash start_all.sh
bash start_doris_client.sh

spark:

> select version();
3.5.1 fd86f85e181fc2dc0f50a096855acf83a6cc5d9c

CREATE TABLE demo.db_iceberg.tb_iceberg (
  id BIGINT NOT NULL,
  val STRING)
USING iceberg
LOCATION 's3://warehouse/wh/db_iceberg/tb_iceberg'
TBLPROPERTIES (
  'current-snapshot-id' = '2047510404873857005',
  'format' = 'iceberg/parquet',
  'format-version' = '2',
  'identifier-fields' = '[id]',
  'upsert-enabled' = 'true',
  'write.delete.mode' = 'merge-on-read',
  'write.parquet.compression-codec' = 'zstd',
  'write.update.mode' = 'merge-on-read',
  'write.upsert.enabled' = 'true');


insert into demo.db_iceberg.tb_iceberg values(1, 'abd');
update demo.db_iceberg.tb_iceberg set val = 'def' where id = 1;
update demo.db_iceberg.tb_iceberg set val = 'hgk' where id = 1;
call demo.system.rewrite_data_files(table => 'demo.db_iceberg.tb_iceberg', options => map('min-input-files', '1'));
call demo.system.expire_snapshots(table => 'demo.db_iceberg.tb_iceberg', older_than => timestamp'2024-10-22 12:41:00');
insert into demo.db_iceberg.tb_iceberg values(2, 'abd');
~/mc ls minio/warehouse/wh/db_iceberg/tb_iceberg/data/
[2024-10-22 12:38:36 CST] 1.4KiB STANDARD 00000-4-c401aec0-dab0-4476-b99e-c67022be3505-00001-deletes.parquet
[2024-10-22 12:42:41 CST]   637B STANDARD 00000-624-9bb2caa4-0c97-4588-8f6b-68b72f970905-0-00001.parquet
[2024-10-22 12:40:03 CST]   646B STANDARD 00000-7-d78a7a7d-a615-429b-b437-31c66d6a00b0-0-00001.parquet
D select * from read_parquet('s3://warehouse/wh/db_iceberg/tb_iceberg/data/00000-624-9bb2caa4-0c97-4588-8f6b-68b72f970905-0-00001.parquet');
┌───────┬─────────┐
│  id   │   val   │
│ int64 │ varchar │
├───────┼─────────┤
│     2 │ abd     │
└───────┴─────────┘
D select * from read_parquet('s3://warehouse/wh/db_iceberg/tb_iceberg/data/00000-4-c401aec0-dab0-4476-b99e-c67022be3505-00001-deletes.parquet');
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────┬───────┐
│                                                file_path                                                │  pos  │
│                                                 varchar                                                 │ int64 │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
│ s3://warehouse/wh/db_iceberg/tb_iceberg/data/00000-2-38e5f4da-8e99-43a2-ba15-f648adc6483b-00001.parquet │     0 │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────┴───────┘
D select * from read_parquet('s3://warehouse/wh/db_iceberg/tb_iceberg/data/00000-7-d78a7a7d-a615-429b-b437-31c66d6a00b0-0-00001.parquet');
┌───────┬─────────┐
│  id   │   val   │
│ int64 │ varchar │
├───────┼─────────┤
│     1 │ hgk     │
└───────┴─────────┘

doris:

mysql> select version from backends();
+-----------------------------+
| version                     |
+-----------------------------+
| doris-2.1.6-rc04-653e315ba5 |
+-----------------------------+

mysql> select count(id) from iceberg.db_iceberg.tb_iceberg;
+-----------+
| count(id) |
+-----------+
|         2 |
+-----------+
1 row in set (0.10 sec)

mysql> select count(*) from iceberg.db_iceberg.tb_iceberg;  -- wrong
+----------+
| count(*) |
+----------+
|        1 |
+----------+
1 row in set (0.07 sec)

mysql> select * from iceberg.db_iceberg.tb_iceberg;
+------+------+
| id   | val  |
+------+------+
|    1 | hgk  |
|    2 | abd  |
+------+------+
2 rows in set (0.06 sec)

使用rewrite_position_delete_files清理
spark:

spark-sql ()> CALL demo.system.rewrite_position_delete_files(table => 'db_iceberg.tb_iceberg', options => map('rewrite-all', 'true'));
1       0       1440    0

doris:

mysql> refresh table iceberg.db_iceberg.tb_iceberg;
Query OK, 0 rows affected (0.01 sec)

mysql> select count(*) from iceberg.db_iceberg.tb_iceberg; -- right
+----------+
| count(*) |
+----------+
|        2 |
+----------+
1 row in set (0.10 sec)

mysql> select * from iceberg.db_iceberg.tb_iceberg;
+------+------+
| id   | val  |
+------+------+
|    2 | abd  |
|    1 | hgk  |
+------+------+
2 rows in set (0.06 sec)

What You Expected?

正确处理Iceberg Dangling Deletes

How to Reproduce?

No response

Anything Else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions