[Bug]: Data lost after optimizing in Mixed format #2253

jefyjiang · 2023-11-06T07:52:56Z

What happened?

通过spark3.3.3 往amoro0.5.1版本管理的iceberg1.3.1表里插入数据，总数据量11710条，刚插入完成时，查询表中数据量正常，经过一次MINOR optimizing 再查表中数据，就只有是7271条数据，经确认，数据ID不重复，以下是相关信息：

建表语句

CREATE TABLE jid.dwd.pat_main1 (
pid string COMMENT 'ID',
ct timestamp COMMENT '创建时间',
an string,
pn string,
ad date ,
pd date ,
db_name string,
primary key (pid)
) USING arctic
PARTITIONED BY (db_name,bucket(8,pid))
TBLPROPERTIES (
'format-version'='2',
'write.metadata.previous-versions-max' = '5',
'write.metadata.delete-after-commit.enabled'= 'true',
'write.upsert.enabled' = 'true',
'self-optimizing.enabled' = 'true',
'change.data.ttl.minutes' = '20',
'snapshot.change.keep.minutes' = '20',
'snapshot.base.keep.minutes' = '10',
'table-expire.enabled' = 'true',
'self-optimizing.max-file-count' = '1000000',
'clean-orphan-file.min-existing-time-minutes' = '15',
'self-optimizing.group' = 'amoro_e_flink',
'clean-orphan-file.enabled' = 'true'
);

查询表和对应的chang，结果如下

查看对应的hdfs数据，发现base中少了两个分区

Affects Versions

Amoro0.5.1

What engines are you seeing the problem on?

Spark

How to reproduce

No response

Relevant log output

No response

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

shidayang · 2023-11-06T07:55:01Z

Thank you for reporting this bug. BTW, Amoro community encourages users to communicate in English.

shidayang · 2023-11-06T07:56:08Z

As far as I know, This is Mixed-format?

jefyjiang added the type:bug Something isn't working label Nov 6, 2023

wangtaohz mentioned this issue Nov 6, 2023

[AMORO-2253] Mixed Format optimized-sequence should not contains the skipped partitions #2249

Merged

3 tasks

shidayang closed this as completed in #2249 Nov 6, 2023

shidayang changed the title ~~[Bug]: optimizing后数据丢失~~ [Bug]: Data lost after optimizing in Mixed format Nov 6, 2023

zhoujinsong mentioned this issue Dec 19, 2023

Release-0.6.1 roadmap #2448

Closed

33 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Data lost after optimizing in Mixed format #2253

[Bug]: Data lost after optimizing in Mixed format #2253

jefyjiang commented Nov 6, 2023

shidayang commented Nov 6, 2023

shidayang commented Nov 6, 2023

[Bug]: Data lost after optimizing in Mixed format #2253

[Bug]: Data lost after optimizing in Mixed format #2253

Comments

jefyjiang commented Nov 6, 2023

What happened?

建表语句

Affects Versions

What engines are you seeing the problem on?

How to reproduce

Relevant log output

Anything else

Are you willing to submit a PR?

Code of Conduct

shidayang commented Nov 6, 2023

shidayang commented Nov 6, 2023