Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: JDBC sink lost update #12026

Closed
chenzl25 opened this issue Sep 1, 2023 · 3 comments · Fixed by #12458
Closed

bug: JDBC sink lost update #12026

chenzl25 opened this issue Sep 1, 2023 · 3 comments · Fixed by #12458
Labels
type/bug Something isn't working
Milestone

Comments

@chenzl25
Copy link
Contributor

chenzl25 commented Sep 1, 2023

Describe the bug

In RisingWave

 create table jdbc_table (id int, v int) ;

  CREATE SINK s_sink FROM jdbc_table WITH (
    connector='jdbc',
    jdbc.url='jdbc:mysql://xxxx',
    table.name='jdbc_table',
    primary_key='id',
    type='upsert'
);

insert into jdbc_table select i, i from generate_series(1, 10000) i;

// Lost update about 5000 rows
update jdbc_table set id = id + 1;

Table definition in MySQL:

 CREATE TABLE `jdbc_table` (
  `id` int NOT NULL,
  `v` int DEFAULT NULL,
  PRIMARY KEY (`id`)
)

Even if we set streaming_parallelism = 1 would still meet this issue because we issue delete + insert for each row partially, instead of delete all before rows and then insert after rows in the sink executor.

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

No response

@chenzl25 chenzl25 added the type/bug Something isn't working label Sep 1, 2023
@github-actions github-actions bot added this to the release-1.2 milestone Sep 1, 2023
@chenzl25
Copy link
Contributor Author

chenzl25 commented Sep 1, 2023

Another case shows that even if we update those non-pk columns (from MySQL sink table's perspective), it could cause data lost.

create table jdbc_table (id int, v int);
create materialized view v as select distinct on(id, v) id, v from jdbc_table;


 CREATE SINK s_sink FROM v WITH (
    connector='jdbc',
    jdbc.url='jdbc:mysql://xxx',
    table.name='jdbc_table',
    primary_key='id',
    type='upsert'
);

insert into jdbc_table select i, i from generate_series(1, 10000) i;



// Data lost
update jdbc_table set v = v + 1;

@BugenZhao
Copy link
Member

Even if we set streaming_parallelism = 1 would still meet this issue because we issue delete + insert for each row partially, instead of delete all before rows and then insert after rows in the sink executor.

Just FYI, we've banned updates on the primary key column in #8569 in case. However, as we allow users to specify the primary key columns for sinks, that cannot cover the issue here.

@tabVersion tabVersion changed the title bug: JDBC sink lost upadte bug: JDBC sink lost update Sep 4, 2023
@st1page
Copy link
Contributor

st1page commented Sep 4, 2023

It is because the stream key is different from the user defined primary key columns for sinks.

stream key: a,b 
sink pk: a

original:
(1,1) -> (1,2)
(1,2) -> (1,3)

mv fragment 1:
delete (1,1) 

mv fragment 2:
insert (1,2)
delete (1,2)

mv fragment 3:
insert (1,3)

merge to sink fragment:
insert (1,3)
insert (1,2)
delete (1,2)
delete (1,1) 

A solution is do additional compaction in the sink executor per barrier.

  1. compact all the chanes with the stream key.
  2. sink all the delete events and then sink all insert evernt.
    why it is correct and can reorder de event in the second phase? because after compacting with the stream key, the two event with the same used defined sink pk must have different stream key. So the delete event is not to delete the inserted record in our internal streaming SQL semantic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants