Implement cataloger current diff using scanners #790

nopcoder · 2020-10-08T12:40:52Z

Use a DB entity scanner to go over parent/child branches to perform diff operation.
The scanner interface operations as an iterator that scan a branch or a lineage.
The diff implementation uses the scanner to compare the two branches and produce diff result into a temporary table.
The iteration support starting after 'path' and the diff loop can limit the output to X records to handle cases where we need pagination.

make the test pretty

code review updates

catalog/cataloger_diff.go

catalog/db_lineage_scanner.go

catalog/db_scanner.go

catalog/cataloger_diff.go

- keep original options and setting defaults if needed

codecov-io · 2020-10-11T08:12:30Z

Codecov Report

Merging #790 into master will increase coverage by 0.18%.
The diff coverage is 78.11%.

@@            Coverage Diff             @@
##           master     #790      +/-   ##
==========================================
+ Coverage   42.92%   43.11%   +0.18%     
==========================================
  Files         135      136       +1     
  Lines       10571    10648      +77     
==========================================
+ Hits         4538     4591      +53     
- Misses       5443     5460      +17     
- Partials      590      597       +7

Impacted Files	Coverage Δ
catalog/diff.go	`18.51% <ø> (ø)`
catalog/views.go	`97.01% <ø> (-1.65%)`	⬇️
catalog/cataloger_merge.go	`60.81% <66.66%> (-0.53%)`	⬇️
catalog/cataloger_diff.go	`66.66% <73.59%> (+9.43%)`	⬆️
catalog/db_lineage_scanner.go	`77.55% <77.55%> (ø)`
catalog/db_branch_scanner.go	`90.32% <90.32%> (ø)`
catalog/db_scanner.go	`100.00% <100.00%> (ø)`
catalog/model.go	`76.92% <100.00%> (+15.38%)`	⬆️
catalog/cataloger_create_entry.go	`94.73% <0.00%> (-5.27%)`	⬇️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1f3cd72...5f1f6e5. Read the comment docs.

arielshaqed

Just requests to clarify some oddities that make it hard for me to read. Nothing blocking.

catalog/cataloger_diff.go

catalog/db_scanner.go

arielshaqed

Cool, thanks!

You may want someone more qualified to approve too, I mostly used this review to clear up my ignorance

catalog/db_lineage_scanner_test.go

arielshaqed

Thanks!

We're storing CTIDs inside tables. This is very scary, because TFM warns against using them for anything long-term. I know that the current implementation only writes them to an effectively-temporary table so it might be OK, but we do need a tonne of documentation around this:

All functions that might write a CTID to a table some time during their execution should carry a warning. Specifically when creating such a dangerous table.
User-level documentation must warn never ever to run VACUUM FULL on the database if there is any chance of a concurrent merge. The fact that we would never do such a thing does not mean everyone follows the same ops guidelines, so we have to tell everyone they have to do it our way. E.g. user X might have their own small private lakeFS instance (we actually encourage this) and naïvely suppose that this is safe because it is fast. We should ensure that they know enough to be safe.
If there is any chance of such tables surviving and being re-used then we must document backup strategy etc.

Very sorry, but I am requesting changes with this as the primary blocker. It cannot be solved by code because it is not about code, it is about dangerous future code or dangerous current ops. I am treating warning all devs calling these functions, and any users running VACUUM FULL, as a threat to user data integrity. So: if they are not such please let me know and I will reconsider.

.all-contributorsrc

catalog/cataloger_diff.go

arielshaqed · 2020-10-18T06:14:25Z

catalog/cataloger_diff.go

+		}
+
+		diffRec := &diffResultRecord{
+			SourceBranch: parentID,


I don't understand: isn't this ID fixed on each record, and equal to a parameter that the caller passed in? (If so, can we just not include it?)

right, it is a bug - should take the branch ID from the parent entry's branch

arielshaqed · 2020-10-18T06:17:44Z

catalog/cataloger_diff.go

+	}
+	ins := psql.Insert(tableName).Columns("source_branch", "diff_type", "path", "entry_ctid")
+	for _, rec := range batch {
+		ins = ins.Values(rec.SourceBranch, rec.DiffType, rec.Entry.Path, rec.EntryCtid)


I don't understand. You write a CTID of another table into a table? Then everything up to the top-level exported functions has to carry a comment saying it is only useful inside a transaction (with an appropriately high isolation level), otherwise this is unsafe in the presence of concurrent VACUUMing FULL. At the very least, add warning lines to all system documentation never to VACUUM FULL the tables.
Similarly, this imposes immediate restrictions either on any implementation of metadata retention (which deletes entries) or of all code that uses the result -- the CTID might point at nothing.
Also, please document the restrictions on restoring from backups: I assume most backup methods will have to invalidate ctid on restore, so the result of writing these diff entries must be trashed during backup/restore.

The manual has this to say:

ctid

The physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row's ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier. A primary key should be used to identify logical rows.

Our reliance on CTIDs is becoming a danger to data integrity. This technical debt might be acceptable because currently there is no breakage that you or I can see. But at the very least it requires precise exact documentation.

This is the same mechanism as today - the cataloger's Merge calls the diff implementation inside the merge transaction and for entries that needs to update/added as part of the diff, we use the ctid to select the full entry information.

catalog/cataloger_diff.go

catalog/db_lineage_scanner_test.go

nopcoder · 2020-10-18T09:00:25Z

You may want someone more qualified to approve too, I mostly used this review to clear up my ignorance

Thanks for everything - I have also Tzahi going over the implementation to check that it matches the current SQL one

…previous commit - not the current commit of the parent (where the entity does not exist)

…FS into feature/diff-with-scanner

This reverts commit 4e601c0.

arielshaqed

Thanks! This sets the R number for CTID-20 exposure to <0.8, meaning we should be able to recover from this plague. :-)

arielshaqed · 2020-10-19T05:45:21Z

catalog/cataloger_diff.go

-	Entry        Entry
-	EntryCtid    *string
+	Entry        Entry   // Partially filled. Path is always set.
+	EntryCtid    *string // CTID of the modified/added entry. Do not use outside of catalog diff-by-iterators. https://github.com/treeverse/lakeFS/issues/831


"Drop a marker and keep going" :-)

nopcoder added 16 commits October 5, 2020 20:05

wip diff from child part

d5087b5

Merge branch 'master' into feature/diff-iter

f2a1e33

lineage reader remove limit

076a032

wip

57d83b7

fix limit remove

7833dc1

make the test pretty

new branch iterator

6858946

code review updates

fix tests to reflect the new implementation

4b29e38

refactor iteration

f77dab2

consider parent lineage commits on diff

9290c5f

remove comments and extract create diff results table

bd3f4d3

encapsulate batch writer

89a7f5d

diff scan limit and after

ebd5fa7

fix merge call to diff

e9d635e

apply changes on entries from parent merge test

18ffa39

continue the test case of modify merged information from child

8409e43

merge from parent changes

3af0f68

nopcoder requested review from tzahij and guy-har October 8, 2020 12:40

nopcoder self-assigned this Oct 8, 2020

nopcoder added the area/cataloger Improvements or additions to the cataloger label Oct 8, 2020

tzahij reviewed Oct 10, 2020

View reviewed changes

filter deleted

c27f6f6

- keep original options and setting defaults if needed

arielshaqed reviewed Oct 11, 2020

View reviewed changes

nopcoder added 2 commits October 11, 2020 15:51

remove the use of nextRow

9855f9f

code review changes

5904679

nopcoder requested review from tzahij and arielshaqed October 11, 2020 14:45

nopcoder requested a review from arielshaqed October 11, 2020 19:42

make limit +1 cleaner

8d38a1c

arielshaqed approved these changes Oct 13, 2020

View reviewed changes

catalog/db_lineage_scanner_test.go Show resolved Hide resolved

catalog/db_lineage_scanner_test.go Outdated Show resolved Hide resolved

nopcoder changed the title ~~Implement cataloger diff using scanners~~ Implement cataloger current diff using scanners Oct 15, 2020

nopcoder added 5 commits October 15, 2020 14:57

Merge branch 'master' into feature/diff-with-scanner

cb75d73

update test code to read branch information

4d8be46

fix tests after merge with master

6fce11e

code review changes

c7ab7eb

Merge branch 'master' into feature/diff-with-scanner

40b7a01

nopcoder requested a review from arielshaqed October 15, 2020 22:25

check limit close to the increment

7ef0773

arielshaqed requested changes Oct 18, 2020

View reviewed changes

nopcoder added 2 commits October 18, 2020 11:49

update code review comments

8119b18

prefer the use of psql calling PlaceholderFormat

0be161a

tzahij and others added 10 commits October 18, 2020 12:41

when a tompstone is created in the parent - the max-commit-id is the …

2249060

…previous commit - not the current commit of the parent (where the entity does not exist)

fix source branch value for merge

4068403

Merge branch 'feature/diff-with-scanner' of github.com:treeverse/lake…

48158b6

…FS into feature/diff-with-scanner

fix commit to use parent

d1f21a4

merge

4e601c0

Revert "merge"

45b4278

This reverts commit 4e601c0.

update func name

1f73f00

fix test check error code

5f1f6e5

use next commit id for merge generated tomstone

104d18a

update api documentation

e7defca

nopcoder requested a review from arielshaqed October 19, 2020 05:43

arielshaqed approved these changes Oct 19, 2020

View reviewed changes

nopcoder merged commit 0f923de into master Oct 19, 2020

nopcoder deleted the feature/diff-with-scanner branch February 4, 2021 12:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement cataloger current diff using scanners #790

Implement cataloger current diff using scanners #790

nopcoder commented Oct 8, 2020

codecov-io commented Oct 11, 2020 •

edited

Loading

arielshaqed left a comment

arielshaqed left a comment

arielshaqed left a comment

arielshaqed Oct 18, 2020

nopcoder Oct 18, 2020

arielshaqed Oct 18, 2020

nopcoder Oct 18, 2020

nopcoder commented Oct 18, 2020

arielshaqed left a comment

arielshaqed Oct 19, 2020

arielshaqed Oct 19, 2020

Implement cataloger current diff using scanners #790

Implement cataloger current diff using scanners #790

Conversation

nopcoder commented Oct 8, 2020

codecov-io commented Oct 11, 2020 • edited Loading

Codecov Report

arielshaqed left a comment

Choose a reason for hiding this comment

arielshaqed left a comment

Choose a reason for hiding this comment

arielshaqed left a comment

Choose a reason for hiding this comment

arielshaqed Oct 18, 2020

Choose a reason for hiding this comment

nopcoder Oct 18, 2020

Choose a reason for hiding this comment

arielshaqed Oct 18, 2020

Choose a reason for hiding this comment

nopcoder Oct 18, 2020

Choose a reason for hiding this comment

nopcoder commented Oct 18, 2020

arielshaqed left a comment

Choose a reason for hiding this comment

arielshaqed Oct 19, 2020

Choose a reason for hiding this comment

arielshaqed Oct 19, 2020

Choose a reason for hiding this comment

codecov-io commented Oct 11, 2020 •

edited

Loading