-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement cataloger current diff using scanners #790
Conversation
make the test pretty
code review updates
- keep original options and setting defaults if needed
Codecov Report
@@ Coverage Diff @@
## master #790 +/- ##
==========================================
+ Coverage 42.92% 43.11% +0.18%
==========================================
Files 135 136 +1
Lines 10571 10648 +77
==========================================
+ Hits 4538 4591 +53
- Misses 5443 5460 +17
- Partials 590 597 +7
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just requests to clarify some oddities that make it hard for me to read. Nothing blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks!
You may want someone more qualified to approve too, I mostly used this review to clear up my ignorance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
We're storing CTIDs inside tables. This is very scary, because TFM warns against using them for anything long-term. I know that the current implementation only writes them to an effectively-temporary table so it might be OK, but we do need a tonne of documentation around this:
- All functions that might write a CTID to a table some time during their execution should carry a warning. Specifically when creating such a dangerous table.
- User-level documentation must warn never ever to run
VACUUM FULL
on the database if there is any chance of a concurrent merge. The fact that we would never do such a thing does not mean everyone follows the same ops guidelines, so we have to tell everyone they have to do it our way. E.g. user X might have their own small private lakeFS instance (we actually encourage this) and naïvely suppose that this is safe because it is fast. We should ensure that they know enough to be safe. - If there is any chance of such tables surviving and being re-used then we must document backup strategy etc.
Very sorry, but I am requesting changes with this as the primary blocker. It cannot be solved by code because it is not about code, it is about dangerous future code or dangerous current ops. I am treating warning all devs calling these functions, and any users running VACUUM FULL
, as a threat to user data integrity. So: if they are not such please let me know and I will reconsider.
catalog/cataloger_diff.go
Outdated
} | ||
|
||
diffRec := &diffResultRecord{ | ||
SourceBranch: parentID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand: isn't this ID fixed on each record, and equal to a parameter that the caller passed in? (If so, can we just not include it?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, it is a bug - should take the branch ID from the parent entry's branch
} | ||
ins := psql.Insert(tableName).Columns("source_branch", "diff_type", "path", "entry_ctid") | ||
for _, rec := range batch { | ||
ins = ins.Values(rec.SourceBranch, rec.DiffType, rec.Entry.Path, rec.EntryCtid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand. You write a CTID of another table into a table? Then everything up to the top-level exported functions has to carry a comment saying it is only useful inside a transaction (with an appropriately high isolation level), otherwise this is unsafe in the presence of concurrent VACUUMing FULL. At the very least, add warning lines to all system documentation never to VACUUM FULL the tables.
Similarly, this imposes immediate restrictions either on any implementation of metadata retention (which deletes entries) or of all code that uses the result -- the CTID might point at nothing.
Also, please document the restrictions on restoring from backups: I assume most backup methods will have to invalidate ctid on restore, so the result of writing these diff entries must be trashed during backup/restore.
The manual has this to say:
ctid
The physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row's ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier. A primary key should be used to identify logical rows.
Our reliance on CTIDs is becoming a danger to data integrity. This technical debt might be acceptable because currently there is no breakage that you or I can see. But at the very least it requires precise exact documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same mechanism as today - the cataloger's Merge calls the diff implementation inside the merge transaction and for entries that needs to update/added as part of the diff, we use the ctid to select the full entry information.
Thanks for everything - I have also Tzahi going over the implementation to check that it matches the current SQL one |
…previous commit - not the current commit of the parent (where the entity does not exist)
…FS into feature/diff-with-scanner
This reverts commit 4e601c0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This sets the R number for CTID-20 exposure to <0.8, meaning we should be able to recover from this plague. :-)
Entry Entry | ||
EntryCtid *string | ||
Entry Entry // Partially filled. Path is always set. | ||
EntryCtid *string // CTID of the modified/added entry. Do not use outside of catalog diff-by-iterators. https://github.com/treeverse/lakeFS/issues/831 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Drop a marker and keep going" :-)
Use a DB entity scanner to go over parent/child branches to perform diff operation.
The scanner interface operations as an iterator that scan a branch or a lineage.
The diff implementation uses the scanner to compare the two branches and produce diff result into a temporary table.
The iteration support starting after 'path' and the diff loop can limit the output to X records to handle cases where we need pagination.