Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lakectl log: add option to filter merge commits #8142

Conversation

ItamarYuran
Copy link
Contributor

@ItamarYuran ItamarYuran commented Sep 8, 2024

Closes #7441

Added an option to filter merge commits in the log command. The flag --no-merges will filter every commit that has more than one parent and thus also import commits. The implementation concerns only lakectl at the moment. Filtering is done at The server's side similarly to --first-parent flag.

Testing Details

Tested only manually, original test pass (after a slight modification)

Additional info

Screenshot 2024-09-08 at 10 36 14

Contact Details

itamar.yuran@treeverse.io

@ItamarYuran ItamarYuran linked an issue Sep 8, 2024 that may be closed by this pull request
@ItamarYuran ItamarYuran added the area/lakectl Issues related to lakeFS' command line interface (lakectl) label Sep 8, 2024
Copy link

github-actions bot commented Sep 8, 2024

E2E Test Results - Quickstart

11 passed

Copy link

github-actions bot commented Sep 8, 2024

E2E Test Results - DynamoDB Local - Local Block Adapter

13 passed

Copy link
Contributor

@itaiad200 itaiad200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fearful PR in one of the most key areas of Graveler - well done.

In order for me to effectively review this PR, I'm going to request 2 things:

  1. Document the new logic in the commit iterator. It's hard to follow this delicate area logic change without this.
  2. Tests. First it seems that some tests fail in our CI (not sure if these are related or not). Also - commit_iterator needs some testing for the new functionality (perhaps more testing for existing functionality too), lakectl requires tests for the new flag, probably in esti.

cmd/lakectl/cmd/log.go Outdated Show resolved Hide resolved
Comment on lines 122 to 172
for ci.queue.Len() > 0 {
ci.value = heap.Pop(&ci.queue).(*graveler.CommitRecord)

// as long as we have something in the queue we will
// set it as the current value and push the current commits parents to the queue
ci.value = heap.Pop(&ci.queue).(*graveler.CommitRecord)
parents := ci.value.Parents
if ci.firstParent && len(parents) > 1 {
parents = parents[:1]
}
for _, p := range parents {
// skip commits we already visited
if _, visited := ci.visit[p]; visited {
if ci.noMerges && len(ci.value.Parents) > 1 {
firstParent := ci.value.Parents[0]

if _, visited := ci.visit[firstParent]; visited {
continue
}

rec, err := ci.getCommitRecord(firstParent)
if err != nil {
ci.value = nil
ci.err = err
return false
}
ci.visit[rec.CommitID] = struct{}{}

if ci.since != nil && rec.Commit.CreationDate.Before(*ci.since) {
continue
}
heap.Push(&ci.queue, rec)
continue
}

rec, err := ci.getCommitRecord(p)
if err != nil {
ci.value = nil
ci.err = err
return false
parents := ci.value.Parents
if ci.firstParent && len(parents) > 1 {
parents = parents[:1]
}
ci.visit[rec.CommitID] = struct{}{}

// skip commits that are older than since time
if ci.since != nil && rec.Commit.CreationDate.Before(*ci.since) {
continue
for _, p := range parents {
if _, visited := ci.visit[p]; visited {
continue
}

rec, err := ci.getCommitRecord(p)
if err != nil {
ci.value = nil
ci.err = err
return false
}
ci.visit[rec.CommitID] = struct{}{}

// skip commits that are older than since time
if ci.since != nil && rec.Commit.CreationDate.Before(*ci.since) {
continue
}

heap.Push(&ci.queue, rec)
}

heap.Push(&ci.queue, rec)
return true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive - One of the more delicate areas in Graveler and it seems really thought through.

For me, as a reviewer, it's near impossible to follow the logic that is applied here without comments walking me through the decisions taken in every branch of this code.

@N-o-Z
Copy link
Member

N-o-Z commented Sep 11, 2024

@ItamarYuran please add missing labels to PR

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking about possible incorrectness.
Also this code needs tests, I assume they'll be in the next commit on this PR so I'm blocking but you probably already know why.

With that it is the way, I would not like this code to be in the server, and this is my primary reason to block. A client like lakectl can and should filter on its side. There is no efficiency gain from filtering on the server side, most of the slowness comes from the server needing to iterate over many commits. It is also very easy to perform on the high-level python SDK.
Just iterate over calls to log-commits in the client and filter out commits with >1 parent. It will be simpler code and allow filtering as desired.

api/swagger.yml Outdated
@@ -3540,6 +3540,11 @@ paths:
description: if set to true, follow only the first parent upon reaching a merge commit
schema:
type: boolean
- in: query
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of parameters with negative names. They're mostly okay as commandline flag, less so in an api.

However here I have no gear alternative to suggest: the default has to remain with merges. If you do have a good alternative I'm very open to suggestions here.

Otherwise let's go with this one! If we keep on having to add flags, we'll anyway need to end up with an expression parser / GraphQL. So I'm willing to wait until we need another flag.

More importantly is why we need an API change if we just want to filter; that appears on the main comment for this review.

if ci.noMerges && len(ci.value.Parents) > 1 {
firstParent := ci.value.Parents[0]

// case parent was visited, continuing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand.

  1. Here no-merges implies first-parent. Why? I would expect no-merges to do just one thing.
  2. How does this code work when there are 2 merges one after the other? firstParent Will also be a merge and the code should skip it.
  3. (Related) Just skip the current record of its a merge and keep going. This is a filter operation, let the next iteration do its thing

@ItamarYuran ItamarYuran added the include-changelog PR description should be included in next release changelog label Sep 12, 2024
@ItamarYuran ItamarYuran force-pushed the 7441-give-an-option-to-hide-merge-commits-in-a-branch-view-in-the-ui branch from 91be003 to d2c2a5f Compare September 12, 2024 16:24
@ItamarYuran
Copy link
Contributor Author

Thank You all for your reviews.
Completely changed attitude to a client side implementation.

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, I really like this code!

There is some nastiness (an actual tech debt, even a bug) in the original code, that became bad when we add filtering. If you specify --amount (the case amount != 0 on l. 177) then the original code will never return >1000 results, even if amount == 2000. The new code compounds this: if the first page does not contain even amount results, then it will perform a single call. So it will display < amount results. For instance, say I specify --amount=10. Then the code asks for 10 log entries, and displays only those which are not merges. So it will often fail to display exactly 10 results.

I would suggest:

  1. Change the loop to reduce amount by the number of results actually output.
    • Exit the loop if we get to amount <= 0.
  2. When setting up initial pagination, ask for amount results if not filtering, or 3*amount results if we are filtering. This is a heuristic: it guesses that at most 2/3 commits will be merges, so it asks for more. But it always works even if it guessed wrong - it just calls lakeFS more times than it could have.

WDYT?

@@ -66,6 +66,19 @@ func (d *dotWriter) Write(commits []apigen.Commit) {
}
}

// filter merge commits, used for --no-merges flag
func filterMergeCommits(commits []apigen.Commit) []apigen.Commit {
var filteredCommits []apigen.Commit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var filteredCommits []apigen.Commit
filteredCommits := make([]apigen.Commit, 0, len(commits))

This is fairly idiomatic Go for when you know roughly the expected length of the result. It says to reserve space for the same number of commits, but keep the slice itself at length 0.

@ItamarYuran
Copy link
Contributor Author

@arielshaqed As a result of using the heuristic, while using both --no-merges & --amount user would usually get more results than asked

Copy link
Contributor

@itaiad200 itaiad200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Some very good progress here. Besides my comments, we're missing some tests coverage for the new flag

@@ -40,6 +40,7 @@ const (
const (
internalPageSize = 1000 // when retrieving all records, use this page size under the hood
defaultAmountArgumentValue = 100 // when no amount is specified, use this value for the argument
maxAmountNoMerges = 333 // when using --no-merges & amount, this const is the upper limit use huristic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rephrase, I don't understand what this means and why 333(?!)

cmd/lakectl/cmd/log.go Show resolved Hide resolved
@@ -100,6 +114,10 @@ var logCmd = &cobra.Command{
if amountForPagination <= 0 {
amountForPagination = internalPageSize
}
// case --no-merges & --amount, ask for more results since some will filter out
if noMerges && amountForPagination < maxAmountNoMerges {
amountForPagination *= 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I get the 333 - it's directly linked to 3. I wonder how our linter doesn't complain about this magic number.

// case --no-merges, filter commits and subtract that amount from amount desired
if noMerges {
data.Commits = filterMergeCommits(data.Commits)
amount = amount - (3 * len(data.Commits))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it tripled here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was that if we use that heuristic, in this way we stay true to it. We asked for three times the amount we needed to get ~1/3 results of that. Updating the amount this way will keep that heuristic in every request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better if we have a single location to handle amount heuristics. Can you try to combine this block and the above one (lines 114-120) into one? Preferably where the loop starts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong here, works better with subtracting only the original size. subtracting X3 amount caused receiving less results than asked.

@N-o-Z
Copy link
Member

N-o-Z commented Sep 18, 2024

@ItamarYuran Can we refactor the PR title from: "7441 give an option to hide merge commits in a branch in lakectl"
to "lakectl log: add option to filter merge commits"?
More concise and there's no need to reference the linked issue as we add it in the description

@ItamarYuran ItamarYuran changed the title 7441 give an option to hide merge commits in a branch in lakectl lakectl log: add option to filter merge commits Sep 18, 2024
@itaiad200 itaiad200 dismissed their stale review September 19, 2024 14:34

OOO and don't want to block

@ItamarYuran ItamarYuran marked this pull request as draft September 26, 2024 13:57
Copy link

github-actions bot commented Sep 29, 2024

♻️ PR Preview 91df0b1 has been successfully destroyed since this PR has been closed.

🤖 By surge-preview

@ItamarYuran
Copy link
Contributor Author

@arielshaqed It works! at least it seems like. Failures occurring because of some other reason

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... Failures occurring because of some other reason

Unfortunately all Estis seem to fail at the same spot. If we pull this our CI will break. So it doesn't matter if the bad test code was added here or not, I cannot approve.

What can we do?

First, skip your added test (add t.Skip() at the beginning) and get Esti to run on it again. Does it still fail?

If it stopped failing, we need to look at the added test - or perhaps t.Skip() another test to see if it's just one extra test which breaks everything. If it still fails, we will need to understand why (whether) the trunk fails, and open an issue to fix trunk.

Sorry.

@@ -41,6 +41,10 @@ const (
internalPageSize = 1000 // when retrieving all records, use this page size under the hood
defaultAmountArgumentValue = 100 // when no amount is specified, use this value for the argument

// when using --no-merges & amount, this const is the upper limit to use huristic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
// when using --no-merges & amount, this const is the upper limit to use huristic.
// when using --no-merges & amount, this const is the upper limit to use heuristic.

@@ -41,6 +41,10 @@ const (
internalPageSize = 1000 // when retrieving all records, use this page size under the hood
defaultAmountArgumentValue = 100 // when no amount is specified, use this value for the argument

// when using --no-merges & amount, this const is the upper limit to use huristic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment also says "3". So that constant needs to appear here, too.

@@ -338,6 +338,59 @@ func TestLakectlMergeAndStrategies(t *testing.T) {
RunCmdAndVerifySuccessWithFile(t, Lakectl()+" fs ls lakefs://"+repoName+"/"+featureBranch+"/", false, "lakectl_fs_ls_1_file", vars)
}

func TestLakectlLogNoMergesWithCommitsAndMerges(t *testing.T) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: please remove this blank line, it does not aid readability.

@arielshaqed
Copy link
Contributor

Note that these runs on master all seem to succeed; I started another one right now, but I would expect it also to succeed.

@arielshaqed
Copy link
Contributor

I also re-ran Esti on this branch. This will give us more confidence whether the problem is on trunk or only on this branch.

@arielshaqed
Copy link
Contributor

I think this means that Esti actually discovered an issue with the code. It is getting a wrong result when logging after a merge. 🙀

@ItamarYuran ItamarYuran marked this pull request as ready for review October 1, 2024 10:43
Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last nit found... and we should be done!

}
amount -= len(data.Commits)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this still prints an incorrect number of commits. Think of the case where you get more commits after filtering than you really needed. Something like data.Commits = data.Commits[:amount] should do the job.

Please also add a test that would fail without this fix.

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, great feature!

@ItamarYuran ItamarYuran merged commit d28387c into master Oct 8, 2024
38 checks passed
@ItamarYuran ItamarYuran deleted the 7441-give-an-option-to-hide-merge-commits-in-a-branch-view-in-the-ui branch October 8, 2024 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/lakectl Issues related to lakeFS' command line interface (lakectl) include-changelog PR description should be included in next release changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Give an option to hide merge commits in a branch view in the UI
4 participants