Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor GitScanner and simplify implementation #5389

Merged
merged 9 commits into from
Jun 9, 2023

Conversation

chrisd8088
Copy link
Member

The GitScanner structure was originally introduced in PR #1670 and has been expanded organically over time since then. This PR addresses a number of inconsistencies in the current design, removing unused or unnecessary logic and revising the internal implementation some of the scanning functions.

It should be easiest to review this PR commit-by-commit, as each commit has a detailed description and should be logically independent and bisectable.


This PR is motivated principally by the desire to remove the structure's Close() method, whose presence implies that it should be called to release resources opened by the GitScanner methods, such as I/O streams or channels. However, upon inspection, this method simply outputs an optional performance timing tracing metric, one whose utility is undermined by the fact that various callers performs unrelated actions between the time they create a GitScanner structure and when the call Close() (if they call it at all). And in the case of the git lfs prune command in particular, multiple Scan*() methods are called in parallel goroutines, but only the final overall timing is reported.

This PR therefore removes this method, along with the structure's internal mutex. The mutex was introduced in commit bdbca39 of PR #1670, with the apparent expectation that some of the GitScanner's elements might be read and written simultaneously by multiple concurrent goroutines. However, in practice, this is never the case, so the mutex is unneeded. As well, not all the current elements of the structure are currently protected by this mutex, as they have been introduced in subsequent PRs which ignored it. Moreover, the two main elements which are protected, namely remote and skippedRefs, are updated exclusively by using the RemoteForPush() method, and the only caller of this method updates two other elements of the structure directly, as these were added later without any mutex protection.

This sole caller of the RemoteForPush() method, the (*uploadContext).buildGitScanner() method in the commands package, is used only by the git lfs pre-push and git lfs push commands, as they need to scan for Git LFS objects which do not exist on a given remote. In this context, the FoundLockable callback and the PotentialLockables GitScannerSet are always set as well; they are also only set in this context. To help clarify these inter-dependencies we replace the RemoteForPush() method with a NewGitScannerForPush() function, which accepts all the necessary arguments for push operations in a single function, and avoids the need to export the GitScannerFoundLockable and GitScannerSet elements of the GitScanner structure.


We also remove the ScanRefsOptions structure in the lfs package, as it is not needed to encapsulate multiple options when initializing a new GitScanner structure, unlike the similar ScanRefsOptions structure of the git package, which is used to consolidate the many options passed to the NewRevListScanner() function. This allows us to make the majority of the elements of the GitScanner structure private to the lfs package.

As part of this change to remove the lfs.ScanRefsOptions structure, we also create a new, dedicated and un-exported nameMap structure, which holds the map of Git object SHAs to their pathspecs, as populated during scan operations. It is then utilized in the existing lockableNameSet structure, whose Check() method is called by the runCatFileBatch() function and also called by the runCatFileBatchCheck() function to collect the pathspecs of files locked by other users.

(Note that the intent to remove the ScanRefsOptions structure from the lfs package was discussed in several PRs in 2016, and in particular, this commit's changes achieve some of the goals of the never-merged PR #1595, but without introducing an entirely generic scanner package.)


Finally, we add an explicit close() of the lockableCh channel at the end of the anonymous function started by runCatFileBatchCheck(), we can ensure the anonymous function which reads that channel will also exit as soon as possible. This function is started by the scanRefsToChan() function, which then proceeds to read directly from the similar channel returned by the runCatFileBatch() function. That read loop would stall indefinitely if its channel was not always closed by runCatFileBatch(), whereas the anonymous function in the goroutine which handles the channel returned by runCatFileBatchCheck() does not exit until the program exits.

@chrisd8088 chrisd8088 requested a review from a team as a code owner June 7, 2023 22:45
In commit d0e950d of PR git-lfs#3978 the
ScanMultiRangeToRemote() method of the GitScanner structure was added,
and the sole caller of the existing ScanRangeToRemote() method was
revised to call the new scan method instead.

As no additional callers of the existing method have been introduced
since then, we can simplify our code by removing the unused method now.
In commit 4049928 of PR git-lfs#4209 the
scanStashed() function of the "lfs" package was revised to no longer
make use of the *GitScanner argument that was part of its original
implementation from commit 2dc718b
in the same PR.

However, the argument was never dropped from the function's signature,
so we can remove it now.
In commit d5fd152 of PR git-lfs#1745 the
"git lfs fetch" command, among others, was revised to set the Filter
element of the GitScanner structure, and no longer pass a common
GitScanner created in the main fetchCommand() function to various
other functions such as scanAll() and fetchAll().

As a result, the GitScanner structure created in the fetchCommand()
function was no longer used, but it was not fully removed at the time,
as was done in the "git lfs checkout" command's checkoutCommand()
function, for instance.  Therefore we can simply remove this unused
structure's initialization now in the fetchCommand() function.
The GitScanner structure and its methods were introduced in PR git-lfs#1670,
and in commit bdbca39 of that PR
the structure's internal sync.Mutex element was added to protect access to
the "remote" and "skippedRefs" elements, as well as the "closed" element.
The structure's methods always acquire the mutex before reading or writing
these elements, particularly the "remote" one.

This type of design is typical of other structures where we expect
concurrent access to the structure's data, such as when multiple
goroutines attempt to read and write the same data.

However, in practice the GitScanner structure's "remote" and "skippedRefs"
elements are only ever initialized once, and indeed only initialized in
one specific use case by the (*uploadContext).buildGitScanner() method
for the "git lfs push" and "git lfs pre-push" commands.  Meanwhile,
the "closed" element is only used in the Close() method, and this
method is only ever called once per structure by our existing code.

Moreover, other elements of the GitScanner structure which have been
added since its original introduction are set and retrieved without
locking the mutex, such as the FoundLockable and PotentialLockables
elements, which are checked and read directly in the scanRefsToChan()
function of the "lfs" package.

As it never the case that multiple goroutines need to read and write
to the "remote", "skippedRefs", and "closed" internal elements of the
GitScanner structure, we can remove the mutex and the associated code,
which simplifies the setup of the structure.
The GitScanner structure and its methods were introduced in PR git-lfs#1670,
and in commit bdbca39 of that PR
the structure's Close() method was introduced.  Unlike other similar
structures whose Close() methods should be called to release underlying
resources such as channels or I/O streams, the (*GitScanner).Close()
method serves only to output an optional performance timing trace metric.

This Close() method is not called consistently; for instance, it is never
called by the migrateExportCommand() function of the "git lfs migrate"
command, and will be skipped by the checkoutCommand() function of the
"git lfs checkout" command if an error is returned by the
(*GitScanner).ScanTree() method.

The utility of the performance timing metric is also undercut by the
fact that some commands perform other tasks before and after calling
the specific (*GitScanner).Scan*() method they invoke.  And in the
particular case of the "git lfs prune" command, multiple goroutines
are started, each of which runs a different Scan*() method simultaneously
with the others, so the final timing metric does not account for
their different execution times, just the overall final timing.

We can improve the value of the timing metric while also simplifying
the calling convention for the GitScanner structure's methods by
removing the Close() method, and tracing the performance of each
Scan*() method individually.

Removing the Close() method clarifies that no underlying resources
must be released for the GitScanner structure, and so callers need
not try to register a deferred call to the method.  This parallels
some other conventional Go structures, such as the Scanner structure
of the "bufio" package.

As well, running a "git lfs prune" command with the GIT_TRACE_PERFORMANCE=1
environment variable set now results in more detailed and useful output,
for example:

  12:36:51.221526 performance ScanStashed: 0.013632533 s
  12:36:51.224494 performance ScanUnpushed: 0.016570280 s
  12:36:51.240670 performance ScanTree: 0.017171717 s
The RevListScanner structure of the "git" package is constructed by
the NewRevListScanner() function, which takes a ScanRefsOptions
structure as an argument.  The git.ScanRefsOptions structure encapsulates
the large number of flags and other fields which a caller may potentially
set for the "git rev-list" scanning operation.

The "lfs" package also has a similar ScanRefsOptions structure, which
is used by the various Scan*() methods of the GitScanner structure to
pass optional flags and fields to the internal functions which implement
the scanning operations.  However, unlike the "git" package, this
ScanRefsOptions structure is not used when constructing a GitScanner,
and is moreover never used outside of the "lfs" package.

We can therefore remove this structure in favour of a more conventional
set of internal elements encapsulated directly in the GitScanner structure.
This simplifies the signatures of many of the internal scan*() functions
and helps clarify what parts of the GitScanner structure actually
need to be exported.

As part of this change, we also create a new, dedicated and un-exported
nameMap structure, which holds the map of Git object SHAs to their
pathspecs, as populated during scan operations.  It is then utilized
in the existing lockableNameSet structure, whose Check() method is
called by the runCatFileBatch() and runCatFileBatchCheck() functions
to collect the pathspecs of files locked by other users.

Note that the intent to remove the ScanRefsOptions structure from
the "lfs" package was discussed in several PRs in 2016, and in
particular, this commit's changes achieve some of the goals of the
never-merged PR git-lfs#1595, but without introducing an entirely generic
"scanner" package.  See also:

git-lfs#1670 (comment)
In commit 781c7f5 of PR git-lfs#1953 the
GitScannerFoundPointer "callback" element of the GitScanner structure
was exported from the "lfs" package by renaming it to "FoundPointer".

However, this element has never since been utilized outside of the
package, so we can simplify the structure's set of exported elements
by renaming it to "foundPointer".
The "git lfs push" and "git lfs pre-push" commands utilize the
uploadForRefUpdates() function of the "commands" package, which
calls the (*uploadContext).buildGitScanner() method to initialize
a GitScanner structure with the additional elements necessary to
exclude from the results any objects which are reachable from
refs known to exist on the given remote.

Two of these elements, the FoundLockable callback and the
PotentialLockables set of pathspecs locked by other users (as used
to report errors or warnings when attempting to push them, specifically
when they are not Git LFS objects, for which lock verification is handled
separately), were added in commit 9b8bed7
of PR git-lfs#1953, and are initialized following a call to the RemoteForPush()
method of the GitScanner structure, which sets two other internal
elements of the structure named "remote" and "skippedRefs".

We can therefore replace the RemoteForPush() method with a
NewGitScannerForPush() function, which allows us to accept all
the necessary arguments for push operations in a single function,
and to avoid exporting the GitScannerFoundLockable and
GitScannerSet elements used to identify locked pathspecs.
In commit 08c5ae6 of PR git-lfs#1953 the
runCatFileBatchCheck() and runCatFileBatch() functions of the "lfs"
package were updated to write the pathspecs of locked files which are
not Git LFS pointers to a dedicated channel created by the
catFileBatchCheck() and catFileBatch() wrapper functions, respectively.

The scanRefsToChan() function, which calls both of these with a
potentially non-nil *lockableNameSet paramter (and is the only caller
to do so), starts a goroutine with an anonymous function to read any
events on the channel returned by catFileBatchCheck(), and then reads
all events on the channel returned by catFileBatch() directly.  In the
latter case, this would cause scanRefsToChan() to stall indefinitely
unless the channel is closed, so the anonymous function started by the
runCatFileBatch() function that writes to the channel always closes
the channel upon exit.

However, the anonymous function started by the runCatFileBatchCheck()
function that writes to its lock path channel does not do the same.
While this does not cause a stalled program because scanRefsToChan()
creates its own anonymous function to read from the channel, that
function will not exit until the progam stops.

By adding an explicit close() of the channel at the end of the
anonymous function started by runCatFileBatchCheck(), we can ensure
the anonymous function which reads that channel will also exit
as soon as possible.
@chrisd8088 chrisd8088 merged commit a41dd92 into git-lfs:main Jun 9, 2023
@chrisd8088 chrisd8088 deleted the refactor-gitscanner branch June 9, 2023 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants