Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Git cache breaks with pool_size > 1 if managing the same git source twice #1058

Closed
Magisus opened this issue Apr 30, 2020 · 7 comments
Closed
Labels

Comments

@Magisus
Copy link
Collaborator

Magisus commented Apr 30, 2020

A bug recently surfaced in our beaker acceptance tests, that seems valid, but might not be a common workflow. It appeared after the merge of #1044, updating the default pool size to 4.

If you are managing two objects (in this case non-module objects) backed by the same git repo, and are using a pool_size of more than 1, the caching mechanism for git behaves poorly on the initial deploy.

Puppetfile

mod 'non_module_object_1',
  :install_path => './',
  :git => 'git://github.com/puppetlabs/control-repo.git',
  :branch => 'production'
mod 'non_module_object_2',
 :install_path => '',
 :git => 'git://github.com/puppetlabs/control-repo.git',
 :branch => 'production'

Error output at --verbose=debug2 from r10k deploy environment -p:

[2020-04-30 22:28:15 - INFO] Deploying Puppetfile content /etc/puppetlabs/code/environments/production/non_module_object_1
[2020-04-30 22:28:15 - DEBUG1] Creating new git cache for "git://github.com/puppetlabs/control-repo.git"
[2020-04-30 22:28:15 - DEBUG] Module thread 23963080 exiting: queue empty
[2020-04-30 22:28:15 - DEBUG2] Starting process: ["git", "clone", "--mirror", "git://github.com/puppetlabs/control-repo.git", "/var/cache/r10k/git---github.com-puppetlabs-control-repo.git"]
[2020-04-30 22:28:15 - INFO] Deploying Puppetfile content /etc/puppetlabs/code/environments/production/non_module_object_2
[2020-04-30 22:28:15 - DEBUG1] Creating new git cache for "git://github.com/puppetlabs/control-repo.git"
[2020-04-30 22:28:15 - DEBUG2] Starting process: ["git", "clone", "--mirror", "git://github.com/puppetlabs/control-repo.git", "/var/cache/r10k/git---github.com-puppetlabs-control-repo.git"]
[2020-04-30 22:28:15 - DEBUG] Module thread 23579460 exiting: queue empty
[2020-04-30 22:28:15 - DEBUG2] Finished process:
Command: git clone --mirror git://github.com/puppetlabs/control-repo.git /var/cache/r10k/git---github.com-puppetlabs-control-repo.git
Stderr:
fatal: destination path '/var/cache/r10k/git---github.com-puppetlabs-control-repo.git' already exists and is not an empty directory.
Exit code: 128
[2020-04-30 22:28:15 - ERROR] Command exited with non-zero exit code:
Command: git clone --mirror git://github.com/puppetlabs/control-repo.git /var/cache/r10k/git---github.com-puppetlabs-control-repo.git
Stderr:
fatal: destination path '/var/cache/r10k/git---github.com-puppetlabs-control-repo.git' already exists and is not an empty directory.
Exit code: 128

I think this occurs because two threads kick off to deploy the entries in the puppetfile and both try to cache the new repo simultaneously, so the second clone fails because its path isn't clear -- the directory was populated between when it decided it needed to clone and when the clone actually completed. Subsequent deploys work fine, because the cache is already populated.

This reproduced very reliably. We're updating our tests for now to no longer have two resources with the same backing repo for the time being, since we think that isn't terribly valid. Seeing as no one has hit this yet, it seems uncommon at least.

@justinstoller
Copy link
Member

@adrienthebo , @dhollinger , @reidmv , @bastelfreak , @binford2k , @npwalker I'm super curious to see if any of you have concerns about how often the above situation might be encountered.

@bastelfreak
Copy link
Contributor

Hi,
sounds like an interesting usecase, I haven't seen that in the wild. But my assumption is that it's a valid one and that it should work.

@npwalker
Copy link
Contributor

npwalker commented May 1, 2020

While it might currently work, I'm not sure it makes a lot of sense. I'd be fine failing if we can detect that two modules use the same git source.

if it's easy enough we could downgrade to a pool_size of 1 instead of failing.

Or depending on the work compared to the above we could try to make the caching code safe from two threads trying to both cache the same git source twice.

Or if we can detect the same git source twice we could just dedupe so we never try to download the same git source twice?

Lots of options but given the low likelihood of someone trying to do this I'd be fine failing and saying the user needs to set pool_size to 1

@binford2k
Copy link
Contributor

I don't see the benefit of duplicate git sources. But it also seems relatively easy to detect. At that point you could either dedupe or just downgrade pool_size to 1 (with a warning).

@Magisus
Copy link
Collaborator Author

Magisus commented May 1, 2020

We're going to look into a real fix this afternoon, because if it's relatively straightforward, the best option is obviously just to fix it.

justinstoller added a commit to justinstoller/r10k that referenced this issue May 1, 2020
When we raised the default threadpool for deploying modules above 1, we
began to see issues when caching a remote locally if that remote/cache
was shared by multiple modules within the Puppetfile.

To resolve this, when we create the module we group it with any other
module that may share its cachedir. We then pull modules by cachedir off
of the queue and serially process those.
justinstoller added a commit to justinstoller/r10k that referenced this issue May 2, 2020
When we raised the default threadpool for deploying modules above 1, we
began to see issues when caching a remote locally if that remote/cache
was shared by multiple modules within the Puppetfile.

To resolve this, when we create the module we group it with any other
module that may share its cachedir. We then pull modules by cachedir off
of the queue and serially process those.
@justinstoller
Copy link
Member

I've opened #1059 with an approach to fixing this instead of warning. For those of you I pinged that review code I'd welcome any comments wrt the approach and unit tests. I still need to update an acceptance test and changelog but I think it's functionally correct.

justinstoller added a commit to justinstoller/r10k that referenced this issue May 6, 2020
After updating the default pool_size, a race condition was
found when initializing our local git cache. See
puppetlabs#1058 for the issue in detail.

When attempting a fix we realized that there were multiple copies of the
code to iterate over modules to manage. See
puppetlabs#1059 for a first attempt at this
problem and discussion.

In order to meet our current timelines, the default will need to reverted
until we can fix this issue properly. Hence this commit and a55536c.

This commit was necessary because the CHANGELOG has been refactored since
the commit referencing the default was added.
Magisus pushed a commit to Magisus/r10k that referenced this issue Jul 27, 2020
When we raised the default threadpool for deploying modules above 1, we
began to see issues when caching a remote locally if that remote/cache
was shared by multiple modules within the Puppetfile.

To resolve this, when we create the module we group it with any other
module that may share its cachedir. We then pull modules by cachedir off
of the queue and serially process those.
Magisus added a commit to Magisus/r10k that referenced this issue Jul 27, 2020
Magisus pushed a commit to Magisus/r10k that referenced this issue Jul 27, 2020
When we raised the default threadpool for deploying modules above 1, we
began to see issues when caching a remote locally if that remote/cache
was shared by multiple modules within the Puppetfile.

To resolve this, when we create the module we group it with any other
module that may share its cachedir. We then pull modules by cachedir off
of the queue and serially process those.
Magisus added a commit to Magisus/r10k that referenced this issue Jul 27, 2020
Magisus added a commit to Magisus/r10k that referenced this issue Jul 28, 2020
This commit adds a beaker acceptance test verifying that it is possible
to use the same remote for several non-module objects. This case
previously broke when the `pool_size` feature was introduced, because
there was a race condition when creating the git cache in shellgit mode,
when the same remote was used in two objects being set up concurrently.
Magisus added a commit to Magisus/r10k that referenced this issue Jul 28, 2020
This commit adds a beaker acceptance test verifying that it is possible
to use the same remote for several non-module objects. This case
previously broke when the `pool_size` feature was introduced, because
there was a race condition when creating the git cache in shellgit mode,
when the same remote was used in two objects being set up concurrently.
Magisus added a commit to Magisus/r10k that referenced this issue Jul 28, 2020
This commit adds a beaker acceptance test verifying that it is possible
to use the same remote for several non-module objects. This case
previously broke when the `pool_size` feature was introduced, because
there was a race condition when creating the git cache in shellgit mode,
when the same remote was used in two objects being set up concurrently.
Magisus added a commit to Magisus/r10k that referenced this issue Jul 28, 2020
This commit adds a beaker acceptance test verifying that it is possible
to use the same remote for several non-module objects. This case
previously broke when the `pool_size` feature was introduced, because
there was a race condition when creating the git cache in shellgit mode,
when the same remote was used in two objects being set up concurrently.
mwaggett added a commit that referenced this issue Jul 30, 2020
[#1058] Avoid concurrent cache updates when downloading modules on multiple threads
@github-actions
Copy link

This issue has been marked stale because it has had no activity for 60 days. The Puppet Team is actively prioritizing existing bugs and new features, if this issue is still important to you please comment and we will add this to our backlog to complete. Otherwise, it will be closed in 7 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants