Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Makes sure there are no duplicate cache entries. #739

Merged
merged 9 commits into from
Jul 31, 2018
Merged

Conversation

coollog
Copy link
Contributor

@coollog coollog commented Jul 30, 2018

Fixes #721

The solution here is to, when adding a new layer, remove any prior layers with the same digest (essentially bumping layers to the top if re-added). This should work, for example, since applying layers A and B in order ABA should produce an identical filesystem as just BA.

@coollog coollog added this to the v0.9.8 milestone Jul 30, 2018
Copy link
Member

@briandealwis briandealwis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be a legitimate reason to provide the same image multiple times (e.g., A, B, A to re-apply a change made in A that was overwritten in B).

@loosebazooka
Copy link
Member

@briandealwis, as far as I can tell in how we layer in jib, ABA == BA and I can't really think of a case where ABA != BA?

@coollog
Copy link
Contributor Author

coollog commented Jul 30, 2018

Each layer (ie. A, B) would have a set of changes for files (ie. x, y, z). Each of these changes can be either a create, update, or delete, which translates to a change operation to set the contents of a file. So each layer can be represented by a set of pairs (file, newContents) (eg. {(x, content1), (y, content2), (z, content3)}). These changes, applied in sequence, have independent histories for each file. Therefore, for each file in its layer history chain, each change essentially replaces any prior change (only the last change takes effect). Therefore, applying layers ABA should have the same effect as applying layers BA, as all changes in the second A replace the changes in the first A (since they have the same change sets).

@coollog coollog requested a review from a team July 30, 2018 22:49
@@ -19,34 +19,36 @@
import com.google.common.collect.ImmutableList;
import com.google.common.collect.ImmutableSet;
import java.util.Iterator;
import java.util.LinkedHashSet;
import java.util.Set;
import javax.annotation.Nullable;

/** Holds the layers for an image. Makes sure that each layer is only added once. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove "Makes sure that each layer is only added once" or update the Javadoc.

layerDigestsBuilder.add(layer.getBlobDescriptor().getDigest());
layersBuilder.remove(layer);
layersBuilder.add(layer);
lastLayer = layer;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lastLayer seems unused.

import javax.annotation.Nullable;

/** Holds the layers for an image. Makes sure that each layer is only added once. */
public class ImageLayers<T extends Layer> implements Iterable<T> {

public static class Builder<T extends Layer> {

private final ImmutableList.Builder<T> layersBuilder = ImmutableList.builder();
private final Set<T> layersBuilder = new LinkedHashSet<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Isn't the order of layers important? (Set does not provide any ordering guarantees.)
  • Are you comfortable calling this a builder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using LinkedHashSet to ensure the order is the order of insertion - I'll change the signature to LinkedHashSet<T> layers to make this clearer.

}
layerDigestsBuilder.add(layer.getBlobDescriptor().getDigest());
layersBuilder.remove(layer);
layersBuilder.add(layer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Set allows duplicates, so probably there is no need to remove before adding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reinsertion in LinkedHashSet is defined to be: "Note that insertion order is not affected if an element is re-inserted into the set", so I believe the remove is necessary to move the layer to the end of the linked list.

public class ImageLayers<T extends Layer> implements Iterable<T> {

public static class Builder<T extends Layer> {

private final ImmutableList.Builder<T> layersBuilder = ImmutableList.builder();
private final LinkedHashSet<T> layers = new LinkedHashSet<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think explicitly marking this a LinkedHashSet is a good idea, but it warrants a bit of explanation that the set aspect is for the fast comparison. TBH given the need to remove-and-then-add, it would be better to just make this a List. Docker is limited to 40 layers (I believe), so the O(n) vs O(ln n) doesn't matter.

Copy link
Contributor Author

@coollog coollog Jul 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the speed here might actually matter a bit, since this class is used to represent the layers in the cache as well (as this is mostly fixing #721). Having the remove-then-add would be O(n) performed O(n) times (for n layers). Since the cache could potentially have many layers after extensive use (not limited to the Docker image layer limit), the O(n2) vs O(n ln n) (or amortized O(n)) could potentially be a noticeable difference when manipulating the cache metadata.

lastLayer = layer;
}
layerDigestsBuilder.add(layer.getBlobDescriptor().getDigest());
layers.remove(layer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't LinkedHashSets already handle duplicates?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out this is to re-order the element (#739 (comment)).

That said, it seems worth to put a comment like "remove necessary to rearrange the insertion-order of LinkedHashSet".

@chanseokoh
Copy link
Member

chanseokoh commented Jul 31, 2018

ABA == BA and I can't really think of a case where ABA != BA?

I wonder if this is possible: for example, both A and B create a read-only file (no permission to write or modify it) at the same location, that once placed no one can overwrite. That is, whoever writes the file first wins. If that's possible, ABA != BA.

@coollog
Copy link
Contributor Author

coollog commented Jul 31, 2018

@chanseokoh The file permissions for the files will not affect the overwrite. In the layered container filesystem at runtime, the order of the search for a file goes from the last layer backwards, so prior layers with a file will not be reached if later layers have the same file.

@briandealwis
Copy link
Member

I wasn't clear in what I wrote previously: by changing ImageLayers, aren't we affecting all generated images, and not just our cache metadata in target/jib-cache and $APPDATA/google-cloud-tools-java/jib.

@coollog
Copy link
Contributor Author

coollog commented Jul 31, 2018

@briandealwis Yep, the intention for this fix is to also have the same optimization applied to generated images as well.

@chanseokoh
Copy link
Member

Okay, this PR looks good to me.

@coollog coollog merged commit b23dcdc into master Jul 31, 2018
@coollog coollog deleted the dedup-cache-entries branch July 31, 2018 16:31
@coollog
Copy link
Contributor Author

coollog commented Jul 31, 2018

Looks like this helped to fix the failing integration test for the simple project, but the empty project still fails (because the second build time is longer).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants