Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Archive.extracted to be able to remove top level directory #54012

Closed
Oloremo opened this issue Jul 25, 2019 · 36 comments
Closed
Assignees
Labels
Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged
Milestone

Comments

@Oloremo
Copy link
Contributor

Oloremo commented Jul 25, 2019

Description of Issue

I have the following workflow in the ansible:

  1. Download archive md5 sum file.
  2. Check local archive(if exists) against md5 in 1)
  3. If it differs(or no old archive) - download the new archive on top of the old one
  4. If we downloaded archive in 3) we check for a specific directory(with archive name) and if it exists - we delete it.
  5. We unarchive archive, it will have the same name as we deleted at step 4)

The idea here is to release applications with a version like "master". It constantly updated and we can't use something like
onlyif: path/to/file

Since it can exist but it could be old. So we rely on the md5 sum and if it's changes - we overwrite old directory by deleting it first and unarchive after.

And since slots are not a thing yet in requisites I'm not particularly sure how to implement this logic in the current state of Salt.

What I tried:
clean: True argument - it's only removing the diff between current and new files based on file names.
overwrite: True - leads to the state to lose idempotency
prereq - with file.absent - leads to the state to be run every time

Versions Report

2019.02

@waynew
Copy link
Contributor

waynew commented Jul 25, 2019

So you have an archive, say master.zip that you've deployed, and then you want to change that master.zip and then deploy it, but only if there are changes from the master.zip that exists on the minion?

@Oloremo
Copy link
Contributor Author

Oloremo commented Jul 26, 2019

@waynew
The idea here is to be able to deploy a new release with the same name. For example "master". Imagine we deployed it once and now we have a directory called "master" with the contents of this archive.

Now new commit was pushed to the master branch of something and the new artifact was created and we want to deploy it again. But now we have no means to check if the update is actually required since there is no sane version, just "master". So our approach here is to create a md5 sum file for each archive and if this md5 differs from the one we already have on a minion - we do a rollout\extraction.

So basically we use a
source_hash_update: True

But the problem is that current arg clean: True is not suitable for that case since it just removes the diff between a local directory and the one that will be unarchived - based on filenames. And in our case filenames could be the same but they're could be very different,

So my proposal is to add the ability for a full cleanup for archive module. Not the diff between, but ensure that extraction will be done in the empty top directory.

@waynew
Copy link
Contributor

waynew commented Jul 26, 2019

@Oloremo have you looked into the onchanges or watch requisites? They might allow you to do what you want without having to add new functionality.

@Oloremo
Copy link
Contributor Author

Oloremo commented Jul 26, 2019

@waynew I looked at prereq with file.absent since I need to remove directory before the extraction. This leads to the state to be run every time.

It was something like this:

remove:
  file.absent:
    - name: /path/to/dir
    - prereq:
      - archive: extract
extract:
  archive.extracted:
    - source: salt://archive.tar.gz
    - name: /path/to/dir
    - source_hash_update: True

@waynew
Copy link
Contributor

waynew commented Jul 26, 2019

Try

- onchanges:
  - archive: extract

instead

@Oloremo
Copy link
Contributor Author

Oloremo commented Jul 26, 2019

@waynew what is the point?.. onchanges will be executed after, right?

remove:
  file.absent:
    - name: /tmp/delme/master
    - onchanges:
      - archive: extract

extract:
  archive.extracted:
    - name: /tmp/delme/master
    - source: http://s3-proxy:9086/behavox-builds/delme/delme.tar.gz
    - source_hash: http://s3-proxy:9086/behavox-builds/delme/delme.tar.gz.md5
    - source_hash_update: True
    - keep_source: False
    - options: "--strip-components=1"
    - enforce_toplevel: False
    - trim_output: 1
----------
          ID: extract
    Function: archive.extracted
        Name: /tmp/delme/master
      Result: True
     Comment: http://s3-proxy:9086/behavox-builds/delme/delme.tar.gz extracted to /tmp/delme/master/, due to absence of one or more files/dirs
     Started: 18:15:09.842483
    Duration: 2794.561 ms
     Changes:
              ----------
              extracted_files:
                  no tar output so far
----------
          ID: remove
    Function: file.absent
        Name: /tmp/delme/master
      Result: True
     Comment: Removed directory /tmp/delme/master
     Started: 18:15:12.637463
    Duration: 319.746 ms
     Changes:
              ----------
              removed:
                  /tmp/delme/master

@Oloremo
Copy link
Contributor Author

Oloremo commented Jul 29, 2019

We figure out why clean: True not working for us. We use a lot the - options: "--strip-components=1" feature to create our own top dir and this approach breaks the clean diff logic.

Obviously now we have 3 choises:

  1. Stop using the --strip-components=1
  2. Implement native support for --strip-components tar feature in archive(could be complicated + not cross platform)
  3. Implement clear top dir logic - proposed in the PR

@dwoz dwoz added the Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged label Jul 30, 2019
@dwoz dwoz added this to the Blocked milestone Jul 30, 2019
@waynew
Copy link
Contributor

waynew commented Jul 31, 2019

So, what's the problem with removing --strip-components=1? That seems like the correct fix to me.

@Oloremo
Copy link
Contributor Author

Oloremo commented Jul 31, 2019

That's a loaded question.
First of all, I think it's kinda a problem that some tar options break the module\state functionality.

And yes, obviously we could drop the --strip-components part but we'll have to re-pack most of our artifacts which we'd like not to do.

@waynew
Copy link
Contributor

waynew commented Jul 31, 2019

I'm confused as to why strip components is breaking anything? And why that would require re-packing your artifacts. Can you create a sample set of commands to duplicate the problem that you're seeing?

e.g.

cd /tmp
mkdir example
touch example/foo.txt
touch example/bar.txt
tar -cvf example.tar example
...
salt \* state.apply some_state

I'm not clear on what part is actually missing here, with what you really want to accomplish. From what I understand, you already have all the pieces available with your existing artifacts and Salt's existing behavior. But if that's not the case, we would like to know.

Can you create that mcve?

@Oloremo
Copy link
Contributor Author

Oloremo commented Jul 31, 2019

@waynew Sorry for the confusion.
I wrote a simple bash script to prepare a mcve:

$ cat ~/54012.sh
#!/bin/bash

rm -rf /tmp/delme/master /tmp/archive1 /tmp/archive2
mkdir -p /tmp/archive1 /tmp/archive2

echo archive1 > /tmp/archive1/archive1only
echo archive1 > /tmp/archive1/both
echo archive2 > /tmp/archive2/archive2only
echo archive2 > /tmp/archive2/both

tar -C /tmp -czf /tmp/archive1_master.tar.gz archive1
mv /tmp/archive1_master.tar.gz /tmp/archive1/master.tar.gz
md5sum /tmp/archive1/master.tar.gz > /tmp/archive1/master.tar.gz.md5

tar -C /tmp -czf /tmp/archive2_master.tar.gz archive2
mv /tmp/archive2_master.tar.gz /tmp/archive2/master.tar.gz
md5sum /tmp/archive2/master.tar.gz > /tmp/archive2/master.tar.gz.md5

This script will create 2 directories which represent the deployment approach we're using. Both these directories will be archived into the archive with the name master.tar.gz and md5 sum of the archive will be created. So as you can see there is a 1 uniq file in each archive and 1 file with the name which exists in both archives.

Now please create this state:

create_top_dir:
  file.directory:
    - name: /tmp/delme/master

extract:
  archive.extracted:
    - name: /tmp/delme/master
    - source: salt://master.tar.gz
    - source_hash: salt://master.tar.gz.md5
    - source_hash_update: True
    - keep_source: False
    - options: "--strip-components=1"
    - enforce_toplevel: False
    - clean: True

And now we're ready to test.
Please copy master.tar.gz and master.tar.gz.md5 from archive1 directory to the fileserver root of your salt installation. Now run this state. It will create a directory /tmp/delme/master and extract first archive into it. So far so good.

Now copy the master.tar.gz and master.tar.gz.md5 from archive2 directory to the fileserver root of your salt installation, overwriting the old ones. And run the state again.

And now you could see the issue. List the files in the /tmp/delme/master, they will look like this:

$ ls -1 /tmp/delme/master/
archive1only
archive2only
both

What'd I expect:

$ ls -1 /tmp/delme/master/
archive2only
both

because clean: True should revome the archive1only file but due to the - options: "--strip-components=1" it can't

@Oloremo
Copy link
Contributor Author

Oloremo commented Aug 16, 2019

@waynew Hi again, just wonder if the comment above makes sense to you

@waynew
Copy link
Contributor

waynew commented Aug 23, 2019

Hey @Oloremo sorry for the wait, it's been a bit busy around these parts. I've finally had a chance to sit down with this and think some more about it.

You're actually right here - the issue is that with the presence of strip-components, archive.extracted has an inconsistent behavior - like you've seen, the issue is that it doesn't remove anything in the toplevel like it would. I think the correct and expected behavior here isn't to have a clean-toplevel, but instead when strip-component is set, '', or the "toplevel" dir should be added to the paths to be deleted.

If you wanted to test this out in your own code, take a look at salt/states/archive.py - you can make this change from

for path in contents['top_level_dirs'] + contents['top_level_files']:

to

if strip_components:
    contents['top_level_dirs'].append('')
for path in contents['top_level_dirs'] + contents['top_level_files']:

This will make the os.path.join(name, path) return the root that things will be extracted to.

@terminalmage may have some alternate thoughts about it, but that seems like the most consistent behavior to me.

@Oloremo
Copy link
Contributor Author

Oloremo commented Sep 2, 2019

@waynew Hi, sorry for the late reply.
So you're suggesting that we need to add a separate strip_components argument to the archive.extracted state? What if it will be specified in options too?

Plus append('') look really weird to me. :-(

@oukooveu
Copy link

oukooveu commented Sep 4, 2019

@waynew solution provided above works as expected, thanks. @Oloremo are you going to create appropriate pull request?

@Oloremo
Copy link
Contributor Author

Oloremo commented Sep 5, 2019

I'll try to draft it soon

@terminalmage
Copy link
Contributor

I think the correct and expected behavior here isn't to have a clean-toplevel, but instead when strip-component is set, '', or the "toplevel" dir should be added to the paths to be deleted.

I don't think I agree with this. The intent of the feature is to only delete paths that Salt is managing. It should never be deleting the parent directory, and adding it to the list of paths to be deleted is reckless and will break things for people.

@terminalmage
Copy link
Contributor

The "parent directory" in this context is the equivalent of the directory one would cd into before extracting (e.g. when running tar -C /some/dir -xf myfile.tar.gz). The archive is being extracted into that directory; it is not what is being extracted. Salt should not be removing it.

@waynew
Copy link
Contributor

waynew commented Sep 5, 2019

I think the correct and expected behavior here isn't to have a clean-toplevel, but instead when strip-component is set, '', or the "toplevel" dir should be added to the paths to be deleted.

I don't think I agree with this. The intent of the feature is to only delete paths that Salt is managing. It should never be deleting the parent directory, and adding it to the list of paths to be deleted is reckless and will break things for people.

From what I see looking at the code it actually behaves differently with and without strip_components, which is why I suggest that change.

If you have an archive that looks like this:

$ tar -tf  one.tar.gz
archive1/
archive1/a/
archive1/a/both.txt
archive1/a/one.txt
archive1/a/b/
archive1/a/b/file.txt

If you extract this with strip components 1 you will get this directory structure:

    `-- a
        |-- b
        |   `-- file.txt
        |-- both.txt
        `-- one.txt

If you have another archive like this:

$ tar -tf two.tar.gz
archive2/
archive2/a/
archive2/a/both.txt
archive2/a/two.txt
archive2/a/b/
archive2/a/b/file.txt

You will end out with this directory structure:

    `-- a
        |-- b
        |   `-- file.txt
        |-- both.txt
        `-- two.txt

Salt is cleaning /a/, so /a/one.txt is gone.

When I say there's an inconsistency, this is what I mean. With an empty directory, change it to strip_components=2 and this is what you get instead:

    |-- b
    |   `-- file.txt
    |-- both.txt
    |-- one.txt
    `-- two.txt

What I'm seeing is that with any level --strip-components and a plain ol' tar, we get the behavior that Salt provides without clean=True.

Which is also the same behavior that we get without clean=True.

If we're providing a clean command that is supposed to delete files and folders, I'm totally fine with that not removing toplevel files by default, based on the argument that a tarfile shouldn't have any toplevel files.

But I have a problem when we're providing multiple ways to (not) do the same thing, and zero ways to accomplish one thing without doing weird hacks like deleting or copying the archive on changes or something.

If you're against the idea of deleting toplevel files with clean when you --strip-components, then what would you suggest?

@Oloremo
Copy link
Contributor Author

Oloremo commented Sep 10, 2019

/me patiently waiting for a decision on implementation

@terminalmage
Copy link
Contributor

@waynew This has nothing to do with any special handling for strip_components, and everything to do with what is considered a "top level" file or directory. With --strip-components=1, the stuff in the subdirectory becomes the top level.

The way the clean functionality works is we take any top-level paths in the archive, with strip_components taken into account, and we recursively remove those paths. Because the paths unique to the first archive are not present in the new archive, they are not among the paths that are removed.

Salt is very conservative about what it will delete, when clean=True is set.

This is by design.

Salt can't know all the ways in which people will use this state. What happens when someone is using --strip-components=1, clean=True, but has other items within the destination dir? Congratulations, your proposal just deleted those files irrevocably. Better hope they have backups!

"Why would anyone have other things in the destination dir?" you may ask. Well, why would anyone use --strip-components=1 and manually make the destination directory? Everyone has different reasons and use cases, and you can't just blindly start changing how a state fundamentally works to solve an edge case.

So, as I have said, removing the parent dir is a Bad Idea:tm:. Don't do it.

The best solution I can think of is to allow for the clean parameter to be either a bool or a string. If a string, we would treat that string as a URL for an archive, cache that archive, and use the paths from that archive as the ones we remove, rather than the paths from the new archive. This would mean running a separate archive.list on that archive, and cleaning the top-level paths in that archive instead when we check for paths to clean.

Example SLS:

extract:
  archive.extracted:
    - name: /tmp/delme/master
    - source: salt://master.tar.gz
    - source_hash: salt://master.tar.gz.md5
    - source_hash_update: True
    - keep_source: False
    - options: "--strip-components=1"
    - enforce_toplevel: False
    - clean: salt://old_master.tar.gz

Note that this would require you to keep the old archive available, and under a different filename. This would not work if the new archive simply takes the place of the old one, with the same URL but different contents.

I can work on the code for this, but I'm still in the process of moving into a new house and will not be able to work on this until sometime next week.

@waynew
Copy link
Contributor

waynew commented Sep 12, 2019

@Oloremo Would that handle your use case?

@Oloremo
Copy link
Contributor Author

Oloremo commented Sep 12, 2019

@waynew
Yes and no. :-)
I fully understand the problem here and whole "different people - different cases" point. We want a custom top directory name and we don't want to re-pack hundreds of archives to achieve this so we come up with the strip-components approach.

The proposed idea of using the archive as a source of truth for Clean is smart, but the limitation of keeping old archive is very frustrating especially since there is no really any cache control\eviction logic for archives in Salt as far as I know. And our archives are big(1Gb in general and we could deploy many times per day) plus I'm not sure how it will work in the originally described case where archive name is something like service-master.tar.gz so both new and old archives have the same name and the same URL.

And considering all of that, why'd you still think that my original approach with additional and optional argument clean_top is bad?

I mean right now we're more or less thinking how to make a change and not ruin other peoples workflows - I get it! So in that situation, a new argument feels like a best option for me since it won't affect anyone else since it will be disabled by default and could be properly documented on how it suppose to work(I can add it to PR). It may be not the cleanest approach and it's not a smart approach but it feels safe to me.

@waynew
Copy link
Contributor

waynew commented Sep 12, 2019

I only dislike it because I fundamentally disagree with terminalmage 🙃

That could be because I haven't seen a scenario in which, disregarding Salt entirely, if you were creating a tar file, deleting some files, and extracting an archive over top of it, with --strip-components that you would ever hit a point where "surprise, you just deleted all the files in that directory!"

Even using Salt I'm not sure how you would get into the situation where you're using clean=True but you don't expect files in the toplevel directory to be removed. I'm not saying that it can't happen... just that it seems like you'd have to be trying to get there.

Another approach that could work, IIUC, is just to implement a watch in the archive state (or maybe it's updating some of the other requisite implementations?). Because basically your problem is that you want to know when you're about to extract a changed archive, and you want to be able to remove everything you don't care about, similar to the behavior of git clean.

I don't really like adding the extra argument, because it seems like we should already support this behavior already.

@Oloremo
Copy link
Contributor Author

Oloremo commented Sep 12, 2019

I'm a bit confused right now.

Even using Salt I'm not sure how you would get into the situation where you're using clean=True but you don't expect files in the toplevel directory to be removed.

I actually expect them to be removed but my example above shows what clean: true don't work as expected if executed with strip-components.

"Just delete top directory" approach is just a very straightforward and simple and that's how I believe, would most of the people do if they need to unarchive new archive with something like "master" version on top of the previous one, and yes kinda like with git.

So if removing a top-level dir is not an option and patching strip-components logic against salt design and the idea of patching of the clean logic to take file list from the url is very limiting - what other options do we have? :-) Assuming you agree that the current behavior with strip-components + clean behavior is wrong right now.

@terminalmage
Copy link
Contributor

Even using Salt I'm not sure how you would get into the situation where you're using clean=True but you don't expect files in the toplevel directory to be removed. I'm not saying that it can't happen... just that it seems like you'd have to be trying to get there.

But that's not at all what's happening here, and I'm really puzzled as to how you still don't understand this. The top-level files/dirs are ones that are part of the archive. They are not the destination directory. The destination directory (i.e. the name parameter to the state) is the directory into which the archive is being extracted. The state does not have any concept of what else might be in the parent directory, so it shouldn't just be removing everything within it.

Just because you can't think of a reason someone wouldn't want to delete everything in the parent dir when using clean=True and --strip-components=1, doesn't mean it's a good idea to do so.

@Oloremo If that doesn't quite work for you (and I understand how large archives can present problems), I would prefer a separate argument called clean_parent that is only used when you are using both --strip-components and clean=True.

@Oloremo
Copy link
Contributor Author

Oloremo commented Sep 12, 2019

@terminalmage Could you please elaborate on clean_parent argument logic?

that is only used when you are using both --strip-components and clean=True.

This is a bit complicated in general since strip-components is not a separate argument of the archive.extracted state, so it's kinda both "edge case" and "magic" at once from my point of view.

@terminalmage
Copy link
Contributor

We determine the number of top-level dirs to strip via the from the value in the options argument, because that CLI flag has gone by more than one name (--strip and/or --strip-components) depending on what version of tar or GNU tar you are using. There's nothing new that would need to be done to get that information because we are already doing it here.

Basically, we would have an additional argument called clean_parent that would only have an effect if both A) we detected --strip or --strip-components being used, and B) clean is set to True.

I know it seems kinda redundant to have two "clean" arguments, but it seems like the best way of achieving the functionality you need, without changing how the clean feature fundamentally works in a way that will cause problems for others.

@moop-moop
Copy link

moop-moop commented Oct 10, 2019

It may be a bug (on Salt 2018.3.2) , but I found that the archive.extracted always reported changes when test=True, which makes the prereq and directory purge useless as it will always be triggered

To get around, I stuck a state to the get the archive local to the minion first, as file.managed works just fine with onchanges. The only breakdown, is if the archive is local but the follow states failed to run, they never will unless the local archive is purged.

jdk archive present on minion:
  file.managed:
    - name: /opt/jdk.tar.gz
    - source: salt://binaries/java/{{ jdk_path }}  
    - makedirs: True
    - show_changes: False

jdk directory purged:
  file.absent:
    - name: /opt/jdk   
    - onchanges:
      - file: jdk archive present on minion

jdk-oracle-openjdk archive extracted:
  archive.extracted:
    - name: /opt/jdk
    - trim_output: 0
    - source: /opt/jdk.tar.gz
    - options: --strip-components=1
    - enforce_toplevel: False
    - onchanges_any:
      - file: jdk directory purged
      - file: jdk archive present on minion

@Oloremo
Copy link
Contributor Author

Oloremo commented Nov 22, 2019

@terminalmage @waynew
Sorry for the delay, I'm ready to get back on this issue.

Basically, we would have an additional argument called clean_parent that would only have an effect if both A) we detected --strip or --strip-components being used, and B) clean is set to True.

This is basically what I proposed in #54013 right?
The only difference is that it's not dependant on the options and I made it conflicting with default clear arg.

@terminalmage
Copy link
Contributor

Yeah, but I personally don't like the clean_top name because top means something specific in Salt (i.e. "top file") and I'm worried that it may cause confusion, which is why I suggested clean_parent.

I'm OK with it conflicting with clean, since it seems redundant to use both clean and clean_parent.

Also, instead of exc.__str__(), can you use six.text_type(exc)? Otherwise, depending on the contents of the error, we might trigger a UnicodeDecodeError on Python 2.

@Oloremo
Copy link
Contributor Author

Oloremo commented Nov 22, 2019

@terminalmage roger that. ill rebase and update this PR

@terminalmage
Copy link
Contributor

Excellent

@Oloremo
Copy link
Contributor Author

Oloremo commented Nov 23, 2019

@terminalmage So I closed the old PR and created a new one, please take a look when you'll have time. It only includes the single very simple test for new logic and I don't really have an idea how to test it properly.

@terminalmage
Copy link
Contributor

Thanks! I just did a code review and suggested a few changes.

@Oloremo
Copy link
Contributor Author

Oloremo commented Dec 23, 2019

Feature merged!

@Oloremo Oloremo closed this as completed Dec 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged
Projects
None yet
Development

No branches or pull requests

6 participants