Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checks the ETag of the files being requested to prevent downloading a… #1383

Merged
merged 22 commits into from
Apr 24, 2020

Conversation

Tro95
Copy link
Contributor

@Tro95 Tro95 commented Feb 24, 2020

Checks the ETag of the files being requested to prevent downloading and replacing files with no updates. #1249

Description of changes: Added caching functionality to the get_url_content() helper function, that stores and compares ETag values in the response headers.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@@ -165,10 +166,52 @@
'AWS::Redshift::Cluster'
]

CACHING_DIR = os.path.join(os.path.dirname(__file__), 'cache/')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can easily be changed to a different directory or moved into a config option if desired. I'm also unsure of the consequences of using __file__ rather than cfnlint.__file__ as is used in some places, but there doesn't seem to be a consistent approach being used, and there were circular dependency issues with the latter option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if it should be under the data folder? Would we want to commit that file as we do the update-specs so the most recent version is included in the package?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy for it to live under the data folder; the primary reason for using a separate cache directory was just to signify the temporary, non-essential nature of the files. It definitely could be commited and used like a lock file, and would save an initial fetch of every spec file when it's first run.

Comment on lines 14 to 15
import random
import string
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can both be removed if people were fine with hard-coding the ETag value in the tests. I just used them to generate a random string.

@Tro95
Copy link
Contributor Author

Tro95 commented Feb 24, 2020

With this pull request I'm seeing a reduction from ~33 seconds to download all the specifications to ~20 seconds when none of them need redownloading. Not exactly the kind of increase I was hoping for, but still an improvement, and I'm sure someone can make it even better.

@PatMyron
Copy link
Contributor

With this pull request I'm seeing a reduction from ~33 seconds to download all the specifications to ~20 seconds when none of them need redownloading. Not exactly the kind of increase I was hoping for, but still an improvement, and I'm sure someone can make it even better.

Especially as the number of regions continues to grow, I think updating the regions in parallel will be the main key to keeping cfn-lint --update-specs completion times within a couple seconds

https://github.com/aws-cloudformation/cfn-python-lint/blob/52864add9eac85a42cc6866cdf05c4d3b4dcd149/src/cfnlint/maintenance.py#L50

@kddejong
Copy link
Contributor

kddejong commented Mar 7, 2020

I think parallelization is a great idea here. Another option would be to combine the region and update specs together. Allowing for a user to filter down to the specs they care about. By default we would have to keep the all which may make this a little trickier to implement.

@Tro95
Copy link
Contributor Author

Tro95 commented Mar 7, 2020

I'm happy to give parallelisation a go, is there any particular parellelisation module you'd prefer to use? The constant I/O of the downloads metadata file concerns me, both from a mutex point-of-view, and the inefficiency of reading and writing a whole file for a single line change. What are your thoughts on having an individual metadata file per a canonical url, i.e. <sha256 of url>.meta.json holding data specifically related to that url, rather than a single, centralised file?

@Tro95
Copy link
Contributor Author

Tro95 commented Mar 10, 2020

I've added parallelisation and changed some basic things like the directory the caching files were being pulled into.

Updating all files speed:

$ time cfn-lint --update-specs

real	0m11.139s
user	0m22.762s
sys	0m2.131s

Subsequent update with no specs to update (as per cache check):

$ time cfn-lint --update-specs

real	0m5.726s
user	0m11.829s
sys	0m1.781s

@PatMyron
Copy link
Contributor

PatMyron commented Apr 7, 2020

Is cfn-lint --update-specs still generating src/cfnlint/data/CloudSpecs/ changes like the ones here?

pip3 install -e . --user
cfn-lint --update-specs
git status

@Tro95
Copy link
Contributor Author

Tro95 commented Apr 7, 2020

It should be making changes (assuming there are any to make), so I'm unsure why it's not generating any for you. At the very least it should be creating the src/cfnlint/data/DownloadsMetadata/ directory and files within that.

$ git show
commit 706369e0e0aacbe90c150c1a7b761d880ce601c4 (HEAD -> cfn-issue-1249, dev/cfn-issue-1249)
Merge: ab46e27a be3867c6
Author: Kevin DeJong <kddejong@amazon.com>
Date:   Sat Mar 14 07:25:39 2020 -0500

    Merge branch 'master' into cfn-issue-1249

$ git status
On branch cfn-issue-1249
Your branch is up to date with 'dev/cfn-issue-1249'.

nothing to commit, working tree clean
$ pip3 install -e . --user
$ cfn-lint --update-specs
$ git status
On branch cfn-issue-1249
Your branch is up to date with 'dev/cfn-issue-1249'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   src/cfnlint/data/CloudSpecs/ap-east-1.json
	modified:   src/cfnlint/data/CloudSpecs/ap-northeast-1.json
	modified:   src/cfnlint/data/CloudSpecs/ap-northeast-2.json
	modified:   src/cfnlint/data/CloudSpecs/ap-northeast-3.json
	modified:   src/cfnlint/data/CloudSpecs/ap-south-1.json
	modified:   src/cfnlint/data/CloudSpecs/ap-southeast-1.json
	modified:   src/cfnlint/data/CloudSpecs/ap-southeast-2.json
	modified:   src/cfnlint/data/CloudSpecs/ca-central-1.json
	modified:   src/cfnlint/data/CloudSpecs/cn-north-1.json
	modified:   src/cfnlint/data/CloudSpecs/cn-northwest-1.json
	modified:   src/cfnlint/data/CloudSpecs/eu-central-1.json
	modified:   src/cfnlint/data/CloudSpecs/eu-north-1.json
	modified:   src/cfnlint/data/CloudSpecs/eu-west-1.json
	modified:   src/cfnlint/data/CloudSpecs/eu-west-2.json
	modified:   src/cfnlint/data/CloudSpecs/eu-west-3.json
	modified:   src/cfnlint/data/CloudSpecs/me-south-1.json
	modified:   src/cfnlint/data/CloudSpecs/sa-east-1.json
	modified:   src/cfnlint/data/CloudSpecs/us-east-1.json
	modified:   src/cfnlint/data/CloudSpecs/us-east-2.json
	modified:   src/cfnlint/data/CloudSpecs/us-gov-east-1.json
	modified:   src/cfnlint/data/CloudSpecs/us-gov-west-1.json
	modified:   src/cfnlint/data/CloudSpecs/us-west-1.json
	modified:   src/cfnlint/data/CloudSpecs/us-west-2.json

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	src/cfnlint/data/DownloadsMetadata/

no changes added to commit (use "git add" and/or "git commit -a")

@PatMyron
Copy link
Contributor

PatMyron commented Apr 8, 2020

@Tro95 sorry, I forgot I was testing the homebrew version recently, was able to generate changes here after uninstalling that

@kddejong
Copy link
Contributor

kddejong commented Apr 8, 2020

@PatMyron this looking good to merge?

src/cfnlint/maintenance.py Outdated Show resolved Hide resolved
src/cfnlint/helpers.py Outdated Show resolved Hide resolved
src/cfnlint/helpers.py Outdated Show resolved Hide resolved
src/cfnlint/helpers.py Outdated Show resolved Hide resolved
src/cfnlint/helpers.py Outdated Show resolved Hide resolved
src/cfnlint/helpers.py Outdated Show resolved Hide resolved
src/cfnlint/maintenance.py Outdated Show resolved Hide resolved
src/cfnlint/maintenance.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Apr 10, 2020

Codecov Report

Merging #1383 into master will increase coverage by 0.01%.
The diff coverage is 91.93%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1383      +/-   ##
==========================================
+ Coverage   87.52%   87.54%   +0.01%     
==========================================
  Files         156      156              
  Lines        8731     8782      +51     
  Branches     2095     2101       +6     
==========================================
+ Hits         7642     7688      +46     
- Misses        654      657       +3     
- Partials      435      437       +2     
Impacted Files Coverage Δ
src/cfnlint/helpers.py 78.60% <87.80%> (+1.76%) ⬆️
src/cfnlint/maintenance.py 98.00% <100.00%> (+0.27%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 91b4586...5c309d6. Read the comment docs.

@Tro95 Tro95 requested a review from miparnisari April 11, 2020 11:06
Copy link
Contributor

@miparnisari miparnisari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor changes and this looks good to me :)

src/cfnlint/helpers.py Outdated Show resolved Hide resolved
src/cfnlint/maintenance.py Show resolved Hide resolved
@Tro95 Tro95 requested a review from miparnisari April 12, 2020 13:22
@kddejong kddejong merged commit 3b706b2 into aws-cloudformation:master Apr 24, 2020
@Tro95 Tro95 deleted the cfn-issue-1249 branch May 1, 2020 12:17
@PatMyron
Copy link
Contributor

PatMyron commented May 18, 2020

@kddejong @miparnisari if the resource specifications haven't changed, some of the other data sources won't be applied without modifying src/cfnlint/data/DownloadsMetadata:

pip3 install -e .
scripts/update_specs_from_pricing.py # requires Boto3 and Credentials
cfn-lint --update-specs # won't apply changes if metadata matches

@Tro95
Copy link
Contributor Author

Tro95 commented May 18, 2020

@kddejong @miparnisari if the resource specifications haven't changed, some of the other data sources won't be applied without modifying src/cfnlint/data/DownloadsMetadata:

pip3 install -e .
scripts/update_specs_from_pricing.py # requires Boto3 and Credentials
cfn-lint --update-specs # won't apply changes if metadata matches

@PatMyron Is the problem the fact that --update-specs does both the downloading of updated specs, and the merging of the specs with the patches produced by the scripts? If so, the fix should be relatively straight-forward, by replacing https://github.com/aws-cloudformation/cfn-python-lint/blob/cbf972fcb3d45dd71bdac4575e81433cbd36dcc6/src/cfnlint/maintenance.py#L46-L53 with something like

    # Check to see if we already have the latest version, otherwise download it
    if url_has_newer_version(url):
        spec_content = get_url_content(url, caching=True)
        spec = json.loads(spec_content)
    else:
        with open(filename, 'r') as f:
            spec = json.load(f)

This would allow the downloading of updated spec files to be skipped if there are no updates, whilst patching the specs everytime regardless. Alternatively, should we separate the patching of the specs, and include the patching with the scripts you are running?

@PatMyron
Copy link
Contributor

Is the problem the fact that --update-specs does both the downloading of updated specs, and the merging of the specs with the patches produced by the scripts?

Yeah pretty much. --update-specs also merges the specs with manually written patches in addition to patches produced by the scripts

Alternatively, should we separate the patching of the specs, and include the patching with the scripts you are running?

Need some way for manually written patches to be applied in addition to patches produced by the scripts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants