Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not all s3 objects are returned during list s3objects #200

Closed
res0nat0r opened this issue Mar 29, 2018 · 6 comments
Closed

Not all s3 objects are returned during list s3objects #200

res0nat0r opened this issue Mar 29, 2018 · 6 comments

Comments

@res0nat0r
Copy link

Hi,

It seems only a subset of my objects in s3 are being returned by listobjects. It doesn't seem to be on an even paging boundary so maybe it is something else? All objects in this bucket are archived to glacier but only a subset appear to be returning.

I'll be happy to do any troubleshooting just let me know.

Example:

$ awless list s3objects --filter bucket=my-bucket-here -r us-east-2 --no-headers | wc -l
2980
$ aws s3 ls --recursive s3://my-bucket-here | wc -l
49511
@res0nat0r res0nat0r changed the title Not all s3 objects are listed during list s3objects Not all s3 objects are returned during list s3objects Mar 29, 2018
@simcap
Copy link
Contributor

simcap commented Apr 4, 2018

@res0nat0r Sorry for delay and thanks for reporting 👍

Let's get to the bottom of it. A few things to be aware of

  • when using wc -l with awless it is better to ouptput using the --format=tsv as there will be no awless builtin line wrapping which does happen with the default output format (markdown table). So with the default format it would double the count for lines wrapped. So the line awless ls s3objects --filter bucket=my-bucket-here -r us-east-2 --format=tsv --no-headers | wc -l would get the accurate number of lines

  • The --filter to filter by bucket name will actually filter on all the bucket that contains the string passed to the filter. So for instance filtering on the bucket --filter bucket=my-company would include all buckets starting with my-company, so it would add a lot to the results.

With that in mind and surprisingly I would expect the huge number of s3objects to be on awless side. Silly question, but are you sure the count results where not inverted when pasting into this issue?

@res0nat0r
Copy link
Author

Hi @simcap thanks for the response,

Thanks for the tsv tip, that output makes sense and seems to more line up with the list s3objects not paginating through the response via NextToken.

Using the below now returns:

$ awless -r us-east-2 list s3objects --format tsv --filter bucket=my-bucket-here | wc -l
1001

So 1001 is 1000 objects plus the one header which is the default returned by the s3 api without iterating through any NextTokens. My bucket name that I am filtering on is unique in my namespace so it should be the only bucket contents getting returned, and I do for sure have tons more than that in the bucket as it is my archive, there are ~50k items there for sure.

Hope this helps.

@simcap
Copy link
Contributor

simcap commented Apr 4, 2018

Ok, thanks @res0nat0r . Tomorrow I will look at if the pagination is properly done on our side and put a fix if needed.

@simcap
Copy link
Contributor

simcap commented Apr 5, 2018

Above commit fixes the issue.

Pagination was missing when listing s3object. Reason was: you pay AWS when fetching s3 objects, so at the beginning of our awless product we did not support the pagination on this specific API endpoint since it costs users... and then we forgot about it ;) . Retrospectively not a good idea anyway!

@res0nat0r If you confirm now that it brings back the correct count we can close this issue. (I tested it on my side)

(At the moment it is slower than the aws-cli - on this particular API endpoint only given a big count of s3objects. I am going to look if we can improve on that).

@res0nat0r
Copy link
Author

@simcap After doing a go get -u it looks good. Thanks!

Also listing my ~50k objects seems pretty comparable to my aws s3 ls output below FYI.

$ time aws s3 ls --recursive s3://my-bucket-here | wc -l
49797

real    0m18.126s
user    0m9.561s
sys     0m0.753s

$ time awless -r us-east-2 list s3objects --format tsv --filter bucket=my-bucket-here | wc -l
49798

real    0m17.512s
user    0m15.231s
sys     0m0.580s

@simcap
Copy link
Contributor

simcap commented Apr 5, 2018

Awesome! Thanks. I will close this issue.

Any other problematic issue let us know, i will find the time to fix them before our release v0.1.10 coming up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants