Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GDAL does not refresh IAMRole creds on EC2 or ECS after 6 hours #1593

Closed
chris-bateman opened this issue May 28, 2019 · 8 comments
Closed

GDAL does not refresh IAMRole creds on EC2 or ECS after 6 hours #1593

chris-bateman opened this issue May 28, 2019 · 8 comments
Milestone

Comments

@chris-bateman
Copy link

GDAL version 2.4.0
Using Python
Running on Amazon Linux 2 with Docker running Ubuntu 19.04
Reading and writing to S3.

After 6 hours GDAL will fail to talk to S3 and end due to continuous failure.
The 6 hour limit appears to be built into ECS and EC2 despite the IAM role having its own session time. Confirmed by checking the token expiration on the instance.

Haven't been able to generate useful logs at this stage but running GDAL in debug mode now.
Also pulling ECS logs.

Confirmed issue is not present when using environmental variables with AWS keys and no IAMrole attached.

Expected behavior - refresh AWS temp credentials token correctly when required.

@adamsteer
Copy link

the process which uncovered this behaviour is using gdal’s vsis3 driver to open warped virtual mosaics held on s3, which in turn reference imagery held on s3, and do the delayed-compute warping and clipping specified by the VRTs. The process can take a while - and as Chris mentioned, credentials time out.

@rouault
Copy link
Member

rouault commented Jun 7, 2019

@chris-bateman @adamsteer Can you test rouault@68ef68a ? (this is against master, but applies on top of 2.4 as well). I believe this should fix your issue, but haven't tested

@rouault
Copy link
Member

rouault commented Jun 12, 2019

@chris-bateman @adamsteer ping ?

@ghost
Copy link

ghost commented Jun 14, 2019

Thanks for the patch. The system went in another direction so it wasn't easy to test.

I will give it a try in the next few weeks on a dev system and let you know how I go.

rouault added a commit that referenced this issue Jun 19, 2019
/vsis3/: for a long living file handle, refresh credentials coming from EC2/AIM (fixes #1593)
rouault added a commit that referenced this issue Jun 19, 2019
rouault added a commit that referenced this issue Jun 19, 2019
@rouault
Copy link
Member

rouault commented Jun 19, 2019

OK, I've merged this and backported to 3.0 and 2.4 branch as I think it should be safe, so that this can be included in the coming bugfix releases. Confirmation that it does fix the issue would still be great.

@rouault rouault added this to the 2.4.2 milestone Jun 20, 2019
@jonseymour
Copy link

jonseymour commented Feb 12, 2020

Update: I now have reason to beileve this fix is sound and the problem I am experiencing lies elsewhere. See following update for more info and also https://lists.osgeo.org/pipermail/gdal-dev/2020-February/051719.html

My experience with this fix is as follows:

  • I installed my own build of gdal 2.4.2 from the source tar
  • I still experienced issues with gdal over vrt files hosted on AWS s3 failing after the container in which GDAL had been running and had been up longer than the AWS token expiry period (~ 6 hours).
  • I didn't experience the issue with physical tiff files hosted on s3 - only vrt files across physical tiff files hosted on s3
  • I also experienced similar issues for some files even without the 6 hour delay, but (not for all files of the failing type).
  • I could not reproduce the issues with the same code and same files running in an environment that does not use IAM role-based authentication (e.g. one that uses a credentials profile)
  • The issues with these files disappeared when I injected AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY variables into the AWS containers.

Now, I can't completely rule out that I have made a dumb error somewhere along the line, but I am reasonably sure I am running code derived from the gdal-2.4.2 source. If I can produce a standalone test-case for the problem I will raise a separate issue documenting that.

I am noting these issues here, for consideration of others who have this fix and are still experiencing similar issues.

@jonseymour
Copy link

jonseymour commented Feb 24, 2020

A further update to above. I have now experienced the same symptoms "ERROR 4:" even when using the explicit AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY variables in my container (e.g. not using IAM roles). The error did seem to happen after a long pause in system usage, so still seems to be related to some kind of timeout, but doesn't seem to be explained by the timeout of IAM role credentials time out since in theory I am not using them currently.

I am able to exercise the gdal.VSICurlClearCache() call inside my container and when I exercised that call, the symptom disappeared and I was able to access the previously failing file successfully.

So, the summary is, it could well be that the fix is sound, but that there is a second issue which causes similar symptoms even if IAM role credentials are not in use.

See also: https://lists.osgeo.org/pipermail/gdal-dev/2020-February/051719.html

@jonseymour
Copy link

The symptoms that this problem produced are somewhat similar to #1244 although the cause is quite different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants