Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3: scrounging bandwidth for uploads #2103

Open
bemoody opened this issue Oct 10, 2023 · 7 comments
Open

S3: scrounging bandwidth for uploads #2103

bemoody opened this issue Oct 10, 2023 · 7 comments

Comments

@bemoody
Copy link
Collaborator

bemoody commented Oct 10, 2023

In pull #2086, one thing that worries me a little is that it will initially be very slow to upload projects. At present, the upload process would have to compete for bandwidth with all of the clients currently downloading data.

I can think of a couple of workarounds:

  1. We could do the uploads from the backup (physionet-production) server. One advantage is that it's in a completely different physical location. However, it would be a messy manual process and we would probably need to manually update the database on physionet-live.

  2. We could configure the S3 client to use an HTTP proxy (via a separate network link, albeit from the same building.) In fact we could set a single proxy server for everything (GCP, DataCite, ORCID, as well as AWS) but I think it might be preferable to configure S3 separately.

One thing I don't want to do is to prioritize uploads over client requests.

@bemoody
Copy link
Collaborator Author

bemoody commented Oct 10, 2023

It'd also be super nifty if we had a way to track the progress of background tasks in the admin console.

@tompollard
Copy link
Member

It'd also be super nifty if we had a way to track the progress of background tasks in the admin console.

Good idea, this would be helpful!

@bemoody
Copy link
Collaborator Author

bemoody commented Oct 10, 2023

Any thoughts on how we should manage the network traffic?

Uploading hundreds of projects will require some automation in any event. But I'd prefer to do so using Chrystinne's code rather than trying to script it in some other way.

@tompollard
Copy link
Member

Sorry, not my area of expertise! Personally I think I'd just take a short term hit on our network, perhaps alongside a news item explaining why downloads are slow.

@tompollard
Copy link
Member

tompollard commented Oct 10, 2023

Thinking about this a little more, my preference would be:

We could configure the S3 client to use an HTTP proxy (via a separate network link, albeit from the same building.) In fact we could set a single proxy server for everything (GCP, DataCite, ORCID, as well as AWS) but I think it might be preferable to configure S3 separately.

This seems like an approach that may be useful in the longer term (rather than a one-off, just for the initial batch of uploads to AWS).

@bemoody
Copy link
Collaborator Author

bemoody commented Oct 10, 2023

Personally I think I'd just take a short term hit on our network, perhaps alongside a news item explaining why downloads are slow.

I don't think that's practical, though. When I say "competing for bandwidth", I mean that uploading to Amazon would be limited to the same speed as everyone else; uploading 30 TB would take months.

It's true that in theory we could monkey with the traffic control settings to prioritize certain connections over others, but that's difficult and finicky and I don't want to try to deal with it.

@bemoody
Copy link
Collaborator Author

bemoody commented Oct 10, 2023

to set a custom proxy, something like this should work

--- a/physionet-django/project/cloud/s3.py
+++ b/physionet-django/project/cloud/s3.py
@@ -58,7 +58,13 @@ def create_s3_client():
         session = boto3.Session(
             profile_name=settings.AWS_PROFILE
         )
-        s3 = session.client("s3")
+        config = botocore.config.Config()
+        if settings.AWS_HTTP_PROXY:
+            config.proxies={
+                'http': settings.AWS_HTTP_PROXY,
+                'https': settings.AWS_HTTP_PROXY,
+            }
+        s3 = session.client("s3", config=config)
         return s3
     else:
         return None

https://stackoverflow.com/a/45492119

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants