S3: scrounging bandwidth for uploads #2103

bemoody · 2023-10-10T14:52:45Z

In pull #2086, one thing that worries me a little is that it will initially be very slow to upload projects. At present, the upload process would have to compete for bandwidth with all of the clients currently downloading data.

I can think of a couple of workarounds:

We could do the uploads from the backup (physionet-production) server. One advantage is that it's in a completely different physical location. However, it would be a messy manual process and we would probably need to manually update the database on physionet-live.
We could configure the S3 client to use an HTTP proxy (via a separate network link, albeit from the same building.) In fact we could set a single proxy server for everything (GCP, DataCite, ORCID, as well as AWS) but I think it might be preferable to configure S3 separately.

One thing I don't want to do is to prioritize uploads over client requests.

bemoody · 2023-10-10T14:58:19Z

It'd also be super nifty if we had a way to track the progress of background tasks in the admin console.

tompollard · 2023-10-10T15:00:08Z

It'd also be super nifty if we had a way to track the progress of background tasks in the admin console.

Good idea, this would be helpful!

bemoody · 2023-10-10T18:29:06Z

Any thoughts on how we should manage the network traffic?

Uploading hundreds of projects will require some automation in any event. But I'd prefer to do so using Chrystinne's code rather than trying to script it in some other way.

tompollard · 2023-10-10T18:31:58Z

Sorry, not my area of expertise! Personally I think I'd just take a short term hit on our network, perhaps alongside a news item explaining why downloads are slow.

tompollard · 2023-10-10T18:36:06Z

Thinking about this a little more, my preference would be:

We could configure the S3 client to use an HTTP proxy (via a separate network link, albeit from the same building.) In fact we could set a single proxy server for everything (GCP, DataCite, ORCID, as well as AWS) but I think it might be preferable to configure S3 separately.

This seems like an approach that may be useful in the longer term (rather than a one-off, just for the initial batch of uploads to AWS).

bemoody · 2023-10-10T19:40:52Z

Personally I think I'd just take a short term hit on our network, perhaps alongside a news item explaining why downloads are slow.

I don't think that's practical, though. When I say "competing for bandwidth", I mean that uploading to Amazon would be limited to the same speed as everyone else; uploading 30 TB would take months.

It's true that in theory we could monkey with the traffic control settings to prioritize certain connections over others, but that's difficult and finicky and I don't want to try to deal with it.

bemoody · 2023-10-10T21:29:59Z

to set a custom proxy, something like this should work

--- a/physionet-django/project/cloud/s3.py
+++ b/physionet-django/project/cloud/s3.py
@@ -58,7 +58,13 @@ def create_s3_client():
         session = boto3.Session(
             profile_name=settings.AWS_PROFILE
         )
-        s3 = session.client("s3")
+        config = botocore.config.Config()
+        if settings.AWS_HTTP_PROXY:
+            config.proxies={
+                'http': settings.AWS_HTTP_PROXY,
+                'https': settings.AWS_HTTP_PROXY,
+            }
+        s3 = session.client("s3", config=config)
         return s3
     else:
         return None

https://stackoverflow.com/a/45492119

bemoody mentioned this issue Mar 12, 2024

S3 sync performance improvements #2203

Closed

bemoody mentioned this issue Sep 12, 2024

Optimize S3 re-upload #2289

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3: scrounging bandwidth for uploads #2103

S3: scrounging bandwidth for uploads #2103

bemoody commented Oct 10, 2023

bemoody commented Oct 10, 2023

tompollard commented Oct 10, 2023

bemoody commented Oct 10, 2023

tompollard commented Oct 10, 2023

tompollard commented Oct 10, 2023 •

edited

Loading

bemoody commented Oct 10, 2023

bemoody commented Oct 10, 2023

S3: scrounging bandwidth for uploads #2103

S3: scrounging bandwidth for uploads #2103

Comments

bemoody commented Oct 10, 2023

bemoody commented Oct 10, 2023

tompollard commented Oct 10, 2023

bemoody commented Oct 10, 2023

tompollard commented Oct 10, 2023

tompollard commented Oct 10, 2023 • edited Loading

bemoody commented Oct 10, 2023

bemoody commented Oct 10, 2023

tompollard commented Oct 10, 2023 •

edited

Loading