Reuse S3 session #622

wouterzwerink · 2024-03-06T16:26:03Z

🚀 Feature Request

Currently when I use S3 with an IAM role, I see StreamingDataset fetch new credentials for every shard:

There is a never ending stream of credential logs after this

That's quite inefficient, getting credentials from IAM roles is not that fast. Would be nicer to reuse credentials until they expire

Motivation

Faster is better!

[Optional] Implementation

I think it would work to just reuse the S3 Session object per thread

Additional context

snarayan21 · 2024-03-21T15:24:56Z

Hey! If it's not too much of a hassle, mind submitting a PR with your proposed change? I'd be happy to review

wouterzwerink · 2024-03-22T14:18:22Z

Hey! If it's not too much of a hassle, mind submitting a PR with your proposed change? I'd be happy to review

Sure! I made a fix for this that worked earlier, but will need to clean it up a bit before submitting. Will take a look somewhere next week

snarayan21 · 2024-04-01T15:53:13Z

Perfect, thank you @wouterzwerink! Feel free to tag me when the PR is up.

snarayan21 · 2024-04-08T20:26:51Z

@wouterzwerink Hey, just wanted to follow up on this, mind submitting a quick PR if/when you have some time? Thanks!!

huxuan · 2024-07-18T09:47:54Z

I am interested in this issue (actually we need it for potential performance improvement). I think the problem is in which level we want to keep a boto3 seesion. Maybe keep one seesion for each stream? If so, I suppose to create an s3 client in stream and reuse it when trigger download_file() in Stream._download_file(). Any comments?

karan6181 · 2024-07-23T03:34:36Z

@huxuan Are you seeing any performance degradation with the current approach? If yes, by how much?

huxuan · 2024-07-23T03:39:46Z

@huxuan Are you seeing any performance degradation with the current approach? If yes, by how much?

I have not done that yet, maybe I can implement a draft version for comparsion.

wouterzwerink · 2024-07-24T12:28:33Z

@huxuan I ended up abandoning this after increasing the shard size, which made the S3 overhead negligible. Perhaps that will work for you as well?

huxuan · 2024-07-25T02:22:52Z

@huxuan I ended up abandoning this after increasing the shard size, which made the S3 overhead negligible. Perhaps that will work for you as well?

Thanks for the response. We saved feature vectors in the data, so the sample size is relatively large (about 12 MB per sample). We are already using 200 MB as the size_limit, resulting in approximately 16 samples per shard and a shard size of about 100 MB with zstd (default level 3) compression. IIUC, with a larger shard size, we also need to increase the sampling_granularity to avoid adding more stress to the network.

Any comments are welcome.

wouterzwerink added the enhancement New feature or request label Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reuse S3 session #622

Reuse S3 session #622

wouterzwerink commented Mar 6, 2024 •

edited

Loading

snarayan21 commented Mar 21, 2024

wouterzwerink commented Mar 22, 2024

snarayan21 commented Apr 1, 2024

snarayan21 commented Apr 8, 2024

huxuan commented Jul 18, 2024

karan6181 commented Jul 23, 2024

huxuan commented Jul 23, 2024

wouterzwerink commented Jul 24, 2024

huxuan commented Jul 25, 2024

Reuse S3 session #622

Reuse S3 session #622

Comments

wouterzwerink commented Mar 6, 2024 • edited Loading

🚀 Feature Request

Motivation

[Optional] Implementation

Additional context

snarayan21 commented Mar 21, 2024

wouterzwerink commented Mar 22, 2024

snarayan21 commented Apr 1, 2024

snarayan21 commented Apr 8, 2024

huxuan commented Jul 18, 2024

karan6181 commented Jul 23, 2024

huxuan commented Jul 23, 2024

wouterzwerink commented Jul 24, 2024

huxuan commented Jul 25, 2024

wouterzwerink commented Mar 6, 2024 •

edited

Loading