If getting less than 1/2 the max bandwidth with my cluster, should I be choosing a different instance type? #286

rsignell · 2024-07-11T14:00:37Z

I've ran a workflow that was just extracting a bunch of data values from a bunch of files in object storage (extracting a time series from a large collection of global simulation NetCDF files on AWS S3).

I have a cluster of 50 workers (200 threads) and I'm only getting less than 1/2 the max bandwidth of the cluster.

Does this mean I should choose a different instance type and perhaps lower my costs?

fjetter · 2024-07-11T14:18:14Z

changing instance types has only very little impact on the network. Memory should likely be the primary decision factor for the instance type, followed by CPUs.

unrelated to the instance types, you may want to increase the worker threads since this is a primarily network bound problem. The network throughput is likely limited by S3 which throttles at about 50MiB/s per connection. On your cluster, you have 50 workers, 4 threads each, i.e. 50MiB/s * 50 * 4 ~ 10GiB/s

You might get better performance if you doubled the number of threads...

import coiled
cluster = coiled.Cluster(
    worker_vm_types=["m7i.xlarge"],  # pick whatever you like, of course (or use default but check #CPUs)
    worker_options={
        # make sure this is aligned to the instance type. This is 2x the number CPUs
        "nthreads": 8
    }, 
)
client = cluster.get_client()

just be careful that now every worker also has twice as many partitions, i.e. it could blow up in memory!

ntabris · 2024-07-11T14:28:06Z

Hi, @rsignell.

Florian and I just had a quick chat and it also probably makes sense to try using a larger number of smaller workers—e.g., 100 m7g.large workers (instead of 50 m7g.xlarge).

Depending on how much tuning you want to do, trying both smaller workers and some oversubscription of threads (maybe 1.5x or maybe 2x, I wouldn't go higher than that).

rsignell · 2024-07-12T17:56:37Z

Bingo @ntabris!

I was a little confused by the initial response because I was already using all the 4 threads on the 50 m7g.xlarge instances Coiled picked for me. I did try using all 8 threads on 25 m6g.2xlarge instances, but that took much longer -- over twice as long.

I then noticed while perusing the different characteristics of the AWS instance ARM instance types that they have a free trial going on until Dec 31, 2024 on the t4g.small instances:

And when I fired off 100 of these t4g.small 2cpu machines, I got the same performance as the default m7g.xlarge instances, but for free! (and if I use more than 650 hours per month, it will still be only 25% of the cost of the m7g.xlarge instances)

Amazing. Goes to show you it really pays to check what instances are appropriate for your type of workflow.
For the same performance with the same workflow, I can pay $4/hour, $1/hour, or FREE (while the promotion lasts).

And that's only made possible by the Cloud and Coiled! So cool! 😎

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If getting less than 1/2 the max bandwidth with my cluster, should I be choosing a different instance type? #286

If getting less than 1/2 the max bandwidth with my cluster, should I be choosing a different instance type? #286

rsignell commented Jul 11, 2024 •

edited

Loading

fjetter commented Jul 11, 2024

ntabris commented Jul 11, 2024

rsignell commented Jul 12, 2024 •

edited

Loading

If getting less than 1/2 the max bandwidth with my cluster, should I be choosing a different instance type? #286

If getting less than 1/2 the max bandwidth with my cluster, should I be choosing a different instance type? #286

Comments

rsignell commented Jul 11, 2024 • edited Loading

fjetter commented Jul 11, 2024

ntabris commented Jul 11, 2024

rsignell commented Jul 12, 2024 • edited Loading

rsignell commented Jul 11, 2024 •

edited

Loading

rsignell commented Jul 12, 2024 •

edited

Loading