Cold start vs warm start comparison of fly.io apps with HTTPSTAT #101

ndrean · 2024-01-18T00:28:34Z

ndrean
Jan 18, 2024
Collaborator

I used httpstat to get some stats on a cold start vs a warm start to get an idea on the state of the current app (only Image-To-Text models are loaded).

The first run:

> httpstat https://imgai.fly.dev/
Connected to 66.241.125.29:443 from 192.168.1.5:64124

HTTP/2 200
cache-control: max-age=0, private, must-revalidate
content-length: 8413
content-type: text/html; charset=utf-8
date: Thu, 18 Jan 2024 00:22:27 GMT
referrer-policy: strict-origin-when-cross-origin
server: Fly/f9c163a6 (2024-01-16)
x-content-type-options: nosniff
x-download-options: noopen
x-frame-options: SAMEORIGIN
x-permitted-cross-domain-policies: none
x-request-id: F6tJLnQnBu3CmmYAADEB
set-cookie: _app_key=SFMyNTY.g3QAAAABbQAAAAtfY3NyZl90b2tlbm0AAAAYbzd4SFFLRnFSZ1NtbEJwcTJTQVhjZjVV.0GiYEeXmsOhTCnmDqPd8Wm72IiPnHhM-HT43sv29Ev0; path=/; HttpOnly; SameSite=Lax
via: 2 fly.io
fly-request-id: 01HMCZ4KG7X6A7YNYWA7QKYBHQ-bog

Body stored in: /var/folders/mz/91hbds1j23125yksdf67dcgm0000gn/T/tmppjsjwz6x

  DNS Lookup   TCP Connection   TLS Handshake   Server Processing   Content Transfer
[     6ms    |      25ms      |     40ms      |      21360ms      |        2ms       ]
             |                |               |                   |                  |
    namelookup:6ms            |               |                   |                  |
                        connect:31ms          |                   |                  |
                                    pretransfer:71ms              |                  |
                                                      starttransfer:21431ms          |
                                                                                 total:21433ms

The next run is a "warm" start:

❯ httpstat https://imgai.fly.dev/
Connected to 66.241.125.29:443 from 192.168.1.5:64465

HTTP/2 200
cache-control: max-age=0, private, must-revalidate
content-length: 8413
content-type: text/html; charset=utf-8
date: Thu, 18 Jan 2024 00:23:27 GMT
referrer-policy: strict-origin-when-cross-origin
server: Fly/f9c163a6 (2024-01-16)
x-content-type-options: nosniff
x-download-options: noopen
x-frame-options: SAMEORIGIN
x-permitted-cross-domain-policies: none
x-request-id: F6tJPGwAMBcXAjgAADGB
set-cookie: _app_key=SFMyNTY.g3QAAAABbQAAAAtfY3NyZl90b2tlbm0AAAAYcDczUTFONkdQaFhlWGcwcll3c3EzdlJS.OKXegdWNqMqAzynS2M4Rm-L5ov_XnR9XHrOxtbT9ZuQ; path=/; HttpOnly; SameSite=Lax
via: 2 fly.io
fly-request-id: 01HMCZ72FQBY3K289TTHBKFV84-bog

Body stored in: /var/folders/mz/91hbds1j23125yksdf67dcgm0000gn/T/tmppf36b4u6

  DNS Lookup   TCP Connection   TLS Handshake   Server Processing   Content Transfer
[     1ms    |      33ms      |     34ms      |       489ms       |        1ms       ]
             |                |               |                   |                  |
    namelookup:1ms            |               |                   |                  |
                        connect:34ms          |                   |                  |
                                    pretransfer:68ms              |                  |
                                                      starttransfer:557ms            |
                                                                                 total:558ms

ndrean · 2024-01-20T16:22:21Z

ndrean
Jan 20, 2024
Collaborator Author

Another trial today:

A cold start:

Connected to 66.241.125.29:443 from 192.168.1.5:53615

HTTP/2 502
server: Fly/f9c163a6 (2024-01-16)
via: 2 fly.io
fly-request-id: 01HMKTKPQBGN3XRXEY95MKF2W5-bog
date: Sat, 20 Jan 2024 16:19:17 GMT

Body stored in: /var/folders/mz/91hbds1j23125yksdf67dcgm0000gn/T/tmp893dwse9

  DNS Lookup   TCP Connection   TLS Handshake   Server Processing   Content Transfer
[    45ms    |      25ms      |     203ms     |      98992ms      |        0ms       ]
             |                |               |                   |                  |
    namelookup:45ms           |               |                   |                  |
                        connect:70ms          |                   |                  |
                                    pretransfer:273ms             |                  |
                                                      starttransfer:99265ms          |
                                                                                 total:99265ms

Seems that nothing is persisted on disk and that you have to download everything?
The "warm start" (rerun the command after the first return) gives totally acceptable results:

  DNS Lookup   TCP Connection   TLS Handshake   Server Processing   Content Transfer
[     1ms    |      27ms      |     29ms      |       485ms       |        1ms       ]
             |                |               |                   |                  |
    namelookup:1ms            |               |                   |                  |
                        connect:28ms          |                   |                  |
                                    pretransfer:57ms              |                  |
                                                      starttransfer:542ms            |
                                                                                 total:543ms

0 replies

ndrean · 2024-01-20T16:54:28Z

ndrean
Jan 20, 2024
Collaborator Author

@LuchoTurtle
Some thoughts I suppose you already went through.
Is Fly pruning the Docker images? And what if you use a Fly volume and reference it as a (persistent) Docker volume? It would be populated once and for all.

You run the upload the image model in "Application.ex" so I also need to upload the whisper model.

I tried to load in parallel the models but for some reason this does not give any speed.

#Application.ex

@models_folder_path Application.compile_env!(:app, :models_cache_dir)
@captioning_prod_model %ModelInfo{
    name: "Salesforce/blip-image-captioning-base",
    cache_path: Path.join(@models_folder_path, "blip-image-captioning-base"),
    load_featurizer: true,
    load_tokenizer: true,
    load_generation_config: true
  }

@whisper_model %ModelInfo{
  name: "openai/whisper-small",
  cache_path: Path.join(@models_folder_path, "whisper-small"),
  load_featurizer: true,
  load_tokenizer: true,
  load_generation_config: true
}

def start(_type, _args) do
    [
     @whisper_model,
      @captioning_prod_model,
      @captioning_test_model
    ]
    |> Enum.each(&App.Models.verify_and_download_models/1)
   
    # this "async upoad" isn't faster ???
    #|> Task.async_stream(&App.Models.verify_and_download_models/1), timeout: :infinity)
    #|> Enum.to_list()
    [...]

0 replies

LuchoTurtle · 2024-01-21T22:23:17Z

LuchoTurtle
Jan 21, 2024
Maintainer

I've documented everything in https://github.com/dwyl/image-classifier/blob/main/deployment.md regarding deployment to fly.io.

I'm indeed using a volume to store the models and last time I deployed, everything seemed to be working. The models were downloaded after deploying and being used first and then used in subsequent runs. In fact, because the way the models are being served with offline: true, it's programmatically enforced that the models have to be used locally, or else the app won't run.

As you know, I've gone through this situation of persisting models quite a few times: first by changing the Dockerfile (as you are aware) and then to the current solution.

I can see your activity on the logs. Here's the volume being mounted:

2024-01-21T20:11:52.780 app[28659e0b5936e8] mad [info] INFO Mounting /dev/vdb at /app/.bumblebee w/ uid: 65534, gid: 65534 and chmod 0755

As you know, when a model is downloaded, a message like [info] Downloading Salesforce/blip-image-captioning-base... appears. It appears to be the case:

2024-01-21T17:27:02.541 app[28659e0b5936e8] mad [info] 17:27:02.539 [info] Downloading Salesforce/blip-image-captioning-base...

Unless the volumes are actively being pruned when downscaling due to inactivity, I don't understand this behaviour :(

Thank you for sharing httpstat though, it seems like an awesome tool that I will probably start using :)

0 replies

ndrean · 2024-01-21T23:47:35Z

ndrean
Jan 21, 2024
Collaborator Author

Yes indeed, seems that volumes are pruned when the machine is killed.

Maybe we could save these 3 models into a Postgres blob field (large object)? A DB is persisted, and a db_query/copy_if_not_exists should be faster option?

I may try this

0 replies

LuchoTurtle · 2024-01-22T02:41:32Z

LuchoTurtle
Jan 22, 2024
Maintainer

That seems like a plausible option (and to be quite frank, the only option we probably have given that we want the machines to scale down with inactivity). It sucks that we have to undergo a "hacky way" to get it to work :(

But, as much as I'd love to do that, I don't think it's pertinent (at least to my/this repo's scenario). Volumes shouldn't be pruned if downscaled :( . The strategy that is documented should work file in most cases, so I don't really feel the need to try to save models within a relational database, it just seems counter-intuitive and may stray beginners to think it's ok, when it's not really suitable for this case.

Although I appreciate your feedback (I really, really, really do), you can try it for yourself if you want. But I don't see myself hacking my around and saving models into a database and all the headache that may come along with it. I'm really excited in actually getting the audio transcription PRs you've implemented and then work from there :D

0 replies

ndrean · 2024-01-22T12:11:03Z

ndrean
Jan 22, 2024
Collaborator Author

I am curious but you are wise, so this project does not need this.
I may probably lok into this one day. Now from the docs, they seem to discourage this for "big" files, and is limited to 1Gb when of type "bytea" or "text". Note that the models used here contain 900Mb files but "large" models are over 1Gb.
One point I don't understand though is why I can't do a // download. I keep this for later and seek for help.

0 replies

ndrean · 2024-01-25T20:30:07Z

ndrean
Jan 25, 2024
Collaborator Author

@LuchoTurtle
I suppose no one is using the Fly machine yet so the Fly must have been stopped/pruned since last time.
Can you check the status of the Fly volumes now to see if there is still there?

0 replies

LuchoTurtle · 2024-01-25T20:59:44Z

LuchoTurtle
Jan 25, 2024
Maintainer

Unfortunately, I can't fly ssh console to the volume to see its contents without initializing a VM (which would prompt the models to be re-downloaded again, according to our theory). The best I can do is checking the size that is being "occupied".

Since the models are usually 1GB, I can assume the volume is being cleaned up :/

0 replies

ndrean · 2024-01-25T23:26:41Z

ndrean
Jan 25, 2024
Collaborator Author

I read volumes forks. Could this one be permanent??

The new volume is in the same region, but on a different physical host, and is not attached to a Machine. The new volume and the source volume are independent, and changes to their contents are not synchronized.

I understand Dwyl is a "real" customer, aren't you? Any chance to use this help to use a fork as a backup? Fly may be reactive with "real" customers? 🤔

1 reply

LuchoTurtle Apr 5, 2024
Maintainer

For anyone that is doubtful of persistent data in fly.io, using volumes does work with keeping data in-between restarts. The key is knowing where you want the data to be saved and making sure the path is correct when deploying the machine and volume to fly.io.

Check #82 for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cold start vs warm start comparison of fly.io apps with HTTPSTAT #101

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Cold start vs warm start comparison of fly.io apps with HTTPSTAT #101

ndrean Jan 18, 2024 Collaborator

Replies: 9 comments · 1 reply

ndrean Jan 20, 2024 Collaborator Author

ndrean Jan 20, 2024 Collaborator Author

LuchoTurtle Jan 21, 2024 Maintainer

ndrean Jan 21, 2024 Collaborator Author

LuchoTurtle Jan 22, 2024 Maintainer

ndrean Jan 22, 2024 Collaborator Author

ndrean Jan 25, 2024 Collaborator Author

LuchoTurtle Jan 25, 2024 Maintainer

ndrean Jan 25, 2024 Collaborator Author

LuchoTurtle Apr 5, 2024 Maintainer

ndrean
Jan 18, 2024
Collaborator

Replies: 9 comments 1 reply

ndrean
Jan 20, 2024
Collaborator Author

ndrean
Jan 20, 2024
Collaborator Author

LuchoTurtle
Jan 21, 2024
Maintainer

ndrean
Jan 21, 2024
Collaborator Author

LuchoTurtle
Jan 22, 2024
Maintainer

ndrean
Jan 22, 2024
Collaborator Author

ndrean
Jan 25, 2024
Collaborator Author

LuchoTurtle
Jan 25, 2024
Maintainer

ndrean
Jan 25, 2024
Collaborator Author

LuchoTurtle Apr 5, 2024
Maintainer