Current version is tested the following instance config:
- A10 GPU, 24Gb
- 200 Gb RAM
- 30 core intel cpu
- 1.4 Tb SSD drive
Running finetune for 7B llama works reasonably fast, but GPU utilization is bad - we can do much better.
Several immediate observations here:
- one CPU core is 100% utilized. What exactly is it doing, moving data around, supposedly?
- no disk reads are happening - we most likely just serve all files from cache.
- each burst of writes is a forward pass - we save inputs. GPU util is especially bad there.
- On backward/combined pass GPU utilization is slightly better but not much.
- The time before first forward pass is generation - we use batch of size 1 here so utilization is very low but that's to be expected
- GPU memory util is low as well - we can go with much larger batch size
What should we do?
- With this amount of memory we don't need to write to disk ever. We can just move layers back and forth between main memory and GPU
- Optionally prefetch and save async.
Let's check 7B model:
offloading to disk:
2023-09-05 22:11:56,099 starting iteration 5
2023-09-05 22:12:16,246 starting iteration 6
2023-09-05 22:12:36,387 starting iteration 7
2023-09-05 22:12:56,433 starting iteration 8
Each iteration takes ~20s
After offloading the layers to RAM rather than disk, we get considerable speedup:
2023-09-05 22:19:27,544 starting iteration 5
2023-09-05 22:19:35,563 starting iteration 6
2023-09-05 22:19:43,580 starting iteration 7
2023-09-05 22:19:51,603 starting iteration 8
Each iteration is ~8 seconds.
For 70B model on disk:
2023-09-06 00:08:59,143 starting iteration 5
2023-09-06 00:10:41,738 starting iteration 6
2023-09-06 00:12:24,215 starting iteration 7
2023-09-06 00:14:04,903 starting iteration 8
~100s / iteration
70B model on RAM:
2023-09-05 23:14:31,709 starting iteration 5
2023-09-05 23:15:46,593 starting iteration 6
2023-09-05 23:17:01,593 starting iteration 7
2023-09-05 23:18:16,730 starting iteration 8
~75s / iteration