train-text-from-scratch oom (in tokenizer?) #4300

tezlm · 2023-12-03T01:16:26Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Running train-text-from-scratch with a 4GiB file works.

Current Behavior

Running train-text-from-scratch with a 4GiB file ooms after allocating over 32GiB of memory.

Environment and Context

with commit 5a7d312

Physical (or virtual) hardware you are using, e.g. for Linux:

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  12
  On-line CPU(s) list:   0-11
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
    CPU family:          6
    Model:               158
    Thread(s) per core:  2
    Core(s) per socket:  6
    Socket(s):           1
    Stepping:            10
    CPU(s) scaling MHz:  44%
    CPU max MHz:         4500.0000
    CPU min MHz:         800.0000
    BogoMIPS:            5199.98
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx f
                         xsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
                         rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx e
                         st tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer
                          aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd i
                         brs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep
                          bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dthe
                         rm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities
Virtualization features:
  Virtualization:        VT-x
Caches (sum of all):
  L1d:                   192 KiB (6 instances)
  L1i:                   192 KiB (6 instances)
  L2:                    1.5 MiB (6 instances)
  L3:                    12 MiB (1 instance)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-11
Vulnerabilities:
  Gather data sampling:  Mitigation; Microcode
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
  Mds:                   Mitigation; Clear CPU buffers; SMT vulnerable
  Meltdown:              Mitigation; PTI
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT vulnerable
  Retbleed:              Mitigation; IBRS
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Mitigation; Microcode
  Tsx async abort:       Not affected

Operating System, e.g. for Linux:

Linux chorusfruit 6.1.63 #1-NixOS SMP PREEMPT_DYNAMIC Mon Nov 20 10:52:19 UTC 2023 x86_64 GNU/Linux

SDK version, e.g. for Linux: used nix flake

Failure Information (for bugs)

It seems like there's a bug (or unoptimized code) in the tokenizer that causes it to allocate way more memory than necessary,

Steps to Reproduce

Get a large amount of data. I concatenated RedPajama-Data-1T-Sample.

 ./result/bin/train-text-from-scratch \
     --vocab-model ./models/ggml-vocab-llama.gguf \
     --ctx 256 --embd 256 --head 8 --layer 16 \
     --checkpoint-in  chk-shakespeare-256x16-LATEST.gguf \
     --checkpoint-out chk-shakespeare-256x16-ITERATION.gguf \
     --model-out ggml-shakespeare-256x16-f32-ITERATION.gguf \
     --train-data "/path/to/large/file.txt" \
     -b 16 --seed 1337 --adam-iter 256

Failure Logs

main: seed: 1337
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5
llama_model_loader: loaded meta data with 17 key-value pairs and 0 tensors from ./models/ggml-vocab-llama.gguf (version GGUF V3 (latest))
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  11:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  12:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  13:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  14:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  15:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  16:            tokenizer.ggml.unknown_token_id u32              = 0
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = all F32 (guessed)
llm_load_print_meta: model params     = 0.00 B
llm_load_print_meta: model size       = 0.00 MiB (-nan BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llama_model_load: vocab only - skipping tensors
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
main: init model
print_params: n_vocab: 32000
print_params: n_ctx:   256
print_params: n_embd:  256
print_params: n_head:  8
print_params: n_ff:    768
print_params: n_layer: 16
print_params: n_rot:   32
main: total train_iterations 0
main: seen train_samples     0
main: seen train_tokens      0
main: completed train_epochs 0
main: model_size = 240304416 bytes (229.2 MB)
main: opt_size  = 360288432 bytes (343.6 MB)
main: opt iter 0
main: input_size = 524304416 bytes (500.0 MB)
main: compute_size = 1639187552 bytes (1563.3 MB)
main: evaluation order = LEFT_TO_RIGHT
main: tokenize training data
./train.sh: line 10: 978503 Killed                  ./result/bin/train-text-from-scratch --vocab-model ./models/ggml-vocab-llama.gguf --ctx 256 --embd 256 --head 8 --layer 16 --checkpoint-in chk-shakespeare-256x16-LATEST.gguf --checkpoint-out chk-shakespeare-256x16-ITERATION.gguf --model-out ggml-shakespeare-256x16-f32-ITERATION.gguf --train-data "/data/RedPajama-Data-1T-Sample/data.txt" -b 16 --seed 1337 --adam-iter 256

The text was updated successfully, but these errors were encountered:

segmond · 2024-01-14T23:01:50Z

Yup,
I'm getting OOM, and I know I have the memory

ubuntu 22.04

./train-text-from-scratch --vocab-model ../models/ggml-vocab-llama.gguf --ctx 64 --embd 256 --head 8 --layer 16 --checkpoint-in chk-shakespeare-256x16-LATEST.gguf --checkpoint-out chk-shakespeare-256x16-ITERATION.gguf --model-out ggml-shakespeare-256x16-f32-ITERATION.gguf --train-data "shakespeare.txt" -t 6 -b 16 --seed 1 --adam-iter 256 --no-checkpointing -ngl 16

main: input_size = 131076128 bytes (125.0 MB)
main: compute_size = 140735674245184 bytes (134216000.0 MB)
main: evaluation order = RIGHT_TO_LEFT
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc

Thread 1 "train-text-from" received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352601600) at ./nptl/pthread_kill.c:44
44 ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352601600) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=140737352601600) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=140737352601600, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x00007fffeec42476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x00007fffeec287f3 in __GI_abort () at ./stdlib/abort.c:79
#5 0x00007fffef0a2b9e in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007fffef0ae20c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x00007fffef0ae277 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8 0x00007fffef0ae4d8 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9 0x00007fffef0a27ac in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x0000555555579c04 in std::vector<unsigned char, std::allocator >::resize(unsigned long) ()
#11 0x000055555556dd50 in main ()
(gdb) quit
A debugging session is active.

    Inferior 1 [process 6627] will be killed.

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1535 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 1853 G /usr/bin/gnome-shell 3MiB |
+---------------------------------------------------------------------------------------+
(base) seg@seg-HP-Z820:~/llama.cpp/training$ free -m
total used free shared buff/cache available
Mem: 128825 957 45951 5 81916 126827
Swap: 2047 0 2047

segmond · 2024-01-14T23:25:58Z

You can train with an older tag.

It's definitely a memory issue compare this it needing 134terabytes of ram vs the actual 669 it needs.
main: compute_size = 140735674245184 bytes (134216000.0 MB)

main: compute_size = 701759840 bytes (669.3 MB)
main: evaluation order = LEFT_TO_RIGHT
main: tokenize training data
tokenize_file: total number of samples: 27520
main: number of training tokens: 27584
main: train data seems to have changed. restarting shuffled epoch.
main: begin training
main: work_size = 768376 bytes (0.7 MB)
train_opt_callback: iter= 0 sample=1/

github-actions · 2024-04-03T01:14:46Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

tezlm added the bug-unconfirmed label Dec 3, 2023

github-actions bot added the stale label Mar 19, 2024

github-actions bot closed this as completed Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train-text-from-scratch oom (in tokenizer?) #4300

train-text-from-scratch oom (in tokenizer?) #4300

tezlm commented Dec 3, 2023

segmond commented Jan 14, 2024

segmond commented Jan 14, 2024

github-actions bot commented Apr 3, 2024

train-text-from-scratch oom (in tokenizer?) #4300

train-text-from-scratch oom (in tokenizer?) #4300

Comments

tezlm commented Dec 3, 2023

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Failure Logs

segmond commented Jan 14, 2024

segmond commented Jan 14, 2024

github-actions bot commented Apr 3, 2024