-
Notifications
You must be signed in to change notification settings - Fork 13.8k
Closed
Labels
Description
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
Running train-text-from-scratch with a 4GiB file works.
Current Behavior
Running train-text-from-scratch with a 4GiB file ooms after allocating over 32GiB of memory.
Environment and Context
with commit 5a7d312
- Physical (or virtual) hardware you are using, e.g. for Linux:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
CPU family: 6
Model: 158
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
Stepping: 10
CPU(s) scaling MHz: 44%
CPU max MHz: 4500.0000
CPU min MHz: 800.0000
BogoMIPS: 5199.98
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx f
xsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx e
st tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer
aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd i
brs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep
bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dthe
rm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 192 KiB (6 instances)
L1i: 192 KiB (6 instances)
L2: 1.5 MiB (6 instances)
L3: 12 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-11
Vulnerabilities:
Gather data sampling: Mitigation; Microcode
Itlb multihit: KVM: Mitigation: VMX disabled
L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Mds: Mitigation; Clear CPU buffers; SMT vulnerable
Meltdown: Mitigation; PTI
Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Retbleed: Mitigation; IBRS
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Srbds: Mitigation; Microcode
Tsx async abort: Not affected
- Operating System, e.g. for Linux:
Linux chorusfruit 6.1.63 #1-NixOS SMP PREEMPT_DYNAMIC Mon Nov 20 10:52:19 UTC 2023 x86_64 GNU/Linux
- SDK version, e.g. for Linux: used nix flake
Failure Information (for bugs)
It seems like there's a bug (or unoptimized code) in the tokenizer that causes it to allocate way more memory than necessary,
Steps to Reproduce
- Get a large amount of data. I concatenated RedPajama-Data-1T-Sample.
-
./result/bin/train-text-from-scratch \ --vocab-model ./models/ggml-vocab-llama.gguf \ --ctx 256 --embd 256 --head 8 --layer 16 \ --checkpoint-in chk-shakespeare-256x16-LATEST.gguf \ --checkpoint-out chk-shakespeare-256x16-ITERATION.gguf \ --model-out ggml-shakespeare-256x16-f32-ITERATION.gguf \ --train-data "/path/to/large/file.txt" \ -b 16 --seed 1337 --adam-iter 256
Failure Logs
main: seed: 1337
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5
llama_model_loader: loaded meta data with 17 key-value pairs and 0 tensors from ./models/ggml-vocab-llama.gguf (version GGUF V3 (latest))
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: tokenizer.ggml.model str = llama
llama_model_loader: - kv 11: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 12: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 13: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 14: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 15: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 16: tokenizer.ggml.unknown_token_id u32 = 0
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = all F32 (guessed)
llm_load_print_meta: model params = 0.00 B
llm_load_print_meta: model size = 0.00 MiB (-nan BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llama_model_load: vocab only - skipping tensors
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
main: init model
print_params: n_vocab: 32000
print_params: n_ctx: 256
print_params: n_embd: 256
print_params: n_head: 8
print_params: n_ff: 768
print_params: n_layer: 16
print_params: n_rot: 32
main: total train_iterations 0
main: seen train_samples 0
main: seen train_tokens 0
main: completed train_epochs 0
main: model_size = 240304416 bytes (229.2 MB)
main: opt_size = 360288432 bytes (343.6 MB)
main: opt iter 0
main: input_size = 524304416 bytes (500.0 MB)
main: compute_size = 1639187552 bytes (1563.3 MB)
main: evaluation order = LEFT_TO_RIGHT
main: tokenize training data
./train.sh: line 10: 978503 Killed ./result/bin/train-text-from-scratch --vocab-model ./models/ggml-vocab-llama.gguf --ctx 256 --embd 256 --head 8 --layer 16 --checkpoint-in chk-shakespeare-256x16-LATEST.gguf --checkpoint-out chk-shakespeare-256x16-ITERATION.gguf --model-out ggml-shakespeare-256x16-f32-ITERATION.gguf --train-data "/data/RedPajama-Data-1T-Sample/data.txt" -b 16 --seed 1337 --adam-iter 256
zolastro, hangingman and yuiseki