"RuntimeError: All the chunks should have been deleted." on non-Studio machine #1716

rasbt · 2024-09-10T18:30:59Z

Bug description

When running the pretraining example:

mkdir -p custom_texts
curl https://www.gutenberg.org/cache/epub/24440/pg24440.txt --output custom_texts/book1.txt
curl https://www.gutenberg.org/cache/epub/26393/pg26393.txt --output custom_texts/book2.txt

# 1) Download a tokenizer
litgpt download EleutherAI/pythia-160m \
  --tokenizer_only True

# 2) Pretrain the model
litgpt pretrain EleutherAI/pythia-160m \
  --tokenizer_dir EleutherAI/pythia-160m \
  --data TextFiles \
  --data.train_data_path "custom_texts/" \
  --train.max_tokens 10_000_000 \
  --out_dir out/custom-model

# 3) Test the model
litgpt chat out/custom-model/final

on a non-Studio machine, it results in the following issue.

litgpt pretrain EleutherAI/pythia-160m   --tokenizer_dir EleutherAI/pythia-160m   --data TextFiles   --data.train_data_path /home/sebastian/custom_texts/   --train.max_tokens 10_000   --out_dir out/custom-model
uvloop is not installed. Falling back to the default asyncio event loop.
/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/lightning/fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python3.10 /home/sebastian/miniforge3/envs/litgp2/bin/litgp ...
Using bfloat16 Automatic Mixed Precision (AMP)
{'data': {'batch_size': 1,
          'max_seq_length': -1,
          'num_workers': 1,
          'seed': 42,
          'tokenizer': None,
          'train_data_path': PosixPath('/home/sebastian/custom_texts'),
          'val_data_path': None},
 'devices': 'auto',
 'eval': {'final_validation': True,
          'initial_validation': False,
          'interval': 1000,
          'max_iters': 100,
          'max_new_tokens': None},
 'initial_checkpoint_dir': None,
 'logger_name': 'tensorboard',
 'model_config': {'attention_logit_softcapping': None,
                  'attention_scores_scalar': None,
                  'bias': True,
                  'block_size': 2048,
                  'final_logit_softcapping': None,
                  'gelu_approximate': 'none',
                  'head_size': 64,
                  'hf_config': {'name': 'pythia-160m', 'org': 'EleutherAI'},
                  'intermediate_size': 3072,
                  'lm_head_bias': False,
                  'mlp_class_name': 'GptNeoxMLP',
                  'n_embd': 768,
                  'n_expert': 0,
                  'n_expert_per_token': 0,
                  'n_head': 12,
                  'n_layer': 12,
                  'n_query_groups': 12,
                  'name': 'pythia-160m',
                  'norm_class_name': 'LayerNorm',
                  'norm_eps': 1e-05,
                  'padded_vocab_size': 50304,
                  'padding_multiple': 128,
                  'parallel_residual': True,
                  'post_attention_norm': False,
                  'post_mlp_norm': False,
                  'rope_base': 10000,
                  'rope_condense_ratio': 1,
                  'rotary_percentage': 0.25,
                  'scale_embeddings': False,
                  'shared_attention_norm': False,
                  'sliding_window_layer_placing': None,
                  'sliding_window_size': None,
                  'vocab_size': 50254},
 'model_name': 'EleutherAI/pythia-160m',
 'num_nodes': 1,
 'optimizer': 'AdamW',
 'out_dir': PosixPath('out/custom-model'),
 'precision': None,
 'resume': False,
 'seed': 42,
 'tokenizer_dir': PosixPath('checkpoints/EleutherAI/pythia-160m'),
 'train': {'epochs': None,
           'global_batch_size': 512,
           'log_interval': 1,
           'lr_warmup_fraction': None,
           'lr_warmup_steps': 2000,
           'max_norm': 1.0,
           'max_seq_length': None,
           'max_steps': None,
           'max_tokens': 10000,
           'micro_batch_size': 4,
           'min_lr': 4e-05,
           'save_interval': 1000,
           'tie_embeddings': False}}
Seed set to 42
Time to instantiate model: 1.23 seconds.
Total parameters: 162,322,944
Create an account on https://lightning.ai/ to optimize your data faster using multiple nodes and large machines.
Storing the files under /home/sebastian/custom_texts/train
Setup started with fast_dev_run=False.
Setup finished in 0.001 seconds. Found 1 items to process.
Starting 1 workers with 1 items. The progress bar is only updated when a worker finishes.
Workers are ready ! Starting data processing...
                                                                                                                                        Rank 0 inferred the following `['no_header_tensor:16']` data format.                                                | 0/1 [00:00<?, ?it/s]
Worker 0 is terminating.
Worker 0 is done.
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.47it/s]
Workers are finished.
Traceback (most recent call last):
  File "/home/sebastian/miniforge3/envs/litgp2/bin/litgpt", line 8, in <module>
    sys.exit(main())
  File "/home/sebastian/test-litgpt/litgpt/litgpt/__main__.py", line 71, in main
    CLI(parser_data)
  File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/jsonargparse/_cli.py", line 119, in CLI
    return _run_component(component, init.get(subcommand))
  File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/jsonargparse/_cli.py", line 204, in _run_component
    return component(**cfg)
  File "/home/sebastian/test-litgpt/litgpt/litgpt/pretrain.py", line 154, in setup
    main(
  File "/home/sebastian/test-litgpt/litgpt/litgpt/pretrain.py", line 214, in main
    train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
  File "/home/sebastian/test-litgpt/litgpt/litgpt/pretrain.py", line 409, in get_dataloaders
    data.prepare_data()
  File "/home/sebastian/test-litgpt/litgpt/litgpt/data/text_files.py", line 72, in prepare_data
    optimize(
  File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/litdata/processing/functions.py", line 375, in optimize
    data_processor.run(
  File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/litdata/processing/data_processor.py", line 1016, in run
    result = data_recipe._done(len(user_items), self.delete_cached_files, self.output_dir)
  File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/litdata/processing/data_processor.py", line 736, in _done
    raise RuntimeError(f"All the chunks should have been deleted. Found {chunks}")
RuntimeError: All the chunks should have been deleted. Found ['chunk-0-0.bin']

What operating system are you using?

Linux

LitGPT Version

Latest LitGPT version in main, and both LitData 0.2.17 and latest 0.2.26

The text was updated successfully, but these errors were encountered:

rasbt · 2024-09-10T18:56:59Z

Might be a LitData bug. Reported it here with a smaller self-contained example that doesn't use LitGPT: Lightning-AI/litdata#367

maulikmadhavi · 2024-12-26T10:10:29Z

It worked with this setup: I downgraded litdata

litgpt = ">=0.5.1, <0.6"
litdata = "==0.2.10"

Regards
Maulik

rasbt added the bug Something isn't working label Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"RuntimeError: All the chunks should have been deleted." on non-Studio machine #1716

"RuntimeError: All the chunks should have been deleted." on non-Studio machine #1716

rasbt commented Sep 10, 2024 •

edited

Loading

rasbt commented Sep 10, 2024

maulikmadhavi commented Dec 26, 2024

"RuntimeError: All the chunks should have been deleted." on non-Studio machine #1716

"RuntimeError: All the chunks should have been deleted." on non-Studio machine #1716

Comments

rasbt commented Sep 10, 2024 • edited Loading

Bug description

What operating system are you using?

LitGPT Version

rasbt commented Sep 10, 2024

maulikmadhavi commented Dec 26, 2024

rasbt commented Sep 10, 2024 •

edited

Loading