Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"RuntimeError: All the chunks should have been deleted." on non-Studio machine #1716

Open
rasbt opened this issue Sep 10, 2024 · 2 comments
Open
Labels
bug Something isn't working

Comments

@rasbt
Copy link
Collaborator

rasbt commented Sep 10, 2024

Bug description

When running the pretraining example:

mkdir -p custom_texts
curl https://www.gutenberg.org/cache/epub/24440/pg24440.txt --output custom_texts/book1.txt
curl https://www.gutenberg.org/cache/epub/26393/pg26393.txt --output custom_texts/book2.txt

# 1) Download a tokenizer
litgpt download EleutherAI/pythia-160m \
  --tokenizer_only True

# 2) Pretrain the model
litgpt pretrain EleutherAI/pythia-160m \
  --tokenizer_dir EleutherAI/pythia-160m \
  --data TextFiles \
  --data.train_data_path "custom_texts/" \
  --train.max_tokens 10_000_000 \
  --out_dir out/custom-model

# 3) Test the model
litgpt chat out/custom-model/final

on a non-Studio machine, it results in the following issue.

litgpt pretrain EleutherAI/pythia-160m   --tokenizer_dir EleutherAI/pythia-160m   --data TextFiles   --data.train_data_path /home/sebastian/custom_texts/   --train.max_tokens 10_000   --out_dir out/custom-model
uvloop is not installed. Falling back to the default asyncio event loop.
/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/lightning/fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python3.10 /home/sebastian/miniforge3/envs/litgp2/bin/litgp ...
Using bfloat16 Automatic Mixed Precision (AMP)
{'data': {'batch_size': 1,
          'max_seq_length': -1,
          'num_workers': 1,
          'seed': 42,
          'tokenizer': None,
          'train_data_path': PosixPath('/home/sebastian/custom_texts'),
          'val_data_path': None},
 'devices': 'auto',
 'eval': {'final_validation': True,
          'initial_validation': False,
          'interval': 1000,
          'max_iters': 100,
          'max_new_tokens': None},
 'initial_checkpoint_dir': None,
 'logger_name': 'tensorboard',
 'model_config': {'attention_logit_softcapping': None,
                  'attention_scores_scalar': None,
                  'bias': True,
                  'block_size': 2048,
                  'final_logit_softcapping': None,
                  'gelu_approximate': 'none',
                  'head_size': 64,
                  'hf_config': {'name': 'pythia-160m', 'org': 'EleutherAI'},
                  'intermediate_size': 3072,
                  'lm_head_bias': False,
                  'mlp_class_name': 'GptNeoxMLP',
                  'n_embd': 768,
                  'n_expert': 0,
                  'n_expert_per_token': 0,
                  'n_head': 12,
                  'n_layer': 12,
                  'n_query_groups': 12,
                  'name': 'pythia-160m',
                  'norm_class_name': 'LayerNorm',
                  'norm_eps': 1e-05,
                  'padded_vocab_size': 50304,
                  'padding_multiple': 128,
                  'parallel_residual': True,
                  'post_attention_norm': False,
                  'post_mlp_norm': False,
                  'rope_base': 10000,
                  'rope_condense_ratio': 1,
                  'rotary_percentage': 0.25,
                  'scale_embeddings': False,
                  'shared_attention_norm': False,
                  'sliding_window_layer_placing': None,
                  'sliding_window_size': None,
                  'vocab_size': 50254},
 'model_name': 'EleutherAI/pythia-160m',
 'num_nodes': 1,
 'optimizer': 'AdamW',
 'out_dir': PosixPath('out/custom-model'),
 'precision': None,
 'resume': False,
 'seed': 42,
 'tokenizer_dir': PosixPath('checkpoints/EleutherAI/pythia-160m'),
 'train': {'epochs': None,
           'global_batch_size': 512,
           'log_interval': 1,
           'lr_warmup_fraction': None,
           'lr_warmup_steps': 2000,
           'max_norm': 1.0,
           'max_seq_length': None,
           'max_steps': None,
           'max_tokens': 10000,
           'micro_batch_size': 4,
           'min_lr': 4e-05,
           'save_interval': 1000,
           'tie_embeddings': False}}
Seed set to 42
Time to instantiate model: 1.23 seconds.
Total parameters: 162,322,944
Create an account on https://lightning.ai/ to optimize your data faster using multiple nodes and large machines.
Storing the files under /home/sebastian/custom_texts/train
Setup started with fast_dev_run=False.
Setup finished in 0.001 seconds. Found 1 items to process.
Starting 1 workers with 1 items. The progress bar is only updated when a worker finishes.
Workers are ready ! Starting data processing...
                                                                                                                                        Rank 0 inferred the following `['no_header_tensor:16']` data format.                                                | 0/1 [00:00<?, ?it/s]
Worker 0 is terminating.
Worker 0 is done.
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.47it/s]
Workers are finished.
Traceback (most recent call last):
  File "/home/sebastian/miniforge3/envs/litgp2/bin/litgpt", line 8, in <module>
    sys.exit(main())
  File "/home/sebastian/test-litgpt/litgpt/litgpt/__main__.py", line 71, in main
    CLI(parser_data)
  File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/jsonargparse/_cli.py", line 119, in CLI
    return _run_component(component, init.get(subcommand))
  File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/jsonargparse/_cli.py", line 204, in _run_component
    return component(**cfg)
  File "/home/sebastian/test-litgpt/litgpt/litgpt/pretrain.py", line 154, in setup
    main(
  File "/home/sebastian/test-litgpt/litgpt/litgpt/pretrain.py", line 214, in main
    train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
  File "/home/sebastian/test-litgpt/litgpt/litgpt/pretrain.py", line 409, in get_dataloaders
    data.prepare_data()
  File "/home/sebastian/test-litgpt/litgpt/litgpt/data/text_files.py", line 72, in prepare_data
    optimize(
  File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/litdata/processing/functions.py", line 375, in optimize
    data_processor.run(
  File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/litdata/processing/data_processor.py", line 1016, in run
    result = data_recipe._done(len(user_items), self.delete_cached_files, self.output_dir)
  File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/litdata/processing/data_processor.py", line 736, in _done
    raise RuntimeError(f"All the chunks should have been deleted. Found {chunks}")
RuntimeError: All the chunks should have been deleted. Found ['chunk-0-0.bin']

What operating system are you using?

Linux

LitGPT Version

Latest LitGPT version in main, and both LitData 0.2.17 and latest 0.2.26

@rasbt rasbt added the bug Something isn't working label Sep 10, 2024
@rasbt
Copy link
Collaborator Author

rasbt commented Sep 10, 2024

Might be a LitData bug. Reported it here with a smaller self-contained example that doesn't use LitGPT: Lightning-AI/litdata#367

@maulikmadhavi
Copy link

It worked with this setup: I downgraded litdata

litgpt = ">=0.5.1, <0.6"
litdata = "==0.2.10"

Regards
Maulik

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants