Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate Error #2216

Closed
nickjtay opened this issue Dec 5, 2023 · 13 comments
Closed

Accelerate Error #2216

nickjtay opened this issue Dec 5, 2023 · 13 comments
Labels
solved The bug or feature request has been solved, but the issue is still opened

Comments

@nickjtay
Copy link

nickjtay commented Dec 5, 2023

My notebook was working and then stopped working when I duplicated my notebook to test out a small revision that required bitsandbytes. I installed bitsandbytes to the same virtual environment, which I wouldn't expect to cause any issues. I went back to the original notebook and it no longer ran successfully. I'm now getting the error below. I have since uninstalled bitsandbytes and restarted the kernel. I'm not sure what happened and I cannot track down someone experiencing this issue on stackoverflow or elsewhere.

LoRA-multi-gpu-working.zip

Error:

---------------------------------------------------------------------------
ProcessRaisedException                    Traceback (most recent call last)
File ~/Projects/llmtest1/lib/python3.10/site-packages/accelerate/launchers.py:186, in notebook_launcher(function, args, num_processes, mixed_precision, use_port, master_addr, node_rank, num_nodes)
    185 try:
--> 186     start_processes(launcher, args=args, nprocs=num_processes, start_method="fork")
    187 except ProcessRaisedException as e:

File ~/Projects/llmtest1/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:202, in start_processes(fn, args, nprocs, join, daemon, start_method)
    201 # Loop on join until it returns True or raises an exception.
--> 202 while not context.join():
    203     pass

File ~/Projects/llmtest1/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:163, in ProcessContext.join(self, timeout)
    162 msg += original_trace
--> 163 raise ProcessRaisedException(msg, error_index, failed_process.pid)

ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/nickjtay/Projects/llmtest1/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/home/nickjtay/Projects/llmtest1/lib/python3.10/site-packages/accelerate/utils/launch.py", line 562, in __call__
    self.launcher(*args)
  File "/tmp/ipykernel_5874/1694138925.py", line 8, in training_loop
    accelerator = Accelerator(mixed_precision=mixed_precision)
  File "/home/nickjtay/Projects/llmtest1/lib/python3.10/site-packages/accelerate/accelerator.py", line 371, in __init__
    self.state = AcceleratorState(
  File "/home/nickjtay/Projects/llmtest1/lib/python3.10/site-packages/accelerate/state.py", line 758, in __init__
    PartialState(cpu, **kwargs)
  File "/home/nickjtay/Projects/llmtest1/lib/python3.10/site-packages/accelerate/state.py", line 218, in __init__
    if not check_cuda_p2p_ib_support():
  File "/home/nickjtay/Projects/llmtest1/lib/python3.10/site-packages/accelerate/utils/environment.py", line 71, in check_cuda_p2p_ib_support
    device_name = torch.cuda.get_device_name()
  File "/home/nickjtay/Projects/llmtest1/lib/python3.10/site-packages/torch/cuda/__init__.py", line 419, in get_device_name
    return get_device_properties(device).name
  File "/home/nickjtay/Projects/llmtest1/lib/python3.10/site-packages/torch/cuda/__init__.py", line 449, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/home/nickjtay/Projects/llmtest1/lib/python3.10/site-packages/torch/cuda/__init__.py", line 284, in _lazy_init
    raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method


The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
Cell In[3], line 2
      1 args = ("fp16", 42, 64)
----> 2 notebook_launcher(training_loop, args, num_processes=2)

File ~/Projects/llmtest1/lib/python3.10/site-packages/accelerate/launchers.py:189, in notebook_launcher(function, args, num_processes, mixed_precision, use_port, master_addr, node_rank, num_nodes)
    187 except ProcessRaisedException as e:
    188     if "Cannot re-initialize CUDA in forked subprocess" in e.args[0]:
--> 189         raise RuntimeError(
    190             "CUDA has been initialized before the `notebook_launcher` could create a forked subprocess. "
    191             "This likely stems from an outside import causing issues once the `notebook_launcher()` is called. "
    192             "Please review your imports and test them when running the `notebook_launcher()` to identify "
    193             "which one is problematic and causing CUDA to be initialized."
    194         ) from e
    195     else:
    196         raise RuntimeError(f"An issue was found when launching the training: {e}") from e

RuntimeError: CUDA has been initialized before the `notebook_launcher` could create a forked subprocess. This likely stems from an outside import causing issues once the `notebook_launcher()` is called. Please review your imports and test them when running the `notebook_launcher()` to identify which one is problematic and causing CUDA to be initialized.
@muellerzr
Copy link
Collaborator

Again, please state info about your env as I asked in the other issue.

Bits and bytes and other libraries similar will init CUDA on import. You need to hide this import inside your training function so it gets imported after you’ve launched your notebook launcher. Later versions of accelerate will warn if this happens

@nickjtay
Copy link
Author

nickjtay commented Dec 5, 2023

Thank you, that makes sense, but I had removed bitsandbytes and rebooted. Not sure why it says there is no default config, because I walked through the config wizard and successfully had accelerate working.

  • Accelerate version: 0.25.0
  • Platform: Linux-6.2.0-37-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Numpy version: 1.24.2
  • PyTorch version (GPU?): 1.13.1+cu117 (True)
  • PyTorch XPU available: False
  • PyTorch NPU available: False
  • System RAM: 62.69 GB
  • GPU type: NVIDIA GeForce RTX 3060
  • Accelerate default config:
    Not found

@nickjtay
Copy link
Author

nickjtay commented Dec 6, 2023

I reconfigured accelerate, but I'm still getting the same error.

  • Accelerate version: 0.25.0
  • Platform: Linux-6.2.0-37-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Numpy version: 1.24.2
  • PyTorch version (GPU?): 1.13.1+cu117 (True)
  • PyTorch XPU available: False
  • PyTorch NPU available: False
  • System RAM: 62.69 GB
  • GPU type: NVIDIA GeForce RTX 3060
  • Accelerate default config:
    • compute_environment: LOCAL_MACHINE
    • distributed_type: MULTI_GPU
    • mixed_precision: fp8
    • use_cpu: False
    • debug: True
    • num_processes: 2
    • machine_rank: 0
    • num_machines: 1
    • gpu_ids: 0,1
    • rdzv_backend: static
    • same_network: True
    • main_training_function: main
    • downcast_bf16: no
    • tpu_use_cluster: False
    • tpu_use_sudo: False
    • tpu_env: []

@muellerzr
Copy link
Collaborator

Can you try installing from accelerate main? pip install git+https://github.com/huggingface/accelerate

@nickjtay
Copy link
Author

nickjtay commented Dec 6, 2023

Still experiencing the error. I rebooted as well. I'm also not running any other notebooks or processes which would use CUDA. I don't see anything in the code that would initialize CUDA before the notebook_launcher() function runs, either.

  • Accelerate version: 0.25.0.dev0
  • Platform: Linux-6.2.0-37-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Numpy version: 1.24.2
  • PyTorch version (GPU?): 1.13.1+cu117 (True)
  • PyTorch XPU available: False
  • PyTorch NPU available: False
  • System RAM: 62.69 GB
  • GPU type: NVIDIA GeForce RTX 3060
  • Accelerate default config:
    • compute_environment: LOCAL_MACHINE
    • distributed_type: MULTI_GPU
    • mixed_precision: fp8
    • use_cpu: False
    • debug: True
    • num_processes: 2
    • machine_rank: 0
    • num_machines: 1
    • gpu_ids: 0,1
    • rdzv_backend: static
    • same_network: True
    • main_training_function: main
    • downcast_bf16: no
    • tpu_use_cluster: False
    • tpu_use_sudo: False
    • tpu_env: []

@geronimi73
Copy link
Contributor

RuntimeError: CUDA has been initialized before the notebook_launcher could create a forked subprocess. This likely stems from an outside import causing issues once the notebook_launcher() is called. Please review your imports and test them when running the notebook_launcher() to identify which one is problematic and causing CUDA to be initialized.

from accelerate import Accelerator initializes CUDA, you have to move it into training_loop.
If the error persists, move all the other imports except from accelerate import notebook_launcher too

@muellerzr
Copy link
Collaborator

muellerzr commented Dec 6, 2023

from accelerate import Accelerator initializes CUDA, you have to move it into training_loop.

That shouldn't be the case/shouldn't be happening 👀

I ran the notebook launcher just fine, can you give me the output from pip freeze @geronimi73?

As all the imports in Accelerate are very cuda-careful for this exact reason

@geronimi73
Copy link
Contributor

geronimi73 commented Dec 6, 2023

nevermind!

from accelerate import Accelerator initializes CUDA, you have to move it into training_loop.

this was definitely the case with accelerate-0.21.0. after a pip update to accelerate-0.25.0: gone.

sorry for the distraction

edit:
but I checked the code from @nickjtay in my notebook and it seems that the peft import initializes cuda

import torch
display(torch.cuda.is_initialized())
from peft import (
    get_peft_config,
    get_peft_model,
    get_peft_model_state_dict,
    set_peft_model_state_dict,
    LoraConfig,
    PeftType,
    PrefixTuningConfig,
    PromptEncoderConfig,
)
display(torch.cuda.is_initialized())

output

False
True

freeze.txt

@muellerzr
Copy link
Collaborator

Yes iirc I opened an issue on the peft side for this. Nothing we can do, they have to do things about that :) (So just import it during your training func)

@BenjaminBossan
Copy link
Member

Yes, we should revisit this in PEFT!

@muellerzr muellerzr added the solved The bug or feature request has been solved, but the issue is still opened label Dec 6, 2023
@nickjtay
Copy link
Author

nickjtay commented Dec 6, 2023

Following the above regarding moving accelerate to the loop and after rebooting my machine to clear the RAM on the GPUs I am still getting the error message. Should I be moving peft to the loop as well?

import argparse
import os

import torch
from torch.optim import AdamW
from torch.utils.data import DataLoader
from peft import (
    get_peft_config,
    get_peft_model,
    get_peft_model_state_dict,
    set_peft_model_state_dict,
    LoraConfig,
    PeftType,
    PrefixTuningConfig,
    PromptEncoderConfig,
)

import evaluate
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, set_seed
from tqdm import tqdm
from accelerate import notebook_launcher    

def training_loop(mixed_precision="fp16", seed:int=42, batch_size:int=32):
    from accelerate import Accelerator, DistributedType
    from accelerate.utils import set_seed
    
    set_seed(seed)
    model_name_or_path = "google/flan-t5-small"
    task = "mrpc"
    
    accelerator = Accelerator(mixed_precision=mixed_precision)
    
    if any(k in model_name_or_path for k in ("gpt", "opt", "bloom")):
        padding_side = "left"
    else:
        padding_side = "right"

    def collate_fn(examples):
        return tokenizer.pad(examples, padding="longest", return_tensors="pt")

    def collate_fn(examples):
        max_length = 128 if accelerator.distributed_type == DistributedType.TPU else None
        if accelerator.mixed_precision == "fp8":
            pad_to_multiple_of = 16
        elif accelerator.mixed_precision != "no":
            pad_to_multiple_of = 8
        else:
            pad_to_multiple_of = None

        return tokenizer.pad(
            examples,
            padding="longest",
            max_length=max_length,
            pad_to_multiple_of=pad_to_multiple_of,
            return_tensors="pt",
        )        
        
    def tokenize_function(examples):
        outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
        return outputs

    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, padding_side=padding_side)
    
    if getattr(tokenizer, "pad_token_id") is None:
        tokenizer.pad_token_id = tokenizer.eos_token_id

    datasets = load_dataset("glue", task)
    metric = evaluate.load("glue", task)

    tokenized_datasets = datasets.map(
        tokenize_function,
        batched=True,
        remove_columns=["idx", "sentence1", "sentence2"],
    )

    tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
    
    train_dataloader = DataLoader(
        tokenized_datasets["train"], 
        shuffle=True, 
        collate_fn=collate_fn, 
        batch_size=batch_size)
    eval_dataloader = DataLoader(
        tokenized_datasets["validation"], 
        shuffle=False, 
        collate_fn=collate_fn, 
        batch_size=batch_size)
    
    peft_type = PeftType.LORA
    num_epochs = 5
    
    peft_config = LoraConfig(
        task_type="SEQ_CLS", 
        inference_mode=False, 
        r=8, lora_alpha=16, 
        lora_dropout=0.1)
    lr = 3e-4
    
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name_or_path, 
        return_dict=True)
    model = model.to(accelerator.device)
    model = get_peft_model(model, peft_config)
    
    optimizer = AdamW(params=model.parameters(), lr=lr)
    
    lr_scheduler = get_linear_schedule_with_warmup(
        optimizer=optimizer,
        num_warmup_steps=0.06 * (len(train_dataloader) * num_epochs),
        num_training_steps=(len(train_dataloader) * num_epochs),
    )    
    
    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
    )

    for epoch in range(num_epochs):
        model.train()
        for step, batch in enumerate(tqdm(train_dataloader)):
            batch.to(accelerator.device)
            outputs = model(**batch)
            loss = outputs.loss
            accelerator.backward(loss)
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()

        model.eval()
        for step, batch in enumerate(tqdm(eval_dataloader)):
            batch.to(accelerator.device)
            with torch.no_grad():
                outputs = model(**batch)
            predictions = outputs.logits.argmax(dim=-1)
            predictions, references = predictions, batch["labels"]
            metric.add_batch(
                predictions=predictions,
                references=references,
            )

Error Message:

---------------------------------------------------------------------------
ProcessRaisedException                    Traceback (most recent call last)
File ~/Projects/llmtest1/lib/python3.10/site-packages/accelerate/launchers.py:186, in notebook_launcher(function, args, num_processes, mixed_precision, use_port, master_addr, node_rank, num_nodes)
    185 try:
--> 186     start_processes(launcher, args=args, nprocs=num_processes, start_method="fork")
    187 except ProcessRaisedException as e:

File ~/Projects/llmtest1/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:202, in start_processes(fn, args, nprocs, join, daemon, start_method)
    201 # Loop on join until it returns True or raises an exception.
--> 202 while not context.join():
    203     pass

File ~/Projects/llmtest1/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:163, in ProcessContext.join(self, timeout)
    162 msg += original_trace
--> 163 raise ProcessRaisedException(msg, error_index, failed_process.pid)

ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/nickjtay/Projects/llmtest1/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/home/nickjtay/Projects/llmtest1/lib/python3.10/site-packages/accelerate/utils/launch.py", line 562, in __call__
    self.launcher(*args)
  File "/tmp/ipykernel_4901/2016609644.py", line 11, in training_loop
    accelerator = Accelerator(mixed_precision=mixed_precision)
  File "/home/nickjtay/Projects/llmtest1/lib/python3.10/site-packages/accelerate/accelerator.py", line 371, in __init__
    self.state = AcceleratorState(
  File "/home/nickjtay/Projects/llmtest1/lib/python3.10/site-packages/accelerate/state.py", line 758, in __init__
    PartialState(cpu, **kwargs)
  File "/home/nickjtay/Projects/llmtest1/lib/python3.10/site-packages/accelerate/state.py", line 218, in __init__
    if not check_cuda_p2p_ib_support():
  File "/home/nickjtay/Projects/llmtest1/lib/python3.10/site-packages/accelerate/utils/environment.py", line 71, in check_cuda_p2p_ib_support
    device_name = torch.cuda.get_device_name()
  File "/home/nickjtay/Projects/llmtest1/lib/python3.10/site-packages/torch/cuda/__init__.py", line 419, in get_device_name
    return get_device_properties(device).name
  File "/home/nickjtay/Projects/llmtest1/lib/python3.10/site-packages/torch/cuda/__init__.py", line 449, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/home/nickjtay/Projects/llmtest1/lib/python3.10/site-packages/torch/cuda/__init__.py", line 284, in _lazy_init
    raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method


The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
Cell In[3], line 2
      1 args = ("fp16", 42, 64)
----> 2 notebook_launcher(training_loop, args, num_processes=2)

File ~/Projects/llmtest1/lib/python3.10/site-packages/accelerate/launchers.py:189, in notebook_launcher(function, args, num_processes, mixed_precision, use_port, master_addr, node_rank, num_nodes)
    187 except ProcessRaisedException as e:
    188     if "Cannot re-initialize CUDA in forked subprocess" in e.args[0]:
--> 189         raise RuntimeError(
    190             "CUDA has been initialized before the `notebook_launcher` could create a forked subprocess. "
    191             "This likely stems from an outside import causing issues once the `notebook_launcher()` is called. "
    192             "Please review your imports and test them when running the `notebook_launcher()` to identify "
    193             "which one is problematic and causing CUDA to be initialized."
    194         ) from e
    195     else:
    196         raise RuntimeError(f"An issue was found when launching the training: {e}") from e

RuntimeError: CUDA has been initialized before the `notebook_launcher` could create a forked subprocess. This likely stems from an outside import causing issues once the `notebook_launcher()` is called. Please review your imports and test them when running the `notebook_launcher()` to identify which one is problematic and causing CUDA to be initialized.
  • Accelerate version: 0.25.0.dev0
  • Platform: Linux-6.2.0-37-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Numpy version: 1.24.2
  • PyTorch version (GPU?): 1.13.1+cu117 (True)
  • PyTorch XPU available: False
  • PyTorch NPU available: False
  • System RAM: 62.69 GB
  • GPU type: NVIDIA GeForce RTX 3060
  • Accelerate default config:
    • compute_environment: LOCAL_MACHINE
    • distributed_type: MULTI_GPU
    • mixed_precision: fp8
    • use_cpu: False
    • debug: True
    • num_processes: 2
    • machine_rank: 0
    • num_machines: 1
    • gpu_ids: 0,1
    • rdzv_backend: static
    • same_network: True
    • main_training_function: main
    • downcast_bf16: no
    • tpu_use_cluster: False
    • tpu_use_sudo: False
    • tpu_env: []

@BenjaminBossan
Copy link
Member

Should I be moving peft to the loop as well?

Yes, please test that as well and let us know if it solves the problem.

@nickjtay
Copy link
Author

nickjtay commented Dec 6, 2023

Nevermind, I see, moving both modules into the loop solved it. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved The bug or feature request has been solved, but the issue is still opened
Projects
None yet
Development

No branches or pull requests

4 participants