IsADirectoryError when training with tqdm enabled for trainer #34766

liougehooa · 2024-11-18T01:43:38Z

System Info

Error info:

**IsADirectoryError**: [Errno 21] Is a directory: '\n    <div>\n      \n      <progress value=\'2\' max=\'108\' style=\'width:300px; height:20px; vertical-align: middle;\'></progress>\n      [  2/108 : < :, Epoch 0.04/4]\n    </div>\n    <table border="1" class="dataframe">\n  <thead>\n <tr style="text-align: left;">\n      <th>Step</th>\n      <th>Training Loss</th>\n      <th>Validation Loss</th>\n    </tr>\n  </thead>\n  <tbody>\n  </tbody>\n</table><p>'

Code:

training_args = transformers.TrainingArguments(
    num_train_epochs=4,                         # Number of training epochs
    per_device_train_batch_size=batch_size,      # Batch size for training
    per_device_eval_batch_size=batch_size,       # Batch size for evaluation
    gradient_accumulation_steps=2,               # Number of steps to accumulate gradients before updating
    gradient_checkpointing=True,                 # Enable gradient checkpointing to save memory
    do_eval=True,                                # Perform evaluation during training
    save_total_limit=2,                          # Limit the total number of saved checkpoints
    evaluation_strategy="steps",                 # Evaluation strategy to use (here, at each specified number of steps)
    save_strategy="steps",                       # Save checkpoints at each specified number of steps
    save_steps=10,                               # Number of steps between each checkpoint save
    eval_steps=10,                               # Number of steps between each evaluation
    max_grad_norm=1,                             # Maximum gradient norm for clipping
    warmup_ratio=0.1,                            # Warmup ratio for learning rate schedule
    weight_decay=0.001,                          # Regularization technique to prevent overfitting
    # fp16=True,                                 # Enable mixed precision training with fp16 (enable it if Ampere architecture is unavailable)
    bf16=True,                                   # Enable mixed precision training with bf16
    logging_steps=10,                            # Number of steps between each log
    output_dir="outputs",                        # Directory to save the model outputs and checkpoints
    optim="adamw_torch",                         # Optimizer to use (AdamW with PyTorch)
    learning_rate=5e-5,                          # Learning rate for the optimizer
    lr_scheduler_type="linear",                  # Learning rate scheduler type: constant
    load_best_model_at_end=True,                 # Load the best model found during training at the end
    metric_for_best_model="rouge",               # Metric used to determine the best model
    greater_is_better=True,                      # Indicates if a higher metric score is better
    push_to_hub=False,                           # Whether to push the model to Hugging Face Hub
    run_name="finetuning",   # Name of the run for experiment tracking
    report_to="wandb"                            # For experiment tracking (login to Weights & Biases needed)
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

Env info:
Jupyter version:

!jupyter --version
IPython          : 8.27.0
ipykernel        : 6.29.5
ipywidgets       : 7.7.1
jupyter_client   : 7.4.9
jupyter_core     : 5.7.2
jupyter_server   : 2.14.2
jupyterlab       : 4.0.11
nbclient         : 0.10.0
nbconvert        : 7.16.4
nbformat         : 5.10.4
notebook         : 6.5.7
qtconsole        : 5.6.0
traitlets        : 5.14.3

Python: 3.10.11
jupyter lab: 4.0.11
transformers: 4.45.2

Detailed errors:

IsADirectoryError                         Traceback (most recent call last)
Cell In[28], line 1
----> 1 trainer.train()

File /anaconda/envs/azureml_py38_PT_TF/lib/python3.10/site-packages/transformers/trainer.py:2052, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   2050         hf_hub_utils.enable_progress_bars()
   2051 else:
-> 2052     return inner_training_loop(
   2053         args=args,
   2054         resume_from_checkpoint=resume_from_checkpoint,
   2055         trial=trial,
   2056         ignore_keys_for_eval=ignore_keys_for_eval,
   2057     )

File /anaconda/envs/azureml_py38_PT_TF/lib/python3.10/site-packages/transformers/trainer.py:2465, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   2463     self.state.global_step += 1
   2464     self.state.epoch = epoch + (step + 1 + steps_skipped) / steps_in_epoch
-> 2465     self.control = self.callback_handler.on_step_end(args, self.state, self.control)
   2467     self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
   2468 else:

File /anaconda/envs/azureml_py38_PT_TF/lib/python3.10/site-packages/transformers/trainer_callback.py:494, in CallbackHandler.on_step_end(self, args, state, control)
    493 def on_step_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl):
--> 494     return self.call_event("on_step_end", args, state, control)

File /anaconda/envs/azureml_py38_PT_TF/lib/python3.10/site-packages/transformers/trainer_callback.py:516, in CallbackHandler.call_event(self, event, args, state, control, **kwargs)
    514 def call_event(self, event, args, state, control, **kwargs):
    515     for callback in self.callbacks:
--> 516         result = getattr(callback, event)(
    517             args,
    518             state,
    519             control,
    520             model=self.model,
    521             tokenizer=self.tokenizer,
    522             optimizer=self.optimizer,
    523             lr_scheduler=self.lr_scheduler,
    524             train_dataloader=self.train_dataloader,
    525             eval_dataloader=self.eval_dataloader,
    526             **kwargs,
    527         )
    528         # A Callback can skip the return of `control` if it doesn't change it.
    529         if result is not None:

File /anaconda/envs/azureml_py38_PT_TF/lib/python3.10/site-packages/transformers/utils/notebook.py:307, in NotebookProgressCallback.on_step_end(self, args, state, control, **kwargs)
    305 def on_step_end(self, args, state, control, **kwargs):
    306     epoch = int(state.epoch) if int(state.epoch) == state.epoch else f"{state.epoch:.2f}"
--> 307     self.training_tracker.update(
    308         state.global_step + 1,
    309         comment=f"Epoch {epoch}/{state.num_train_epochs}",
    310         force_update=self._force_next_update,
    311     )
    312     self._force_next_update = False

File /anaconda/envs/azureml_py38_PT_TF/lib/python3.10/site-packages/transformers/utils/notebook.py:143, in NotebookProgressBar.update(self, value, force_update, comment)
    141     self.first_calls = self.warmup
    142     self.wait_for = 1
--> 143     self.update_bar(value)
    144 elif value <= self.last_value and not force_update:
    145     return

File /anaconda/envs/azureml_py38_PT_TF/lib/python3.10/site-packages/transformers/utils/notebook.py:188, in NotebookProgressBar.update_bar(self, value, comment)
    185         self.label += f", {1/self.average_time_per_item:.2f} it/s"
    187 self.label += "]" if self.comment is None or len(self.comment) == 0 else f", {self.comment}]"
--> 188 self.display()

File /anaconda/envs/azureml_py38_PT_TF/lib/python3.10/site-packages/transformers/utils/notebook.py:229, in NotebookTrainingTracker.display(self)
    227     self.html_code += self.child_bar.html_code
    228 if self.output is None:
--> 229     self.output = disp.display(disp.HTML(self.html_code), display_id=True)
    230 else:
    231     self.output.update(disp.HTML(self.html_code))

File /anaconda/envs/azureml_py38_PT_TF/lib/python3.10/site-packages/IPython/core/display.py:432, in HTML.__init__(self, data, url, filename, metadata)
    430 if warn():
    431     warnings.warn("Consider using IPython.display.IFrame instead")
--> 432 super(HTML, self).__init__(data=data, url=url, filename=filename, metadata=metadata)

File /anaconda/envs/azureml_py38_PT_TF/lib/python3.10/site-packages/IPython/core/display.py:327, in DisplayObject.__init__(self, data, url, filename, metadata)
    324 elif self.metadata is None:
    325     self.metadata = {}
--> 327 self.reload()
    328 self._check_data()

File /anaconda/envs/azureml_py38_PT_TF/lib/python3.10/site-packages/IPython/core/display.py:353, in DisplayObject.reload(self)
    351 if self.filename is not None:
    352     encoding = None if "b" in self._read_flags else "utf-8"
--> 353     with open(self.filename, self._read_flags, encoding=encoding) as f:
    354         self.data = f.read()
    355 elif self.url is not None:
    356     # Deferred import

IsADirectoryError: [Errno 21] Is a directory: '\n    <div>\n      \n      <progress value=\'2\' max=\'108\' style=\'width:300px; height:20px; vertical-align: middle;\'></progress>\n      [  2/108 : < :, Epoch 0.04/4]\n    </div>\n    <table border="1" class="dataframe">\n  <thead>\n <tr style="text-align: left;">\n      <th>Step</th>\n      <th>Training Loss</th>\n      <th>Validation Loss</th>\n    </tr>\n  </thead>\n  <tbody>\n  </tbody>\n</table><p>'

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

This can be reproduced by the following code:

import time
import transformers
from transformers.utils.notebook import NotebookProgressBar

pbar = NotebookProgressBar(100)
for val in range(100):
    pbar.update(val)
    time.sleep(0.07)
pbar.update(100)

Expected behavior

Training with progress bar being updated:

The text was updated successfully, but these errors were encountered:

LysandreJik · 2024-11-19T10:20:36Z

Hello! You're running this in a notebook?

akshay9 · 2024-11-20T12:25:27Z

+1 Facing the same issue

liougehooa · 2024-11-20T15:10:29Z

Yes

Rocketknight1 · 2024-11-20T16:22:40Z

Seems like this is a real issue - if anyone wants to investigate this and maybe file a PR, feel free to take it!

Knight7561 · 2024-11-21T19:26:13Z

Would the same issue be reproducible on colab too? I tried reproducing in notebook and it worked without an error. May be either something is missing in the steps to reproduce. or it is a path error for the tracker unable to update the progress which might have happened only on your setup. Please share further details to seek help.

Kulloa24 · 2024-11-22T10:54:52Z

Adjust the time.sleep value to control the speed of the progress bar.

For more complex progress bars, explore the customization options provided by the NotebookProgressBar or other libraries.

hsilva664 · 2024-11-23T04:51:13Z

Hello, I've tried reproducing this issue but could not get the reported error.

Screen recording: https://github.com/user-attachments/assets/1631ffcf-5599-44c4-a0b0-7893e14c9bb7

I tried:

creating a python 3.12-based venv and running the minimal code provided above in a jupyter notebook, the one that only runs the tqdm and sleep. The dependencies of the environment where installed from a fork of this repo with pip install -e ".[dev-torch]". The bar displays correctly..
I also tried creating a minimal example callingtrain method from transformers.Trainer like above and the bar displayed correctly on the jupyter notebook.
repeating the steps with a python 3.10-based venv with dependencies installed
via pip install -e ".[quality]", I also created a requirements.txt to install the exact versions of IPython, ipykernel, ipywidgets... reported above the only difference is the transformers version. Mine is 4.47.0.dev0 and the reported one is 4.45.2.

This is my first attempt to contribute here, so please do tell if I should have done something else.

0xjuju · 2024-11-23T19:22:45Z

It seems as if this issue is more platform-specific. I was also unable to reproduce the issue using the following Dockerfile configuration. From the traceback, it seems that Azure and anaconda is used. Maybe a more specific environment setup is needed.

FROM python:3.10.11-slim

WORKDIR /app

# Install system dependencies for Jupyter and other needed tools
RUN apt-get update &&
apt-get install -y --no-install-recommends
git
build-essential &&
rm -rf /var/lib/apt/lists/*

# Install Python dependencies including Jupyter and the Transformers library
RUN pip install --upgrade pip &&
pip install jupyterlab==4.0.11
transformers==4.45.2
torch
ipywidgets==7.7.1
ipython==8.27.0
jupyter_client==7.4.9
jupyter_core==5.7.2
nbclient==0.10.0
nbconvert==7.16.4
nbformat==5.10.4
notebook==6.5.7
qtconsole==5.6.0
traitlets==5.14.3
wandb

# Set environment variables for Jupyter Notebook
ENV JUPYTER_ENABLE_LAB=yes EXPOSE 8888

COPY . .

CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888", "--allow-root"] ```

liougehooa · 2024-11-25T06:02:58Z

It seems as if this issue is more platform-specific. I was also unable to reproduce the issue using the following Dockerfile configuration. From the traceback, it seems that Azure and anaconda is used. Maybe a more specific environment setup is needed.

FROM python:3.10.11-slim

WORKDIR /app

# Install system dependencies for Jupyter and other needed tools
RUN apt-get update &&
apt-get install -y --no-install-recommends
git
build-essential &&
rm -rf /var/lib/apt/lists/*

# Install Python dependencies including Jupyter and the Transformers library
RUN pip install --upgrade pip &&
pip install jupyterlab==4.0.11
transformers==4.45.2
torch
ipywidgets==7.7.1
ipython==8.27.0
jupyter_client==7.4.9
jupyter_core==5.7.2
nbclient==0.10.0
nbconvert==7.16.4
nbformat==5.10.4
notebook==6.5.7
qtconsole==5.6.0
traitlets==5.14.3
wandb

# Set environment variables for Jupyter Notebook
ENV JUPYTER_ENABLE_LAB=yes EXPOSE 8888

COPY . .

CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888", "--allow-root"] ```

I tested in colab. This seems working.
I tested in some regions(us east2, us west 3) with Azure ML Notebook, it doesn't work. But it could work in swedencentral, and some other regions in Europe.

I agree this is more platform-specific.

Knight7561 · 2024-11-26T22:10:29Z

+1 for platform specific issue. Need for details for reproducibility.

sahillihas · 2024-12-12T17:06:50Z

This is not a bug but a usage issue, likely caused by a misconfiguration in your Jupyter Notebook environment or an issue with the HTML-based NotebookProgressBar.

liougehooa added the bug label Nov 18, 2024

Rocketknight1 added the Good First Issue label Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IsADirectoryError when training with tqdm enabled for trainer #34766

IsADirectoryError when training with tqdm enabled for trainer #34766

liougehooa commented Nov 18, 2024

LysandreJik commented Nov 19, 2024

akshay9 commented Nov 20, 2024

liougehooa commented Nov 20, 2024 via email

Rocketknight1 commented Nov 20, 2024

Knight7561 commented Nov 21, 2024 •

edited

Loading

Kulloa24 commented Nov 22, 2024

hsilva664 commented Nov 23, 2024

0xjuju commented Nov 23, 2024 •

edited

Loading

liougehooa commented Nov 25, 2024

Knight7561 commented Nov 26, 2024

sahillihas commented Dec 12, 2024

IsADirectoryError when training with tqdm enabled for trainer #34766

IsADirectoryError when training with tqdm enabled for trainer #34766

Comments

liougehooa commented Nov 18, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

LysandreJik commented Nov 19, 2024

akshay9 commented Nov 20, 2024

liougehooa commented Nov 20, 2024 via email

Rocketknight1 commented Nov 20, 2024

Knight7561 commented Nov 21, 2024 • edited Loading

Kulloa24 commented Nov 22, 2024

hsilva664 commented Nov 23, 2024

0xjuju commented Nov 23, 2024 • edited Loading

liougehooa commented Nov 25, 2024

Knight7561 commented Nov 26, 2024

sahillihas commented Dec 12, 2024

Knight7561 commented Nov 21, 2024 •

edited

Loading

0xjuju commented Nov 23, 2024 •

edited

Loading