-
Notifications
You must be signed in to change notification settings - Fork 333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using large dataset from a TF record the model doesnt train anything #2444
Comments
Hi @emmanuelol Could you please provide dummy dataset to reproduce this issue ? |
Hi, one of the datasets where this issue is present is the BDD100k dataset. I downloaded and converted it to TFRecord; in the past, I've been using such TFRecord to train models with the TensorFlow Object Detection API without issues. |
https://www.kaggle.com/datasets/pa928human/bdd100k-multiclass-tfrecords-val-part-1 |
Current Behavior:
I'm using Tensorflow nightly as a backend for Keras_CV-nightly; I'm using the docker image from Docker HUB of Tensorflow. I'm trying to use a larger dataset for object detection, almost 70K images for training and 10K for testing. For the past two weeks, I've been trying to train Retinanet and YOLOV8 with this dataset, but as soon I start the first epoch, I get a bunch of times the following message:
After that, there is no training; everything freezes. If I review the GPU resources, I see that almost all of the memory is used, and there are peaks of activity in the GPU. I wait for hours but never do anything, and the system kills the training after a while.
When I try YOLOV8m, I get an additional message:
Expected Behavior:
When I took under 5K images from the dataset, the training presented the same message as above, but the model started training.
Steps To Reproduce:
Version:
Docker image: tensorflow/tensorflow:nightly-gpu
Docker 26.1.1
NVIDIA Container Toolkit
Ubuntu 22.04
NVIDIA driver 550.67
Thanks in advance. Any help is always welcome.
The text was updated successfully, but these errors were encountered: