-
Notifications
You must be signed in to change notification settings - Fork 8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
train det model时卡住 #14304
Comments
根据问题描述和配置文件分析,你在训练 1. 文件系统性能问题保存模型需要将大量参数写入磁盘,可能因为文件系统性能瓶颈(如硬盘速度过慢或 I/O 竞争)而导致卡住。 解决方案:
2. 数据加载器 (
|
检查一下字典是不是一样的吧 |
这种情况通常是由于 训练和推理阶段的配置参数或模型文件不一致 导致的。以下是可能的原因和解决方案: 1. 模型文件不一致
2. 配置文件和参数不一致
3. 预处理和后处理步骤不一致
4. 版本兼容性问题
5. 检查模型是否正确导出
建议的解决步骤
|
🔎 Search before asking
🐛 Bug (问题描述)
我在训练det model时每次在eval之后得第一个save model阶段卡住,必现,调整了batch_size和num_workers也是一样,数据集也调整到少数得35张也是一样得情况,请问是什么原因,使用得是det_mv3_db.yml来训练得。在训练rec模型得时候没有这个问题
Global:
use_gpu: true
use_xpu: false
use_mlu: false
epoch_num: 1200
log_smooth_window: 20
print_batch_step: 10
save_model_dir: ./output/db_mv3/
save_epoch_step: 500
evaluation is run every 2000 iterations
eval_batch_step: [0, 500]
cal_metric_during_train: False
pretrained_model: ./pretrain_models/MobileNetV3_large_x0_5_pretrained
checkpoints:
save_inference_dir:
use_visualdl: False
infer_img: doc/imgs_en/img_10.jpg
save_res_path: ./output/det_db/predicts_db.txt
Architecture:
model_type: det
algorithm: DB
Transform:
Backbone:
name: MobileNetV3
scale: 0.5
model_name: large
Neck:
name: DBFPN
out_channels: 256
Head:
name: DBHead
k: 50
Loss:
name: DBLoss
balance_loss: true
main_loss_type: DiceLoss
alpha: 5
beta: 10
ohem_ratio: 3
Optimizer:
name: Adam
beta1: 0.9
beta2: 0.999
lr:
learning_rate: 0.0005
regularizer:
name: 'L2'
factor: 0
PostProcess:
name: DBPostProcess
thresh: 0.3
box_thresh: 0.6
max_candidates: 1000
unclip_ratio: 1.5
Metric:
name: DetMetric
main_indicator: hmean
Train:
dataset:
name: SimpleDataSet
data_dir: ./train_data/
label_file_list:
- ./train_data/det/train.txt
ratio_list: [1.0]
transforms:
- DecodeImage: # load image
img_mode: BGR
channel_first: False
- DetLabelEncode: # Class handling label
- IaaAugment:
augmenter_args:
- { 'type': Fliplr, 'args': { 'p': 0.5 } }
- { 'type': Affine, 'args': { 'rotate': [-10, 10] } }
- { 'type': Resize, 'args': { 'size': [0.5, 3] } }
- EastRandomCropData:
size: [640, 640]
max_tries: 50
keep_ratio: true
- MakeBorderMap:
shrink_ratio: 0.4
thresh_min: 0.3
thresh_max: 0.7
- MakeShrinkMap:
shrink_ratio: 0.4
min_text_size: 8
- NormalizeImage:
scale: 1./255.
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
order: 'hwc'
- ToCHWImage:
- KeepKeys:
keep_keys: ['image', 'threshold_map', 'threshold_mask', 'shrink_map', 'shrink_mask'] # the order of the dataloader list
loader:
shuffle: True
drop_last: False
batch_size_per_card: 16
num_workers: 8
use_shared_memory: True
pin_memory: True
Eval:
dataset:
name: SimpleDataSet
data_dir: ./train_data/
label_file_list:
- ./train_data/det/val.txt
transforms:
- DecodeImage: # load image
img_mode: BGR
channel_first: False
- DetLabelEncode: # Class handling label
- DetResizeForTest:
image_shape: [736, 1280]
- NormalizeImage:
scale: 1./255.
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
order: 'hwc'
- ToCHWImage:
- KeepKeys:
keep_keys: ['image', 'shape', 'polys', 'ignore_tags']
loader:
shuffle: False
drop_last: False
batch_size_per_card: 1 # must be 1
num_workers: 8
use_shared_memory: True
pin_memory: True
🏃♂️ Environment (运行环境)
centos7.9, cuda 11.8, cudnn 8.6
![image](https://private-user-images.githubusercontent.com/42798227/391305386-2e511ef3-3799-4c6e-a0c1-8f8377373348.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk2MTc0NTcsIm5iZiI6MTczOTYxNzE1NywicGF0aCI6Ii80Mjc5ODIyNy8zOTEzMDUzODYtMmU1MTFlZjMtMzc5OS00YzZlLWEwYzEtOGY4Mzc3MzczMzQ4LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTUlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE1VDEwNTkxN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWQzMzI2YmFmMjlhYTBiYjUzZGIwNGJhODIwMDJiM2M4N2NkNTFlNWJhZThlY2QzM2ZmZDhmZmM0NWUwMzY0ZmYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.aJXHv5wU7i48rqxrMRXtsQCRLK-8nHaikoshMdOtlv0)
A10显存24g单卡
🌰 Minimal Reproducible Example (最小可复现问题的Demo)
nohup python tools/train.py -c pretrain_models/det_mv3_db.yml > /dev/null 2>&1 &
The text was updated successfully, but these errors were encountered: