Skip to content

【Hackathon 8th No.13】Domino 论文复现 #1093

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

xiaoyewww
Copy link

PR types

New Features

PR changes

Others

Describe

support domino

Copy link

paddle-bot bot commented Mar 4, 2025

Thanks for your contribution!

@xiaoyewww
Copy link
Author

xiaoyewww commented Mar 4, 2025

复现Domino有两个问题:

  1. 目前只复现了模型内容,另外还有前处理和后处理部分,需要的CPU配置和内存非常高,aistudio上无法跑通
  2. 官方没有提供预训练权重

@luotao1
Copy link
Collaborator

luotao1 commented Mar 5, 2025

请先提交 RFC 设计文档

@wangguan1995
Copy link
Contributor

不好意思这里任务描述有误,需要改为【推理】和【训练】

@xiaoyewww
Copy link
Author

@wangguan1995 目前模型已经能正常训练,精度通过Padiff验证通过。训练代码存在随机性,每个step前处理数据没法通过随机数种子固定。
目前还有需要验证的任务有:
(1)目前缺少数据集,仅通过一个样本进行训练,50个epoch loss正常下降
(2)推理代码已经适配,但前处理部分仍需进行点云处理,aistudio上无法处理,这部分需要继续验证

@wangguan1995
Copy link
Contributor

  1. 目前代码仓库存在大量相对路径
  2. 下载脚本目前有一些问题(aws自己的问题,先标记在文档里)
  3. run_1作为验证训练的数据,10个epoch的torch对比日志贴在这里
  4. 需要做的是前向loss 1e-5级别的对齐

@xiaoyewww
Copy link
Author

前10 个epoch日志:

torch

Device cuda:0, batch processed: 1, loss volume: 0.23092692             , loss surface: 0.06465874, loss integral: 0.00000000, loss surface area: 0.00293629
 Device cuda:0,  batch: 1, loss norm: 0.26472443
Device cuda:0 LOSS train 0.26472443 valid 0.19338508 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.05669114366173744, Time taken 118.63378143310547

Device cuda:0, epoch 1:
Device cuda:0, batch processed: 1, loss volume: 0.17430124             , loss surface: 0.03637335, loss integral: 0.00000000, loss surface area: 0.00179438
 Device cuda:0,  batch: 1, loss norm: 0.19338511
Device cuda:0 LOSS train 0.19338511 valid 0.11204524 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.05669114366173744, Time taken 114.84539103507996

Device cuda:0, epoch 2:
Device cuda:0, batch processed: 1, loss volume: 0.09825166             , loss surface: 0.02525518, loss integral: 0.00000000, loss surface area: 0.00233155
 Device cuda:0,  batch: 1, loss norm: 0.11204502
Device cuda:0 LOSS train 0.11204502 valid 0.10328176 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.05669114366173744, Time taken 115.87338018417358

Device cuda:0, epoch 3:
Device cuda:0, batch processed: 1, loss volume: 0.09899043             , loss surface: 0.00795730, loss integral: 0.00000000, loss surface area: 0.00062173
 Device cuda:0,  batch: 1, loss norm: 0.10327994
Device cuda:0 LOSS train 0.10327994 valid 0.05417285 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.05417285114526749, Time taken 116.46289348602295

Device cuda:0, epoch 4:
Device cuda:0, batch processed: 1, loss volume: 0.04868028             , loss surface: 0.01017947, loss integral: 0.00000000, loss surface area: 0.00080577
 Device cuda:0,  batch: 1, loss norm: 0.05417290
Device cuda:0 LOSS train 0.05417290 valid 0.08227389 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.05417285114526749, Time taken 115.91964483261108

Device cuda:0, epoch 5:
Device cuda:0, batch processed: 1, loss volume: 0.07733711             , loss surface: 0.00911645, loss integral: 0.00000000, loss surface area: 0.00075788
 Device cuda:0,  batch: 1, loss norm: 0.08227427
Device cuda:0 LOSS train 0.08227427 valid 0.08577856 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.05417285114526749, Time taken 115.41759729385376

Device cuda:0, epoch 6:
Device cuda:0, batch processed: 1, loss volume: 0.08035985             , loss surface: 0.01013019, loss integral: 0.00000000, loss surface area: 0.00070755
 Device cuda:0,  batch: 1, loss norm: 0.08577872
Device cuda:0 LOSS train 0.08577872 valid 0.06831404 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.05417285114526749, Time taken 115.05477333068848

Device cuda:0, epoch 7:
Device cuda:0, batch processed: 1, loss volume: 0.06417362             , loss surface: 0.00766032, loss integral: 0.00000000, loss surface area: 0.00062155
 Device cuda:0,  batch: 1, loss norm: 0.06831456
Device cuda:0 LOSS train 0.06831456 valid 0.04400067 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.044000666588544846, Time taken 116.03830194473267

Device cuda:0, epoch 8:
Device cuda:0, batch processed: 1, loss volume: 0.04010706             , loss surface: 0.00724047, loss integral: 0.00000000, loss surface area: 0.00054595
 Device cuda:0,  batch: 1, loss norm: 0.04400026
Device cuda:0 LOSS train 0.04400026 valid 0.04530785 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.044000666588544846, Time taken 115.86109900474548

Device cuda:0, epoch 9:
Device cuda:0, batch processed: 1, loss volume: 0.04175998             , loss surface: 0.00659478, loss integral: 0.00000000, loss surface area: 0.00050289
 Device cuda:0,  batch: 1, loss norm: 0.04530882
Device cuda:0 LOSS train 0.04530882 valid 0.06185470 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.044000666588544846, Time taken 115.73935866355896

Device cuda:0, epoch 10:
Device cuda:0, batch processed: 1, loss volume: 0.05804047             , loss surface: 0.00707586, loss integral: 0.00000000, loss surface area: 0.00055142
 Device cuda:0,  batch: 1, loss norm: 0.06185411
Device cuda:0 LOSS train 0.06185411 valid 0.04006197 Current lr 0.001Integral factor 0
Device cuda:0, Best val loss 0.04006196931004524, Time taken 116.31600141525269

paddle:

Device gpu:0, batch processed: 1, loss volume: 0.23092692             , loss surface: 0.06465873, loss integral: 0.00000000, loss surface area: 0.00293629
 Device gpu:0,  batch: 1, loss norm: 0.26472443
Loss/train: 0.2647244334220886/1
Device gpu:0 LOSS train 0.26472443 valid 0.19338842 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.05531581491231918, Time taken 118.70474576950073

Device gpu:0, epoch 1:
Device gpu:0, batch processed: 1, loss volume: 0.17430471             , loss surface: 0.03637305, loss integral: 0.00000000, loss surface area: 0.00179433
 Device gpu:0,  batch: 1, loss norm: 0.19338840
Loss/train: 0.19338840246200562/2
Device gpu:0 LOSS train 0.19338840 valid 0.11218615 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.05531581491231918, Time taken 115.69161486625671

Device gpu:0, epoch 2:
Device gpu:0, batch processed: 1, loss volume: 0.09839579             , loss surface: 0.02524963, loss integral: 0.00000000, loss surface area: 0.00233097
 Device gpu:0,  batch: 1, loss norm: 0.11218609
Loss/train: 0.11218608915805817/3
Device gpu:0 LOSS train 0.11218609 valid 0.10281664 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.05531581491231918, Time taken 115.92429780960083

Device gpu:0, epoch 3:
Device gpu:0, batch processed: 1, loss volume: 0.09853126             , loss surface: 0.00795383, loss integral: 0.00000000, loss surface area: 0.00062137
 Device gpu:0,  batch: 1, loss norm: 0.10281885
Loss/train: 0.10281885415315628/4
Device gpu:0 LOSS train 0.10281885 valid 0.05420898 Current lr 0.001Integral factor 0

Device gpu:0, Best val loss 0.05420897901058197, Time taken 116.74323606491089

Device gpu:0, epoch 4:
Device gpu:0, batch processed: 1, loss volume: 0.04871760             , loss surface: 0.01017731, loss integral: 0.00000000, loss surface area: 0.00080539
 Device gpu:0,  batch: 1, loss norm: 0.05420895
Loss/train: 0.05420895293354988/5
Device gpu:0 LOSS train 0.05420895 valid 0.08210348 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.05420897901058197, Time taken 115.5925304889679

Device gpu:0, epoch 5:
Device gpu:0, batch processed: 1, loss volume: 0.07717736             , loss surface: 0.00909785, loss integral: 0.00000000, loss surface area: 0.00075586
 Device gpu:0,  batch: 1, loss norm: 0.08210421
Loss/train: 0.08210421353578568/6
Device gpu:0 LOSS train 0.08210421 valid 0.08545748 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.05420897901058197, Time taken 115.61996293067932

Device gpu:0, epoch 6:
Device gpu:0, batch processed: 1, loss volume: 0.08003174             , loss surface: 0.01014488, loss integral: 0.00000000, loss surface area: 0.00070565
 Device gpu:0,  batch: 1, loss norm: 0.08545700
Loss/train: 0.08545700460672379/7
Device gpu:0 LOSS train 0.08545700 valid 0.06784783 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.05420897901058197, Time taken 116.20984172821045

Device gpu:0, epoch 7:
Device gpu:0, batch processed: 1, loss volume: 0.06372426             , loss surface: 0.00762877, loss integral: 0.00000000, loss surface area: 0.00061806
 Device gpu:0,  batch: 1, loss norm: 0.06784768
Loss/train: 0.06784767657518387/8
Device gpu:0 LOSS train 0.06784768 valid 0.04360897 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.04360896721482277, Time taken 116.99345993995667

Device gpu:0, epoch 8:
Device gpu:0, batch processed: 1, loss volume: 0.03972780             , loss surface: 0.00721701, loss integral: 0.00000000, loss surface area: 0.00054254
 Device gpu:0,  batch: 1, loss norm: 0.04360757
Loss/train: 0.043607573956251144/9
Device gpu:0 LOSS train 0.04360757 valid 0.04554129 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.04360896721482277, Time taken 116.08153223991394

Device gpu:0, epoch 9:
Device gpu:0, batch processed: 1, loss volume: 0.04196901             , loss surface: 0.00663531, loss integral: 0.00000000, loss surface area: 0.00050686
 Device gpu:0,  batch: 1, loss norm: 0.04554009
Loss/train: 0.04554009437561035/10
Device gpu:0 LOSS train 0.04554009 valid 0.06155418 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.04360896721482277, Time taken 115.41889214515686

Device gpu:0, epoch 10:
Device gpu:0, batch processed: 1, loss volume: 0.05775697             , loss surface: 0.00704247, loss integral: 0.00000000, loss surface area: 0.00054853
 Device gpu:0,  batch: 1, loss norm: 0.06155247
Loss/train: 0.061552468687295914/11
Device gpu:0 LOSS train 0.06155247 valid 0.04058465 Current lr 0.001Integral factor 0
Device gpu:0, Best val loss 0.04058464989066124, Time taken 116.95455741882324

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants