Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(train): make timeout of wenet_join configurable #2123

Merged
merged 1 commit into from
Nov 6, 2023
Merged

Conversation

xingchensong
Copy link
Member

在某些共享存储系统(jfs/hdfs)中,prefetch大数据集可能会花很多时间,此时30s timeout可能会造成模型还没处理完第一批次的数据就退出了epoch,所以大数据集相对来说要能设置更长的timeout

@xingchensong
Copy link
Member Author

#2107 new io有没有可能可以更优雅地解决这个问题?

@robin1001 robin1001 merged commit fd3803b into main Nov 6, 2023
6 checks passed
@robin1001 robin1001 deleted the xcsong-timeout branch November 6, 2023 12:20
@Mddct
Copy link
Collaborator

Mddct commented Nov 6, 2023

#2107 new io有没有可能可以更优雅地解决这个问题?

这块 大数据下我还是觉得 把epoch概念去掉, 保留interval step save model, 然后 list_shads * epochs, 这样uneven data 只会出现在训练最后 (训练最后 可以用padding)https://flax.readthedocs.io/en/latest/guides/data_preprocessing/full_eval.html , 这样的话,gloo这个也可以简化掉

另外 #2107 配合现在gloo 应该也有类似的问题,所以这个io 已经把epoch概念给去年掉了

@xingchensong
Copy link
Member Author

xingchensong commented Nov 6, 2023

我看了一下描述,我感觉这个infinit padding其实和wenet_join是一样的,他的count_p是个同步操作(sync barrier),作用类似dist.monitored_barrier,一旦检测到n不一致,就break,这个就是类似dist.monitored_barrier一旦检测到超时就break

image

@xingchensong
Copy link
Member Author

至于另一个点,把epoch概念去掉我觉得是对大数据训练比较友好的,支持

@Mddct
Copy link
Collaborator

Mddct commented Nov 6, 2023

我看了一下描述,我感觉这个infinit padding其实和wenet_join是一样的,他的count_p是个同步操作(sync barrier),作用类似dist.monitored_barrier,一旦检测到n不一致,就break,这个就是类似dist.monitored_barrier一旦检测到超时就break

image

嗯对 所有如果是steps 概念的话 这里已经到了最后 直接就不用训练了 gloo也不用去barrier了

@xingchensong
Copy link
Member Author

我看了一下描述,我感觉这个infinit padding其实和wenet_join是一样的,他的count_p是个同步操作(sync barrier),作用类似dist.monitored_barrier,一旦检测到n不一致,就break,这个就是类似dist.monitored_barrier一旦检测到超时就break
image

嗯对 所有如果是steps 概念的话 这里已经到了最后 直接就不用训练了 gloo也不用去barrier了

我的意思是,这个count_p就是个barrier,无论如何这个barrier省不掉,要么用jax的barrier(conunt_p),要么用gloo的(dist.monitor_barrier),要么用nccl的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants