Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix multi worker #10096

Merged
merged 10 commits into from
Mar 16, 2018
Merged

Fix multi worker #10096

merged 10 commits into from
Mar 16, 2018

Conversation

zhreshold
Copy link
Member

@zhreshold zhreshold commented Mar 14, 2018

Description

  • Fix race condition for CPUSharedStorageManager->Free
  • Launch workers at iter init stage to avoid frequent relaunch

@piiswrong @sxjscience @yajiedesign

Checklist

Essentials

  • Passed code style checking (make lint)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@sxjscience
Copy link
Member

@cjolivier01 @Jerryzcn

@Jerryzcn
Copy link
Contributor

thank god

@Jerryzcn
Copy link
Contributor

Jerryzcn commented Mar 14, 2018

I will patch this and test on our speech dataset. This will take approx. 2 day

workers.append(worker)

for idx, batch in enumerate(self._batch_sampler):
self._key_queue.put((idx, batch))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May still need to revise the logic in later PRs to dynamically push more jobs into the key_queue.

@zhreshold
Copy link
Member Author

@Jerryzcn The latest commit should fix #10042

@piiswrong piiswrong merged commit 24a8b78 into apache:master Mar 16, 2018
jinhuang415 pushed a commit to jinhuang415/incubator-mxnet that referenced this pull request Mar 30, 2018
* improve multi worker iterator

* debug

* debug

* fix python2

* fix

* update

* fix race condition in cpu shared storage free

* fix docstring

* update

* push workload in next
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* improve multi worker iterator

* debug

* debug

* fix python2

* fix

* update

* fix race condition in cpu shared storage free

* fix docstring

* update

* push workload in next
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* improve multi worker iterator

* debug

* debug

* fix python2

* fix

* update

* fix race condition in cpu shared storage free

* fix docstring

* update

* push workload in next
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants