Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a hidden bug in the function named update_frontier_nodes #27

Closed
hliangzhao opened this issue Dec 8, 2020 · 3 comments
Closed

a hidden bug in the function named update_frontier_nodes #27

hliangzhao opened this issue Dec 8, 2020 · 3 comments

Comments

@hliangzhao
Copy link

I think there is a bug in the following function, defined in spark_env/job_dag.py:

def update_frontier_nodes(self, node):
        frontier_nodes_changed = False
        for child in node.child_nodes:
            if child.is_schedulable():
                if child.idx not in self.frontier_nodes:   # bug is here
                    self.frontier_nodes.add(child)
                    frontier_nodes_changed = True
        return frontier_nodes_changed

What self.frontier_nodes stores are the nodes themselves, not their indices. Although this did not have a significant effect on the training results.

@hongzimao
Copy link
Owner

Very nice catch! It's indeed a mistake. Fortunately self.frontier_nodes is a set, so child won't be duplicated. The effect of this bug is that frontier_nodes_changed becomes True more often than intended.

I guess the meta problem for causing these bugs was we didn't properly unit test all the modules. Even this was some research code, I think this level of complexity already requires proper testing. It was an oversight and we should prevent it in next projects.

@hliangzhao
Copy link
Author

hliangzhao commented Dec 13, 2020

Thanks for your reply!

Also, the training procedure takes too much time. I‘ m running on a server with 4 12-core cpus, 256G mem and 2 tesla p40 (48G mem in total). However, one epoch takes approximate 40 secs (it takes less time without gpu) with only one worker. Currently I'm optimizing the code to improve efficiency.

If a more efficient version of Decima is released, please tell me!

@hongzimao
Copy link
Owner

hongzimao commented Dec 13, 2020

We would suggest training with a smaller problem first, e.g., try reducing num_stream_dags. We didn't optimize the implementation too much at the time and we were just training the final model for a few days. Here is a trained model for comparison: #12.

Let us know if you find a more efficient implementation or figure out which parts of the code are the bottleneck. Feel free to also submit pull requests too. Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants