a hidden bug in the function named update_frontier_nodes #27

hliangzhao · 2020-12-08T11:00:06Z

I think there is a bug in the following function, defined in spark_env/job_dag.py:

def update_frontier_nodes(self, node):
        frontier_nodes_changed = False
        for child in node.child_nodes:
            if child.is_schedulable():
                if child.idx not in self.frontier_nodes:   # bug is here
                    self.frontier_nodes.add(child)
                    frontier_nodes_changed = True
        return frontier_nodes_changed

What self.frontier_nodes stores are the nodes themselves, not their indices. Although this did not have a significant effect on the training results.

The text was updated successfully, but these errors were encountered:

hongzimao · 2020-12-12T16:34:20Z

Very nice catch! It's indeed a mistake. Fortunately self.frontier_nodes is a set, so child won't be duplicated. The effect of this bug is that frontier_nodes_changed becomes True more often than intended.

I guess the meta problem for causing these bugs was we didn't properly unit test all the modules. Even this was some research code, I think this level of complexity already requires proper testing. It was an oversight and we should prevent it in next projects.

hliangzhao · 2020-12-13T03:33:08Z

Thanks for your reply!

Also, the training procedure takes too much time. I‘ m running on a server with 4 12-core cpus, 256G mem and 2 tesla p40 (48G mem in total). However, one epoch takes approximate 40 secs (it takes less time without gpu) with only one worker. Currently I'm optimizing the code to improve efficiency.

If a more efficient version of Decima is released, please tell me!

hongzimao · 2020-12-13T14:54:40Z

We would suggest training with a smaller problem first, e.g., try reducing num_stream_dags. We didn't optimize the implementation too much at the time and we were just training the final model for a few days. Here is a trained model for comparison: #12.

Let us know if you find a more efficient implementation or figure out which parts of the code are the bottleneck. Feel free to also submit pull requests too. Thanks a lot!

hliangzhao closed this as completed Dec 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a hidden bug in the function named update_frontier_nodes #27

a hidden bug in the function named update_frontier_nodes #27

hliangzhao commented Dec 8, 2020

hongzimao commented Dec 12, 2020

hliangzhao commented Dec 13, 2020 •

edited

Loading

hongzimao commented Dec 13, 2020 •

edited

Loading

a hidden bug in the function named update_frontier_nodes #27

a hidden bug in the function named update_frontier_nodes #27

Comments

hliangzhao commented Dec 8, 2020

hongzimao commented Dec 12, 2020

hliangzhao commented Dec 13, 2020 • edited Loading

hongzimao commented Dec 13, 2020 • edited Loading

hliangzhao commented Dec 13, 2020 •

edited

Loading

hongzimao commented Dec 13, 2020 •

edited

Loading