Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration #54

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

Integration #54

wants to merge 14 commits into from

Conversation

DachengLi1
Copy link
Contributor

No description provided.

@DachengLi1 DachengLi1 requested a review from pengwu22 December 8, 2020 18:19
autodist/autodist.py Show resolved Hide resolved
autodist/autodist.py Outdated Show resolved Hide resolved
autodist/autodist.py Outdated Show resolved Hide resolved
autodist/resource_spec.py Outdated Show resolved Hide resolved
autodist/cluster.py Outdated Show resolved Hide resolved
autodist/cluster.py Outdated Show resolved Hide resolved
autodist/const.py Outdated Show resolved Hide resolved
@zhisbug
Copy link
Contributor

zhisbug commented Dec 28, 2020

@YLJALDC please address the comments above?

@DachengLi1
Copy link
Contributor Author

I have addressed the comments. Do you have any further suggestions?

@DachengLi1 DachengLi1 requested a review from pengwu22 January 3, 2021 05:02
@zhisbug
Copy link
Contributor

zhisbug commented Jan 4, 2021

I have addressed the comments. Do you have any further suggestions?

Will review tomorrow and see whether more changes are needed.

# At AdaptDL mode, when the worker pass through this before
# the chief has created the strategy, this should returns
# nothing. Later, when the chief has created the strategy,
# it can load it.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not quite sure about the purpose of load and what this comment means. In L162-L163 load is always true when IS_ADAPTDL is true. Could you explain more?

Copy link
Contributor Author

@DachengLi1 DachengLi1 Jan 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is kind of subtle. Previously Autodist chief run first and generate the strategy; it will spawn worker instances after it builds the strategy, setup the cluster, etc. Now every instance will run through _build, and thus call _build_or_load_strategy. The first time the worker gets None from this function. The second time the worker will get the strategy from the chief. This is because kubernetes launch instances parallelly. The second time when the worker call the load, it is guaranteed that the chief has already generates it because there are several collective calls in between, which is blocking.

self._coordinator.launch_clients()
else:
if IS_AUTODIST_CHIEF:
self._coordinator = Coordinator(strategy=strategy, cluster=self._cluster)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better if we create different Coordinator classes based on the cluster mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. I tried similar format like you suggest. But I think the current version is more readable in autodist.py though more lengthy. Its easy to maintain this way since autodist.py is the first file to look at.

if IS_ADAPTDL:
hostname = socket.gethostname()
local_ip = socket.gethostbyname(hostname)
return local_ip
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is already a class named ADAPTDLCluster inherited from Cluster, is it necessary to insert ADAPTDL related code in the base class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I have updated it to the ADAPTDLCluster Class. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants