-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mods to run from Modal #3908
Mods to run from Modal #3908
Conversation
@Michaelvll could you provide some guidance on how you'd like to integrate these changes if you're open to accepting them? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting this PR @sethkimmel3! We should probably not remove the control master
for all the other cases, as that can significantly increase the time for executing commands on the remote cluster.
sky/backends/wheel_utils.py
Outdated
if not os.path.exists(os.path.join(wheel_dir, os.path.basename(wheel_path))): | ||
shutil.move(str(wheel_path), wheel_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems indicating that the temp folder we are using in the code is not available for writing wheels to. Can we debug a bit for why the /tmp
folder is not working and whether it will work with tempfile.gettempdir()
instead of using the hardcoded /tmp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I revert to the original code, the error I'm getting is: shutil.Error: Destination path '/root/.sky/wheels/0b62cd7da6552f20e52d8a83687dd835/skypilot-1.0.0.dev0-py3-none-any.whl' already exists
. So I think it's actually that the wheel is persisting, and shutil.move
fails to overwrite if it exists. Is it atypical that the wheel would already be found at the destination?
I think there's a number of ways to fix this that are relatively harmless and idempotent. This could be checking if the file already exists and skipping (as I did here), or running a shutil
method that overwrites instead. Just let me know what you're preference is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this is weird. The original code's destination, wheel_dir
, is a directory (supposedly to be /root/.sky/wheels/0b62cd7da6552f20e52d8a83687dd835
) but the error message you shown shutil.Error: Destination path '/root/.sky/wheels/0b62cd7da6552f20e52d8a83687dd835/skypilot-1.0.0.dev0-py3-none-any.whl' already exists
indicates the wheel_dir
somehow becomes the target file.
IIUC, if the destination is a dir, shutil
should correctly overwrite the file in it. Could you help check if that is the case in your case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example printout of the wheel_path, and wheel_dir:
Wheel path: /tmp/tmp1ca8zhil/skypilot-1.0.0.dev0-py3-none-any.whl
Wheel dir: /root/.sky/wheels/6b3f8d33bac9d114a388c31be0e1998e
I think it's possible that the 9p filesystem is again doing something odd here. Also from observation, it starts to fail the second time (not first time) the wheel hash changes. I probably need to do more digging here myself but let me know if that's helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the reason this is failing seems simple. It turns out that shutil.move
fails during a move across two filesystems. Since I'm moving the wheel from Modal's tmp
mount to my own .sky
mount, this makes sense. It seems like the best option is to use shutil.copy2
or check if the wheel already exists before attempting to move.
@Michaelvll I agree changing the default behavior of the control master isn't a good idea. However, modifying the behavior seems critical in this case. Do you see a good option to parameterize this, potentially in |
Another finding @Michaelvll - if I set the ssh control path to I believe this is due to Modal using a 9p filesystem under the hood, which seems to not interface well with the symlink behavior required to use ssh control master. Again, I'm not sure if this would be an acceptable change, or something that could be configurable by the user. |
sky/utils/command_runner.py
Outdated
@@ -401,6 +402,7 @@ def __init__( | |||
ssh_proxy_command: Optional[str] = None, | |||
docker_user: Optional[str] = None, | |||
disable_control_master: Optional[bool] = False, | |||
# disable_control_master: Optional[bool] = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# disable_control_master: Optional[bool] = True, |
Let's remove this unused comment : )
sky/utils/command_runner.py
Outdated
@@ -42,7 +43,7 @@ def _ssh_control_path(ssh_control_filename: Optional[str]) -> Optional[str]: | |||
if ssh_control_filename is None: | |||
return None | |||
user_hash = common_utils.get_user_hash() | |||
path = f'/tmp/skypilot_ssh_{user_hash}/{ssh_control_filename}' | |||
path = f'/dev/shm/skypilot_ssh_{user_hash}/{ssh_control_filename}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to use tempfile.gettempdir()
instead of /dev/shm/
to be more general?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like tempfile.gettempdir()
returns /tmp
when run from a Modal container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems control master is only supported on some specific file system types and that seems to be why it fails on modal's container. I am thinking that we should check the file system type to decide whether we should use control master. Changing the path to /dev/shm/
feels a bit losing generality for other systems, for example
def _file_system_allow_ssh_control_master() -> bool:
file_system_type = None
partitions = psutil.disk_partitions()
for partition in partitions:
if '/' in partition.mountpoint and file_system_type is None:
file_system_type = partition.fstype
if '/tmp' in partition.mountpoint:
file_system_type = partition.fstype
break
# Allow unknown file system types, as well as ext4, xfs, btrfs, and apfs.
if file_system_type in ['ext4', 'xfs', 'btrfs', 'apfs', None]:
return True
return False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense. I'll check on my end to make sure it detects the 9p filesystem correctly. My only concern is losing some performance benefits of the ssh control master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like your function captures the 9p filesystem for /tmp
. Since it's the only erroneous case that's specifically known as of now, would something like this work?
import subprocess
def is_tmp_9p_filesystem():
try:
# Use 'df' command to get filesystem information for /tmp
result = subprocess.run(['df', '-T', '/tmp'], capture_output=True, text=True)
if result.returncode != 0:
raise subprocess.CalledProcessError(result.returncode, result.args, result.stdout, result.stderr)
# Split the output into lines and get the second line (which contains the filesystem info)
filesystem_info = result.stdout.strip().split('\n')[1]
# The second column of this line is the filesystem type
filesystem_type = filesystem_info.split()[1]
# Check if the filesystem type is '9p'
return filesystem_type.lower() == '9p'
except subprocess.CalledProcessError as e:
print(f"Error running 'df' command: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
return False
If it's True, then we disable ssh_control_master
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's revert this change to the original /tmp
considering we have the check now?
sky/backends/wheel_utils.py
Outdated
if not os.path.exists(os.path.join(wheel_dir, os.path.basename(wheel_path))): | ||
shutil.move(str(wheel_path), wheel_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this is weird. The original code's destination, wheel_dir
, is a directory (supposedly to be /root/.sky/wheels/0b62cd7da6552f20e52d8a83687dd835
) but the error message you shown shutil.Error: Destination path '/root/.sky/wheels/0b62cd7da6552f20e52d8a83687dd835/skypilot-1.0.0.dev0-py3-none-any.whl' already exists
indicates the wheel_dir
somehow becomes the target file.
IIUC, if the destination is a dir, shutil
should correctly overwrite the file in it. Could you help check if that is the case in your case?
@Michaelvll - I proposed an implementation here. Let me know what you think! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @sethkimmel3! Sorry for the delay. The PR looks mostly good to me with some comments
sky/utils/control_master_checks.py
Outdated
return filesystem_type.lower() == '9p' | ||
|
||
except CalledProcessError as e: | ||
print(f'Error running "df" command: {e}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid printing directly to the output.
print(f'Error running "df" command: {e}') | |
logger.debug(f'Error running "df" command: {e}') |
sky/utils/command_runner.py
Outdated
@@ -42,7 +43,7 @@ def _ssh_control_path(ssh_control_filename: Optional[str]) -> Optional[str]: | |||
if ssh_control_filename is None: | |||
return None | |||
user_hash = common_utils.get_user_hash() | |||
path = f'/tmp/skypilot_ssh_{user_hash}/{ssh_control_filename}' | |||
path = f'/dev/shm/skypilot_ssh_{user_hash}/{ssh_control_filename}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's revert this change to the original /tmp
considering we have the check now?
sky/utils/command_runner.py
Outdated
self.disable_control_master = ( | ||
control_master_checks.disable_control_master_checks() or | ||
disable_control_master) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit:
self.disable_control_master = ( | |
control_master_checks.disable_control_master_checks() or | |
disable_control_master) | |
self.disable_control_master = ( | |
disable_control_master or control_master_checks.disable_control_master_checks()) |
This may make it more efficient for the case disable_control_master
is specified.
sky/utils/control_master_checks.py
Outdated
try: | ||
result = subprocess_utils.run(['df', '-T', '/tmp'], | ||
capture_output=True, | ||
text=True) | ||
|
||
if result.returncode != 0: | ||
raise CalledProcessError(result.returncode, result.args, | ||
result.stdout, result.stderr) | ||
|
||
filesystem_info = result.stdout.strip().split('\n')[1] | ||
filesystem_type = filesystem_info.split()[1] | ||
return filesystem_type.lower() == '9p' | ||
|
||
except CalledProcessError as e: | ||
print(f'Error running "df" command: {e}') | ||
|
||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making it more robust with:
try: | |
result = subprocess_utils.run(['df', '-T', '/tmp'], | |
capture_output=True, | |
text=True) | |
if result.returncode != 0: | |
raise CalledProcessError(result.returncode, result.args, | |
result.stdout, result.stderr) | |
filesystem_info = result.stdout.strip().split('\n')[1] | |
filesystem_type = filesystem_info.split()[1] | |
return filesystem_type.lower() == '9p' | |
except CalledProcessError as e: | |
print(f'Error running "df" command: {e}') | |
return False | |
result = subprocess_utils.run(['df', '-T', '/tmp'], | |
capture_output=True, | |
text=True) | |
if result.returncode != 0: | |
return False | |
filesystem_infos = result.stdout.strip().split('\n') | |
if len(filesystem_info) < 2: | |
return False | |
filesystem_types = filesystem_infos[1].split() | |
if len(filesystem_types) < 2: | |
return False | |
return filesystem_types[1].lower() == '9p' | |
sky/utils/control_master_checks.py
Outdated
return False | ||
|
||
|
||
def disable_control_master_checks() -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def disable_control_master_checks() -> bool: | |
@functools.lru_cache | |
def should_disable_control_master() -> bool: |
We should add a lru_cache to avoid performance degradation due to many duplicate checks
sky/utils/control_master_checks.py
Outdated
|
||
|
||
def disable_control_master_checks() -> bool: | ||
"""Disable ssh control master checks if the /tmp filesystem is 9p. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Disable ssh control master checks if the /tmp filesystem is 9p. | |
"""Whether disable ssh control master based on file system. |
if not os.path.exists( | ||
os.path.join(wheel_dir, os.path.basename(wheel_path))): | ||
shutil.move(str(wheel_path), wheel_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment for in what case this will not exist? Or should we remove this change for this PR?
@Michaelvll - made the changes you suggested. Also had to make a small mod based on subprocess_utils based on my testing. Not sure what's going on with the failing CI/CD test - have you seen that before? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sethkimmel3! LGTM. The CI/CD should be fixed if you merge the latest master.
sky/utils/control_master_checks.py
Outdated
return filesystem_types[1].lower() == '9p' | ||
|
||
|
||
@lru_cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: import module instead of functions
@lru_cache | |
@functools.lru_cache |
sky/utils/control_master_checks.py
Outdated
@@ -0,0 +1,49 @@ | |||
"""Utils to check if the ssh control master should be disabled.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would prefer renaming it to control_master_utils.py
sky/utils/control_master_checks.py
Outdated
capture_output=True, | ||
text=True, | ||
shell=None, | ||
check=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check=None, | |
check=False, |
Looks like it worked now |
The small changes in this PR were necessary to successfully deploy Skypilot from Modal. I believe that the root cause issue is a specialized filesystem that Modal uses and mounts to containers for faster provisioning and startup time.
I've marked it as a draft for now, as I'm assuming we want to let a user add a configuration, perhaps to the
config.yaml
advanced configurations to disable the ssh control master. There is an example inbackend_utils.py
that disables it for kubernetes, but I don't see a more general pattern to allow such an override:skypilot/sky/backends/backend_utils.py
Line 1273 in e3b9763
I believe the other change is relatively benign, but I'd be open to other suggestions to fix this too.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh