Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mods to run from Modal #3908

Merged

Conversation

sethkimmel3
Copy link
Contributor

@sethkimmel3 sethkimmel3 commented Sep 4, 2024

The small changes in this PR were necessary to successfully deploy Skypilot from Modal. I believe that the root cause issue is a specialized filesystem that Modal uses and mounts to containers for faster provisioning and startup time.

I've marked it as a draft for now, as I'm assuming we want to let a user add a configuration, perhaps to the config.yaml advanced configurations to disable the ssh control master. There is an example in backend_utils.py that disables it for kubernetes, but I don't see a more general pattern to allow such an override:

credentials['disable_control_master'] = True

I believe the other change is relatively benign, but I'd be open to other suggestions to fix this too.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@sethkimmel3 sethkimmel3 changed the title Sethkimmel3 modify ssh control path Mods to run from Modal Sep 4, 2024
@sethkimmel3
Copy link
Contributor Author

@Michaelvll could you provide some guidance on how you'd like to integrate these changes if you're open to accepting them?

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this PR @sethkimmel3! We should probably not remove the control master for all the other cases, as that can significantly increase the time for executing commands on the remote cluster.

Comment on lines 132 to 133
if not os.path.exists(os.path.join(wheel_dir, os.path.basename(wheel_path))):
shutil.move(str(wheel_path), wheel_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems indicating that the temp folder we are using in the code is not available for writing wheels to. Can we debug a bit for why the /tmp folder is not working and whether it will work with tempfile.gettempdir() instead of using the hardcoded /tmp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I revert to the original code, the error I'm getting is: shutil.Error: Destination path '/root/.sky/wheels/0b62cd7da6552f20e52d8a83687dd835/skypilot-1.0.0.dev0-py3-none-any.whl' already exists. So I think it's actually that the wheel is persisting, and shutil.move fails to overwrite if it exists. Is it atypical that the wheel would already be found at the destination?

I think there's a number of ways to fix this that are relatively harmless and idempotent. This could be checking if the file already exists and skipping (as I did here), or running a shutil method that overwrites instead. Just let me know what you're preference is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is weird. The original code's destination, wheel_dir, is a directory (supposedly to be /root/.sky/wheels/0b62cd7da6552f20e52d8a83687dd835) but the error message you shown shutil.Error: Destination path '/root/.sky/wheels/0b62cd7da6552f20e52d8a83687dd835/skypilot-1.0.0.dev0-py3-none-any.whl' already exists indicates the wheel_dir somehow becomes the target file.

IIUC, if the destination is a dir, shutil should correctly overwrite the file in it. Could you help check if that is the case in your case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example printout of the wheel_path, and wheel_dir:
Wheel path: /tmp/tmp1ca8zhil/skypilot-1.0.0.dev0-py3-none-any.whl
Wheel dir: /root/.sky/wheels/6b3f8d33bac9d114a388c31be0e1998e

I think it's possible that the 9p filesystem is again doing something odd here. Also from observation, it starts to fail the second time (not first time) the wheel hash changes. I probably need to do more digging here myself but let me know if that's helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the reason this is failing seems simple. It turns out that shutil.move fails during a move across two filesystems. Since I'm moving the wheel from Modal's tmp mount to my own .sky mount, this makes sense. It seems like the best option is to use shutil.copy2 or check if the wheel already exists before attempting to move.

@sethkimmel3
Copy link
Contributor Author

@Michaelvll I agree changing the default behavior of the control master isn't a good idea. However, modifying the behavior seems critical in this case. Do you see a good option to parameterize this, potentially in config.yaml, so the user can change the behavior if necessary?

@sethkimmel3
Copy link
Contributor Author

Another finding @Michaelvll - if I set the ssh control path to f'/dev/shm/skypilot_ssh_{user_hash}/{ssh_control_filename}'rather than f'/tmp/skypilot_ssh_{user_hash}/{ssh_control_filename}', I can leave the default control master behavior.

I believe this is due to Modal using a 9p filesystem under the hood, which seems to not interface well with the symlink behavior required to use ssh control master. Again, I'm not sure if this would be an acceptable change, or something that could be configurable by the user.

@@ -401,6 +402,7 @@ def __init__(
ssh_proxy_command: Optional[str] = None,
docker_user: Optional[str] = None,
disable_control_master: Optional[bool] = False,
# disable_control_master: Optional[bool] = True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# disable_control_master: Optional[bool] = True,

Let's remove this unused comment : )

@@ -42,7 +43,7 @@ def _ssh_control_path(ssh_control_filename: Optional[str]) -> Optional[str]:
if ssh_control_filename is None:
return None
user_hash = common_utils.get_user_hash()
path = f'/tmp/skypilot_ssh_{user_hash}/{ssh_control_filename}'
path = f'/dev/shm/skypilot_ssh_{user_hash}/{ssh_control_filename}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use tempfile.gettempdir() instead of /dev/shm/ to be more general?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like tempfile.gettempdir() returns /tmp when run from a Modal container.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems control master is only supported on some specific file system types and that seems to be why it fails on modal's container. I am thinking that we should check the file system type to decide whether we should use control master. Changing the path to /dev/shm/ feels a bit losing generality for other systems, for example

def _file_system_allow_ssh_control_master() -> bool:
    file_system_type = None
    partitions = psutil.disk_partitions()
    for partition in partitions:
        if '/' in partition.mountpoint and file_system_type is None:
            file_system_type = partition.fstype
        if '/tmp' in partition.mountpoint:
            file_system_type = partition.fstype
            break
    # Allow unknown file system types, as well as ext4, xfs, btrfs, and apfs.
    if file_system_type in ['ext4', 'xfs', 'btrfs', 'apfs', None]:
        return True
    return False

Copy link
Contributor Author

@sethkimmel3 sethkimmel3 Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense. I'll check on my end to make sure it detects the 9p filesystem correctly. My only concern is losing some performance benefits of the ssh control master.

Copy link
Contributor Author

@sethkimmel3 sethkimmel3 Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like your function captures the 9p filesystem for /tmp. Since it's the only erroneous case that's specifically known as of now, would something like this work?

import subprocess

def is_tmp_9p_filesystem():
    try:
        # Use 'df' command to get filesystem information for /tmp
        result = subprocess.run(['df', '-T', '/tmp'], capture_output=True, text=True)
        
        if result.returncode != 0:
            raise subprocess.CalledProcessError(result.returncode, result.args, result.stdout, result.stderr)

        # Split the output into lines and get the second line (which contains the filesystem info)
        filesystem_info = result.stdout.strip().split('\n')[1]
        
        # The second column of this line is the filesystem type
        filesystem_type = filesystem_info.split()[1]

        # Check if the filesystem type is '9p'
        return filesystem_type.lower() == '9p'

    except subprocess.CalledProcessError as e:
        print(f"Error running 'df' command: {e}")
    except Exception as e:
        print(f"Unexpected error: {e}")
    
    return False

If it's True, then we disable ssh_control_master.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's revert this change to the original /tmp considering we have the check now?

Comment on lines 132 to 133
if not os.path.exists(os.path.join(wheel_dir, os.path.basename(wheel_path))):
shutil.move(str(wheel_path), wheel_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is weird. The original code's destination, wheel_dir, is a directory (supposedly to be /root/.sky/wheels/0b62cd7da6552f20e52d8a83687dd835) but the error message you shown shutil.Error: Destination path '/root/.sky/wheels/0b62cd7da6552f20e52d8a83687dd835/skypilot-1.0.0.dev0-py3-none-any.whl' already exists indicates the wheel_dir somehow becomes the target file.

IIUC, if the destination is a dir, shutil should correctly overwrite the file in it. Could you help check if that is the case in your case?

@sethkimmel3 sethkimmel3 marked this pull request as ready for review October 27, 2024 23:51
@sethkimmel3
Copy link
Contributor Author

@Michaelvll - I proposed an implementation here. Let me know what you think!

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @sethkimmel3! Sorry for the delay. The PR looks mostly good to me with some comments

return filesystem_type.lower() == '9p'

except CalledProcessError as e:
print(f'Error running "df" command: {e}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid printing directly to the output.

Suggested change
print(f'Error running "df" command: {e}')
logger.debug(f'Error running "df" command: {e}')

@@ -42,7 +43,7 @@ def _ssh_control_path(ssh_control_filename: Optional[str]) -> Optional[str]:
if ssh_control_filename is None:
return None
user_hash = common_utils.get_user_hash()
path = f'/tmp/skypilot_ssh_{user_hash}/{ssh_control_filename}'
path = f'/dev/shm/skypilot_ssh_{user_hash}/{ssh_control_filename}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's revert this change to the original /tmp considering we have the check now?

Comment on lines 446 to 448
self.disable_control_master = (
control_master_checks.disable_control_master_checks() or
disable_control_master)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit:

Suggested change
self.disable_control_master = (
control_master_checks.disable_control_master_checks() or
disable_control_master)
self.disable_control_master = (
disable_control_master or control_master_checks.disable_control_master_checks())

This may make it more efficient for the case disable_control_master is specified.

Comment on lines 13 to 29
try:
result = subprocess_utils.run(['df', '-T', '/tmp'],
capture_output=True,
text=True)

if result.returncode != 0:
raise CalledProcessError(result.returncode, result.args,
result.stdout, result.stderr)

filesystem_info = result.stdout.strip().split('\n')[1]
filesystem_type = filesystem_info.split()[1]
return filesystem_type.lower() == '9p'

except CalledProcessError as e:
print(f'Error running "df" command: {e}')

return False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making it more robust with:

Suggested change
try:
result = subprocess_utils.run(['df', '-T', '/tmp'],
capture_output=True,
text=True)
if result.returncode != 0:
raise CalledProcessError(result.returncode, result.args,
result.stdout, result.stderr)
filesystem_info = result.stdout.strip().split('\n')[1]
filesystem_type = filesystem_info.split()[1]
return filesystem_type.lower() == '9p'
except CalledProcessError as e:
print(f'Error running "df" command: {e}')
return False
result = subprocess_utils.run(['df', '-T', '/tmp'],
capture_output=True,
text=True)
if result.returncode != 0:
return False
filesystem_infos = result.stdout.strip().split('\n')
if len(filesystem_info) < 2:
return False
filesystem_types = filesystem_infos[1].split()
if len(filesystem_types) < 2:
return False
return filesystem_types[1].lower() == '9p'

return False


def disable_control_master_checks() -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def disable_control_master_checks() -> bool:
@functools.lru_cache
def should_disable_control_master() -> bool:

We should add a lru_cache to avoid performance degradation due to many duplicate checks



def disable_control_master_checks() -> bool:
"""Disable ssh control master checks if the /tmp filesystem is 9p.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Disable ssh control master checks if the /tmp filesystem is 9p.
"""Whether disable ssh control master based on file system.

Comment on lines +132 to +134
if not os.path.exists(
os.path.join(wheel_dir, os.path.basename(wheel_path))):
shutil.move(str(wheel_path), wheel_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment for in what case this will not exist? Or should we remove this change for this PR?

@sethkimmel3
Copy link
Contributor Author

sethkimmel3 commented Nov 8, 2024

@Michaelvll - made the changes you suggested. Also had to make a small mod based on subprocess_utils based on my testing.

Not sure what's going on with the failing CI/CD test - have you seen that before?

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sethkimmel3! LGTM. The CI/CD should be fixed if you merge the latest master.

return filesystem_types[1].lower() == '9p'


@lru_cache
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: import module instead of functions

Suggested change
@lru_cache
@functools.lru_cache

@@ -0,0 +1,49 @@
"""Utils to check if the ssh control master should be disabled."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer renaming it to control_master_utils.py

capture_output=True,
text=True,
shell=None,
check=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
check=None,
check=False,

@sethkimmel3
Copy link
Contributor Author

Looks like it worked now

@Michaelvll Michaelvll added this pull request to the merge queue Nov 9, 2024
Merged via the queue into skypilot-org:master with commit 409b92f Nov 9, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants