Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Users sometimes not found #18

Closed
jmsmkn opened this issue Jan 9, 2024 · 12 comments · Fixed by #19
Closed

Users sometimes not found #18

jmsmkn opened this issue Jan 9, 2024 · 12 comments · Fixed by #19
Assignees

Comments

@jmsmkn
Copy link
Member

jmsmkn commented Jan 9, 2024

Two occurrences:

File \"sagemaker_shim/models.py\", line 302, in _get_user_info

RuntimeError: User 1000 not found

RuntimeError: User 1001 not found

First one is from @pkcakeout, have requested the Dockerfile.

@jmsmkn
Copy link
Member Author

jmsmkn commented Jan 9, 2024

The second is from a container that previously worked, so needs fixing.

@jmsmkn
Copy link
Member Author

jmsmkn commented Jan 10, 2024

The Dockerfile:

FROM python:alpine3.17
COPY main.py /app.py
USER 1000
ENTRYPOINT [ "python3", "/app.py" ]

@pkcakeout
Copy link
Member

My 2-cents, in case not obvious: _get_user_info likely tries to query the user-information for a user that was not actually created in my container. Linux allows to set the UID/GID of a process to any numeric number independent from the underlying system definitions for users. So while the container should work, get_user_info will not.

@jmsmkn
Copy link
Member Author

jmsmkn commented Jan 10, 2024

The user is indeed not created, which is a problem for switching here.

We do some lookups to find what user (and group) to switch to as we always receive a string, and are not sure if this is a username or user id:

def _get_user_info(id_or_name: str) -> pwd.struct_passwd | None:
if id_or_name == "":
return None
try:
return pwd.getpwnam(id_or_name)
except (KeyError, AttributeError):
try:
return pwd.getpwuid(int(id_or_name))
except (KeyError, ValueError, AttributeError) as error:
raise RuntimeError(f"User {id_or_name} not found") from error

We then set the user info from this:

def proc_user(self) -> UserGroup:
match = re.fullmatch(
r"^(?P<user>[0-9a-zA-Z]*):?(?P<group>[0-9a-zA-Z]*)$", self.user
)
if match:
user = self._get_user_info(id_or_name=match.group("user"))
group = self._get_group_info(id_or_name=match.group("group"))
if user is None:
uid = None
home = None
else:
uid = user.pw_uid
home = user.pw_dir
if group is None:
if user is None:
gid = None
else:
# Switch to the users primary group
gid = user.pw_gid
else:
gid = group.gr_gid
return UserGroup(uid=uid, gid=gid, home=home)
else:
return UserGroup(uid=None, gid=None, home=None)

Before passing it to create_subprocess_exec:

process = await asyncio.create_subprocess_exec(
*self.proc_args,
user=self.proc_user.uid,
group=self.proc_user.gid,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
env=self.proc_env,
)

Here the user and group will be handled by subprocess.Popen. AFAIK this does essentially the same checks with getpwuid so would fail there.

@pkcakeout
Copy link
Member

pkcakeout commented Jan 10, 2024

Docs for the disambiguation issue: https://www.gnu.org/software/coreutils/manual/coreutils.html#Disambiguating-names-and-IDs

Short summary: It might be enough to check the first character of the string to decide if we want to interpret it as a number or a name.

Not sure what can be done about associated groups yet though... researching...

@pkcakeout
Copy link
Member

Explanation about my incorrect conclusion from the previous post: It seems that newer linux-tools have discouraged the use of number as the first letter of linux usernames by generating an error. Though technically it seems to be still allowed!

@jmsmkn
Copy link
Member Author

jmsmkn commented Jan 10, 2024

Main question for me - does Popen allow passing user=1000 when that user does not exist in the db? If so, then we only need to fix our code. If not, what is the Python interface for doing this?

@pkcakeout
Copy link
Member

pkcakeout commented Jan 10, 2024

As long as you are root this seems to work as expected. You cannot do it as a "lower user" though, for obvious reasons if you think about it...

root@dingo-linux:/home/paul# python3
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import subprocess
>>> subprocess.check_call(["id"])
uid=0(root) gid=0(root) groups=0(root)
0
>>> subprocess.check_call(["id"], user=21322)
uid=21322 gid=0(root) groups=0(root)
0
>>> subprocess.check_call(["id"], user=21322, group=32321)
uid=21322 gid=32321 groups=32321,0(root)
0

@jmsmkn
Copy link
Member Author

jmsmkn commented Jan 10, 2024

Great! Then we just need to fix our logic, no problem.

@jmsmkn jmsmkn self-assigned this Jan 10, 2024
jmsmkn added a commit that referenced this issue Jan 10, 2024
@jmsmkn
Copy link
Member Author

jmsmkn commented Jan 10, 2024

Okay I think I have the fix, please take a look.

@jmsmkn
Copy link
Member Author

jmsmkn commented Jan 10, 2024

>>> subprocess.check_call(["id"], user=21322, group=32321)
uid=21322 gid=32321 groups=32321,0(root)
0

Is that correct? There it looks like two groups are being used, 32321,0, I thought that group would set the only group used, extra_groups would add additional ones?

@pkcakeout
Copy link
Member

pkcakeout commented Jan 10, 2024

Hey James!

Yes, you replicated what I did it seems, however, as you pointed out, we should probably still drop* the extra_groups so that the process does not have root-group permissions either.

*: Which brings me to the following headache when thinking about the state-space here: if we want to do it "perfectly well" we likely should do a lookup into the system groups config to see if there should be assigned to the user with the given uid (if it exists) and set extra_groups accordingly.

Though honestly I would only do this if we encounter any follow-up issues. I do not suspect the average deep-learning to go that much into the inner workings of Linux to get his project to work when working locally, so I suspect that all code we encounter in the wild will only need uid and gid assigned and the extra_groups=[]. So, I suggest:

subprocess.check_call(["id"], user=21322, group=32321, extra_groups=[])

... for now.

jmsmkn added a commit that referenced this issue Jan 10, 2024
jmsmkn added a commit to comic/grand-challenge.org that referenced this issue Jan 11, 2024
Adds fixes for the SageMaker Training backend:

- Metrics collection
- Handle event being called twice
- Timeouts
- Sets minimum time limits as no shorter limit can be handled by
SageMaker Training

Upgrades SageMaker shim to fix to user determination issue
(DIAGNijmegen/rse-sagemaker-shim#18)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants