-
-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle BentoML errors & clean up failed models #1527
base: master
Are you sure you want to change the base?
Handle BentoML errors & clean up failed models #1527
Conversation
@DhanshreeA |
@DhanshreeA
|
|
I don't understand, what's the |
I did not copy the full log. Here, "batch": True means that the API supports batch processing. This is my understanding. |
|
ersilia/hub/fetch/fetch.py
Outdated
@@ -25,6 +25,15 @@ | |||
FetchResult = namedtuple("FetchResult", ["fetch_success", "reason"]) | |||
|
|||
|
|||
class BentoMLError(Exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @OlawumiSalaam could you move this class to this module, and implement it according to the other exceptions implemented here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually chuck that, we should be reusing this exception class. You can put your message string in there, right now the class is simply a placeholder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually thought about that, but wanted your feedback first. Thanks
ersilia/hub/fetch/fetch.py
Outdated
except NotInstallableWithBentoML: | ||
raise | ||
|
||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to have any try...excepts in this code, because if you look carefully, this part is contextualized with the decorator throw_ersilia_exception
, implemented here. What this decorator does is to see if there are any exceptions in the function that it decorates, and then prints that nicely on the terminal. And the implementation itself has a try...except, so it removes the need for adding that explicitly to methods decorated with this function.
ersilia/hub/fetch/fetch.py
Outdated
@@ -395,4 +417,20 @@ async def fetch(self, model_id: str) -> bool: | |||
fetch_success=True, reason="Model fetched successfully" | |||
) | |||
else: | |||
# Handle BentoML-specific errors here | |||
if isinstance(fr.reason, BentoMLError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we shouldn't be adding this here. Look at L392, we should be catching one of StandardModelExampleError
, or BentoMLError
. So something like:
try:
fr = await self._fetch(model_id)
if fr.fetch_success:
self._standard_csv_example(model_id)
except (StandardModelExampleError, BentoMLError) as err: # <--- In the future we can account for more cases
else:
self.logger.debug("Standard model example failed, deleting artifacts")
do_delete = yes_no_input(
"Do you want to delete the model artifacts? [Y/n]",
default_answer="Y",
)
if do_delete:
md = ModelFullDeleter(overwrite=False)
md.delete(model_id)
return FetchResult(
fetch_success=False,
reason="..." # Here modify the reason string based on the error.msg (which would be in the Exception class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't use this code as is because the formatting might be messed up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @DhanshreeA ,
Thank you for your detailed feedback. There is a lot of information to process so I am taking one step at a time. I have reworked the fetch()
method to better align with your suggestions, and I would like to confirm if this approach is correct before finalizing.
Here is my understanding and changes made:
- Structured
try...except
Block:
- Wrapped
_fetch()
and_standard_csv_example()
in a try block. - Explicitly caught
StandardModelExampleError
andBentoMLException
to handle cleanup and error messaging.
- Dynamic Error Handling:
- Used
str(err)
to populate the reason field inFetchResult
, ensuring error messages are specific to the exception.
- Cleanup Logic:
- Moved artifact deletion into the except block to ensure cleanup occurs only when errors are encountered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
async def fetch(self, model_id: str) -> bool:
"""
Fetch a model with the given eos identifier.
Parameters
----------
model_id : str
The eos identifier of the model.
Returns
-------
bool
True if the model was fetched successfully, False otherwise.
Examples
--------
.. code-block:: python
fetcher = ModelFetcher(config_json=config)
success = await fetcher.fetch(model_id="eosxxxx")
"""
try:
fr = await self._fetch(model_id)
if not fr.fetch_success:
return fr
self._standard_csv_example(model_id)
self.logger.debug("Writing model source to file")
model_source_file = os.path.join(
self._model_path(model_id), MODEL_SOURCE_FILE
)
try:
os.makedirs(self._model_path(model_id), exist_ok=True)
except OSError as error:
self.logger.error(f"Error during folder creation: {error}")
with open(model_source_file, "w") as f:
f.write(self.model_source)
return FetchResult(
fetch_success=True, reason="Model fetched successfully"
)
except (StandardModelExampleError, BentoMLException) as err:
self.logger.debug(f"{type(err).__name__} occurred: {str(err)}")
do_delete = yes_no_input(
"Do you want to delete the model artifacts? [Y/n]",
default_answer="Y",
)
if do_delete:
md = ModelFullDeleter(overwrite=False)
md.delete(model_id)
self.logger.info(f"Model '{model_id}' artifacts successfully deleted.")
print(f"✅ Model '{model_id}' artifacts have been successfully deleted.")
reason = str(err) if str(err) else "An unknown error occurred during fetching."
return FetchResult(fetch_success=False, reason=reason)
Does this structure align with your expectations and the intent of your feedback? Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect @OlawumiSalaam ship it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing, avoid using print statements, please use the logger.
Hey @OlawumiSalaam some examples of where we invoke bentoml through subprocess calls. Also from the logs shared in the referenced issue for this PR, for model eos69p6, we see the error occurred after these lines:
So I would recommend looking for these logs in the codebase, and the functions/methods that make subsequent subprocess calls to bentoml. Usually, you'd see this happening through a terminal utility we have called Although this here seems like a JSONDecodeError (Expecting value: line 1 column 1(char 0), so I would put the JSON reading in a try...except as well, catch the JSONDecodeError, and in turn raise a BentoMLError. Remember, the main idea is to catch the BentoMLError like here and delete the model artifacts that got created during fetch, so it doesn't appear in the catalog eventually. |
ersilia/hub/fetch/fetch.py
Outdated
@@ -395,4 +417,20 @@ async def fetch(self, model_id: str) -> bool: | |||
fetch_success=True, reason="Model fetched successfully" | |||
) | |||
else: | |||
# Handle BentoML-specific errors here | |||
if isinstance(fr.reason, BentoMLError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing, avoid using print statements, please use the logger.
6985a3a
to
d950091
Compare
Request for Review Key Changes
Thank you! |
Hi @DhanshreeA, Following your feedback, I audited the codebase to identify all BentoML subprocess calls. I used |
Hi @OlawumiSalaam great work. Just one comment. I am not seeing the fix for if that specific issue arises, to automatically resolve it. I mentioned the details in this issue #1532 (comment). The error bentoml creates now can be captured by your PR. But we also need some small change to fix the issue itself. Here is the details that I discussed with Gemma: I highly suspect that this problem is, what we call it module shadow. Let me explain. We have a file called in ersilia
|
One solution I found was to immidiately uninstall bentoml then when we fetch again it works. You may take it from here. |
@Abellegese Thank you for your feedback. The focus was just on the specific issue I am working on. Your solution was not captured earlier but I will definitely be look into it and update here as it goes. Hope we get this bentoml issues solved. You are always helpful. Thank you. |
Issue Summary
Proposed Fix Update Imports: Installation Safeguards:
I want to ensure that my understanding and proposed approach meet your expectations before moving forward with these changes or if there are additional adjustments you would recommend. |
Hi @OlawumiSalaam this looks good. Have you tried it if it solves the problem? If it is not solving it let me know, I have some idea we go for that. Update: |
This is perfect, thanks @OlawumiSalaam for covering these cases. |
@Abellegese I agree that this is probably what happens, however that points to a larger issue with users - ie, either not having conda on their system or on their path, causing Ersilia to fall back to installing all model dependencies within the ersilia environment - which should not happen to begin with. While your solution works, I am afraid it's not a good long term solution, because ersilia should always be creating standalone model environments, and not corrupt its own environment if it cannot find conda on the system path. EDIT: Looking at logs from #1531 and #1532 my conda theory fails because I can see that ersilia is able to create the right conda environments for the models. You're probably on to something with the symlinking/module shadow theory. If I happen to run into this error at some point, I would inspect which source ends up running for the bentoml cli. |
@DhanshreeA |
@OlawumiSalaam my bad, after following the discussions across the referenced issues, I think you and @Abellegese are right. Let's implement it this way. |
@OlawumiSalaam - I agree with Abel here, adding a handling logic for the |
|
||
cmd = ["bentoml", "get", f"{model_id}:latest", "--print-location", "--quiet"] | ||
stdout, stderr, returncode = run_command(cmd, quiet=True) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this implementation, thanks!
return CatalogTable(data=[], columns=[]) # Return empty table | ||
|
||
# Extract columns and values | ||
columns = ["BENTO_SERVICE", "AGE", "APIS", "ARTIFACTS"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move this to defaults.py
as something like BENTOML_COLS
.
if mf.seems_installable(model_id=self.model_id): | ||
mf.fetch(model_id=self.model_id) | ||
else: | ||
self.logger.debug("Not installable with BentoML") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can retain this log line.
result = subprocess.run( | ||
cmd, | ||
shell=isinstance(cmd, str), | ||
stdout=subprocess.PIPE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we piping both stdout and stderr? Why not set both to subprocess.STDOUT
?
* upgrade loguru (ersilia-os#1525) * Bump abatilo/actions-poetry from 3.0.1 to 4.0.0 Bumps [abatilo/actions-poetry](https://github.com/abatilo/actions-poetry) from 3.0.1 to 4.0.0. - [Release notes](https://github.com/abatilo/actions-poetry/releases) - [Changelog](https://github.com/abatilo/actions-poetry/blob/master/.releaserc) - [Commits](abatilo/actions-poetry@v3.0.1...v4.0.0) --- updated-dependencies: - dependency-name: abatilo/actions-poetry dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Dhanshree Arora <DhanshreeA@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Hi @Abellegese and @DhanshreeA, I have pushed updates to two files—
These changes address both the error handling and the artifact cleanup requirements. My goal was to ensure that any failures in the |
Hi Dhanshree, I have been investigating the module shadowing issue and its impact on our BentoML integration. Here is a summary of my findings and concerns: Root Cause: My Observations: Proposed Fix: I would like to clarify if proceeding with the renaming (and the subsequent update of imports) is acceptable, given that |
Hi @OlawumiSalaam I thinks this great. A few thought.
|
Thank you for taking your time to contribute to Ersilia, just a few checks before we proceed
Description
This PR resolves an issue where the Ersilia CLI fails to fetch a model (especially due to BentoML-related errors) but still adds it to the catalog, leaving users unaware of the failed fetch.
Changes made
To do
Relevant tests to ensure the new error-handling mechanism works as expected.
Is this pull request related to any open issue? If yes, replace issueID below with the issue ID
Fix #1505