Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle BentoML errors & clean up failed models #1527

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

OlawumiSalaam
Copy link
Contributor

Thank you for taking your time to contribute to Ersilia, just a few checks before we proceed

  • Have you followed the guidelines in our Contribution Guide
  • Have you written new tests for your core changes, as applicable?
  • Have you successfully ran tests with your changes locally?

Description

This PR resolves an issue where the Ersilia CLI fails to fetch a model (especially due to BentoML-related errors) but still adds it to the catalog, leaving users unaware of the failed fetch.

Changes made

  1. Encapsulated BentoML-related errors:
  • Wrapped all BentoML subprocess calls in a general BentoMLError to ensure proper error handling downstream.
  1. Cleaned up failed model artifacts:
  • Implemented logic to remove model artifacts when fetching fails, similar to how artifacts are removed for StandardModelExampleError.
  1. Provided user feedback on failures:
  • Improved logging and added user prompts to notify them when a model cannot be fetched.
  • If a fetch fails, users are asked whether they want to delete the model artifacts and default answer is yes.

To do

Relevant tests to ensure the new error-handling mechanism works as expected.

Is this pull request related to any open issue? If yes, replace issueID below with the issue ID

Fix #1505

@OlawumiSalaam
Copy link
Contributor Author

@DhanshreeA
Please review and let me know if any changes are needed.
Thanks for your guide always!

@OlawumiSalaam
Copy link
Contributor Author

OlawumiSalaam commented Jan 29, 2025

@DhanshreeA
Quick update;
I was able to fetch eos69p9 and eos2ta5 successfully without errors. These models use bentoml

14:44:36 | DEBUG | Resolved pack method: bentoml

@OlawumiSalaam
Copy link
Contributor Author

`4:18:11 | DEBUG    | Pack method is: bentoml
14:18:11 | DEBUG    | Service class: conda
14:18:11 | DEBUG    | Getting APIs from list file
14:18:11 | DEBUG    | Getting APIs from BentoML
14:18:11 | DEBUG    | Getting APIs from Bento
14:18:11 | DEBUG    | Getting info from BentoML and storing in /tmp/ersilia-v_830lch/information.json
14:18:14 | DEBUG    | Info {'name': 'eos69p9', 'version': '20250129141807_E52199', 'created_at': '2025-01-29T13:18:08.362066Z', 'env': {'conda_env': 'name: bentoml-default-conda-env\nchannels:\n- defaults\ndependencies: []\n', 'python_version': '3.7.16', 'docker_base_image': 'bentoml/model-server:0.11.0-py37', 'pip_packages': ['bentoml==0.11.0']}, 'artifacts': [{'name': 'model', 'artifact_type': 'Artifact', 'metadata': {}}], 'apis': [{'name': 'run', 'input_type': 'JsonInput', 'docs': "BentoService inference API 'run', input: 'JsonInput', output: 'DefaultOutput'", 'output_config': {'cors': '*'}, 'output_type': 'DefaultOutput', 'mb_max_latency': 10000, 'mb_max_batch_size': 2000, 'batch': 

> True}]}`

@DhanshreeA
Copy link
Member

`4:18:11 | DEBUG    | Pack method is: bentoml
14:18:11 | DEBUG    | Service class: conda
14:18:11 | DEBUG    | Getting APIs from list file
14:18:11 | DEBUG    | Getting APIs from BentoML
14:18:11 | DEBUG    | Getting APIs from Bento
14:18:11 | DEBUG    | Getting info from BentoML and storing in /tmp/ersilia-v_830lch/information.json
14:18:14 | DEBUG    | Info {'name': 'eos69p9', 'version': '20250129141807_E52199', 'created_at': '2025-01-29T13:18:08.362066Z', 'env': {'conda_env': 'name: bentoml-default-conda-env\nchannels:\n- defaults\ndependencies: []\n', 'python_version': '3.7.16', 'docker_base_image': 'bentoml/model-server:0.11.0-py37', 'pip_packages': ['bentoml==0.11.0']}, 'artifacts': [{'name': 'model', 'artifact_type': 'Artifact', 'metadata': {}}], 'apis': [{'name': 'run', 'input_type': 'JsonInput', 'docs': "BentoService inference API 'run', input: 'JsonInput', output: 'DefaultOutput'", 'output_config': {'cors': '*'}, 'output_type': 'DefaultOutput', 'mb_max_latency': 10000, 'mb_max_batch_size': 2000, 'batch': 

> True}]}`

I don't understand, what's the True from the logs here?

@OlawumiSalaam
Copy link
Contributor Author

OlawumiSalaam commented Jan 29, 2025

I did not copy the full log. Here, "batch": True means that the API supports batch processing. This is my understanding.

@OlawumiSalaam
Copy link
Contributor Author

(ersilia) olawumi-salaam@DESKTOP-L7RDL5L:~/ersilia_internship/ersilia$ ersilia -v fetch eos69p9 --from_s3 > eos69p9_fetch.txt
18:06:40 | DEBUG    | Trying to get metadata from: /home/olawumi-salaam/eos/dest/eos69p9
18:06:41 | INFO     | Detected Python 3.12. Verifying setuptools...
18:06:41 | INFO     | Setuptools is already installed.
18:06:42 | DEBUG    | Initialized with URL: None
18:06:42 | DEBUG    | Getting model source
18:06:42 | DEBUG    | Model getting fetched from Amazon S3
18:06:42 | INFO     | Model doesn't exist on your system, fetching it now.
18:06:42 | DEBUG    | Starting fetching procedure
18:06:42 | DEBUG    | Fetching in your system, not from DockerHub
18:06:42 | DEBUG    | Deciding fetcher (BentoML or FastAPI)
18:06:44 | DEBUG    | Fetching using BentoML
18:06:44 | DEBUG    | Checking if the model is installable with BentoML
18:06:45 | DEBUG    | Starting fetching procedure
18:06:45 | INFO     | GitHub CLI is not installed. Ersilia can work without it, but we highly recommend that you install this tool.
18:06:45 | DEBUG    | Git LFS is installed
18:06:45 | DEBUG    | Git LFS has been activated
18:06:46 | DEBUG    | Connected to the internet
18:06:46 | DEBUG    | Conda is installed
18:06:46 | DEBUG    | EOS Home path exists
18:06:46 | INFO     | Starting delete of model eos69p9
18:06:46 | DEBUG    | Attempting Bento delete
18:06:49 | INFO     | Deleting conda environment eos69p9

@@ -25,6 +25,15 @@
FetchResult = namedtuple("FetchResult", ["fetch_success", "reason"])


class BentoMLError(Exception):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @OlawumiSalaam could you move this class to this module, and implement it according to the other exceptions implemented here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually chuck that, we should be reusing this exception class. You can put your message string in there, right now the class is simply a placeholder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually thought about that, but wanted your feedback first. Thanks

except NotInstallableWithBentoML:
raise

except Exception as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to have any try...excepts in this code, because if you look carefully, this part is contextualized with the decorator throw_ersilia_exception, implemented here. What this decorator does is to see if there are any exceptions in the function that it decorates, and then prints that nicely on the terminal. And the implementation itself has a try...except, so it removes the need for adding that explicitly to methods decorated with this function.

@@ -395,4 +417,20 @@ async def fetch(self, model_id: str) -> bool:
fetch_success=True, reason="Model fetched successfully"
)
else:
# Handle BentoML-specific errors here
if isinstance(fr.reason, BentoMLError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we shouldn't be adding this here. Look at L392, we should be catching one of StandardModelExampleError, or BentoMLError. So something like:

try:
        fr = await self._fetch(model_id)
        if fr.fetch_success:
                self._standard_csv_example(model_id)
except (StandardModelExampleError, BentoMLError) as err: # <--- In the future we can account for more cases
else:
                self.logger.debug("Standard model example failed, deleting artifacts")
                do_delete = yes_no_input(
                    "Do you want to delete the model artifacts? [Y/n]",
                    default_answer="Y",
                )
                if do_delete:
                    md = ModelFullDeleter(overwrite=False)
                    md.delete(model_id)
                return FetchResult(
                    fetch_success=False,
reason="..." # Here modify the reason string based on the error.msg (which would be in the Exception class

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't use this code as is because the formatting might be messed up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @DhanshreeA ,

Thank you for your detailed feedback. There is a lot of information to process so I am taking one step at a time. I have reworked the fetch() method to better align with your suggestions, and I would like to confirm if this approach is correct before finalizing.
Here is my understanding and changes made:

  1. Structured try...except Block:
  • Wrapped _fetch() and _standard_csv_example() in a try block.
  • Explicitly caught StandardModelExampleError and BentoMLException to handle cleanup and error messaging.
  1. Dynamic Error Handling:
  • Used str(err) to populate the reason field in FetchResult, ensuring error messages are specific to the exception.
  1. Cleanup Logic:
  • Moved artifact deletion into the except block to ensure cleanup occurs only when errors are encountered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

async def fetch(self, model_id: str) -> bool:
    """
        Fetch a model with the given eos identifier.

        Parameters
        ----------
        model_id : str
            The eos identifier of the model.

        Returns
        -------
        bool
            True if the model was fetched successfully, False otherwise.

        Examples
        --------
        .. code-block:: python

            fetcher = ModelFetcher(config_json=config)
            success = await fetcher.fetch(model_id="eosxxxx")
        """
    try:
        fr = await self._fetch(model_id)
        if not fr.fetch_success:
            return fr  

        self._standard_csv_example(model_id)
        self.logger.debug("Writing model source to file")
        model_source_file = os.path.join(
            self._model_path(model_id), MODEL_SOURCE_FILE
        )
        try:
            os.makedirs(self._model_path(model_id), exist_ok=True)
        except OSError as error:
            self.logger.error(f"Error during folder creation: {error}")
        with open(model_source_file, "w") as f:
            f.write(self.model_source)

        return FetchResult(
            fetch_success=True, reason="Model fetched successfully"
        )

    except (StandardModelExampleError, BentoMLException) as err:
        self.logger.debug(f"{type(err).__name__} occurred: {str(err)}")
        do_delete = yes_no_input(
            "Do you want to delete the model artifacts? [Y/n]",
            default_answer="Y",
        )
        if do_delete:
            md = ModelFullDeleter(overwrite=False)
            md.delete(model_id)
            self.logger.info(f"Model '{model_id}' artifacts successfully deleted.")
            print(f"✅ Model '{model_id}' artifacts have been successfully deleted.")

        reason = str(err) if str(err) else "An unknown error occurred during fetching."
        return FetchResult(fetch_success=False, reason=reason)

Does this structure align with your expectations and the intent of your feedback? Thank you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect @OlawumiSalaam ship it!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing, avoid using print statements, please use the logger.

@DhanshreeA
Copy link
Member

DhanshreeA commented Jan 30, 2025

Hey @OlawumiSalaam some examples of where we invoke bentoml through subprocess calls.

Also from the logs shared in the referenced issue for this PR, for model eos69p6, we see the error occurred after these lines:

12:39:16 | DEBUG    | Service class: conda
12:39:16 | DEBUG    | Getting APIs from list file
12:39:16 | DEBUG    | Getting APIs from BentoML
12:39:16 | DEBUG    | Getting APIs from Bento
12:39:16 | DEBUG    | Getting info from BentoML and storing in /tmp/ersilia-jjar2a_c/information.json
🚨🚨🚨 Something went wrong with Ersilia 🚨🚨🚨

Error message:

Expecting value: line 1 column 1 (char 0)
If this error message is not helpful, open an issue at:
 - https://github.com/ersilia-os/ersilia
Or feel free to reach out to us at:
 - hello[at]ersilia.io

So I would recommend looking for these logs in the codebase, and the functions/methods that make subsequent subprocess calls to bentoml. Usually, you'd see this happening through a terminal utility we have called run_command. You would want to inspect the std_error stream from these subprocess calls.

Although this here seems like a JSONDecodeError (Expecting value: line 1 column 1(char 0), so I would put the JSON reading in a try...except as well, catch the JSONDecodeError, and in turn raise a BentoMLError.

Remember, the main idea is to catch the BentoMLError like here and delete the model artifacts that got created during fetch, so it doesn't appear in the catalog eventually.

@@ -395,4 +417,20 @@ async def fetch(self, model_id: str) -> bool:
fetch_success=True, reason="Model fetched successfully"
)
else:
# Handle BentoML-specific errors here
if isinstance(fr.reason, BentoMLError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing, avoid using print statements, please use the logger.

@OlawumiSalaam
Copy link
Contributor Author

Request for Review
Hi @DhanshreeA , thank you for the guidance! Here’s a summary of the changes for your review:

Key Changes

  1. Error Handling & Cleanup
  • Added JSONDecodeError handling and BentoML command failure checks.
  • Raised BentoMLException to trigger artifact cleanup (prevents invalid models from appearing in the catalog).
  1. Files Updated:
  • utils/terminal.py: Refactored run_command to return stdout, stderr, and returncode.
  • core/base.py, hub/content/catalog.py: Replaced raw subprocess.run with run_command.
  1. Logging Improvements
  • Replaced print() statements with logger for consistency.
  • Captured stderr from BentoML subprocess calls to surface hidden errors (e.g., network failures).

Thank you!

@OlawumiSalaam
Copy link
Contributor Author

Hey @OlawumiSalaam some examples of where we invoke bentoml through subprocess calls.

Also from the logs shared in the referenced issue for this PR, for model eos69p6, we see the error occurred after these lines:

12:39:16 | DEBUG    | Service class: conda
12:39:16 | DEBUG    | Getting APIs from list file
12:39:16 | DEBUG    | Getting APIs from BentoML
12:39:16 | DEBUG    | Getting APIs from Bento
12:39:16 | DEBUG    | Getting info from BentoML and storing in /tmp/ersilia-jjar2a_c/information.json
🚨🚨🚨 Something went wrong with Ersilia 🚨🚨🚨

Error message:

Expecting value: line 1 column 1 (char 0)
If this error message is not helpful, open an issue at:
 - https://github.com/ersilia-os/ersilia
Or feel free to reach out to us at:
 - hello[at]ersilia.io

So I would recommend looking for these logs in the codebase, and the functions/methods that make subsequent subprocess calls to bentoml. Usually, you'd see this happening through a terminal utility we have called run_command. You would want to inspect the std_error stream from these subprocess calls.

Although this here seems like a JSONDecodeError (Expecting value: line 1 column 1(char 0), so I would put the JSON reading in a try...except as well, catch the JSONDecodeError, and in turn raise a BentoMLError.

Remember, the main idea is to catch the BentoMLError like here and delete the model artifacts that got created during fetch, so it doesn't appear in the catalog eventually.

Hi @DhanshreeA,

Following your feedback, I audited the codebase to identify all BentoML subprocess calls. I used grep to search for "bentoml" in the Python files, which yielded several results. The main areas where BentoML commands are invoked include serving (e.g., in serve/services.py where lines 84 and 192 execute bentoml info and bentoml serve), model deletion in hub/delete/delete.py (line 271, which calls bentoml delete via subprocess.run), and model containerization in hub/fetch/actions/toolize.py (line 62, which runs bentoml containerize). My plan is to update these calls to use the updated run_command function from ersilia/utils/terminal.py instead of raw subprocess calls. This updated approach will check the return code and capture stderr, and in the event of a failure, it will raise a BentoMLException to trigger proper artifact cleanup and provide clear user feedback. Could you confirm if this approach aligns with your expectations. If it does,
I will proceed with these changes and update you on my progress.
Thank you.

@Abellegese
Copy link
Contributor

Hi @OlawumiSalaam great work. Just one comment. I am not seeing the fix for if that specific issue arises, to automatically resolve it. I mentioned the details in this issue #1532 (comment). The error bentoml creates now can be captured by your PR. But we also need some small change to fix the issue itself. Here is the details that I discussed with Gemma:

I highly suspect that this problem is, what we call it module shadow. Let me explain. We have a file called in ersilia bentoml.py located in ersilia/setup/requirements/. So when you install ersilia for the first time, this module (bentoml.py) will be packed as ersilia subpackage. Then you start fetching from github, local dir or s3 which might require bentoml installation, and when bentoml gets installed at this time, it override the first module and it works perfectly. But this installed bentoml package will be broken for the following possible reasons:

  1. Ersilia subpackage (bentoml.py) creates shadow over it and breaks dependency
  2. it might be due to a corrupted installation or conflicts in the environment
  3. So as we know ersilia, usually gets installed in editable mode. Editable installations often use symlinks to link ersilia package's source code to the Python environment. If these symlinks are broken or if there are leftover artifacts from a previous installation, it could cause issues.

@Abellegese
Copy link
Contributor

One solution I found was to immidiately uninstall bentoml then when we fetch again it works. You may take it from here.

@OlawumiSalaam
Copy link
Contributor Author

@Abellegese Thank you for your feedback. The focus was just on the specific issue I am working on. Your solution was not captured earlier but I will definitely be look into it and update here as it goes. Hope we get this bentoml issues solved. You are always helpful. Thank you.

@OlawumiSalaam
Copy link
Contributor Author

Hi @Abellegese @DhanshreeA,

Issue Summary
Module Shadowing:
The Ersilia codebase has a file ersilia/setup/requirements/bentoml.py that conflicts with the official bentoml Python package. When Ersilia is installed in editable mode, Python prioritizes this local file over the actual bentoml package, leading to:

  • Broken dependencies during BentoML installation.
  • Runtime errors (e.g., crashes when code expects the official package but imports the local file).

Proposed Fix
Rename Conflicting File:
Change bentoml.pybentoml_requirement.py to eliminate naming conflicts.

Update Imports:
Refactor all references to the renamed file (e.g., from ersilia.setup.requirements.bentoml import BentoMLRequirement from ersilia.setup.requirements.bentoml_requirement import BentoMLRequirement).

Installation Safeguards:
In BentoMLRequirement.install(), add logic to auto-uninstall corrupted BentoML if the installed version isn’t Ersilia’s, then reinstall using run_command:

def install(self) -> None:  
    if self.is_installed() and not self.is_bentoml_ersilia_version():  
        run_command([sys.executable, "-m", "pip", "uninstall", "bentoml", "-y"])  
    run_command([sys.executable, "-m", "pip", "install", "-U", "git+https://github.com/ersilia-os/bentoml-ersilia.git"])  

I want to ensure that my understanding and proposed approach meet your expectations before moving forward with these changes or if there are additional adjustments you would recommend.

@Abellegese
Copy link
Contributor

Abellegese commented Feb 5, 2025

Hi @OlawumiSalaam this looks good. Have you tried it if it solves the problem? If it is not solving it let me know, I have some idea we go for that.

Update:
It would be nice first we detect the JsonEncodingError happened then before it exists for failure, we automatically do the solution you proposed and normal fetching will proceed after that.

@DhanshreeA
Copy link
Member

Request for Review Hi @DhanshreeA , thank you for the guidance! Here’s a summary of the changes for your review:

Key Changes

1. Error Handling & Cleanup


* Added `JSONDecodeError` handling and `BentoML` command failure checks.

* Raised `BentoMLException` to trigger artifact cleanup (prevents invalid models from appearing in the catalog).


2. Files Updated:


* `utils/terminal.py`: Refactored run_command to return` stdout`, `stderr`, and `returncode`.

* `core/base.py`, `hub/content/catalog.py`: Replaced raw `subprocess.run` with `run_command`.


3. Logging Improvements


* Replaced print() statements with logger for consistency.

* Captured stderr from BentoML subprocess calls to surface hidden errors (e.g., network failures).

Thank you!

This is perfect, thanks @OlawumiSalaam for covering these cases.

@DhanshreeA
Copy link
Member

DhanshreeA commented Feb 5, 2025

Hi @OlawumiSalaam great work. Just one comment. I am not seeing the fix for if that specific issue arises, to automatically resolve it. I mentioned the details in this issue #1532 (comment). The error bentoml creates now can be captured by your PR. But we also need some small change to fix the issue itself. Here is the details that I discussed with Gemma:

I highly suspect that this problem is, what we call it module shadow. Let me explain. We have a file called in ersilia bentoml.py located in ersilia/setup/requirements/. So when you install ersilia for the first time, this module (bentoml.py) will be packed as ersilia subpackage. Then you start fetching from github, local dir or s3 which might require bentoml installation, and when bentoml gets installed at this time, it override the first module and it works perfectly. But this installed bentoml package will be broken for the following possible reasons:

1. Ersilia subpackage (`bentoml.py`) creates shadow over it and breaks dependency

2. it might be due to a corrupted installation or conflicts in the environment

3. So as we know ersilia, usually gets installed in editable mode. Editable installations often use symlinks to link ersilia package's source code to the Python environment. If these symlinks are broken or if there are leftover artifacts from a previous installation, it could cause issues.

@Abellegese I agree that this is probably what happens, however that points to a larger issue with users - ie, either not having conda on their system or on their path, causing Ersilia to fall back to installing all model dependencies within the ersilia environment - which should not happen to begin with. While your solution works, I am afraid it's not a good long term solution, because ersilia should always be creating standalone model environments, and not corrupt its own environment if it cannot find conda on the system path.

EDIT: Looking at logs from #1531 and #1532 my conda theory fails because I can see that ersilia is able to create the right conda environments for the models. You're probably on to something with the symlinking/module shadow theory. If I happen to run into this error at some point, I would inspect which source ends up running for the bentoml cli.

@OlawumiSalaam
Copy link
Contributor Author

When Ersilia is installed in editable mode, Python prioritizes this local file over the actual bentoml package, leading to:

* Broken dependencies during BentoML installation.

* Runtime errors (e.g., crashes when code expects the official package but imports the local file).

@OlawumiSalaam we maintain our own fork of bentoml, and have stopped using the official bentoml distribution a long time ago. Please let's not resort to using the official bentoml at all, I don't even know what all that might break.

@DhanshreeA
So what do you suggest is the way forward? Your guide is deeply appreciated

@DhanshreeA
Copy link
Member

When Ersilia is installed in editable mode, Python prioritizes this local file over the actual bentoml package, leading to:

* Broken dependencies during BentoML installation.

* Runtime errors (e.g., crashes when code expects the official package but imports the local file).

@OlawumiSalaam we maintain our own fork of bentoml, and have stopped using the official bentoml distribution a long time ago. Please let's not resort to using the official bentoml at all, I don't even know what all that might break.

@DhanshreeA So what do you suggest is the way forward? Your guide is deeply appreciated

@OlawumiSalaam my bad, after following the discussions across the referenced issues, I think you and @Abellegese are right. Let's implement it this way.

@DhanshreeA
Copy link
Member

Update:

@OlawumiSalaam - I agree with Abel here, adding a handling logic for the JSONDecodeError, in ersilia/ersilia/serve/services.py", line 98, in _get_apis_from_bento would be ideal to make this more robust, on top of everything else that you're doing.


cmd = ["bentoml", "get", f"{model_id}:latest", "--print-location", "--quiet"]
stdout, stderr, returncode = run_command(cmd, quiet=True)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this implementation, thanks!

return CatalogTable(data=[], columns=[]) # Return empty table

# Extract columns and values
columns = ["BENTO_SERVICE", "AGE", "APIS", "ARTIFACTS"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move this to defaults.py as something like BENTOML_COLS.

if mf.seems_installable(model_id=self.model_id):
mf.fetch(model_id=self.model_id)
else:
self.logger.debug("Not installable with BentoML")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can retain this log line.

result = subprocess.run(
cmd,
shell=isinstance(cmd, str),
stdout=subprocess.PIPE,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we piping both stdout and stderr? Why not set both to subprocess.STDOUT?

* upgrade loguru (ersilia-os#1525)

* Bump abatilo/actions-poetry from 3.0.1 to 4.0.0

Bumps [abatilo/actions-poetry](https://github.com/abatilo/actions-poetry) from 3.0.1 to 4.0.0.
- [Release notes](https://github.com/abatilo/actions-poetry/releases)
- [Changelog](https://github.com/abatilo/actions-poetry/blob/master/.releaserc)
- [Commits](abatilo/actions-poetry@v3.0.1...v4.0.0)

---
updated-dependencies:
- dependency-name: abatilo/actions-poetry
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Dhanshree Arora <DhanshreeA@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@OlawumiSalaam
Copy link
Contributor Author

Hi @Abellegese and @DhanshreeA,

I have pushed updates to two files—service.py and bentoml.py—as part of my solution for the BentoML-related fetch/serve issues. Here is a summary of the changes I made:

  1. Enhanced Error Handling in service.py:
  • Updated the logic for fetching API information from BentoML by capturing errors from the BentoML subprocess calls.
  • Added specific handling for JSONDecodeError when parsing the JSON output from BentoML.
  • If a JSON parsing error is detected (indicating a potentially corrupted BentoML installation), the code now raises a BentoMLException to trigger the artifact cleanup process.
  • This ensures that errors are not silently ignored, and invalid models do not appear in the catalog.
  1. Refined Installation and Cleanup in bentoml.py:
  • Refactored the installation logic to include safeguards: if an incompatible or corrupted version of BentoML is detected, the system will automatically uninstall it and attempt a reinstallation using our updated run_command utility.
  • Integrated thread safety by using a class-level lock to prevent concurrent installations from causing race conditions.
  • Improved logging to capture stderr from BentoML subprocess calls, which helps in debugging and ensures that all errors are properly surfaced.

These changes address both the error handling and the artifact cleanup requirements. My goal was to ensure that any failures in the BentoML subprocess calls are caught under a generalized BentoMLException and then resolved automatically by cleaning up and reinstalling the correct version. This should mitigate issues caused by module shadowing and prevent invalid models from appearing in the catalog.
I look forward to your feedback on these updates.

@OlawumiSalaam
Copy link
Contributor Author

Hi Dhanshree,

I have been investigating the module shadowing issue and its impact on our BentoML integration. Here is a summary of my findings and concerns:

Root Cause:
The local file ersilia/setup/requirements/bentoml.py conflicts with the official BentoML package. When Ersilia is installed (especially in editable mode), Python may prioritize this local file over the external package. This leads to dependency issues and runtime errors when code expects the official BentoML package.

My Observations:
Running the command
grep -Rn "ersilia/setup/requirements/bentoml"
mainly returned results from binary files (e.g., pycache), but a manual inspection confirmed that explicit import statements (such as in services.py) reference the old module path. This indicates that BentoML is deeply embedded in the codebase.

Proposed Fix:
I suggest renaming the conflicting file from ersilia/setup/requirements/bentoml.py to bentoml_requirement.py and updating all corresponding import statements
from ersilia.setup.requirements.bentoml import BentoMLRequirement
to
from ersilia.setup.requirements.bentoml_requirement import BentoMLRequirement
This change is important because it directly addresses the root cause of the module shadowing issue by eliminating the possibility that the local file will override the official BentoML package during imports.
Additional Changes:
In addition to the renaming, my PR also implements improved error handling and cleanup logic for BentoML subprocess calls. This includes capturing JSON decoding errors and other BentoML command failures, and raising a BentoMLException to trigger artifact cleanup. Together, these changes help ensure that if fetching fails, invalid models do not appear in the catalog.

I would like to clarify if proceeding with the renaming (and the subsequent update of imports) is acceptable, given that BentoML is so deeply rooted in our codebase. I want to avoid a scenario where renaming inadvertently breaks many references. Please let me know if this approach meets your expectations or if you have an alternative suggestion.
Thanks

@Abellegese
Copy link
Contributor

Hi @Abellegese and @DhanshreeA,

I have pushed updates to two files—service.py and bentoml.py—as part of my solution for the BentoML-related fetch/serve issues. Here is a summary of the changes I made:

  1. Enhanced Error Handling in service.py:
  • Updated the logic for fetching API information from BentoML by capturing errors from the BentoML subprocess calls.
  • Added specific handling for JSONDecodeError when parsing the JSON output from BentoML.
  • If a JSON parsing error is detected (indicating a potentially corrupted BentoML installation), the code now raises a BentoMLException to trigger the artifact cleanup process.
  • This ensures that errors are not silently ignored, and invalid models do not appear in the catalog.
  1. Refined Installation and Cleanup in bentoml.py:
  • Refactored the installation logic to include safeguards: if an incompatible or corrupted version of BentoML is detected, the system will automatically uninstall it and attempt a reinstallation using our updated run_command utility.
  • Integrated thread safety by using a class-level lock to prevent concurrent installations from causing race conditions.
  • Improved logging to capture stderr from BentoML subprocess calls, which helps in debugging and ensures that all errors are properly surfaced.

These changes address both the error handling and the artifact cleanup requirements. My goal was to ensure that any failures in the BentoML subprocess calls are caught under a generalized BentoMLException and then resolved automatically by cleaning up and reinstalling the correct version. This should mitigate issues caused by module shadowing and prevent invalid models from appearing in the catalog.
I look forward to your feedback on these updates.

Hi @OlawumiSalaam I thinks this great. A few thought.

  • The json parsing error I think happens almost at the final step and instead of cleanup the artifact, what if we reinstall it then continue again writting the 1information.json` file, because this is where it fails.
  • second comments is, could you please try the solution and report it here. Without it would be impossible to know if its working or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

🐛 Bug: Ersilia fetch/serve fails but model appears on catalog
3 participants