🐛 Bug: Ersilia fetch/serve fails but model appears on catalog #1505

GemmaTuron · 2025-01-14T11:52:49Z

Describe the bug.

Hi,

I tried to get a model through the CLI directly using the serve command (Docker inactive, so it will try from S3) but it crashed (See attached error log). Nonetheless, if I immediatly after do ersilia catalog --local --more, the model appears:

(ersilia) gturon@pujarnol:~$ ersilia catalog --local --more
┌───────┬────────────┬──────────────────────┬──────────────────────────────────────────────────────────────────────────┬────────────────────┬─────────────┬─────────────────┬──────────────┬──────────────┐
│ Index | Identifier | Slug                 | Title                                                                    | Task               | Input Shape | Output          | Output Shape | Model Source │
├───────┼────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────┼────────────────────┼─────────────┼─────────────────┼──────────────┼──────────────┤
│ 1     | eos78ao    | mordred              | Mordred chemical descriptors                                             | ['Representation'] | Single      | ['Descriptor']  | List         | DockerHub    │
├───────┼────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────┼────────────────────┼─────────────┼─────────────────┼──────────────┼──────────────┤
│ 2     | eos5axz    | morgan-counts        | Morgan counts fingerprints                                               | ['Representation'] | Single      | ['Descriptor']  | List         | DockerHub    │
├───────┼────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────┼────────────────────┼─────────────┼─────────────────┼──────────────┼──────────────┤
│ 3     | eos69p9    | ssl-gcn-tox21        | Toxicity prediction across the Tox21 panel with semi-supervised learning | ['Classification'] | Single      | ['Probability'] | List         |              │
├───────┼────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────┼────────────────────┼─────────────┼─────────────────┼──────────────┼──────────────┤
│ 4     | eos2gw4    | eosce                | Ersilia Compound Embeddings                                              | ['Representation'] | Single      | ['Descriptor']  | List         | DockerHub    │
├───────┼────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────┼────────────────────┼─────────────┼─────────────────┼──────────────┼──────────────┤
│ 5     | eos3b5e    | molecular-weight     | Molecular weight                                                         | ['Regression']     | Single      | ['Other value'] | Single       | DockerHub    │
├───────┼────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────┼────────────────────┼─────────────┼─────────────────┼──────────────┼──────────────┤
│ 6     | eos4avb    | image-mol-embeddings | Molecular representation learning                                        | ['Representation'] | Single      | ['Descriptor']  | Matrix       | DockerHub    │
├───────┼────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────┼────────────────────┼─────────────┼─────────────────┼──────────────┼──────────────┤
│ 7     | eos4u6p    | cc-signaturizer      | Chemical Checker signaturizer                                            | ['Representation'] | Single      | ['Descriptor']  | List         | DockerHub    │
├───────┼────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────┼────────────────────┼─────────────┼─────────────────┼──────────────┼──────────────┤
│ 8     | eos3cf4    | molfeat-chemgpt      | ChemGPT-4.7                                                              | ['Representation'] | Single      | ['Descriptor']  | List         | DockerHub    │
├───────┼────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────┼────────────────────┼─────────────┼─────────────────┼──────────────┼──────────────┤
│ 9     | eos7w6n    | grover-embedding     | Large-scale graph transformer                                            | ['Representation'] | Single      | ['Descriptor']  | List         | DockerHub    │
├───────┼────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────┼────────────────────┼─────────────┼─────────────────┼──────────────┼──────────────┤
│ 10    | eos7jio    | rdkit-fingerprint    | Path-based fingerprint                                                   | ['Representation'] | Single      | ['Descriptor']  | List         | DockerHub    │
└───────┴────────────┴──────────────────────┴──────────────────────────────────────────────────────────────────────────┴────────────────────┴─────────────┴─────────────────┴──────────────┴──────────────┘

The model source does not appear, which indicates it has failed but it would be good to have a way to catch this or add a note to users that a certain model is not working?

I'll tag this as an addition as it is not critical

Describe the steps to reproduce the behavior

No response

Operating environment

Ubuntu 24.02 LTS

The text was updated successfully, but these errors were encountered:

DhanshreeA · 2025-01-15T10:34:27Z

This info bug is very similar to what @Abellegese was describing having faced with bentoml. But in any case, we should just delete the model and its artifacts if it fails to fetch. We presently only do it if the model fails to generate a Standard Model Example, as you can see in this snippet from the referenced code:

        fr = await self._fetch(model_id)
        if fr.fetch_success:
            try:
                self._standard_csv_example(model_id)
            except StandardModelExampleError:
                self.logger.debug("Standard model example failed, deleting artifacts")
                do_delete = yes_no_input(
                    "Do you want to delete the model artifacts? [Y/n]",
                    default_answer="Y",
                )
                if do_delete:
                    md = ModelFullDeleter(overwrite=False)
                    md.delete(model_id)
                return FetchResult(
                    fetch_success=False,
                    reason="Could not successfully run a standard example from the model.",
                )
            else:
                self.logger.debug("Writing model source to file")
                model_source_file = os.path.join(
                    self._model_path(model_id), MODEL_SOURCE_FILE
                )
                try:
                    os.makedirs(self._model_path(model_id), exist_ok=True)
                except OSError as error:
                    self.logger.error(f"Error during folder creation: {error}")
                with open(model_source_file, "w") as f:
                    f.write(self.model_source)
                return FetchResult(
                    fetch_success=True, reason="Model fetched successfully"
                )
        else:
            return fr

I think we should encapsulate all BentoML related sub-process calls with a general BentoMLError, and catch that in downstream code, and then use that to do something similar to above and delete model artifacts if the fetching fails for whatever reason.

Abellegese · 2025-01-15T10:48:01Z

Yes exactly @DhanshreeA @GemmaTuron . Quick fix is to do the following

pip uninstall bentoml
# then just call this bentoml commamd
bentoml --version

DhanshreeA · 2025-01-16T07:52:07Z

@OlawumiSalaam this might be interesting, and definitely more of a deep dive than your current task. Please take a look when you can.

OlawumiSalaam · 2025-01-27T20:20:15Z

@DhanshreeA

This is my understanding of the issue and the steps I plan to take to address it:

The issue occurs when the Ersilia CLI fails to fetch a model but still adds it to the catalog and leave users unaware of the failed fetch. To resolve this, I will implement the following improvements:

Error Handling: Wrap all BentoML-related subprocess calls with a generalized exception (BentoMLError) to ensure errors are caught and properly handled downstream.
Artifact Cleanup: Implement logic to delete any model artifacts associated with a failed fetch, similar to how artifacts are removed for StandardModelExampleError.
User Feedback: Provide clear feedback to users when a model fetch or standard example generation fails.

I will proceed by modifying the fetch process to include these changes and adding appropriate logging to capture and communicate failure details.
Please let me know if I am missing anything. Thanks

DhanshreeA · 2025-01-28T17:27:34Z

This looks good @OlawumiSalaam thanks!

OlawumiSalaam · 2025-01-28T17:49:38Z

@DhanshreeA
here is an update on my work regarding the bug where the Ersilia fetch/serve fails, yet the model is still being listed in the catalog.
What I Have Done:
Error Encapsulation:
I encapsulated all BentoML-related sub-process calls with a general BentoMLError to standardize error handling for these operations.
This ensures that any issues encountered during fetch/serve processes are caught and can be handled downstream.

Failure Handling and Cleanup:
Introduced logic to catch the BentoMLError and trigger a cleanup process for any partially created model artifacts.
This ensures that if fetching fails for any reason, it does not leave invalid artifacts in the system.
we now have a more robust way to handle unexpected failures during BentoML-related operations. This should prevent similar issues in the future.

Next Step:
I am preparing to open a PR to submit these changes for review. The PR will include:
Encapsulation of BentoML-related sub-process calls.
Cleanup logic for failed fetch/serve operations.

GemmaTuron added the bug Something isn't working label Jan 14, 2025

github-project-automation bot added this to Ersilia Model Hub Jan 14, 2025

github-project-automation bot moved this to On Hold in Ersilia Model Hub Jan 14, 2025

DhanshreeA assigned OlawumiSalaam Jan 27, 2025

DhanshreeA moved this from On Hold to In Progress in Ersilia Model Hub Jan 27, 2025

OlawumiSalaam linked a pull request Jan 29, 2025 that will close this issue

Handle BentoML errors & clean up failed models #1527

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Bug: Ersilia fetch/serve fails but model appears on catalog #1505

🐛 Bug: Ersilia fetch/serve fails but model appears on catalog #1505

GemmaTuron commented Jan 14, 2025 •

edited

Loading

DhanshreeA commented Jan 15, 2025 •

edited

Loading

Abellegese commented Jan 15, 2025

DhanshreeA commented Jan 16, 2025

OlawumiSalaam commented Jan 27, 2025

DhanshreeA commented Jan 28, 2025

OlawumiSalaam commented Jan 28, 2025

🐛 Bug: Ersilia fetch/serve fails but model appears on catalog #1505

🐛 Bug: Ersilia fetch/serve fails but model appears on catalog #1505

Comments

GemmaTuron commented Jan 14, 2025 • edited Loading

Describe the bug.

Describe the steps to reproduce the behavior

Operating environment

DhanshreeA commented Jan 15, 2025 • edited Loading

Abellegese commented Jan 15, 2025

DhanshreeA commented Jan 16, 2025

OlawumiSalaam commented Jan 27, 2025

DhanshreeA commented Jan 28, 2025

OlawumiSalaam commented Jan 28, 2025

GemmaTuron commented Jan 14, 2025 •

edited

Loading

DhanshreeA commented Jan 15, 2025 •

edited

Loading