Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Bug: Ersilia fetch/serve fails but model appears on catalog #1505

Open
GemmaTuron opened this issue Jan 14, 2025 · 6 comments · May be fixed by #1527
Open

🐛 Bug: Ersilia fetch/serve fails but model appears on catalog #1505

GemmaTuron opened this issue Jan 14, 2025 · 6 comments · May be fixed by #1527
Assignees
Labels
bug Something isn't working

Comments

@GemmaTuron
Copy link
Member

GemmaTuron commented Jan 14, 2025

eos69p9_serve.txt

Describe the bug.

Hi,

I tried to get a model through the CLI directly using the serve command (Docker inactive, so it will try from S3) but it crashed (See attached error log). Nonetheless, if I immediatly after do ersilia catalog --local --more, the model appears:

(ersilia) gturon@pujarnol:~$ ersilia catalog --local --more
┌───────┬────────────┬──────────────────────┬──────────────────────────────────────────────────────────────────────────┬────────────────────┬─────────────┬─────────────────┬──────────────┬──────────────┐
│ Index | Identifier | Slug                 | Title                                                                    | Task               | Input Shape | Output          | Output Shape | Model Source │
├───────┼────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────┼────────────────────┼─────────────┼─────────────────┼──────────────┼──────────────┤
│ 1     | eos78ao    | mordred              | Mordred chemical descriptors                                             | ['Representation'] | Single      | ['Descriptor']  | List         | DockerHub    │
├───────┼────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────┼────────────────────┼─────────────┼─────────────────┼──────────────┼──────────────┤
│ 2     | eos5axz    | morgan-counts        | Morgan counts fingerprints                                               | ['Representation'] | Single      | ['Descriptor']  | List         | DockerHub    │
├───────┼────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────┼────────────────────┼─────────────┼─────────────────┼──────────────┼──────────────┤
│ 3     | eos69p9    | ssl-gcn-tox21        | Toxicity prediction across the Tox21 panel with semi-supervised learning | ['Classification'] | Single      | ['Probability'] | List         |              │
├───────┼────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────┼────────────────────┼─────────────┼─────────────────┼──────────────┼──────────────┤
│ 4     | eos2gw4    | eosce                | Ersilia Compound Embeddings                                              | ['Representation'] | Single      | ['Descriptor']  | List         | DockerHub    │
├───────┼────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────┼────────────────────┼─────────────┼─────────────────┼──────────────┼──────────────┤
│ 5     | eos3b5e    | molecular-weight     | Molecular weight                                                         | ['Regression']     | Single      | ['Other value'] | Single       | DockerHub    │
├───────┼────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────┼────────────────────┼─────────────┼─────────────────┼──────────────┼──────────────┤
│ 6     | eos4avb    | image-mol-embeddings | Molecular representation learning                                        | ['Representation'] | Single      | ['Descriptor']  | Matrix       | DockerHub    │
├───────┼────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────┼────────────────────┼─────────────┼─────────────────┼──────────────┼──────────────┤
│ 7     | eos4u6p    | cc-signaturizer      | Chemical Checker signaturizer                                            | ['Representation'] | Single      | ['Descriptor']  | List         | DockerHub    │
├───────┼────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────┼────────────────────┼─────────────┼─────────────────┼──────────────┼──────────────┤
│ 8     | eos3cf4    | molfeat-chemgpt      | ChemGPT-4.7                                                              | ['Representation'] | Single      | ['Descriptor']  | List         | DockerHub    │
├───────┼────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────┼────────────────────┼─────────────┼─────────────────┼──────────────┼──────────────┤
│ 9     | eos7w6n    | grover-embedding     | Large-scale graph transformer                                            | ['Representation'] | Single      | ['Descriptor']  | List         | DockerHub    │
├───────┼────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────┼────────────────────┼─────────────┼─────────────────┼──────────────┼──────────────┤
│ 10    | eos7jio    | rdkit-fingerprint    | Path-based fingerprint                                                   | ['Representation'] | Single      | ['Descriptor']  | List         | DockerHub    │
└───────┴────────────┴──────────────────────┴──────────────────────────────────────────────────────────────────────────┴────────────────────┴─────────────┴─────────────────┴──────────────┴──────────────┘

The model source does not appear, which indicates it has failed but it would be good to have a way to catch this or add a note to users that a certain model is not working?

I'll tag this as an addition as it is not critical

Describe the steps to reproduce the behavior

No response

Operating environment

Ubuntu 24.02 LTS

@GemmaTuron GemmaTuron added the bug Something isn't working label Jan 14, 2025
@DhanshreeA
Copy link
Member

DhanshreeA commented Jan 15, 2025

This info bug is very similar to what @Abellegese was describing having faced with bentoml. But in any case, we should just delete the model and its artifacts if it fails to fetch. We presently only do it if the model fails to generate a Standard Model Example, as you can see in this snippet from the referenced code:

        fr = await self._fetch(model_id)
        if fr.fetch_success:
            try:
                self._standard_csv_example(model_id)
            except StandardModelExampleError:
                self.logger.debug("Standard model example failed, deleting artifacts")
                do_delete = yes_no_input(
                    "Do you want to delete the model artifacts? [Y/n]",
                    default_answer="Y",
                )
                if do_delete:
                    md = ModelFullDeleter(overwrite=False)
                    md.delete(model_id)
                return FetchResult(
                    fetch_success=False,
                    reason="Could not successfully run a standard example from the model.",
                )
            else:
                self.logger.debug("Writing model source to file")
                model_source_file = os.path.join(
                    self._model_path(model_id), MODEL_SOURCE_FILE
                )
                try:
                    os.makedirs(self._model_path(model_id), exist_ok=True)
                except OSError as error:
                    self.logger.error(f"Error during folder creation: {error}")
                with open(model_source_file, "w") as f:
                    f.write(self.model_source)
                return FetchResult(
                    fetch_success=True, reason="Model fetched successfully"
                )
        else:
            return fr

I think we should encapsulate all BentoML related sub-process calls with a general BentoMLError, and catch that in downstream code, and then use that to do something similar to above and delete model artifacts if the fetching fails for whatever reason.

@Abellegese
Copy link
Contributor

Yes exactly @DhanshreeA @GemmaTuron . Quick fix is to do the following

pip uninstall bentoml
# then just call this bentoml commamd
bentoml --version

@DhanshreeA
Copy link
Member

@OlawumiSalaam this might be interesting, and definitely more of a deep dive than your current task. Please take a look when you can.

@DhanshreeA DhanshreeA moved this from On Hold to In Progress in Ersilia Model Hub Jan 27, 2025
@OlawumiSalaam
Copy link
Contributor

@DhanshreeA

This is my understanding of the issue and the steps I plan to take to address it:

The issue occurs when the Ersilia CLI fails to fetch a model but still adds it to the catalog and leave users unaware of the failed fetch. To resolve this, I will implement the following improvements:

  1. Error Handling: Wrap all BentoML-related subprocess calls with a generalized exception (BentoMLError) to ensure errors are caught and properly handled downstream.

  2. Artifact Cleanup: Implement logic to delete any model artifacts associated with a failed fetch, similar to how artifacts are removed for StandardModelExampleError.

  3. User Feedback: Provide clear feedback to users when a model fetch or standard example generation fails.

I will proceed by modifying the fetch process to include these changes and adding appropriate logging to capture and communicate failure details.
Please let me know if I am missing anything. Thanks

@DhanshreeA
Copy link
Member

This looks good @OlawumiSalaam thanks!

@OlawumiSalaam
Copy link
Contributor

@DhanshreeA
here is an update on my work regarding the bug where the Ersilia fetch/serve fails, yet the model is still being listed in the catalog.
What I Have Done:
Error Encapsulation:
I encapsulated all BentoML-related sub-process calls with a general BentoMLError to standardize error handling for these operations.
This ensures that any issues encountered during fetch/serve processes are caught and can be handled downstream.

Failure Handling and Cleanup:
Introduced logic to catch the BentoMLError and trigger a cleanup process for any partially created model artifacts.
This ensures that if fetching fails for any reason, it does not leave invalid artifacts in the system.
we now have a more robust way to handle unexpected failures during BentoML-related operations. This should prevent similar issues in the future.

Next Step:
I am preparing to open a PR to submit these changes for review. The PR will include:
Encapsulation of BentoML-related sub-process calls.
Cleanup logic for failed fetch/serve operations.

@OlawumiSalaam OlawumiSalaam linked a pull request Jan 29, 2025 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

4 participants