Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Descriptors do not parse Nones to h5 #7

Open
GemmaTuron opened this issue Jan 23, 2025 · 21 comments
Open

Descriptors do not parse Nones to h5 #7

GemmaTuron opened this issue Jan 23, 2025 · 21 comments

Comments

@GemmaTuron
Copy link
Member

I am running validations of the Zaira pipeline with the purpose of comparing several descriptors. I have a fairly large list of model descriptors set up in vars.py, and those get correctly picked by ZairaChem in parameters.json:

    "ersilia_hub": [
        "eos5axz",
        "eos78ao",
        "eos4u6p",
        "eos3cf4",
        "eos2gw4",
        "eos39co",
        "eos4avb",
        "eos4djh",
        "eos8aa5",
        "eos5guo",
        "eos8a4x",
        "eos3ae6"
    ],

and the reference descriptor eos7w6n, in total 13 descriptors. I have set an automated run across TDC benchmarks with 5 fold validations. On the first dataset tested (DILI), the descriptor eos3ae6 was not calculated (done_eos.json):

[
    "eos5guo",
    "eos3cf4",
    "eos4u6p",
    "eos39co",
    "eos4avb",
    "eos5axz",
    "eos78ao",
    "eos4djh",
    "eos8a4x",
    "eos2gw4",
    "eos8aa5",
    "eos7w6n"
]

On the second dataset (AMES, currently running) we have lost more descriptors: eos4u6p, eos3ae6, eos8ax4. See the done_eos.json:

[
    "eos39co",
    "eos5guo",
    "eos4avb",
    "eos4djh",
    "eos78ao",
    "eos2gw4",
    "eos3cf4",
    "eos5axz",
    "eos8aa5",
    "eos7w6n"
]

The problem is that they do appear as calculated without problems in the logs. I have attached below the different logs for example:

  • eos39co: it works, I'll use this as reference to compare the logs. eos39co_run.txt
  • eos3ae6: it does run, but shows many more Status code: 200 lines that other models. Also, at the end it does not have the same model closing notices. eos3ae6_run.txt
  • eos4u6p: same as eos3ae6. eos4u6p_run.txt

It seems to me the models are running into errors that are not reported in the logs. @DhanshreeA do you pick up what could be happening from the information I am sharing? Could also potentially be a memory issue?
I tried the models individually through the CLI (not the python API) before I set the pipelines to run and they did work

@DhanshreeA
Copy link
Member

Hey @GemmaTuron I reviewed this and indeed there is nothing useful in the logs unfortunately. It could be a memory issue. Since we've both tried the models through CLI, I'll try to use the Python API for them, but in general how would you suggest I try reproducing this? I can let this run on a Linux machine with the data you are using and see if the same behavior is repeated.

@GemmaTuron
Copy link
Member Author

mmm is there a way to see more verbosity from the python API? then we can maybe better see what is going on.
I tried the models with the pythonAPI and they worked (with less data)

@GemmaTuron
Copy link
Member Author

Hi @DhanshreeA and @Abellegese

I have little time next week to look at this in detail. If either of you has some spare time and can think about it or elucidate the cause for failure that would be super, but no pressure. To give a few more pointers:

  • I am running 5-fold validations on the train/test datasets that you can find in this folder. As you will see, there are some very large datasets and others more manageable. As I reported above, for example for the DILI dataset (380 molecules) all descriptors except eos3ae6 worked, whereas for AMES (5822 molecules) less descriptors worked, which might indicate a memory issue.
  • I am running this in our workstation, which has quite a bit of memory so I am surprised it runs out
  • I eliminate the descriptors and model checkpoints after each fold automatically, so the system is not getting more cluttered with more runs
  • I use the Python API as explained in this step. To reproduce it you can simply take the same code. I have not yet checked them with verbosity
  • The entire descriptor list I'd like to make work is:
[
    "eos5axz",
    "eos78ao",
    "eos4u6p",
    "eos3cf4",
    "eos2gw4",
    "eos39co",
    "eos4avb",
    "eos4djh",
    "eos8aa5",
    "eos5guo",
    "eos8a4x",
    "eos7w6n",
    "eos3ae6",
]

Okay I hope this is helpful
I'd say:
Set a manual run of any of the descriptors failing with verbosity and several of the datasets and see what is happening. Sorry I cannot do it before my trip to SF

@Abellegese
Copy link

Okay @GemmaTuron

@GemmaTuron
Copy link
Member Author

Hello @DhanshreeA

From the above there is one model that certainly does not work, eos3ae6. Did you check when pushing the new Docker image that everything was correct? The output file is generated but this is that it shows:

key,input,outcome
UNPROCESSABLE_INPUT,UNPROCESSABLE_INPUT,"[None, None, None, None, None, None, None, None, None, None, None, None...]

The log error is this one:

eos3ae6_error.txt

@GemmaTuron
Copy link
Member Author

Hello @DhanshreeA

From the above there is one model that certainly does not work, eos3ae6. Did you check when pushing the new Docker image that everything was correct? The output file is generated but this is that it shows:

key,input,outcome
UNPROCESSABLE_INPUT,UNPROCESSABLE_INPUT,"[None, None, None, None, None, None, None, None, None, None, None, None...]

The log error is this one:

eos3ae6_error.txt

LOL I have found I was not adding the ".csv" in the input file and Ersilia was not telling me but running the model and producing this weird output. This needs to be more informative.

But it brings me back to the starting point as to why models are failing, if I run them manually they are able to produce outputs for all the inputs I pass to them, so it does not seem to be a memory error. Do you have any suggestions of how to debug this? For the moment I will set to run in verbose mode the two first models (DILI and AMES) see if it happens exactly the same and if the verbose mode gives some more info.

@GemmaTuron
Copy link
Member Author

Hi @DhanshreeA and @Abellegese
I confirm this is reproducible (i.e not working for AMES the exact same descriptors while working for the DILI dataset)
The verbose does not give more info. The only point I see is that the "closing" messages after the molecules have been calculated do not show up on the models that fail:

Session needs to be closed once the other model is already initialized:

18:42:35 | DEBUG    | Schema available in /home/gturon/eos/dest/eos3ae6/api_schema.json
18:42:44 | DEBUG    | Status code: 200
18:42:44 | DEBUG    | Schema available in /home/gturon/eos/dest/eos3ae6/api_schema.json
18:42:51 | DEBUG    | Status code: 200
...
21:20:10 | DEBUG    | Status code: 200
21:20:19 | DEBUG    | Status code: 200
21:20:19 | DEBUG    | Done with unique posting
21:20:19 | DEBUG    | Is fetched: True
21:20:19 | DEBUG    | Schema available in /home/gturon/eos/dest/eos4u6p/api_schema.json
21:20:19 | DEBUG    | Setting BentoML AutoService for eos4u6p
21:20:19 | INFO     | Service class provided
21:20:19 | DEBUG    | Using port 49593
21:20:19 | DEBUG    | Starting Docker Daemon service
21:20:19 | DEBUG    | Creating container tmp logs folder /home/gturon/eos/sessions/session_86251/_logs/tmp and mounting as volume in container
21:20:19 | DEBUG    | Image ersiliaos/eos4u6p:latest is available locally
21:20:19 | DEBUG    | Using port 40717
21:20:19 | DEBUG    | Starting Docker Daemon service
21:20:19 | DEBUG    | Creating container tmp logs folder /home/gturon/eos/sessions/session_86251/_logs/tmp and mounting as volume in container
21:20:19 | INFO     | Done with initialization!
21:20:19 | DEBUG    | Checking rdkit and other requirements
21:20:19 | DEBUG    | Cleaning temp dir
21:20:19 | DEBUG    | Flushing temporary directory /tmp/ersilia-m23jzfaf
21:20:19 | DEBUG    | Flushing temporary directory /tmp/ersilia-5as4fziv
21:20:19 | DEBUG    | Flushing temporary directory /tmp/ersilia-3p3jifnn
21:20:19 | DEBUG    | Flushing temporary directory /tmp/ersilia-u4i9g7fj
21:20:19 | DEBUG    | Flushing temporary directory /tmp/ersilia-9fymel2c
21:20:19 | DEBUG    | Silencing docker containers if necessary
21:20:19 | DEBUG    | It is not inside docker
21:20:19 | DEBUG    | Stopping and removing container
21:20:19 | DEBUG    | Stopping all containers related to model eos4u6p
21:20:19 | DEBUG    | Closing session /home/gturon/eos/sessions/session_86251/session.json
21:20:19 | DEBUG    | Opening session /home/gturon/eos/sessions/session_86251/session.json
21:20:19 | DEBUG    | Cleaning processes before serving
21:20:19 | DEBUG    | /home/gturon/eos/sessions/session_86251/eos3ae6.pid

Session seems properly closing:

...
23:48:34 | DEBUG    | Status code: 200
23:48:34 | DEBUG    | Done with unique posting
23:48:37 | DEBUG    | Cleaning temp dir
23:48:37 | DEBUG    | Flushing temporary directory /tmp/ersilia-fpkwwyst
23:48:37 | DEBUG    | Flushing temporary directory /tmp/ersilia-ircnl18s
23:48:37 | DEBUG    | Flushing temporary directory /tmp/ersilia-g9610_t3
23:48:37 | DEBUG    | Flushing temporary directory /tmp/ersilia-0v8hfiez
23:48:37 | DEBUG    | Silencing docker containers if necessary
23:48:37 | DEBUG    | It is not inside docker
23:48:37 | DEBUG    | Stopping and removing container
23:48:37 | DEBUG    | Stopping all containers related to model eos2gw4
23:48:37 | DEBUG    | Stopping and removing container eos2gw4_5d5a
23:48:37 | DEBUG    | Deleted temp file ersilia-84m9frfm from container eos2gw4_5d5a
23:48:37 | DEBUG    | Deleted temp file  from container eos2gw4_5d5a
23:48:47 | DEBUG    | Container stopped
23:48:48 | DEBUG    | Container removed
23:48:48 | DEBUG    | Closing session /home/gturon/eos/sessions/session_86251/session.json
23:48:48 | DEBUG    | Is fetched: True
23:48:48 | DEBUG    | Schema available in /home/gturon/eos/dest/eos7w6n/api_schema.json
...

@GemmaTuron
Copy link
Member Author

Update:

  • eos3ae6 does not serialise to H5: Model cannot parse to H5 eos3ae6#14
  • eos4u6p works with DILI (300 mols) and when using AMES (3000 mols) it gives a very similar error to eos3ae6, why?
Traceback (most recent call last):
  File "/home/gturon/github/ersilia-os/zairachem-docker/02_describe/development.py", line 33, in <module>
    m.run("../../zaira-chem-docker-tdc/data/AMES_train.csv", "test_ames.h5")
  File "/home/gturon/github/ersilia-os/zairachem-docker/02_describe/development.py", line 23, in run
    self.model.run(input=input_csv, output=output_h5)
  File "/home/gturon/miniconda3/envs/zairadescribe/lib/python3.12/site-packages/ersilia/core/model.py", line 800, in run
    result = self._run(
             ^^^^^^^^^^
  File "/home/gturon/miniconda3/envs/zairadescribe/lib/python3.12/site-packages/ersilia/core/model.py", line 727, in _run
    result = self.api(
             ^^^^^^^^^
  File "/home/gturon/miniconda3/envs/zairadescribe/lib/python3.12/site-packages/ersilia/core/model.py", line 596, in api
    return self.api_task(
           ^^^^^^^^^^^^^^
  File "/home/gturon/miniconda3/envs/zairadescribe/lib/python3.12/site-packages/ersilia/core/model.py", line 633, in api_task
    for r in result:
             ^^^^^^
  File "/home/gturon/miniconda3/envs/zairadescribe/lib/python3.12/site-packages/ersilia/core/model.py", line 365, in _api_runner_iter
    for result in api.post(input=input, output=output, batch_size=batch_size):
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gturon/miniconda3/envs/zairadescribe/lib/python3.12/site-packages/ersilia/serve/api.py", line 188, in post
    self.output_adapter.adapt(
  File "/home/gturon/miniconda3/envs/zairadescribe/lib/python3.12/site-packages/ersilia/io/output.py", line 855, in adapt
    adapted_result = self._adapt_when_fastapi_was_used(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gturon/miniconda3/envs/zairadescribe/lib/python3.12/site-packages/ersilia/io/output.py", line 824, in _adapt_when_fastapi_was_used
    df.write(output, delimiter=delimiters[extension])
  File "/home/gturon/miniconda3/envs/zairadescribe/lib/python3.12/site-packages/ersilia/io/output.py", line 162, in write
    self.write_hdf5(file_name)
  File "/home/gturon/miniconda3/envs/zairadescribe/lib/python3.12/site-packages/ersilia/io/output.py", line 123, in write_hdf5
    hdf5 = Hdf5Data(
           ^^^^^^^^^
  File "/home/gturon/miniconda3/envs/zairadescribe/lib/python3.12/site-packages/ersilia/utils/hdf5.py", line 22, in __init__
    self.values = np.array(values, dtype=np.float32)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (5821,) + inhomogeneous part.

@GemmaTuron
Copy link
Member Author

GemmaTuron commented Feb 4, 2025

I have identified what the problem is. I will explain using eos3ae6 as an example

When a molecule gives "None" as a result, instead of being properly parsed into the .csv format (where under each column we would have a None) we have the first column with a list of Nones and the rest of the columns are empty:

key	input	R_0	R_1	R_2	R_3	R_4	R_5	R_6	R_7	...	IR_1	IR_2	IR_3	IR_4	IR_5	IR_6	IR_7	IR_8	IR_9	IR_10
68	NLXLAEXVIDQMFP-UHFFFAOYSA-N	[Cl-].[NH4+]	[None, None, None, None, None, None, None, Non...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
70	HBBGRARXTFLTSG-UHFFFAOYSA-N	[Li+]	[None, None, None, None, None, None, None, Non...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
74	FAPWRFPIFSIZLT-UHFFFAOYSA-M	[Cl-].[Na+]	[None, None, None, None, None, None, None, Non...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

What we need is:

key	input	R_0	R_1	R_2	R_3	R_4	R_5	R_6	R_7	...	IR_1	IR_2	IR_3	IR_4	IR_5	IR_6	IR_7	IR_8	IR_9	IR_10
0	WIIZWVCIJKGZOK-UHFFFAOYSA-N	O=C(NC(CO)C(O)c1ccc([N+](=O)[O-])cc1)C(Cl)Cl	-1.894	-1.516	-1.359	-0.991	-0.582	-0.371	0.394	0.560	...	-0.194	-0.157	-0.111	-0.102	-0.068	0.035	0.066	0.120	0.137	0.262
1	YGSDEFSMJLZEOE-UHFFFAOYSA-N	O=C(O)c1ccccc1O	-2.337	-2.057	-1.918	-1.511	-1.286	-1.194	-1.115	-0.638	...	-0.370	-0.160	-0.106	-0.093	-0.084	-0.071	0.115	0.584	0.887	0.890
2	SNPPWIUOZRMYNY-UHFFFAOYSA-N	CC(NC(C)(C)C)C(=O)c1cccc(Cl)c1	None	None	None	None	None	None	None	None	...	None	None	None	None	None	None	None	None	None	None

I am 99% sure this is due to the FAST API packaging because this did not happen before, and is happening across all models. @DhanshreeA can you take care of fixing this?
@Abellegese do you think this is something we could catch in the test? the behaviour of the model when a molecule is not being processed - the difficult part here being that we cannot know which molecules do not produce predictions for any specific given model.

For completion I have found the same issue in eos4u6p and eos8a4x. None of this models appears to be using FastAPI packaging from their GitHub repos but probably @DhanshreeA has pushed them manually and has yet to update the repos. I think I have done enough testing to ensure this IS the issue happening. It needs to be fixed Ersilia-wide on how the Nones are handled.

To facilitate the screening, three molecules that fail in the eos3ae6 are:

[Cl-].[NH4+] NLXLAEXVIDQMFP-UHFFFAOYSA-N
[Li+] HBBGRARXTFLTSG-UHFFFAOYSA-N
[Cl-].[Na+] FAPWRFPIFSIZLT-UHFFFAOYSA-M

A molecule that fails in eos4u6p:
[O-]Cl+2O LSUVGSGHYCLVPU-UHFFFAOYSA-N

Two molecules that fail in eos8a4x:
OCPH(CO)CO PQJIXFVXQRCTKI-UHFFFAOYSA-N
OCP(Cl)(CO)(CO)CO CKNGMMSBYBSTLC-UHFFFAOYSA-N

@Abellegese
Copy link

Okay this is interesting @GemmaTuron Id think I now have clear picture of whats going on. This wont be a problem at our model out content checker because the current implementatiion, thes are the problem I need to fix in test commad:

  1. the model output content checker runs after run.sh checks which we could not at least see the results of the content validation check and to have some idea on what the problem might be. So the run bash shell fails due to these error you mentioned or the other. Also subprocess did not give clear because of this mainly this:

But I will try to fix this in easy way possible.

@GemmaTuron GemmaTuron changed the title Probably memory issue for descriptor calculation Descriptors do not parse Nones to h5 Feb 5, 2025
@GemmaTuron
Copy link
Member Author

Also more information in this from @Abellegese - this only happens in Docker models not when you fetch from GitHub

@GemmaTuron
Copy link
Member Author

Okay @Abellegese

Some more information on this error. I have found two types of behaviour:

  1. The None results are still None but parsed correctly when the model is fetched from_github but not when fetched from_dockerhub

Example: model eos3ae6
Molecules:

[Cl-].[NH4+] NLXLAEXVIDQMFP-UHFFFAOYSA-N
[Li+] HBBGRARXTFLTSG-UHFFFAOYSA-N
[Cl-].[Na+] FAPWRFPIFSIZLT-UHFFFAOYSA-M

I cannot try with eos4u6p as it does not fetch from github.

  1. The molecules that give None when being run from a Docker fetched model work with a GitHub fetched model:

Example: eos8a4x
Molecules:

OC[PH](CO)(CO)CO, PQJIXFVXQRCTKI-UHFFFAOYSA-N
OCP(Cl)(CO)(CO)CO, CKNGMMSBYBSTLC-UHFFFAOYSA-N

@Abellegese
Copy link

Yes exactly, this what I found as well. Thanks @GemmaTuron .

  1. Some changes has been made to the docker source code without syncing to tje github
  2. One file could be api_schema.json which was used to map the output.
  3. Since this happenes when the api itself is giving status code 500, internal server error which indicate the api gets crash for the specific input requests
  4. in all of the above case, we need to have checks that either fix or report the problem clear in ersilia cli.

I will work on that. More info is also appreciated.

@GemmaTuron
Copy link
Member Author

I am sorry I cannot provide more info as I have not found more molecules that are correct but fail to be described at the individual model level. For eos4u6p I can re-try if we solve the github fetch :)

@Abellegese
Copy link

Hi @GemmaTuron okay the first reason why the h5 fails is, for instance for eos3ae6, is given as below. The logs is extracted from the running containers of the model.

bash /root/bundles/eos3ae6/20250116-5251bd9f-927d-454f-acdc-5d4bb17a2c0f/model/framework/run.sh /root/bundles/eos3ae6/20250116-5251bd9f-927d-454f-acdc-5d4bb17a2c0f/model/framework /tmp/ersilia-4c1em5lh/input-2ee8617a-bc66-4e88-8e54-725203863869.csv /tmp/ersilia-4c1em5lh/output-2ee8617a-bc66-4e88-8e54-725203863869.csv
INFO:     172.17.0.1:38364 - "POST /run HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 409, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/usr/local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
    response = await f(request)
  File "/usr/local/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
  File "/root/bundles/eos3ae6/20250116-5251bd9f-927d-454f-acdc-5d4bb17a2c0f/app/main.py", line 219, in run
    response = orient_to_json(R, header, data, orient, output_type)
  File "/root/bundles/eos3ae6/20250116-5251bd9f-927d-454f-acdc-5d4bb17a2c0f/app/utils.py", line 50, in orient_to_json
    record[columns[j]] = values_serializer([values[i][j]])[0]
  File "/root/bundles/eos3ae6/20250116-5251bd9f-927d-454f-acdc-5d4bb17a2c0f/app/utils.py", line 33, in values_serializer
    return [float(x) for x in values]
  File "/root/bundles/eos3ae6/20250116-5251bd9f-927d-454f-acdc-5d4bb17a2c0f/app/utils.py", line 33, in <listcomp>
    return [float(x) for x in values]
ValueError: could not convert string to float: ''
No computed charges.

@Abellegese
Copy link

This results a status code 500. And I the github repo for this model has been modified long ago. But when I check the docker image recent history, its been updated 3 weeks ago.

(ersilia-py3.12)  ersilia   bug-fixes-h5  docker history ersiliaos/eos3ae6
IMAGE          CREATED        CREATED BY                                      SIZE      COMMENT
cf446ace057e   3 weeks ago    RUN |1 MODEL=eos3ae6 /bin/sh -c apt-get upda…   260MB     buildkit.dockerfile.v0
<missing>      3 weeks ago    RUN |1 MODEL=eos3ae6 /bin/sh -c apt-get upda…   11.4MB    buildkit.dockerfile.v0
<missing>      3 weeks ago    COPY ./eos3ae6 /root/eos3ae6 # buildkit         216kB     buildkit.dockerfile.v0
<missing>      3 weeks ago    WORKDIR /root                                   0B        buildkit.dockerfile.v0
<missing>      3 weeks ago    ENV MODEL=eos3ae6                               0B        buildkit.dockerfile.v0
<missing>      3 weeks ago    ARG MODEL=eos3ae6                               0B        buildkit.dockerfile.v0
<missing>      6 weeks ago    ENTRYPOINT ["sh" "/root/docker-entrypoint.sh…   0B        buildkit.dockerfile.v0
<missing>      6 weeks ago    EXPOSE map[80/tcp:{}]                           0B        buildkit.dockerfile.v0
<missing>      6 weeks ago    RUN /bin/sh -c apt-get clean && apt-get auto…   52MB      buildkit.dockerfile.v0
<missing>      6 weeks ago    COPY . /ersilia-pack # buildkit                 79.2kB    buildkit.dockerfile.v0
<missing>      6 weeks ago    WORKDIR /root                                   0B        buildkit.dockerfile.v0
<missing>      2 months ago   CMD ["python3"]                                 0B        buildkit.dockerfile.v0
<missing>      2 months ago   RUN /bin/sh -c set -eux;  for src in idle3 p…   36B       buildkit.dockerfile.v0
<missing>      2 months ago   RUN /bin/sh -c set -eux;   savedAptMark="$(a…   40.9MB    buildkit.dockerfile.v0
<missing>      2 months ago   ENV PYTHON_SHA256=bfb249609990220491a1b92850…   0B        buildkit.dockerfile.v0
<missing>      2 months ago   ENV PYTHON_VERSION=3.10.16                      0B        buildkit.dockerfile.v0
<missing>      2 months ago   ENV GPG_KEY=A035C8C19219BA821ECEA86B64E628F8…   0B        buildkit.dockerfile.v0
<missing>      2 months ago   RUN /bin/sh -c set -eux;  apt-get update;  a…   2.33MB    buildkit.dockerfile.v0
<missing>      2 months ago   ENV LANG=C.UTF-8                                0B        buildkit.dockerfile.v0
<missing>      2 months ago   ENV PATH=/usr/local/bin:/usr/local/sbin:/usr…   0B        buildkit.dockerfile.v0
<missing>      2 months ago   # debian.sh --arch 'amd64' out/ 'bullseye' '…   80.7MB    debuerreotype 0.15

@Abellegese
Copy link

Its straight forward to solve the h5 issues but I just want to figure out the root of cause of the issue. I will post what source code change has been made between the github and the docker image.

@Abellegese
Copy link

Hi @GemmaTuron As I previously suspected the api_schema.json file has been changed (outcome key to be specific) in the dockerhub version here is blow

  • Dockerhub api_schema.json
{
    "run": {
        "input": {
            "key": {
                "type": "string"
            },
            "input": {
                "type": "string"
            },
            "text": {
                "type": "string"
            }
        },
        "output": {
            "outcome": {
                "type": "numeric_array",
                "shape": [
                    33
                ],
                "meta": [
                    "R_0",
                    "R_1",
                    "R_2",
                    "R_3",
                    "R_4",
                    "R_5",
                    "R_6",
                    "R_7",
                    "R_8",
                    "R_9",
                    "R_10",
                    "I_0",
                    "I_1",
                    "I_2",
                    "I_3",
                    "I_4",
                    "I_5",
                    "I_6",
                    "I_7",
                    "I_8",
                    "I_9",
                    "I_10",
                    "IR_0",
                    "IR_1",
                    "IR_2",
                    "IR_3",
                    "IR_4",
                    "IR_5",
                    "IR_6",
                    "IR_7",
                    "IR_8",
                    "IR_9",
                    "IR_10"
                ]
            }
        }
    }
}
  • Github api_shema.json
{
    "run": {
        "input": {
            "key": {
                "type": "string"
            },
            "input": {
                "type": "string"
            },
            "text": {
                "type": "string"
            }
        },
        "output": {
            "whales": {
                "type": "numeric_array",
                "shape": [
                    33
                ],
                "meta": [
                    "R_0",
                    "R_1",
                    "R_2",
                    "R_3",
                    "R_4",
                    "R_5",
                    "R_6",
                    "R_7",
                    "R_8",
                    "R_9",
                    "R_10",
                    "I_0",
                    "I_1",
                    "I_2",
                    "I_3",
                    "I_4",
                    "I_5",
                    "I_6",
                    "I_7",
                    "I_8",
                    "I_9",
                    "I_10",
                    "IR_0",
                    "IR_1",
                    "IR_2",
                    "IR_3",
                    "IR_4",
                    "IR_5",
                    "IR_6",
                    "IR_7",
                    "IR_8",
                    "IR_9",
                    "IR_10"
                ]
            }
        }
    }
}

@GemmaTuron
Copy link
Member Author

Okay that is helpful @Abellegese I see the change in the output section from "outcome" to "whales" but I do not see why this would affect the parsing of the None results or I am missing something here?

This model was probably updated by @DhanshreeA locally but changes were never pushed into the repository.

@Abellegese
Copy link

Abellegese commented Feb 7, 2025

The issue detected
We encountered critical errors in downstream file conversions (HDF5/CSV) due to inconsistent API output structures:

  1. HDF5 Conversion Failures
    ValueError: inhomogeneous shape occurred when some entries had None values represented as lists/sequences while others used scalar None.

  2. Inconsistent CSV Structures
    Missing/extra columns appeared due to schema mismatches between API responses.

Root Cause
Some API outputs contained:

  • Missing keys (fields present in schema but not in output)
  • Extra keys (fields not defined in schema)
  • None values represented as [None, None] lists instead of scalar None

Solution: Schema-Driven Data Standardization
We implemented a robust solution using api_schema.json to enforce consistent output structures:

# api_schema.json
{
  "run": {
    "output": {
      "outcome": {
        "meta": ["R_0", "R_1", ..., "IR_10"]  # <-- Defines expected keys
      }
    }
  }
}

Implementation Flow

  1. Schema Loading
    Extract expected keys from meta definitions in the schema

  2. Mismatch Detection
    Compare actual outputs against schema requirements:

    def _detect_mismatch(data, expected_keys):
        # Identifies missing/extra keys per entry
        return mismatches
  3. Data Standardization
    Enforce schema compliance by:

    • Removing extra keys, adding missing keys as None, converting [None] lists to scalar None

Few Advantages

  • Guaranteed Structural Consistency
    All outputs now strictly match schema-defined structures
  • Error Prevention
    Eliminates HDF5 conversion errors by ensuring homogeneous array shapes
  • Reliable CSV/H5/JSON Generation
    Maintains consistent columns across all records
  • Null Handling
    Standardizes None representation across all data types

Closing Remarks
@GemmaTuron and @DhanshreeA this solution comprehensively addresses all reported issues with null handling and structural mismatches using the api schema which is more general. Your comments are highly appreciated. And also am looking other solution till we create PR.

If you want to try the solution use this branch: bug-fixes-h5

So the mismatch detector designed to have O(1) time complexity and is superfast when we have large data.

@GemmaTuron
Copy link
Member Author

Hi @Abellegese

Many thanks this is super helpful. I will try the branch and let you know if it fixes the issue by Monday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

3 participants