Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating the test data for tools/cellpose #1495

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

SaimMomin12
Copy link
Contributor

This pull request updates the test-data for cellpose tool for an existing PR #1494 and fixes #1494

@SaimMomin12
Copy link
Contributor Author

For the tests here, should we keep the delta_frac high?

@bgruening
Copy link
Owner

Lets ask @kostrykin - my guess is we should switch to asserts. This would hopefully in less work in the next bot PR.

@kostrykin
Copy link
Contributor

Yes, using delta_frac to verify the result image is not reliable.

I'd suggest to use the img_diff comparison method https://docs.galaxyproject.org/en/master/dev/schema.html#tool-tests-test-output, because there are composite images involved (i.e. figures with multiple images). The assertions are not suitable for that kind of images.

For the other images, using assertions should be possible, but since you have the expected data generated already, I think using img_diff is more reliable and easier.

@SaimMomin12
Copy link
Contributor Author

@kostrykin Thanks for the pointers. I will replace the tests with image_diff comparison and check if the tests pass.

@SaimMomin12
Copy link
Contributor Author

Using image_diff did not help in this case. We might have to think about some other alternative.

@kostrykin
Copy link
Contributor

kostrykin commented Sep 3, 2024

As I see from the test results https://github.com/bgruening/galaxytools/actions/runs/10681249964?pr=1495, the issue is that the difference between the images generated by Cellpose on the one hand, and the expected images on the other hand, is simply too large. This is not an issue of testing, but of Cellpose generating unexpected results. The question is, why is that happening?

I'm trying to inspect the generated data, but it gets deleted even though I've used --no_cleanup for planemo test. Any idea how to get those?

@kostrykin
Copy link
Contributor

Ok I managed to generate the results using --update_test_data… kind of a workaround, but anyways…

The issue is that the generated results indeed are very different than the expected results.

Here is the expected image for img02_cp_masks_cyto.tif:
image

And here is what Cellpose generated for img02_cp_masks_cyto.tif:
image

@SaimMomin12
Copy link
Contributor Author

@kostrykin this behaviour is quite strange. Are you using --biocontainers in your planemo test command?

@kostrykin
Copy link
Contributor

@kostrykin this behaviour is quite strange. Are you using --biocontainers in your planemo test command?

It takes forever to run with --biocontainers like several hours, so I omitted it at first.

Interestingly, Test 1 fails when running with --biocontainers and --update_test_data:

Problem:
Expecting value: line 1 column 1 (char 0)
Command line:
export CELLPOSE_LOCAL_MODELS_PATH='cellpose_models' && mkdir -p segmentation && ln -s '/tmp/tmpgzgkjdjl/files/a/e/b/dataset_aebc8958-c288-417a-b3e7-b8b78370474d.dat' ./image.png && python '/home/void/Documents/SaimMomin12-galaxytools/tools/cellpose/cp_segmentation.py' --inputs '/tmp/tmpgzgkjdjl/job_working_directory/000/2/configs/tmprb58hfv0' --img_path ./image.png --img_format 'png' --output_dir ./segmentation

The output of Cellpose for img02_cp_masks_cyto.tif is also different:

image

So we have 3 different outputs, one for the expected data, one for without --biocontainers, and one another for with --biocontainers. I have absolutely no idea what is going on here. How did you generate the output for the expected data?

@SaimMomin12
Copy link
Contributor Author

@kostrykin Ideally the tools needs to be tested with --biocontainers as this simulates the exact behaviour of Planemo run on GitHub.

I updated the test data with the following command
planemo t cellpose.xml --galaxy_root /home/momins/galaxy/ --biocontainers --update_test_data

@kostrykin
Copy link
Contributor

@kostrykin Ideally the tools needs to be tested with --biocontainers as this simulates the exact behaviour of Planemo run on GitHub.

I updated the test data with the following command planemo t cellpose.xml --galaxy_root /home/momins/galaxy/ --biocontainers --update_test_data

I did planemo test --update_test_data --biocontainers.

Test 1 fail to run with the above mentioned error. I attached the other generated results. test-data.tgz

Do you obtain the same results when you re-generate them?

@SaimMomin12
Copy link
Contributor Author

The *.tif files are similar, but the generated PNGs are different from yours.

@kostrykin
Copy link
Contributor

I just tested the base branch on my computer.

Using --biocontainers still is horribly slow, but faster than with the new version in this PR (like just 1-2 hours instead of 2-3). Moreover, every test passes when using --update_test_data.

For the img02_cp_masks_cyto.tif file, I again get different results than the expected image in the test-data directory. But, the file has the same file size! Since the tests used file size comparison before, I suspect the the issue that we are facing here was already present in the base branch version.

In consequence, since the base branch version is faster, does not exhibit failing tests, and to better understand what is going on, I suggest the following strategy:

  1. First migrate the tests of the base branch to img_diff. In the course of this, we should investigate whether we get persistent results in CI and locally. Fix everything that needs to be fixed.
  2. Then re-do this PR against the fixed base branch.

@SaimMomin12
Copy link
Contributor Author

@kostrykin I will first migrate the changes of image_diff to the base branch and see if all tests pass and then compare the results of CI and locally. For testing this, we update the tool version or retain the same version?

@kostrykin
Copy link
Contributor

@kostrykin I will first migrate the changes of image_diff to the base branch and see if all tests pass and then compare the results of CI and locally. For testing this, we update the tool version or retain the same version?

I think we keep the TOOL_VERSION. Possibly it would make sense to increment the VERSION_SUFFIX.

@SaimMomin12 SaimMomin12 marked this pull request as draft September 17, 2024 08:34
@SaimMomin12
Copy link
Contributor Author

@kostrykin I was testing the base branch of cellpose and migrated the image_diff changes to it in my last commit. The tests fail here as well due to different images being generated during tests in CI.

You are correct that all the tests of the tool pass when --update_test_data is used locally, but they do fail here in CI. I am not so sure know why is this happening. Did you find anything similar?

@kostrykin
Copy link
Contributor

Nope, unfortunately, I haven't seen similar issues before. Basically there are two options, I guess: Either a dependency of the Cellpose package is resolved to different versions and this induces the diverging behavior. Or it is some internal switch of Cellpose, that causes the divergence. Are the results consistent when running locally with CPU vs. GPU utilization?

Maybe those who implemented/updated the Cellpose wrapper have an idea? @sunyi000 @pavanvidem

@bgruening
Copy link
Owner

Here are the updated test-data with docker: #1505

@sunyi000
Copy link
Contributor

Nope, unfortunately, I haven't seen similar issues before. Basically there are two options, I guess: Either a dependency of the Cellpose package is resolved to different versions and this induces the diverging behavior. Or it is some internal switch of Cellpose, that causes the divergence. Are the results consistent when running locally with CPU vs. GPU utilization?

Maybe those who implemented/updated the Cellpose wrapper have an idea? @sunyi000 @pavanvidem

sorry i have no idea why it's producing different results.
shall we try not using the py script anymore? the python script was developed for cellpose2 long time ago..
I could give it a try. @kostrykin @SaimMomin12

@pavanvidem
Copy link
Collaborator

As we are now using conda requirements, differences maybe due to some dependency. Let me test this.

@bgruening
Copy link
Owner

What should we do here? Do we assume cellpose knows what they are doing and accept that the outputs have changed so dramatically?

We all use biocontainers, hopefully, we should not use conda for testing, locally or remote only biocontainers should be used.

@SaimMomin12
Copy link
Contributor Author

What should we do here? Do we assume cellpose knows what they are doing and accept that the outputs have changed so dramatically?

We all use biocontainers, hopefully, we should not use conda for testing, locally or remote only biocontainers should be used.

ping @kostrykin

@kostrykin
Copy link
Contributor

What should we do here? Do we assume cellpose knows what they are doing and accept that the outputs have changed so dramatically?

@bgruening If I understood this comment correctly, then the problem here is that the Cellpose wrapper for one and the same version of Cellpose produces very different results. This makes me concerned regarding reproducibility. If the wrapper now yields different results when being run locally vs. in CI, it might as well yield inconsistent results in the future when run in Galaxy, as long as we do not understand why the inconsistency happens.

@SaimMomin12 You reported the diverging results when (i) using planemo test --update_test_data in comparison to (ii) the results obtained by the CI. Have you been using --biocontainers for (i)? If not, does this issue still persist when using --biocontainers, or are the results consistent with those obtained by the CI then?

IMO, if and only if the answer to this last question is "yes", we can ignore the inconsistency.

@kostrykin
Copy link
Contributor

As we are now using conda requirements, differences maybe due to some dependency. Let me test this.

@pavanvidem Did you find out something?

@bgruening
Copy link
Owner

@bgruening If I understood #1495 (comment) comment correctly, then the problem here is that the Cellpose wrapper for one and the same version of Cellpose produces very different results. This makes me concerned regarding reproducibility. If the wrapper now yields different results when being run locally vs. in CI, it might as well yield inconsistent results in the future when run in Galaxy, as long as we do not understand why the inconsistency happens.

I'm not sure we should assume that. The new tests are stricter and we don't know for sure how it was with the old version + old tests, correct?

To make it easier for all of us, we should always use --biocontainers. There is just one problem, also the --biocontainer is created on the fly and depending on which time it was created can have different dependencies. There are two workarounds:

  1. add more requirements to the tool
  2. create a container for this PR already in https://github.com/BioContainers/multi-package-containers/ then this PR will use this rebuild image and we are using all the same image (in contrast to using a on-the-fly build one)

@sunyi000
Copy link
Contributor

sunyi000 commented Nov 12, 2024

I could try rework on this tool.. I was trying to use this in one of my testing workflow, it didn't work.
Since the python code was developed with cellpose2, I was thinking of removing it and re-do the tool with pure cellpose3 headless options

@pavanvidem
Copy link
Collaborator

After trying out many possible combinations of dependencies, I found out that the discrepancies are most likely coming from opencv library. If we use pip installation of opencv-python-headless, the results are consistent. The container-based tool v3.0.8 produces the same image every time. The conda-based tool which uses opencv conda package produces a slightly different result for every run.

Surprisingly, the results from v3.0.9 and v3.0.10 containers are also not reproducible. I suspect they are also using opencv conda package but I couldn't locate the Dockerfile. Can someone please point me to the Dockerfile?

@tischi
Copy link

tischi commented Dec 17, 2024

Hi, for deploying it in Galaxy, does it have to reproducible in terms of rebuilding the container image, or would be it possible to build the image once, and then never touch it again (assuming that at least the results when running from the same image are reproducible)?

@bgruening
Copy link
Owner

@tischi that would be possible. But how do you decide which container is "correct"?

@tischi
Copy link

tischi commented Dec 17, 2024

Same result as the CellPose in BAND and BARD?
We could also ask Ruman to run the same image in JIPipe.

@kostrykin
Copy link
Contributor

assuming that at least the results when running from the same image are reproducible

As far as I remember from talking to @pavanvidem, the issue is that the results aren't even reproducible when using the same image between the runs (if opencv is installed via Conda, but it works if it is installed via pip). So the assumption is unfortunately wrong. What we need to do here is to build a container for Cellpose where opencv is installed via pip.

@bgruening
Copy link
Owner

... or find out which compile flag is used for the pip binaries that is not used for conda and produces randomness ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants