Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the GATK base image to a newer LTS ubuntu release #8610

Merged
merged 13 commits into from
Dec 12, 2023

Conversation

droazen
Copy link
Contributor

@droazen droazen commented Dec 8, 2023

No description provided.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok, I have a question, but the tests are the real validation

ENV CONDA_PATH /opt/miniconda
# Following the instructions from https://docs.conda.io/projects/conda/en/latest/user-guide/install/rpm-debian.html
# Note that this *does* install just miniconda, despite the package being named "conda"
RUN curl https://repo.anaconda.com/pkgs/misc/gpgkeys/anaconda.asc | gpg --dearmor > /tmp/conda.gpg && \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate conda.

apt -y autoclean && \
apt -y autoremove && \
. /opt/conda/etc/profile.d/conda.sh && \
conda -V
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to delete the conda trash that we were removing before?
all the .a and .pyc files it makes when it builds stuff. (including itself)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted all these changes

@gatk-bot
Copy link

gatk-bot commented Dec 9, 2023

Github actions tests reported job failures from actions build 7147397849
Failures in the following jobs:

Test Type JDK Job ID Logs
conda 17.0.6+10 7147397849.3 logs

@lbergelson
Copy link
Member

This h5diff error is the same thing I saw when I tried to just bump the version. We must be either getting a new version of it?

@lbergelson
Copy link
Member

Oh, it's a python problem.

 @Test(groups = {"python"}) // python environment is required to use h5diff
    public void testRunSystemCommand() {
        runSystemCommand(String.format("h5diff %s/extract.AS.indel.pos.annot.hdf5 %s/extract.AS.indel.pos.annot.hdf5",
                EXPECTED_TEST_FILES_DIR, EXPECTED_TEST_FILES_DIR));
        runSystemCommand(String.format("diff %s/extract.AS.indel.pos.vcf %s/extract.AS.indel.pos.vcf",
                EXPECTED_TEST_FILES_DIR, EXPECTED_TEST_FILES_DIR));
    }

Of course it is.

@lbergelson
Copy link
Member

  • conda-forge::h5py=2.10.0 # required by keras 2.2.4

@gatk-bot
Copy link

gatk-bot commented Dec 9, 2023

Github actions tests reported job failures from actions build 7153514898
Failures in the following jobs:

Test Type JDK Job ID Logs
conda 17.0.6+10 7153514898.3 logs

@droazen
Copy link
Contributor Author

droazen commented Dec 9, 2023

@lbergelson So, the test failures are in ScoreVariantAnnotationsIntegrationTest and TrainVariantAnnotationsModelIntegrationTest, when run within the conda environment in the new docker image. I've confirmed that they happen with both Ubuntu 22.04 and 20.04. They are assertion failures from h5diff indicating that the actual outputs did not match the expected outputs, so there are likely actual numerical differences here. Most likely the newer Ubuntu releases are shipping with newer versions of native libraries that HDF5 (or some other GATK dependency) depends on, and there was some breaking change somewhere along the line.

I think our only option is to capture the outputs of these tests, and ask @samuelklee to confirm that the differences do not indicate an actual regression.

@gatk-bot
Copy link

gatk-bot commented Dec 9, 2023

Github actions tests reported job failures from actions build 7153790937
Failures in the following jobs:

Test Type JDK Job ID Logs
conda 17.0.6+10 7153790937.3 logs

@samuelklee
Copy link
Contributor

samuelklee commented Dec 10, 2023

@droazen I hacked one of the TrainVariantAnnotationsModelIntegrationTest cases to run in your Docker (only necessary because it seems like gradlew test --tests *TrainVariantAnnotationsModelIntegrationTest doesn't recognize tests that use a DataProvider, but perhaps I did something wrong). Here are the differences:

(gatk) root@a87e0994889e:/repo# h5diff -v /repo/src/test/resources/large/org/broadinstitute/hellbender/tools/walkers/vqsr/scalable/train/expected/extract.nonAS.snpIndel.posUn.train.snpIndel.posOnly.IF.snp.trainingScores.hdf5 /repo/extract.nonAS.snpIndel.posUn.train.snpIndel.posOnly.IF.snp.trainingScores.hdf5

file1     file2
---------------------------------------
    x      x    /              
    x      x    /data          
    x      x    /data/scores   

group  : </> and </>
0 differences found
group  : </data> and </data>
0 differences found
dataset: </data/scores> and </data/scores>
size:           [445]           [445]
position        scores          scores          difference          
------------------------------------------------------------
[ 60 ]          -0.419202       -0.419202       5.55112e-17    
1 differences found

Looks pretty negligible to me! 😝 Probably a result of the native code being called by the python/ML packages used in these tools; even minor changes in the compilers across Ubuntu versions might introduce differences like these.

A quick fix might be to replace all system calls to h5diff in these tests with h5diff --use-system-epsilon; seems to do the trick here. But if that doesn't fix all test cases, then perhaps you can relax things with h5diff -p EPSILON, where EPSILON is a relative threshold. Probably OK to pick something like 1E-6. OK if I leave it to you to try this or otherwise check the rest of the cases?

Sorry for the inconvenience! I think the exact-match test worked as intended here, but I probably could've put in better messaging originally. Unfortunately, it's a bit awkward to grab the output of system commands.

And thanks for dealing with conda again (a necessary evil, unless we want to reimplement the entire field of machine learning in Java)! I'll experiment to see if I can't get the more recent version used in #8561 (23.10) working with the current environment---probably just some minor tweak to the pip version is needed to get around the error you're seeing. You could try unpinning it to see what gets pulled in. It would be great if we could get off the old version of conda, since more recent versions using the libmamba solver are MUCH faster and would cut down all of our Docker build times substantially.

@droazen
Copy link
Contributor Author

droazen commented Dec 10, 2023

Thanks @samuelklee for the Saturday night reply -- you rock! I will try running the tests with an epsilon and see if that does the trick. If it does, I might make another attempt at the new conda version to get the fast solver, since the current one is definitely annoyingly slow.

…ationsIntegrationTest and TrainVariantAnnotationsModelIntegrationTest
@gatk-bot
Copy link

Github actions tests reported job failures from actions build 7160404020
Failures in the following jobs:

Test Type JDK Job ID Logs
conda 17.0.6+10 7160404020.3 logs

@gatk-bot
Copy link

Github actions tests reported job failures from actions build 7165443572
Failures in the following jobs:

Test Type JDK Job ID Logs
conda 17.0.6+10 7165443572.3 logs

@samuelklee
Copy link
Contributor

samuelklee commented Dec 11, 2023

@droazen see how to update to conda 23.10.0 with the correct pip over at #8614. This shaves a couple of minutes off the Docker build, but more importantly, the faster solver should make life much easier for devs doing any updates to the conda environment in the future. (For example, trying to experiment with the environment updates made in #8561 was very painful with conda 23.1.0 currently in the base image---lots of hanging/failed solves.)

23.11.0 was just released, but it seems that there were some hiccups in the associated libmamba release, so let's stick with the last version for now.

@droazen
Copy link
Contributor Author

droazen commented Dec 11, 2023

Thanks @samuelklee , I will incorporate your conda update into this branch, now that we've dealt with the test failures!

I patched the VETS test code to include the h5diff (and diff) output in the exception messages when one of these commands fails, and switched to the existing BaseTest methods for running the process and capturing the output. You can see what the output looks like (when we remove the epsilon tolerance) here:

https://storage.googleapis.com/hellbender-test-logs/build_reports/8610/merge_7165443572.3/tests/testOnPackagedReleaseJar/classes/org.broadinstitute.hellbender.tools.walkers.vqsr.scalable.ScoreVariantAnnotationsIntegrationTest.html

https://storage.googleapis.com/hellbender-test-logs/build_reports/8610/merge_7165443572.3/tests/testOnPackagedReleaseJar/classes/org.broadinstitute.hellbender.tools.walkers.vqsr.scalable.TrainVariantAnnotationsModelIntegrationTest.html

As you suspected/hoped, all the differences were tiny. When you have a chance, could you please review these changes to the VETS tests and let me know if you spot any issues?

Copy link
Contributor

@samuelklee samuelklee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests and results LGTM, just a few extra comments/questions. Thanks for cleaning up the system-command stuff too!

python3 \
python3-pip \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this system pip required for anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it might be useful to have a pip in the base environment, but I can remove it if you think it might cause problems?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fine, just curious!

exit 1
fi

IMAGE_VERSION=$1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build_docker_base.sh script currently has an (outdated) image version hardcoded in---maybe parametrize it there as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am definitely going to update the old build script before merging this, and also add a separate script to release a pre-staged base image to the official location

@gatk-bot
Copy link

Github actions tests reported job failures from actions build 7173178607
Failures in the following jobs:

Test Type JDK Job ID Logs
conda 17.0.6+10 7173178607.3 logs

@droazen droazen merged commit 85d13d4 into master Dec 12, 2023
20 checks passed
@droazen droazen deleted the dr_update_docker_base_image branch December 12, 2023 02:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants