Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Numpy & h5py bug in 1.7.0 #562

Merged
merged 3 commits into from
Jun 25, 2024

Conversation

IanHoang
Copy link
Collaborator

@IanHoang IanHoang commented Jun 20, 2024

Description

Version 1.7.0 encountered a bug when we use h5py 3.10.0 and numpy 2.0.0 (which was recently released). Currently, Dockerfile contains a line that forces h5py 3.10.0, which was used to unblock an issue in Jenkins a few months ago. For more, details, see this issue.

h5py 3.10.0 with numpy 2.0.0 results to error

benchmark@b7e33afd6ad6:~$ pip3 list
Package                   Version
------------------------- -----------
...
h5py                      3.10.0
numpy                     2.0.0
...
benchmark@b7e33afd6ad6:~$ opensearch-benchmark --version
Traceback (most recent call last):
  File "/usr/local/bin/opensearch-benchmark", line 5, in <module>
    from osbenchmark.benchmark import main
  File "/usr/local/lib/python3.11/site-packages/osbenchmark/benchmark.py", line 37, in <module>
    from osbenchmark import version, actor, config, paths, \
  File "/usr/local/lib/python3.11/site-packages/osbenchmark/test_execution_orchestrator.py", line 33, in <module>
    from osbenchmark import actor, config, doc_link, \
  File "/usr/local/lib/python3.11/site-packages/osbenchmark/worker_coordinator/__init__.py", line 26, in <module>
    from .worker_coordinator import (
  File "/usr/local/lib/python3.11/site-packages/osbenchmark/worker_coordinator/worker_coordinator.py", line 44, in <module>
    from osbenchmark import actor, config, exceptions, metrics, workload, client, paths, PROGRAM_NAME, telemetry
  File "/usr/local/lib/python3.11/site-packages/osbenchmark/workload/__init__.py", line 25, in <module>
    from .loader import (
  File "/usr/local/lib/python3.11/site-packages/osbenchmark/workload/loader.py", line 41, in <module>
    from osbenchmark.workload import params, workload
  File "/usr/local/lib/python3.11/site-packages/osbenchmark/workload/params.py", line 42, in <module>
    from osbenchmark.utils.dataset import DataSet, get_data_set, Context
  File "/usr/local/lib/python3.11/site-packages/osbenchmark/utils/dataset.py", line 13, in <module>
    import h5py
  File "/usr/local/lib/python3.11/site-packages/h5py/__init__.py", line 25, in <module>
    from . import _errors
  File "h5py/_errors.pyx", line 1, in init h5py._errors
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

Issues Resolved

#545

Testing

  • Built docker image with dockerfile after removing that line. Confirmed that OSB 1.7.0 worked with with numpy 2.0.0 and h5py 3.11.0 (latest version of h5py).
  • Also confirmed that numpy version 1.26.4 (prior version of numpy that OSB used before numpy 2.0.0) works with h5py 3.10.0 and 3.11.0.

h5py 3.11.0 with numpy 2.0.0

benchmark@b7e33afd6ad6:~$ pip3 list
Package                   Version
------------------------- -----------
...
h5py                      3.11.0
numpy                     2.0.0
...
benchmark@b7e33afd6ad6:~$ opensearch-benchmark --version
opensearch-benchmark 1.7.0
benchmark@b7e33afd6ad6:~$ opensearch-benchmark list workloads

   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

Available workloads:

Name                  Description                                                                                                        Documents    Compressed Size    Uncompressed Size    Default TestProcedure         All TestProcedures
--------------------  -----------------------------------------------------------------------------------------------------------------  -----------  -----------------  -------------------  ----------------------------  ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
geoshape              Shapes from PlanetOSM                                                                                              60,523,283   13.4 GB            45.4 GB              append-no-conflicts           append-no-conflicts
eventdata             This benchmark indexes HTTP access logs generated based sample logs from the elastic.co website using a generator  20,000,000   756.0 MB           15.3 GB              append-no-conflicts           append-no-conflicts,transform
so                    Indexing benchmark using up to questions and answers from StackOverflow                                            36,062,278   8.9 GB             33.1 GB              append-no-conflicts           append-no-conflicts
noaa_semantic_search  Benchmark performance of semantic search queries based on dataset of global daily weather measurements from NOAA   33,659,481   949.4 MB           9.0 GB               hybrid-query-aggs-light       hybrid-query-aggs-light,hybrid-query-aggs-full,create-and-index,hybrid-query-aggs-no-index,search-profiling
geopoint              Point coordinates from PlanetOSM                                                                                   60,844,404   482.1 MB           2.3 GB               append-no-conflicts           append-no-conflicts,append-no-conflicts-index-only,append-fast-with-conflicts
vectorsearch          Benchmark vector search engine performance for different engine types like faiss, lucene and nmslib                0            N/A                N/A                  no-train-test                 no-train-test,no-train-test-index-only,no-train-test-index-with-merge,search-only,force-merge-index,no-train-test-aoss
pmc                   Full text benchmark with academic papers from PMC                                                                  574,199      5.5 GB             21.7 GB              append-no-conflicts           indexing-querying,append-no-conflicts,append-no-conflicts-index-only,append-sorted-no-conflicts,append-fast-with-conflicts
noaa                  Global daily weather measurements from NOAA                                                                        33,659,481   949.4 MB           9.0 GB               append-no-conflicts           append-no-conflicts,append-no-conflicts-index-only,top_metrics,aggs
http_logs             HTTP server log data                                                                                               247,249,096  1.2 GB             31.1 GB              append-no-conflicts           append-no-conflicts,append-no-conflicts-original,append-no-conflicts-index-only,append-sorted-no-conflicts,append-index-only-with-ingest-pipeline,update,append-no-conflicts-index-reindex-only,search-pipeline
nested                StackOverflow Q&A stored as nested docs                                                                            11,203,029   663.3 MB           3.4 GB               nested-search-test-procedure  nested-search-test-procedure,index-only
geopointshape         Point coordinates from PlanetOSM indexed as geoshapes                                                              60,844,404   470.8 MB           2.6 GB               append-no-conflicts           append-no-conflicts,append-no-conflicts-index-only,append-fast-with-conflicts
nyc_taxis             Taxi rides in New York in 2015                                                                                     165,346,692  4.5 GB             74.3 GB              append-no-conflicts           searchable-snapshot,append-no-conflicts,append-no-conflicts-index-only,append-sorted-no-conflicts-index-only,update
geonames              POIs from Geonames                                                                                                 11,396,503   252.9 MB           3.3 GB               append-no-conflicts           append-no-conflicts,append-no-conflicts-index-only,append-sorted-no-conflicts,append-fast-with-conflicts,significant-text
big5                  Big5 workload based on synthetically generated data corpus                                                         116,000,000  5.6 GB             100.0 GB             big5                          big5,test
percolator            Percolator benchmark based on AOL queries                                                                          2,000,000    121.1 kB           104.9 MB             append-no-conflicts           append-no-conflicts

--------------------------------
[INFO] SUCCESS (took 12 seconds)
--------------------------------
...

h5py 3.10.0 with numpy 1.26.4

benchmark@b7e33afd6ad6:~$ pip3 list
Package                   Version
------------------------- -----------
...
h5py                      3.10.0
numpy                     1.26.4
...
benchmark@b7e33afd6ad6:~$ opensearch-benchmark --version
opensearch-benchmark 1.7.0
benchmark@b7e33afd6ad6:~$ opensearch-benchmark list workloads

   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

Available workloads:

Name                  Description                                                                                                        Documents    Compressed Size    Uncompressed Size    Default TestProcedure         All TestProcedures
--------------------  -----------------------------------------------------------------------------------------------------------------  -----------  -----------------  -------------------  ----------------------------  ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
geoshape              Shapes from PlanetOSM                                                                                              60,523,283   13.4 GB            45.4 GB              append-no-conflicts           append-no-conflicts
eventdata             This benchmark indexes HTTP access logs generated based sample logs from the elastic.co website using a generator  20,000,000   756.0 MB           15.3 GB              append-no-conflicts           append-no-conflicts,transform
so                    Indexing benchmark using up to questions and answers from StackOverflow                                            36,062,278   8.9 GB             33.1 GB              append-no-conflicts           append-no-conflicts
noaa_semantic_search  Benchmark performance of semantic search queries based on dataset of global daily weather measurements from NOAA   33,659,481   949.4 MB           9.0 GB               hybrid-query-aggs-light       hybrid-query-aggs-light,hybrid-query-aggs-full,create-and-index,hybrid-query-aggs-no-index,search-profiling
geopoint              Point coordinates from PlanetOSM                                                                                   60,844,404   482.1 MB           2.3 GB               append-no-conflicts           append-no-conflicts,append-no-conflicts-index-only,append-fast-with-conflicts
vectorsearch          Benchmark vector search engine performance for different engine types like faiss, lucene and nmslib                0            N/A                N/A                  no-train-test                 no-train-test,no-train-test-index-only,no-train-test-index-with-merge,search-only,force-merge-index,no-train-test-aoss
pmc                   Full text benchmark with academic papers from PMC                                                                  574,199      5.5 GB             21.7 GB              append-no-conflicts           indexing-querying,append-no-conflicts,append-no-conflicts-index-only,append-sorted-no-conflicts,append-fast-with-conflicts
noaa                  Global daily weather measurements from NOAA                                                                        33,659,481   949.4 MB           9.0 GB               append-no-conflicts           append-no-conflicts,append-no-conflicts-index-only,top_metrics,aggs
http_logs             HTTP server log data                                                                                               247,249,096  1.2 GB             31.1 GB              append-no-conflicts           append-no-conflicts,append-no-conflicts-original,append-no-conflicts-index-only,append-sorted-no-conflicts,append-index-only-with-ingest-pipeline,update,append-no-conflicts-index-reindex-only,search-pipeline
nested                StackOverflow Q&A stored as nested docs                                                                            11,203,029   663.3 MB           3.4 GB               nested-search-test-procedure  nested-search-test-procedure,index-only
geopointshape         Point coordinates from PlanetOSM indexed as geoshapes                                                              60,844,404   470.8 MB           2.6 GB               append-no-conflicts           append-no-conflicts,append-no-conflicts-index-only,append-fast-with-conflicts
nyc_taxis             Taxi rides in New York in 2015                                                                                     165,346,692  4.5 GB             74.3 GB              append-no-conflicts           searchable-snapshot,append-no-conflicts,append-no-conflicts-index-only,append-sorted-no-conflicts-index-only,update
geonames              POIs from Geonames                                                                                                 11,396,503   252.9 MB           3.3 GB               append-no-conflicts           append-no-conflicts,append-no-conflicts-index-only,append-sorted-no-conflicts,append-fast-with-conflicts,significant-text
big5                  Big5 workload based on synthetically generated data corpus                                                         116,000,000  5.6 GB             100.0 GB             big5                          big5,test
percolator            Percolator benchmark based on AOL queries                                                                          2,000,000    121.1 kB           104.9 MB             append-no-conflicts           append-no-conflicts

--------------------------------
[INFO] SUCCESS (took 12 seconds)
--------------------------------

h5py 3.11.0 with numpy 1.26.4

benchmark@b7e33afd6ad6:~$ pip3 list
Package                   Version
------------------------- -----------
...
h5py                      3.11.0
numpy                     1.26.4
...
benchmark@b7e33afd6ad6:~$ opensearch-benchmark --version
opensearch-benchmark 1.7.0
benchmark@b7e33afd6ad6:~$ opensearch-benchmark list workloads

   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

Available workloads:

Name                  Description                                                                                                        Documents    Compressed Size    Uncompressed Size    Default TestProcedure         All TestProcedures
--------------------  -----------------------------------------------------------------------------------------------------------------  -----------  -----------------  -------------------  ----------------------------  ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
geoshape              Shapes from PlanetOSM                                                                                              60,523,283   13.4 GB            45.4 GB              append-no-conflicts           append-no-conflicts
eventdata             This benchmark indexes HTTP access logs generated based sample logs from the elastic.co website using a generator  20,000,000   756.0 MB           15.3 GB              append-no-conflicts           append-no-conflicts,transform
so                    Indexing benchmark using up to questions and answers from StackOverflow                                            36,062,278   8.9 GB             33.1 GB              append-no-conflicts           append-no-conflicts
noaa_semantic_search  Benchmark performance of semantic search queries based on dataset of global daily weather measurements from NOAA   33,659,481   949.4 MB           9.0 GB               hybrid-query-aggs-light       hybrid-query-aggs-light,hybrid-query-aggs-full,create-and-index,hybrid-query-aggs-no-index,search-profiling
geopoint              Point coordinates from PlanetOSM                                                                                   60,844,404   482.1 MB           2.3 GB               append-no-conflicts           append-no-conflicts,append-no-conflicts-index-only,append-fast-with-conflicts
vectorsearch          Benchmark vector search engine performance for different engine types like faiss, lucene and nmslib                0            N/A                N/A                  no-train-test                 no-train-test,no-train-test-index-only,no-train-test-index-with-merge,search-only,force-merge-index,no-train-test-aoss
pmc                   Full text benchmark with academic papers from PMC                                                                  574,199      5.5 GB             21.7 GB              append-no-conflicts           indexing-querying,append-no-conflicts,append-no-conflicts-index-only,append-sorted-no-conflicts,append-fast-with-conflicts
noaa                  Global daily weather measurements from NOAA                                                                        33,659,481   949.4 MB           9.0 GB               append-no-conflicts           append-no-conflicts,append-no-conflicts-index-only,top_metrics,aggs
http_logs             HTTP server log data                                                                                               247,249,096  1.2 GB             31.1 GB              append-no-conflicts           append-no-conflicts,append-no-conflicts-original,append-no-conflicts-index-only,append-sorted-no-conflicts,append-index-only-with-ingest-pipeline,update,append-no-conflicts-index-reindex-only,search-pipeline
nested                StackOverflow Q&A stored as nested docs                                                                            11,203,029   663.3 MB           3.4 GB               nested-search-test-procedure  nested-search-test-procedure,index-only
geopointshape         Point coordinates from PlanetOSM indexed as geoshapes                                                              60,844,404   470.8 MB           2.6 GB               append-no-conflicts           append-no-conflicts,append-no-conflicts-index-only,append-fast-with-conflicts
nyc_taxis             Taxi rides in New York in 2015                                                                                     165,346,692  4.5 GB             74.3 GB              append-no-conflicts           searchable-snapshot,append-no-conflicts,append-no-conflicts-index-only,append-sorted-no-conflicts-index-only,update
geonames              POIs from Geonames                                                                                                 11,396,503   252.9 MB           3.3 GB               append-no-conflicts           append-no-conflicts,append-no-conflicts-index-only,append-sorted-no-conflicts,append-fast-with-conflicts,significant-text
big5                  Big5 workload based on synthetically generated data corpus                                                         116,000,000  5.6 GB             100.0 GB             big5                          big5,test
percolator            Percolator benchmark based on AOL queries                                                                          2,000,000    121.1 kB           104.9 MB             append-no-conflicts           append-no-conflicts

--------------------------------
[INFO] SUCCESS (took 12 seconds)
--------------------------------

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Ian Hoang <hoangia@amazon.com>
@IanHoang
Copy link
Collaborator Author

Looks like it was temporarily failing due to latest OpenSearch Docker image. Now it seems to be working

Copy link
Collaborator

@rishabh6788 rishabh6788 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rather than bumping to a major version let us pin the numpy to last released 1.x version, which is 1.26.4 and release path for both, pypi and docker.

@IanHoang
Copy link
Collaborator Author

I think rather than bumping to a major version let us pin the numpy to last released 1.x version, which is 1.26.4 and release path for both, pypi and docker.

Couple of questions @rishabh6788:

  • Are you suggesting pinning in addition to the removal of the line in the dockerfile?
  • What are the benefits of pinning numpy to 1.X and would we ever remove that pin?

@rishabh6788
Copy link
Collaborator

I think rather than bumping to a major version let us pin the numpy to last released 1.x version, which is 1.26.4 and release path for both, pypi and docker.

Couple of questions @rishabh6788:

  • Are you suggesting pinning in addition to the removal of the line in the dockerfile?
  • What are the benefits of pinning numpy to 1.X and would we ever remove that pin?

Not in the docker file, but in setup.py and do a patch release in pypi as well.
This shall require no changes in the Dockerfile and things will remain as is.

for 2, mostly related to bumping to major versions as they are more prone to breaking changes and we don't know what will break in future. Pinning it to latest 1.x release will keep the experience same as before.

Ian Hoang added 2 commits June 25, 2024 09:34
Signed-off-by: Ian Hoang <hoangia@amazon.com>
Signed-off-by: Ian Hoang <hoangia@amazon.com>
@@ -106,7 +106,7 @@ def str_from_file(name):
"h5py>=3.10.0",
# License: BSD
# Required for knnvector workload
"numpy>=1.24.2",
"numpy>=1.24.2,<=1.26.4",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I beleieve numpy<=1.26.4 would do the job?

Copy link
Collaborator Author

@IanHoang IanHoang Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous versions of OSB might have issues with versions earlier than numpy 1.24.2 if I recall correctly. Plus, there are some dependencies above that use this same format as well (such as thespianpy)

@@ -15,7 +15,7 @@ RUN apt-get -y update && \
RUN groupadd --gid 1000 opensearch-benchmark && \
useradd -d /opensearch-benchmark -m -k /dev/null -g 1000 -N -u 1000 -l -s /bin/bash benchmark

RUN python3 -m pip install h5py==3.10.0; if [ -z "$VERSION" ] ; then python3 -m pip install opensearch-benchmark ; else python3 -m pip install opensearch-benchmark==$VERSION ; fi
RUN if [ -z "$VERSION" ] ; then python3 -m pip install opensearch-benchmark ; else python3 -m pip install opensearch-benchmark==$VERSION ; fi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we removing h5py installation inside dockerfile? Is it because it is already a dependency of OSB installation and will be present as part of installation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, h5py is already included in setup.py and is already a dependency during the OSB installation. If we decide to pin to a version of a dependency, we should do it in setup.py and not the Dockerfile in order to prevent future conflicts.

@IanHoang IanHoang merged commit 887b7e5 into opensearch-project:main Jun 25, 2024
8 of 9 checks passed
@IanHoang IanHoang deleted the fix-numpy-bug branch June 25, 2024 17:30
This was referenced Jun 25, 2024
finnroblin pushed a commit to finnroblin/opensearch-benchmark that referenced this pull request Jul 19, 2024
Signed-off-by: Ian Hoang <hoangia@amazon.com>
Co-authored-by: Ian Hoang <hoangia@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants