Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Python to 3.9 #1006

Merged
merged 28 commits into from
Nov 26, 2024
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
1f1eff6
upgrade python, nltk, pillow
vicilliar Oct 2, 2024
3cff524
hardcoded python 3.8.20 vector regression tests
vicilliar Oct 9, 2024
7f1a977
remove commented method
vicilliar Oct 15, 2024
931a804
Merge branch 'mainline' into joshua/upgrade-to-python-3.9
vicilliar Oct 15, 2024
82a18b1
re-add open clip embeddings script
vicilliar Oct 16, 2024
7d6248d
new open clip embeddings file
vicilliar Oct 16, 2024
8a120d1
remove open clip embeddings code
vicilliar Oct 16, 2024
d00f5d8
add error message for open clip embeddings
vicilliar Oct 16, 2024
d2e88b1
raise tolerance for open clip model
vicilliar Oct 16, 2024
c2de309
Merge branch 'mainline' into joshua/upgrade-to-python-3.9
vicilliar Oct 21, 2024
391488f
bump marqo and marqo-base version
vicilliar Oct 21, 2024
3120153
remove version bump
vicilliar Oct 21, 2024
2a685d7
Merge branch 'mainline' into joshua/upgrade-to-python-3.9
vicilliar Nov 8, 2024
13e24e6
rename embeddings file, add bypass for non-existent embeddings
vicilliar Nov 8, 2024
e37f727
add script to record stella embeddings
vicilliar Nov 9, 2024
ed232bb
generate embeddings for stella
vicilliar Nov 9, 2024
fdb208a
remove embedding generation code
vicilliar Nov 9, 2024
e56c9b7
Update marqo base version to 44
vicilliar Nov 19, 2024
e5bfd14
upgrade vespa version
vicilliar Nov 19, 2024
ce64458
Merge branch 'mainline' into joshua/upgrade-to-python-3.9
vicilliar Nov 19, 2024
efa69ca
Merge branch 'mainline' into joshua/upgrade-to-python-3.9
vicilliar Nov 20, 2024
796ae58
change vespa version in test
vicilliar Nov 20, 2024
eb17341
update test - no characters fail
vicilliar Nov 21, 2024
b9b872b
Merge branch 'mainline' into joshua/upgrade-to-python-3.9
vicilliar Nov 21, 2024
a524e87
Merge branch 'mainline' into joshua/upgrade-to-python-3.9
vicilliar Nov 25, 2024
f0305b4
add hardcoded embeddings error message
vicilliar Nov 25, 2024
24fdad6
temporarily skip tests with previous regressions
vicilliar Nov 26, 2024
075454f
change snowflake omission to test_vectorise
vicilliar Nov 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -139,5 +139,6 @@ local_only/
tests/cache/

cache/
src/marqo/cache/

__pycache__/
2 changes: 1 addition & 1 deletion .github/workflows/arm64_docker_marqo.yml
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ jobs:
with:
fetch-depth: 0

- name: Set up Python 3.9 # TODO: Check if 3.9 is okay instead of 3.8. So far, so good
- name: Set up Python 3.9
run: |
apt-get -y update
apt-get -y install python3.9
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/cpu_docker_marqo.yml
Original file line number Diff line number Diff line change
Expand Up @@ -91,10 +91,10 @@ jobs:
with:
fetch-depth: 0

- name: Set up Python 3.8
- name: Set up Python 3.9
uses: actions/setup-python@v3
with:
python-version: "3.8"
python-version: "3.9"
cache: "pip"

- name: Install Dependencies
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/cpu_local_marqo.yml
Original file line number Diff line number Diff line change
Expand Up @@ -91,10 +91,10 @@ jobs:
with:
fetch-depth: 0

- name: Set up Python 3.8
- name: Set up Python 3.9
uses: actions/setup-python@v3
with:
python-version: "3.8"
python-version: "3.9"
cache: "pip"

- name: Install Dependencies
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/cuda_docker_marqo.yml
Original file line number Diff line number Diff line change
Expand Up @@ -86,10 +86,10 @@ jobs:
with:
fetch-depth: 0

- name: Set up Python 3.8
- name: Set up Python 3.9
uses: actions/setup-python@v3
with:
python-version: "3.8"
python-version: "3.9"
cache: "pip"

- name: Install Dependencies
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/largemodel_unit_test_CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -66,10 +66,10 @@ jobs:
fetch-depth: 0
path: marqo

- name: Set up Python 3.8
- name: Set up Python 3.9
uses: actions/setup-python@v3
with:
python-version: "3.8"
python-version: "3.9"
cache: "pip"

- name: Checkout marqo-base for requirements
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/locust_perf_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -112,10 +112,10 @@ jobs:
with:
ref: ${{ github.event.inputs.marqo_ref }}

- name: Set up Python 3.8
- name: Set up Python 3.9
uses: actions/setup-python@v3
with:
python-version: "3.8"
python-version: "3.9"

- name: Set up Docker Buildx
if: github.event.inputs.marqo_host == 'http://localhost:8882' && github.event.inputs.image_to_test == 'marqo_docker_0'
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/test_documentation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,10 @@ jobs:
fetch-depth: 0
path: marqo

- name: Set up Python 3.8
- name: Set up Python 3.9
uses: actions/setup-python@v3
with:
python-version: "3.8"
python-version: "3.9"
cache: "pip"

- name: Install dependencies
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/unit_test_200gb_CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -66,10 +66,10 @@ jobs:
fetch-depth: 0
path: marqo

- name: Set up Python 3.8
- name: Set up Python 3.9
uses: actions/setup-python@v3
with:
python-version: "3.8"
python-version: "3.9"
cache: "pip"

- name: Checkout marqo-base for requirements
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ COPY vespa .
RUN mvn clean package

# Stage 2: Base image for Python setup
FROM marqoai/marqo-base:36 as base_image
FROM marqoai/marqo-base:37 as base_image

# Allow mounting volume containing data and configs for vespa
VOLUME /opt/vespa/var
Expand Down

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

7 changes: 7 additions & 0 deletions tests/s2_inference/embeddings_reference/info.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
16/10/24 - All embeddings were generated with:
vicilliar marked this conversation as resolved.
Show resolved Hide resolved
- Marqo mainline head: 055237ae6c4a8121b4026650582f3a23bd416564 (2.12.2 release notes)
- Python 3.8.20
- open_clip_torch==2.24.0
- torch==1.12.1
- Ubuntu 22.04.4 LTS
- g4dn.xlarge EC2 instance
77 changes: 66 additions & 11 deletions tests/s2_inference/test_encoding.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
import unittest
import torch
import json
import numpy as np
from unittest.mock import MagicMock, patch
from marqo.s2_inference.types import FloatTensor
from marqo.s2_inference.s2_inference import clear_loaded_models, get_model_properties_from_registry
from marqo.s2_inference.model_registry import load_model_properties, _get_open_clip_properties
from marqo.s2_inference.s2_inference import _convert_tensor_to_numpy
import numpy as np
import functools
import os

from marqo.s2_inference.s2_inference import (
_check_output_type, vectorise,
Expand All @@ -17,6 +20,13 @@

_load_model = functools.partial(og_load_model, calling_func = "unit_test")


def get_absolute_file_path(filename: str) -> str:
currentdir = os.path.dirname(os.path.abspath(__file__))
abspath = os.path.join(currentdir, filename)
return abspath


class TestEncoding(unittest.TestCase):

def setUp(self) -> None:
Expand All @@ -26,8 +36,12 @@ def tearDown(self) -> None:
clear_loaded_models()

def test_vectorize(self):
names = ["fp16/ViT-B/32", "onnx16/open_clip/ViT-B-32/laion400m_e32",
'onnx32/open_clip/ViT-B-32-quickgelu/laion400m_e32',
"""
Ensure that vectorised output from vectorise function matches both the model.encode output and
hardcoded embeddings from Python 3.8.20
"""

names = ["fp16/ViT-B/32", "onnx16/open_clip/ViT-B-32/laion400m_e32", 'onnx32/open_clip/ViT-B-32-quickgelu/laion400m_e32',
"all-MiniLM-L6-v1", "all_datasets_v4_MiniLM-L6", "hf/all-MiniLM-L6-v1", "hf/all_datasets_v4_MiniLM-L6",
"hf/bge-small-en-v1.5", "onnx/all-MiniLM-L6-v1", "onnx/all_datasets_v4_MiniLM-L6"]

Expand All @@ -43,21 +57,42 @@ def test_vectorize(self):
sentences = ['hello', 'this is a test sentence. so is this.', ['hello', 'this is a test sentence. so is this.']]
device = 'cpu'
eps = 1e-9
embeddings_file_name = get_absolute_file_path("embeddings_reference/embeddings_python_3_8.json")

# Load in hardcoded embeddings json file
with open(embeddings_file_name, "r") as f:
embeddings_python_3_8 = json.load(f)

for name in names:
model_properties = get_model_properties_from_registry(name)
model = _load_model(model_properties['name'], model_properties=model_properties, device=device)
with self.subTest(name=name):
# Add hardcoded embeddings into the variable.
model_properties = get_model_properties_from_registry(name)
model = _load_model(model_properties['name'], model_properties=model_properties, device=device)

for sentence in sentences:
output_v = vectorise(name, sentence, model_properties, device, normalize_embeddings=True)
for sentence in sentences:
with self.subTest(sentence=sentence):
output_v = vectorise(name, sentence, model_properties, device, normalize_embeddings=True)
assert _check_output_type(output_v)

assert _check_output_type(output_v)
output_m = model.encode(sentence, normalize=True)

output_m = model.encode(sentence, normalize=True)
# Embeddings must match hardcoded python 3.8.20 embeddings
if isinstance(sentence, str):
wanliAlex marked this conversation as resolved.
Show resolved Hide resolved
with self.subTest("Hardcoded Python 3.8 Embeddings Comparison"):
try:
self.assertEqual(np.allclose(output_m, embeddings_python_3_8[name][sentence],
atol=1e-6),
True)
except KeyError:
raise KeyError(f"Hardcoded Python 3.8 embeddings not found for "
f"model: {name}, sentence: {sentence} in JSON file: "
f"{embeddings_file_name}")

assert abs(torch.FloatTensor(output_m) - torch.FloatTensor(output_v)).sum() < eps
with self.subTest("Model encode vs vectorize"):
self.assertEqual(np.allclose(output_m, output_v, atol=eps), True)

clear_loaded_models()

clear_loaded_models()

def test_vectorize_normalise(self):
open_clip_names = ["open_clip/ViT-B-32/laion2b_s34b_b79k"]
Expand Down Expand Up @@ -120,6 +155,7 @@ def test_cpu_encode_type(self):

clear_loaded_models()


def test_load_clip_text_model(self):
names = ["fp16/ViT-B/32", "onnx16/open_clip/ViT-B-32/laion400m_e32", 'onnx32/open_clip/ViT-B-32-quickgelu/laion400m_e32',
'RN50', "ViT-B/16"]
Expand Down Expand Up @@ -313,6 +349,11 @@ def test_open_clip_vectorize(self):
sentences = ['hello', 'this is a test sentence. so is this.', ['hello', 'this is a test sentence. so is this.']]
device = 'cpu'
eps = 1e-9
embeddings_reference_file = get_absolute_file_path("embeddings_reference/embeddings_open_clip_python_3_8.json")

# Load in hardcoded embeddings json file
with open(embeddings_reference_file, "r") as f:
embeddings_python_3_8 = json.load(f)

for name in names:
model_properties = get_model_properties_from_registry(name)
Expand All @@ -327,7 +368,21 @@ def test_open_clip_vectorize(self):

output_m = model.encode(sentence, normalize=normalize_embeddings)

assert abs(torch.FloatTensor(output_m) - torch.FloatTensor(output_v)).sum() < eps
# Embeddings must match hardcoded python 3.8.20 embeddings
if isinstance(sentence, str):
with self.subTest("Hardcoded Python 3.8 Embeddings Comparison"):
try:
self.assertEqual(np.allclose(output_m, embeddings_python_3_8[name][sentence], atol=1e-5),
True, f"For model {name} and sentence {sentence}: "
f"Calculated embedding is {output_m} but "
f"hardcoded embedding is {embeddings_python_3_8[name][sentence]}")
except KeyError:
raise KeyError(f"Hardcoded Python 3.8 embeddings not found for "
f"model: {name}, sentence: {sentence} in JSON file: "
f"{embeddings_reference_file}")

with self.subTest("Model encode vs vectorize"):
self.assertEqual(np.allclose(output_m, output_v, atol=eps), True)

clear_loaded_models()

Expand Down
51 changes: 39 additions & 12 deletions tests/s2_inference/test_large_model_encoding.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import os
import torch
import pytest
import json
from marqo.s2_inference.types import FloatTensor
from marqo.s2_inference.s2_inference import clear_loaded_models, get_model_properties_from_registry, _convert_tensor_to_numpy
from unittest.mock import patch
Expand Down Expand Up @@ -34,10 +35,31 @@ def remove_cached_model_files():
elif os.path.isdir(item_path):
shutil.rmtree(item_path)

def run_test_vectorize(models):

def get_absolute_file_path(filename: str) -> str:
currentdir = os.path.dirname(os.path.abspath(__file__))
abspath = os.path.join(currentdir, filename)
return abspath


def run_test_vectorize(models, model_type):

# model_type determines the filename with which the embeddings are saved/loaded
# Ensure that vectorised output from vectorise function matches both the model.encode output and
# hardcoded embeddings from Python 3.8


sentences = ['hello', 'this is a test sentence. so is this.', ['hello', 'this is a test sentence. so is this.']]
device = "cuda"
eps = 1e-9
embeddings_reference_file = get_absolute_file_path(
f"embeddings_reference/embeddings_{model_type}_python_3_8.json"
)

# Load in hardcoded embeddings json file
with open(embeddings_reference_file, "r") as f:
embeddings_python_3_8 = json.load(f)

with patch.dict(os.environ, {"MARQO_MAX_CUDA_MODEL_MEMORY": "10"}):
def run():
for name in models:
Expand All @@ -55,7 +77,16 @@ def run():
if type(output_m) == torch.Tensor:
output_m = output_m.cpu().numpy()

assert abs(torch.FloatTensor(output_m) - torch.FloatTensor(output_v)).sum() < eps
# Embeddings must match hardcoded python 3.8.20 embeddings
if isinstance(sentence, str):
try:
assert np.allclose(output_m, embeddings_python_3_8[name][sentence], atol=1e-6)
except KeyError:
raise KeyError(f"Hardcoded Python 3.8 embeddings not found for "
f"model: {name}, sentence: {sentence} in JSON file: "
f"{embeddings_reference_file}")

assert np.allclose(output_m, output_v, atol=eps)

clear_loaded_models()
torch.cuda.empty_cache()
Expand All @@ -67,6 +98,7 @@ def run():

assert run()


def run_test_model_outputs(models):
sentences = ['hello', 'this is a test sentence. so is this.', ['hello', 'this is a test sentence. so is this.']]
device = "cuda"
Expand Down Expand Up @@ -155,8 +187,7 @@ def tearDownClass(cls) -> None:

def test_vectorize(self):
# For GPU Memory Optimization, we shouldn't load all models at once
for model_name in self.models:
run_test_vectorize(models=[model_name])
run_test_vectorize(models=self.models, model_type="large_open_clip")

def test_load_clip_text_model(self):
device = "cuda"
Expand Down Expand Up @@ -224,8 +255,7 @@ def tearDownClass(cls) -> None:

def test_vectorize(self):
# For GPU Memory Optimization, we shouldn't load all models at once
for model_name in self.models:
run_test_vectorize(models=[model_name])
run_test_vectorize(models=self.models, model_type="large_e5")

def test_model_outputs(self):
for model_name in self.models:
Expand Down Expand Up @@ -259,8 +289,7 @@ def tearDownClass(cls) -> None:

def test_vectorize(self):
# For GPU Memory Optimization, we shouldn't load all models at once
for model_name in self.models:
run_test_vectorize(models=[model_name])
run_test_vectorize(models=self.models, model_type="large_bge")

def test_model_outputs(self):
for model_name in self.models:
Expand Down Expand Up @@ -294,8 +323,7 @@ def tearDownClass(cls) -> None:

def test_vectorize(self):
# For GPU Memory Optimization, we shouldn't load all models at once
for model_name in self.models:
run_test_vectorize(models=[model_name])
run_test_vectorize(models=self.models, model_type="large_snowflake")

def test_model_outputs(self):
for model_name in self.models:
Expand Down Expand Up @@ -334,8 +362,7 @@ def tearDownClass(cls) -> None:

def test_vectorize(self):
# For GPU Memory Optimization, we shouldn't load all models at once
for model_name in self.models:
run_test_vectorize(models=[model_name])
run_test_vectorize(models=self.models, model_type="large_multilingual_e5")

def test_model_outputs(self):
for model_name in self.models:
Expand Down
Loading