Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add multiprocessing #141

Merged
merged 9 commits into from
Dec 27, 2024
Merged

feat: Add multiprocessing #141

merged 9 commits into from
Dec 27, 2024

Conversation

Pringled
Copy link
Member

@Pringled Pringled commented Dec 26, 2024

Some simple code for benchmarking:

from time import perf_counter

from model2vec import StaticModel
from datasets import load_dataset

def main(use_multiprocessing: bool):
    ds = load_dataset(
        "wikimedia/wikipedia",  
        data_files="20231101.en/train-00000-of-00041.parquet"
    )["train"]
    texts = ds["text"]

    model = StaticModel.from_pretrained("minishlab/potion-base-8M")

    start = perf_counter()
    output = model.encode(
        sentences=texts, 
        show_progress_bar=True, 
        use_multiprocessing=use_multiprocessing,
    )
    total_time = perf_counter() - start

    docs_per_second = len(texts) / total_time
    print(f"Processed {len(texts)} docs in {total_time:.3f}s.")
    print(f"Docs per second: {docs_per_second:.2f}")
    print("Output shape:", output.shape)


if __name__ == "__main__":
    print("Multiprocessing=False")
    main(use_multiprocessing=False)

    print("\nMultiprocessing=True")
    main(use_multiprocessing=True)

On my machine, this gives:

Multiprocessing=False
Processed 156289 docs in 34.067s.
Docs per second: 4587.69
Output shape: (156289, 256)
Multiprocessing=True
Processed 156289 docs in 8.516s.
Docs per second: 19533.39
Output shape: (156289, 256)

So roughly 4x faster.

@Pringled Pringled requested a review from stephantul December 26, 2024 23:05
Copy link

codecov bot commented Dec 26, 2024

Codecov Report

Attention: Patch coverage is 98.18182% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
model2vec/utils.py 94.44% 1 Missing ⚠️
Files with missing lines Coverage Δ
model2vec/model.py 96.52% <100.00%> (+0.49%) ⬆️
tests/test_model.py 97.69% <100.00%> (+0.19%) ⬆️
model2vec/utils.py 92.53% <94.44%> (+0.53%) ⬆️

@Pringled Pringled self-assigned this Dec 26, 2024
@Pringled
Copy link
Member Author

Threshold of number of samples after which to apply MP is currently set to 10k based on the following plot (I think we can keep a safe margin since the differences are relatively small around the inflection point of ~6k samples):

Screenshot 2024-12-27 at 17 06 41

@Pringled Pringled marked this pull request as ready for review December 27, 2024 16:33
Copy link
Member

@stephantul stephantul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Veryyyy nice.

model2vec/model.py Outdated Show resolved Hide resolved
model2vec/model.py Show resolved Hide resolved
model2vec/model.py Show resolved Hide resolved
@Pringled Pringled merged commit ecf022f into main Dec 27, 2024
6 checks passed
@Pringled Pringled deleted the add-multiprocessing branch December 27, 2024 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants