feat: Add multiprocessing #141

Pringled · 2024-12-26T23:04:45Z

Some simple code for benchmarking:

from time import perf_counter

from model2vec import StaticModel
from datasets import load_dataset

def main(use_multiprocessing: bool):
    ds = load_dataset(
        "wikimedia/wikipedia",  
        data_files="20231101.en/train-00000-of-00041.parquet"
    )["train"]
    texts = ds["text"]

    model = StaticModel.from_pretrained("minishlab/potion-base-8M")

    start = perf_counter()
    output = model.encode(
        sentences=texts, 
        show_progress_bar=True, 
        use_multiprocessing=use_multiprocessing,
    )
    total_time = perf_counter() - start

    docs_per_second = len(texts) / total_time
    print(f"Processed {len(texts)} docs in {total_time:.3f}s.")
    print(f"Docs per second: {docs_per_second:.2f}")
    print("Output shape:", output.shape)


if __name__ == "__main__":
    print("Multiprocessing=False")
    main(use_multiprocessing=False)

    print("\nMultiprocessing=True")
    main(use_multiprocessing=True)

On my machine, this gives:

Multiprocessing=False
Processed 156289 docs in 34.067s.
Docs per second: 4587.69
Output shape: (156289, 256)

Multiprocessing=True
Processed 156289 docs in 8.516s.
Docs per second: 19533.39
Output shape: (156289, 256)

So roughly 4x faster.

codecov · 2024-12-26T23:09:36Z

Codecov Report

Attention: Patch coverage is 98.18182% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
model2vec/utils.py	94.44%	1 Missing ⚠️

Files with missing lines	Coverage Δ
model2vec/model.py	`96.52% <100.00%> (+0.49%)`	⬆️
tests/test_model.py	`97.69% <100.00%> (+0.19%)`	⬆️
model2vec/utils.py	`92.53% <94.44%> (+0.53%)`	⬆️

Pringled · 2024-12-27T16:10:23Z

Threshold of number of samples after which to apply MP is currently set to 10k based on the following plot (I think we can keep a safe margin since the differences are relatively small around the inflection point of ~6k samples):

stephantul

Veryyyy nice.

model2vec/model.py

Added initial setup for multiprocessed encode

8248d18

Pringled requested a review from stephantul December 26, 2024 23:05

Updated imports

8b16998

Pringled self-assigned this Dec 26, 2024

Pringled added 3 commits December 27, 2024 12:08

Added multiprocessing for encode_as_sequence

a20ce34

Added multiprocessing threshold

345f590

Added multiprocessing threshold

fe3eccb

Pringled added 2 commits December 27, 2024 17:15

Updated tests

da07a53

Update

4f64888

Pringled marked this pull request as ready for review December 27, 2024 16:33

Added disable for tokenizers mp

cdc8fbe

stephantul approved these changes Dec 27, 2024

View reviewed changes

model2vec/model.py Outdated Show resolved Hide resolved

model2vec/model.py Show resolved Hide resolved

model2vec/model.py Show resolved Hide resolved

Resolved comments

f1c3d08

Pringled merged commit ecf022f into main Dec 27, 2024
6 checks passed

Pringled deleted the add-multiprocessing branch December 27, 2024 17:43

Pringled mentioned this pull request Dec 27, 2024

Multiprocess encoding for speed #139

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add multiprocessing #141

feat: Add multiprocessing #141

Uh oh!

Pringled commented Dec 26, 2024 •

edited

Loading

Uh oh!

codecov bot commented Dec 26, 2024 •

edited

Loading

Uh oh!

Pringled commented Dec 27, 2024

Uh oh!

stephantul left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat: Add multiprocessing #141

feat: Add multiprocessing #141

Uh oh!

Conversation

Pringled commented Dec 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Dec 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Pringled commented Dec 27, 2024

Uh oh!

stephantul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Pringled commented Dec 26, 2024 •

edited

Loading

codecov bot commented Dec 26, 2024 •

edited

Loading