Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization and validation are slow #1296

Closed
x-tabdeveloping opened this issue Oct 17, 2024 · 17 comments · Fixed by #1365
Closed

Serialization and validation are slow #1296

x-tabdeveloping opened this issue Oct 17, 2024 · 17 comments · Fixed by #1365

Comments

@x-tabdeveloping
Copy link
Collaborator

Since I'm working on the leaderboard right now, I regularly need to load and manipulate result objects.
In my work branch (#1235 ) we have the following hierarchy for result objects:

BenchmarkResults:
   - ModelResult
       - TaskResult
       - ...
    - ...

When I filter benchmark results based on given criteria (for instance, what languages the user's interested in), I always create a new instance of the class based on the query:

class BenchmarkResults(BaseModel):
     model_results: list[ModelResult]

     def filter_results(self, **criteria) -> "BenchmarkResults":
         # do something
         return type(self)(new_model_results)

This results in these objects getting validated by Pydantic every time they are created, which is very slow, and this needs to happen in real time, while users are interacting with the leaderboard.

Another issue is loading time. I cache results to the disk and try to only load them from disk or from our repos when needed, but even just deserializing and validating the object from a local json file takes multiple minutes.
This will result in very slow cold starts, and, again, a subpar user experience.

I'm wondering if there is a way for us to either fix these bottlenecks or somehow avoid running into them.

For at least serialization, I put my money on msgspec, which deserializes and validates schematic json objects 6.5x faster than Pydantic v2 with orjson.

@isaac-chung
Copy link
Collaborator

Can you please share a small minimum reproducible example to either load results or to filter benchmark results? I'd love to help take a look into this issue.

@x-tabdeveloping
Copy link
Collaborator Author

@isaac-chung of course! I will ping you once the PR is merged and give you a more concrete example of this.

@x-tabdeveloping
Copy link
Collaborator Author

PR merged! Here are examples of some things that could be sped up:

import mteb

# Takes REALLY long time (this downloads things and validates them)
all_results = mteb.load_results()

# Just serialization, also takes really long time, it's gonna be problem
# Especially loading is problematic for cold starts
all_results.to_disk("results.json")
all_results = mteb.BenchmarkResults.load_results("results.json")

# This filters results based on a selected benchmark
# it's slow because of validation (probably)
benchmark = mteb.get_benchmark("MTEB(multilingual)")
benchmark_results = default_benchmark.load_results(base_results=all_results)

# This filters results based on criteria
# Probably also slow because of validation
filtered_results = all_results.filter_tasks(languages=["eng", "deu", "dan"])

@isaac-chung
Copy link
Collaborator

So far this script fails and there are a lot of validation errors like this. Is that expected?

Validation failed for ARCChallenge in BAAI/bge-base-en-v1.5 a5beb1e3e68b9ab74eb54cfd186867f64f240e1a: name 'task_result' is not defined

@x-tabdeveloping
Copy link
Collaborator Author

Sorry about that. I have a fix in a new branch, will merge tomorrow morning

@x-tabdeveloping
Copy link
Collaborator Author

I merged the fix, try now.

@isaac-chung
Copy link
Collaborator

Rerunning on the latest main gives a bunch of validation errors as well. Are you seeing the same thing?
log.txt

@x-tabdeveloping
Copy link
Collaborator Author

Most of the validation warnings come from mteb.load_results() those are fine. The error you're getting is because I made an error in the code I wrote above :D
It should've been BenchmarkResults.from_disk() instead of BenchmarkResults.load_results()

@isaac-chung
Copy link
Collaborator

Thanks! Here is a quick confirmation of the slowness. Using Line Profiler, we see a per-line breakdown of the profile.

Note:

  • Timer unit: 1e-06 s
  • pydantic==2.6.2
  • pydantic-settings==2.2.1
  • pydantic_core==2.16.3

Hot run ( 90.27 seconds)

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     6                                           @profile                                                                        
     7                                           def main():                                                                     
     8                                               # Takes REALLY long time (this downloads things and validates them)         
     9         1   60075993.8    6e+07     66.5      all_results = mteb.load_results()                                           
    10                                                                                                                           
    11                                               # Just serialization, also takes really long time, it's gonna be problem    
    12                                               # Especially loading is problematic for cold starts                         
    13         1    2881084.2    3e+06      3.2      all_results.to_disk("results.json")                                         
    14         1   14670458.8    1e+07     16.3      all_results = mteb.BenchmarkResults.from_disk("results.json")               
    15                                                                                                                           
    16                                               # This filters results based on a selected benchmark                        
    17                                               # it's slow because of validation (probably)                                
    18         1         14.7     14.7      0.0      benchmark = mteb.get_benchmark("MTEB(Multilingual)")                        
    19         1   11647673.9    1e+07     12.9      benchmark_results = benchmark.load_results(base_results=all_results)        
    20                                                                                                                           
    21                                               # This filters results based on criteria                                    
    22                                               # Probably also slow because of validation                                  
    23         1     998277.1 998277.1      1.1      filtered_results = all_results.filter_tasks(languages=["eng", "deu", "dan"])

cold run (101.20 seconds, with empty ~./cache/mteb):

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     6                                           @profile                                                                        
     7                                           def main():                                                                     
     8                                               # Takes REALLY long time (this downloads things and validates them)         
     9         1   69979607.1    7e+07     69.1      all_results = mteb.load_results()                                           
    10                                                                                                                           
    11                                               # Just serialization, also takes really long time, it's gonna be problem    
    12                                               # Especially loading is problematic for cold starts                         
    13         1    2673161.2    3e+06      2.6      all_results.to_disk("results.json")                                         
    14         1   15440685.8    2e+07     15.3      all_results = mteb.BenchmarkResults.from_disk("results.json")               
    15                                                                                                                           
    16                                               # This filters results based on a selected benchmark                        
    17                                               # it's slow because of validation (probably)                                
    18         1         11.4     11.4      0.0      benchmark = mteb.get_benchmark("MTEB(Multilingual)")                        
    19         1   12124962.5    1e+07     12.0      benchmark_results = benchmark.load_results(base_results=all_results)        
    20                                                                                                                           
    21                                               # This filters results based on criteria                                    
    22                                               # Probably also slow because of validation                                  
    23         1     984644.2 984644.2      1.0      filtered_results = all_results.filter_tasks(languages=["eng", "deu", "dan"])

@isaac-chung
Copy link
Collaborator

isaac-chung commented Oct 26, 2024

Re: cold starts, how often does that happen? Isn't the leaderboard up most of the time?

Re: validation: Could we validate these results models at write time, and not at load time? When we load results, maybe we could use the model_construct method to create models without validation.

Re: nested models, maybe we could try using Typed Dicts instead of nested models: https://docs.pydantic.dev/latest/concepts/performance/#use-typeddict-over-nested-models

@x-tabdeveloping
Copy link
Collaborator Author

Cold starts happen if we're running on some provider that shuts down servers when they're not used. In that case the whole thing has to be spun up again. I only have experience with this on Google Cloud with some of my apps, I don't exactly know how often this would happen with HF Spaces.

@x-tabdeveloping
Copy link
Collaborator Author

x-tabdeveloping commented Oct 26, 2024

Thanks for looking into this!! I'm not that familiar with Pydantic, I'll have a look at these. But if you're judgement indicates that this is a good idea to do, then don't hesitate. They seem like very reasonable options to me at first glance

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Oct 26, 2024

How easy is it to do a load with and without validation (we can have the validation running on push to the results repo and then avoid validation otherwise

edit: docs

@KennethEnevoldsen
Copy link
Contributor

Seems like there are decent gains to have from simply upgrading pydantic to >2.7 (or even 2.9)

@isaac-chung
Copy link
Collaborator

isaac-chung commented Oct 27, 2024

Re: validation: Could we validate these results models at write time, and not at load time? When we load results, maybe we could use the model_construct method to create models without validation.

and

How easy is it to do a load with and without validation (we can have the validation running on push to the results repo and then avoid validation otherwise
edit: docs

Yeah - that'd be a good next step.

@x-tabdeveloping
Copy link
Collaborator Author

It might also be worth it to look into what's actually making the leaderboard slow. I think I will run a profiler in the afternoon.

run.py

from mteb.leaderboard import demo

demo.launch()
python3 -m cProfile -o profiler.log run.py 

@x-tabdeveloping
Copy link
Collaborator Author

I will also try using model_construct with filters and maps then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants