Serialization and validation are slow #1296

x-tabdeveloping · 2024-10-17T18:42:10Z

Since I'm working on the leaderboard right now, I regularly need to load and manipulate result objects.
In my work branch (#1235 ) we have the following hierarchy for result objects:

BenchmarkResults:
   - ModelResult
       - TaskResult
       - ...
    - ...

When I filter benchmark results based on given criteria (for instance, what languages the user's interested in), I always create a new instance of the class based on the query:

class BenchmarkResults(BaseModel):
     model_results: list[ModelResult]

     def filter_results(self, **criteria) -> "BenchmarkResults":
         # do something
         return type(self)(new_model_results)

This results in these objects getting validated by Pydantic every time they are created, which is very slow, and this needs to happen in real time, while users are interacting with the leaderboard.

Another issue is loading time. I cache results to the disk and try to only load them from disk or from our repos when needed, but even just deserializing and validating the object from a local json file takes multiple minutes.
This will result in very slow cold starts, and, again, a subpar user experience.

I'm wondering if there is a way for us to either fix these bottlenecks or somehow avoid running into them.

For at least serialization, I put my money on msgspec, which deserializes and validates schematic json objects 6.5x faster than Pydantic v2 with orjson.

The text was updated successfully, but these errors were encountered:

isaac-chung · 2024-10-19T19:16:13Z

Can you please share a small minimum reproducible example to either load results or to filter benchmark results? I'd love to help take a look into this issue.

x-tabdeveloping · 2024-10-21T06:23:38Z

@isaac-chung of course! I will ping you once the PR is merged and give you a more concrete example of this.

x-tabdeveloping · 2024-10-21T13:46:20Z

PR merged! Here are examples of some things that could be sped up:

import mteb

# Takes REALLY long time (this downloads things and validates them)
all_results = mteb.load_results()

# Just serialization, also takes really long time, it's gonna be problem
# Especially loading is problematic for cold starts
all_results.to_disk("results.json")
all_results = mteb.BenchmarkResults.load_results("results.json")

# This filters results based on a selected benchmark
# it's slow because of validation (probably)
benchmark = mteb.get_benchmark("MTEB(multilingual)")
benchmark_results = default_benchmark.load_results(base_results=all_results)

# This filters results based on criteria
# Probably also slow because of validation
filtered_results = all_results.filter_tasks(languages=["eng", "deu", "dan"])

isaac-chung · 2024-10-23T19:00:49Z

So far this script fails and there are a lot of validation errors like this. Is that expected?

Validation failed for ARCChallenge in BAAI/bge-base-en-v1.5 a5beb1e3e68b9ab74eb54cfd186867f64f240e1a: name 'task_result' is not defined

x-tabdeveloping · 2024-10-23T21:55:26Z

Sorry about that. I have a fix in a new branch, will merge tomorrow morning

x-tabdeveloping · 2024-10-24T11:20:43Z

I merged the fix, try now.

isaac-chung · 2024-10-24T20:59:51Z

Rerunning on the latest main gives a bunch of validation errors as well. Are you seeing the same thing?
log.txt

x-tabdeveloping · 2024-10-25T06:18:47Z

Most of the validation warnings come from mteb.load_results() those are fine. The error you're getting is because I made an error in the code I wrote above :D
It should've been BenchmarkResults.from_disk() instead of BenchmarkResults.load_results()

isaac-chung · 2024-10-26T09:32:19Z

Thanks! Here is a quick confirmation of the slowness. Using Line Profiler, we see a per-line breakdown of the profile.

Note:

Timer unit: 1e-06 s
pydantic==2.6.2
pydantic-settings==2.2.1
pydantic_core==2.16.3

Hot run ( 90.27 seconds)

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     6                                           @profile                                                                        
     7                                           def main():                                                                     
     8                                               # Takes REALLY long time (this downloads things and validates them)         
     9         1   60075993.8    6e+07     66.5      all_results = mteb.load_results()                                           
    10                                                                                                                           
    11                                               # Just serialization, also takes really long time, it's gonna be problem    
    12                                               # Especially loading is problematic for cold starts                         
    13         1    2881084.2    3e+06      3.2      all_results.to_disk("results.json")                                         
    14         1   14670458.8    1e+07     16.3      all_results = mteb.BenchmarkResults.from_disk("results.json")               
    15                                                                                                                           
    16                                               # This filters results based on a selected benchmark                        
    17                                               # it's slow because of validation (probably)                                
    18         1         14.7     14.7      0.0      benchmark = mteb.get_benchmark("MTEB(Multilingual)")                        
    19         1   11647673.9    1e+07     12.9      benchmark_results = benchmark.load_results(base_results=all_results)        
    20                                                                                                                           
    21                                               # This filters results based on criteria                                    
    22                                               # Probably also slow because of validation                                  
    23         1     998277.1 998277.1      1.1      filtered_results = all_results.filter_tasks(languages=["eng", "deu", "dan"])

cold run (101.20 seconds, with empty ~./cache/mteb):

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     6                                           @profile                                                                        
     7                                           def main():                                                                     
     8                                               # Takes REALLY long time (this downloads things and validates them)         
     9         1   69979607.1    7e+07     69.1      all_results = mteb.load_results()                                           
    10                                                                                                                           
    11                                               # Just serialization, also takes really long time, it's gonna be problem    
    12                                               # Especially loading is problematic for cold starts                         
    13         1    2673161.2    3e+06      2.6      all_results.to_disk("results.json")                                         
    14         1   15440685.8    2e+07     15.3      all_results = mteb.BenchmarkResults.from_disk("results.json")               
    15                                                                                                                           
    16                                               # This filters results based on a selected benchmark                        
    17                                               # it's slow because of validation (probably)                                
    18         1         11.4     11.4      0.0      benchmark = mteb.get_benchmark("MTEB(Multilingual)")                        
    19         1   12124962.5    1e+07     12.0      benchmark_results = benchmark.load_results(base_results=all_results)        
    20                                                                                                                           
    21                                               # This filters results based on criteria                                    
    22                                               # Probably also slow because of validation                                  
    23         1     984644.2 984644.2      1.0      filtered_results = all_results.filter_tasks(languages=["eng", "deu", "dan"])

isaac-chung · 2024-10-26T10:37:20Z

Re: cold starts, how often does that happen? Isn't the leaderboard up most of the time?

Re: validation: Could we validate these results models at write time, and not at load time? When we load results, maybe we could use the model_construct method to create models without validation.

Re: nested models, maybe we could try using Typed Dicts instead of nested models: https://docs.pydantic.dev/latest/concepts/performance/#use-typeddict-over-nested-models

x-tabdeveloping · 2024-10-26T12:29:04Z

Cold starts happen if we're running on some provider that shuts down servers when they're not used. In that case the whole thing has to be spun up again. I only have experience with this on Google Cloud with some of my apps, I don't exactly know how often this would happen with HF Spaces.

x-tabdeveloping · 2024-10-26T12:29:48Z

Thanks for looking into this!! I'm not that familiar with Pydantic, I'll have a look at these. But if you're judgement indicates that this is a good idea to do, then don't hesitate. They seem like very reasonable options to me at first glance

KennethEnevoldsen · 2024-10-26T13:04:44Z

How easy is it to do a load with and without validation (we can have the validation running on push to the results repo and then avoid validation otherwise

edit: docs

KennethEnevoldsen · 2024-10-26T13:15:54Z

Seems like there are decent gains to have from simply upgrading pydantic to >2.7 (or even 2.9)

isaac-chung · 2024-10-27T16:22:13Z

Re: validation: Could we validate these results models at write time, and not at load time? When we load results, maybe we could use the model_construct method to create models without validation.

and

How easy is it to do a load with and without validation (we can have the validation running on push to the results repo and then avoid validation otherwise
edit: docs

Yeah - that'd be a good next step.

x-tabdeveloping · 2024-10-28T08:06:00Z

It might also be worth it to look into what's actually making the leaderboard slow. I think I will run a profiler in the afternoon.

run.py

from mteb.leaderboard import demo

demo.launch()

python3 -m cProfile -o profiler.log run.py

x-tabdeveloping · 2024-10-28T08:06:50Z

I will also try using model_construct with filters and maps then

x-tabdeveloping mentioned this issue Oct 21, 2024

Leaderboard - Overview Issue #1303

Open

4 tasks

isaac-chung mentioned this issue Oct 30, 2024

fix: Speed up leaderboard by caching and skipping validation #1365

Merged

x-tabdeveloping closed this as completed in #1365 Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialization and validation are slow #1296

Serialization and validation are slow #1296

x-tabdeveloping commented Oct 17, 2024

isaac-chung commented Oct 19, 2024

x-tabdeveloping commented Oct 21, 2024

x-tabdeveloping commented Oct 21, 2024

isaac-chung commented Oct 23, 2024

x-tabdeveloping commented Oct 23, 2024

x-tabdeveloping commented Oct 24, 2024

isaac-chung commented Oct 24, 2024

x-tabdeveloping commented Oct 25, 2024

isaac-chung commented Oct 26, 2024

isaac-chung commented Oct 26, 2024 •

edited

Loading

x-tabdeveloping commented Oct 26, 2024

x-tabdeveloping commented Oct 26, 2024 •

edited

Loading

KennethEnevoldsen commented Oct 26, 2024 •

edited

Loading

KennethEnevoldsen commented Oct 26, 2024

isaac-chung commented Oct 27, 2024 •

edited

Loading

x-tabdeveloping commented Oct 28, 2024

x-tabdeveloping commented Oct 28, 2024

Serialization and validation are slow #1296

Serialization and validation are slow #1296

Comments

x-tabdeveloping commented Oct 17, 2024

isaac-chung commented Oct 19, 2024

x-tabdeveloping commented Oct 21, 2024

x-tabdeveloping commented Oct 21, 2024

isaac-chung commented Oct 23, 2024

x-tabdeveloping commented Oct 23, 2024

x-tabdeveloping commented Oct 24, 2024

isaac-chung commented Oct 24, 2024

x-tabdeveloping commented Oct 25, 2024

isaac-chung commented Oct 26, 2024

isaac-chung commented Oct 26, 2024 • edited Loading

x-tabdeveloping commented Oct 26, 2024

x-tabdeveloping commented Oct 26, 2024 • edited Loading

KennethEnevoldsen commented Oct 26, 2024 • edited Loading

KennethEnevoldsen commented Oct 26, 2024

isaac-chung commented Oct 27, 2024 • edited Loading

x-tabdeveloping commented Oct 28, 2024

x-tabdeveloping commented Oct 28, 2024

isaac-chung commented Oct 26, 2024 •

edited

Loading

x-tabdeveloping commented Oct 26, 2024 •

edited

Loading

KennethEnevoldsen commented Oct 26, 2024 •

edited

Loading

isaac-chung commented Oct 27, 2024 •

edited

Loading