-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialization and validation are slow #1296
Comments
Can you please share a small minimum reproducible example to either load results or to filter benchmark results? I'd love to help take a look into this issue. |
@isaac-chung of course! I will ping you once the PR is merged and give you a more concrete example of this. |
PR merged! Here are examples of some things that could be sped up: import mteb
# Takes REALLY long time (this downloads things and validates them)
all_results = mteb.load_results()
# Just serialization, also takes really long time, it's gonna be problem
# Especially loading is problematic for cold starts
all_results.to_disk("results.json")
all_results = mteb.BenchmarkResults.load_results("results.json")
# This filters results based on a selected benchmark
# it's slow because of validation (probably)
benchmark = mteb.get_benchmark("MTEB(multilingual)")
benchmark_results = default_benchmark.load_results(base_results=all_results)
# This filters results based on criteria
# Probably also slow because of validation
filtered_results = all_results.filter_tasks(languages=["eng", "deu", "dan"]) |
So far this script fails and there are a lot of validation errors like this. Is that expected?
|
Sorry about that. I have a fix in a new branch, will merge tomorrow morning |
I merged the fix, try now. |
Rerunning on the latest |
Most of the validation warnings come from |
Thanks! Here is a quick confirmation of the slowness. Using Line Profiler, we see a per-line breakdown of the profile. Note:
Hot run ( 90.27 seconds)
cold run (101.20 seconds, with empty
|
Re: cold starts, how often does that happen? Isn't the leaderboard up most of the time? Re: validation: Could we validate these results models at write time, and not at load time? When we load results, maybe we could use the model_construct method to create models without validation. Re: nested models, maybe we could try using Typed Dicts instead of nested models: https://docs.pydantic.dev/latest/concepts/performance/#use-typeddict-over-nested-models |
Cold starts happen if we're running on some provider that shuts down servers when they're not used. In that case the whole thing has to be spun up again. I only have experience with this on Google Cloud with some of my apps, I don't exactly know how often this would happen with HF Spaces. |
Thanks for looking into this!! I'm not that familiar with Pydantic, I'll have a look at these. But if you're judgement indicates that this is a good idea to do, then don't hesitate. They seem like very reasonable options to me at first glance |
How easy is it to do a load with and without validation (we can have the validation running on push to the results repo and then avoid validation otherwise edit: docs |
Seems like there are decent gains to have from simply upgrading pydantic to >2.7 (or even 2.9) |
and
Yeah - that'd be a good next step. |
It might also be worth it to look into what's actually making the leaderboard slow. I think I will run a profiler in the afternoon. run.py from mteb.leaderboard import demo
demo.launch()
|
I will also try using |
Since I'm working on the leaderboard right now, I regularly need to load and manipulate result objects.
In my work branch (#1235 ) we have the following hierarchy for result objects:
When I filter benchmark results based on given criteria (for instance, what languages the user's interested in), I always create a new instance of the class based on the query:
This results in these objects getting validated by Pydantic every time they are created, which is very slow, and this needs to happen in real time, while users are interacting with the leaderboard.
Another issue is loading time. I cache results to the disk and try to only load them from disk or from our repos when needed, but even just deserializing and validating the object from a local json file takes multiple minutes.
This will result in very slow cold starts, and, again, a subpar user experience.
I'm wondering if there is a way for us to either fix these bottlenecks or somehow avoid running into them.
For at least serialization, I put my money on msgspec, which deserializes and validates schematic json objects 6.5x faster than Pydantic v2 with orjson.
The text was updated successfully, but these errors were encountered: