-
Notifications
You must be signed in to change notification settings - Fork 253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance questions? #35
Comments
I tried doing some performance profiling, see But I didn't find anything particularly interesting. |
Hi :) First of all, thank you for Pydantic! Here are some ideas about allocations:
use std::fmt::Write;
macro_rules! truncate {
($out: expr, $value: expr) => {
if $value.len() > 50 {
write!($out, "{}...{}", &$value[0..25], &$value[$value.len() - 24..]);
} else {
$out.push_str(&$value);
}
};
} Similarly, some output.push_str(&format!(", input_type={}", type_)); to write!(output, ", input_type={}", type_); There are a few more let loc = self
.location
.iter()
.map(|i| i.to_string())
.collect::<Vec<String>>()
.join(" -> "); to: let mut first = true;
for item in &self.location {
if !first {
output.push_str(" -> ");
first = false
}
match item {
LocItem::S(s) => output.push_str(&s),
LocItem::I(i) => {
output.push_str(itoa::Buffer::new().format(*i))
},
}
} Required the itoa crate though - it would be helpful in a few more places (ryu would help with floats as well)
Also, the signature implies an allocation, but it is not needed if e.g.
Are you ok with me submitting a PR for such things? Static vs Dynamic dispatch. Is there any particular reason for using
inside self.location = [location.clone(), self.location].concat(); One allocation could be avoided: let mut new_location = location.clone();
new_location.extend_from_slice(&self.location);
self.location = new_location; I'll take a look at other places during the weekend :) |
Thanks so much, this is really interesting. I didn't know allocation was particularly slow, I'll research it. My initial guess is that most of these things will have a pretty small effect, e.g. so the truncate stuff is only called if you get an error and if print repr. I don't want to add more dependencies unless they allow a significant improvement. The most significant thing is PR very welcome. |
Unfortunately using
|
I would like to help too (been keen about helping this project for some time 😃 ). I would take a look and if I see some low hanging fruit will be posting (small) PRs. that is ok? |
Amazing yes please.
yes, but my third priority (after safety and speed) is readability, particularly for python developers - e.g. me. So if changes make it technically more correct rust but make it harder to read for notices, and have no other impact, I might be inclined to refuse them. Best to create some small PRs and I'll comment. |
That's not just a stylistic choice - avoiding bound checks means iterators can often be faster than index based loops.
Where is this?
If you just want to provide an interface, consider using a visitor rather than returning an iterator. This also avoids the boxing, but it doesn't really matter for performance I think. For example, you have: impl<'data> ListInput<'data> for &'data PyList {
fn input_iter(&self) -> Box<dyn Iterator<Item = &'data dyn Input> + 'data> {
Box::new(self.iter().map(|item| item as &dyn Input))
}
// ...
} But you could also accept a closure to run over each item: impl<'data> ListInput<'data> for &'data PyList {
fn visit(&self, visitor: impl FnMut(???)) {
self.iter().for_each(visitor)
}
} Or if you prefer something more typed: trait Validator{
fn validate(???)
}
impl<'data> ListInput<'data> for &'data PyList {
fn visit(&self, visitor: impl Validator) {
self.iter().for_each(|item| visitor.validate(item))
}
}
I mainly see some papercuts, not any actual performance crimes. For example, in let loc = if key_loc {
vec![key.to_loc(), "[key]".to_loc()]
} else {
vec![key.to_loc()]
};
for err in line_errors {
errors.push(err.with_prefix_location(&loc));
} But if you fix the if key_loc {
let locs = [key.to_loc(), "[key]".to_loc()];
for err in line_errors {
errors.push(err.with_prefix_location(&locs));
}
} else {
let loc = key.to_loc();
for err in line_errors {
errors.push(err.with_prefix_location(std::slice::from_ref(&loc)));
}
}; Note that the above comment holds true in general: because taking fn foo(blahs: &Vec<Blah>){} but this: fn foo(blahs: &[Blah]){}
fn set_ref(&mut self, name: &str, validator_weak: ValidatorWeak) -> PyResult<()> {
unsafe { Arc::get_mut_unchecked(&mut self.validator_arc).set_ref(name, validator_weak) }
} Functions like these are incredibly dangerous, they enable unsynchronized shared mutability. At least make
Arc is just for sharing things. If you want to share things, you will need it or Rc.
Do you mean Rust's lifetime annotations? |
Also, I recommend to do benchmarking as part of your CI workflows, so you can identify performance regressions (or improvements :) ) quickly. |
cc @woodruffw |
I got some really helpful suggestions from @hoodmane, noting them here before i forgot:
|
This is a small thing, but For example, this: could become: let new_method = class.getattr(intern!("__new__"))?; ...to save an allocation. I can make a small PR for these, if you'd like. Edit: Actually looks like one of these is already interned, so this would be a single line change 🙂 |
Those are indeed micro-optimizations for some specific scenarios, but sometimes they may yield some nice cumulative effect :) I.e. if there is an opportunity to avoid heap allocations I usually prefer to take it Here is another small Idea that may help a bit in some scenarios (with some cost of slowing down the others) - using a bitvector (e.g. |
@woodruffw thanks so much for the input. I think on this occasion this is superseded by #43. I'll look at all the other suggestions here when I have time. |
posted on r/rust for visibility: https://www.reddit.com/r/rust/comments/ugniwv/creator_of_pydantic_is_looking_for_help_in_moving/ |
An individual allocation is pretty quick, but it can add up. You know how in python, you try to avoid patterns like: s = "Something"
for line in lines():
s += "\n" + line and instead do something like: s = "Something\n" + "\n".join(get_lines()) The main reason the first pattern is slow is because each iteration through the loop allocates a new string for In rust, you can mutate strings, and strings amortize reallocations (in the same way dicts do in python) which means that let mut s = String::from("Something\n");
for line in get_lines() {
s.push_str(&line);
s.push('\n')
} is relatively performant, as it only has to reallocate roughly every time the string doubles in size. If you know the size of all your inputs, you can often make it even faster by allocating the right size up front for the string: let lines = get_lines();
let mut s = String::with_capacity(lines.iter().map(|s| s.len()).sum() + "Something\n".len());
s.push_str("Something\n");
for line in lines {
s.push_str(&line);
s.push('\n');
} In this case the string never needs to reallocate, since it never outgrows its initial allocation. As you said, readability for python programmers is a high concern, so the performance boost of the |
impl<'a> ValLineError<'a> {
pub fn with_prefix_location(mut self, location: &Location) -> Self {
if self.location.is_empty() {
self.location = location.clone();
} else {
// TODO we could perhaps instead store "reverse_location" in the ValLineError, then reverse it in
// `PyLineError` so we could just extend here.
self.location = [location.clone(), self.location].concat();
}
self
}
} This one uses needless clones. Looking at the code, it seems that almost all usages involve taking a impl<'a> ValLineError<'a> {
pub fn with_prefix_location(mut self, location: LocItem) -> Self {
self.location.insert(0, location);
self
}
} The old code involved unconditionally cloning a vector, which is a strong hint that you should have taken it by value (cloning a vector also involves cloning each of its elements, and LocItem can contain a String, so that's actually 2-3 needless allocations per call). The new approach eliminates those temporary vectors entirely. If you expect that in the future you may need vectors longer than 2, or even ones of indeterminate length, you can consider taking an ArrayVec instead (see arrayvec, smallvec or tinyvec crates, I recommend tinyvec since it has no unsafe code). While I have not benchmarked the crate and don't know of specific issues, in general excessive allocations can significantly harm the performance. For that reason I would try to avoid all allocations unless necessary: collecting into vectors, cloning Box'es, needlessly returning String's. For example,
Unfortunately, there is no simple solution to this problem. It's something that should have been considered during the design phase. You are also returning a lot of boxed trait objects. Besides extra allocations, those also cause extra indirection on method calls. Some of those trait objects probably can't be removed entirely, but some (where there are just a few implementing types, e.g. DictInput) can be converted into enumerations. This would allow to do away both with excessive boxing and with indirect trait calls, and it would also somewhat simplify the API. If you want, I can make a PR with my suggestions. |
For general performance ideas The Rust Performance Book may be useful. It's pretty short, and full of suggestions about things like how to avoid unnecessary allocations. |
From readme: This is not true at least for jsonschema-rs. jsonschema-rs is also quite performant, so I think it may worth it to check how it achieves it. Also please take a look at this issue from jsonschema-rs repo: simd support? (many ideas in it!) |
As the author of Not sure how much it is applicable to this library, though it might be helpful. |
Thanks both for your input, I'll definitely review the simd-support issue and think more about what you've said here @Stranger6667. @karolzlot, in terms of using jsonschema-rs (or any other JSON Schema library) directly in pydantic-core, I don't think it would ever be possible for the other reasons listed in the readme: Most of all:
|
@samuelcolvin Yes, I agree with those points you mentioned, directly using JSON Schema is not good idea. What I mean is that it's possible that jsonschema-rs implemented this well ( "load dict and also load json string"), so it may be worth to check the code. I don't know rust enough to give any more specific tip unfortunately. |
Yes, agreed, thank you. |
I'd say that is suboptimal there - it converts a Python object into |
I'm going to close this issue as the codebase has moved a long way since this discussion. Thank you all for your input, it's been really useful and interesting Any more feedback (probably best via new issues or PRs) is very welcome. A lot of the suggestions here have now been implemented and from some quick check the code base is around 2x faster than when I created this issue (as well as significantly more complex and powerful). |
* improvements to with_prefix_location as suggested in #35 * avoid clone and better benches
I'm keen to "run onto the spike" and find any big potential performance improvements in pydantic-core while the API can be changed easily.
I'd therefore love anyone with experience of rust and/or pyo3 to have a look through the code and see if I'm doing anything dumb.
Particular concerns:
cast_as
vs.extract
" issues described in Add performance suggestions to docs PyO3/pyo3#2278 was a bit scary as I only found the solution by chance, are there any other similar issues with pyo3?input
or parts ofinput
(in the case of a dict/list/tuple/set etc.) copied when it doesn't need to be?PyObject
instead ofPyAny
or visa-versa and improve performance?ListInput
andDictInput
we do a totally unnecessarymap
, is this avoidable? Is this having a performance impact? Is there another way to give a general interface to the underlying datatypes that's more performanceRwLock
that was causing the performance problems, but I managed to remove that (albeit in a slightly unsafe way) in simplifying recursive references #32 and it didn't make a difference. Is something else the problem? Could we removeArc
completely?I'll add to this list if anything else comes to me.
More generally I wonder if there are performance improvements that I'm not even aware of? "What you don't know, you can't optimise"
@pauleveritt @robcxyz
The text was updated successfully, but these errors were encountered: