-
-
Notifications
You must be signed in to change notification settings - Fork 777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store span information/line & column numbers in deserialization #1811
Comments
This is not something that serde would handle; many deserializers don't even have a concept of bytes because their input is not in the form of bytes. But data formats can provide this functionality if they want. For example the toml crate exposes a |
Hello, Thank you for the link. Unfortunately this |
@jfrimmel I think it might be possible to make a I'm trying to figure out how I could retain span information from multiple file formats so that during later consumption of the input data I could tie failures back to line and column values for an editor to show problems. Validation during deserialization, as seen in toml-rs/toml-rs#236, doesn't work for my use-case. As an illustration of why, assume I have these two structs. #[derive(Deserialize)]
struct Rule {
id: Spanned<String>,
outcome: Spanned<CatalogId>
}
#[derive(Deserialize)]
struct CatalogItem {
id: Spanned<CatalogId>,
name: Spanned<String>,
} I want to deserialize the rules from one file and the catalog items from another, then create diagnostic messages for each rule where Without a standard Alternatively, a struct Rule<S> {
id: WithSpan<String, S>,
outcome: WithSpan<CatalogId, S>,
}
struct CatalogItem<S> {
id: WithSpan<CatalogId, S>,
name: WithSpan<String, S>,
}
trait IntoErrorLocation {
fn to_range(&self, file_contents: &str) -> Option<Range>;
}
impl IntoErrorLocation for toml::Span {}
impl IntoErrorLocation for serde_json::Span {} However, for that one I'm not sure if it would be possible to support deriving the |
@dtolnay not all formats use byte locations... but a lot of them do, and even the ones that don't will often have some kind of span information. What about adding a new method The technique used by |
The implementation of I think the simplest option for support on fn tell(&self) -> Option<usize> { None } This prevents us from supporting spans for formats that have a position representation other than byte offsets. A more generic solution might involve adding an associated type to trait Deserializer<'de> {
...
type Position = ();
fn tell(&self) -> Self::Position;
} This would require also adding a type parameter to trait Deserialize<'de, P> {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where D: Deserializer<'de, Position=P>;
} Then, the generic implementation for struct Spanned<T, P> {
start: P;
end: P;
value: T;
}
impl<'de, T, P> Deserialize<'de, P> for Spanned<T, P> where
T: Deserialize<'de, P>
{
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where D: Deserializer<'de, Position=P>
{
let start = deserializer.tell();
let value = T::deserialize(deserializer)?;
let end = deserializer.tell();
Ok(Spanned { start, end, value })
}
} This still bakes in the assumption that spans have a range with a start position and an end position. I played around with possible ways to make the span type itself generic. The only way I could find to make this work is to add a struct Spanned<T, S> {
span: S;
value: T;
}
trait Deserializer<'de> {
...
type Span = ();
// We can have a default implementation of this that just doesn't record span info
fn deserialize_with_span<T>(&self) -> Result<Spanned<T, Self::Span>, Self::Error>
where T: Deserialize<'de, Self::Span>
{
Ok(Spanned {
span: (),
value: T::deserialize(self)?,
})
}
}
impl<'de, T, S> Deserialize<'de, S> for Spanned<T, S>
where T: Deserialize<'de, S>
{
// Then the Deserialize impl for Spanned just forwards to deserialize_with_span
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where D: Deserializer<'de, Span=S>
{
deserialize.deserialize_with_span()
}
} The only option I could find that doesn't involve a breaking change is It might be useful here to look at the current set of serde formats and categorize them based on what span representations they could have. My hunch is that almost all formats either have byte-offset spans or don't have a notion of spans at all. There are probably a few ( |
A potential option that allows generic spans but doesn't break backwards compatibility would be to introduce a new impl<'de, T, S> DeserializeWithSpan<'de, T, S> for T
where T: Deserialize<'de> The big downside of this approach is that it introduces more complexity to serde's API surface purely for backwards compatibility reasons. Every Like I mentioned in the previous comment, it's possible to add support for a hardcoded span representation to serde, and then later add support for generic spans without breaking existing @dtolnay I'd be happy to write up a PR for this |
I have a very similar need to the originally submitted issue description and wondered @jfrimmel what you did in the end, or if you have any advice for someone wanting to do something very similar? |
Another alternative might be adding something like As a side note, currently serde_spanned forces span information to precede value information due to order of branches in I implemented a simple prototype to see what it would look like. It adds a new pub trait ContextAccess<'de> {
type Error: Error;
fn span(&mut self) -> Result<Range<usize>, Self::Error>;
fn inner_value<V>(&mut self) -> Result<V, Self::Error>
where
V: Deserialize<'de>;
}
pub trait Visitor<'de>: Sized {
// [...]
fn visit_context<A>(self, context: A) -> Result<Self::Value, A::Error>
where
A: ContextAccess<'de>,
{
let _ = context;
Err(Error::invalid_type(Unexpected::Other("contextful value"), &self))
}
}
pub trait Deserializer<'de>: Sized {
// [...]
fn deserialize_context<V>(self, visitor: V) -> Result<V::Value, Self::Error>
where
V: Visitor<'de>,
{
let _ = visitor;
Err(Error::custom("contextful values are not supported"))
}
} The interface for downstream libraries look like this (protyped here): #[derive(Debug)]
struct ContextfulTrimAccess<'de> {
de: StrDeserializer<'de, Error>,
span: Range<usize>,
}
impl<'de> ContextAccess<'de> for ContextfulTrimAccess<'de> {
type Error = Error;
fn span(&mut self) -> Result<Range<usize>, Self::Error> {
Ok(self.span.clone())
}
fn inner_value<V>(&mut self) -> Result<V, Self::Error>
where
V: Deserialize<'de>,
{
V::deserialize(self.de)
}
}
#[derive(Debug)]
struct Spanned<T> {
inner: T,
span: Range<usize>,
}
impl<'de, T> Deserialize<'de> for Spanned<T>
where
T: Deserialize<'de>,
{
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: Deserializer<'de>,
{
struct SpannedVisitor<T>(PhantomData<T>);
impl<'de, T> Visitor<'de> for SpannedVisitor<T>
where
T: Deserialize<'de>,
{
type Value = Spanned<T>;
fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
write!(formatter, "a spanned value")
}
fn visit_context<A>(self, mut context: A) -> Result<Self::Value, A::Error>
where
A: ContextAccess<'de>,
{
Ok(Spanned {
inner: context.inner_value()?,
span: context.span()?,
})
}
}
deserializer.deserialize_context(SpannedVisitor(PhantomData))
}
} This is just a simple working protoype (full diff). There are some further details that would likely need to be wrinkled out, provieded that this looks like a reasonable/viable direction. @dtolnay WDYT? |
Could we reopen this for tracking? |
Hello together,
I'm currently developing an application, that accepts (manually written) input files and deserializes them. Those input files may contain a broad range of errors, that exceed the syntactic checks (e.g. if this field as a specific value, another one has to have the same specific value). This checking is already implemented and works like a treat.
Unfortunately the correlation between input file and the deserialized in-memory representation is lost, i.e. there is no line or column information. Those would be really helpful for error messages, because one could point the user to the problematic part of the (potentially large) input file.
My idea is, that one can add a special field to the deserialization structures, that takes the byte position, where the current field starts. I think, that could be achieved via an additional custom attribute, like so:
Once you get the byte position, it is easy to get the line/column number (so serde needn't to count those).
Question: is that something, that is in scope of serde? Is such a thing even possible inside serde without support of the deserializers?
Or should the deserialzation rather been done via a custom implementation?
If that is something, that can be helpful for others, I can try to work on it.
The text was updated successfully, but these errors were encountered: