Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store span information/line & column numbers in deserialization #1811

Closed
jfrimmel opened this issue May 15, 2020 · 9 comments
Closed

Store span information/line & column numbers in deserialization #1811

jfrimmel opened this issue May 15, 2020 · 9 comments

Comments

@jfrimmel
Copy link

Hello together,

I'm currently developing an application, that accepts (manually written) input files and deserializes them. Those input files may contain a broad range of errors, that exceed the syntactic checks (e.g. if this field as a specific value, another one has to have the same specific value). This checking is already implemented and works like a treat.
Unfortunately the correlation between input file and the deserialized in-memory representation is lost, i.e. there is no line or column information. Those would be really helpful for error messages, because one could point the user to the problematic part of the (potentially large) input file.

My idea is, that one can add a special field to the deserialization structures, that takes the byte position, where the current field starts. I think, that could be achieved via an additional custom attribute, like so:

#[derive(Deserialize)]
struct Users(Vec<User>);

#[derive(Deserialize)]
struct User {
    name: String,
    #[serde(byte_position)] // not a field in the input, but gets the position of the first byte of the `User` input
    position: usize,
}

Once you get the byte position, it is easy to get the line/column number (so serde needn't to count those).

Question: is that something, that is in scope of serde? Is such a thing even possible inside serde without support of the deserializers?
Or should the deserialzation rather been done via a custom implementation?

If that is something, that can be helpful for others, I can try to work on it.

@dtolnay
Copy link
Member

dtolnay commented May 16, 2020

This is not something that serde would handle; many deserializers don't even have a concept of bytes because their input is not in the form of bytes. But data formats can provide this functionality if they want. For example the toml crate exposes a toml::Spanned<T> type that you can wrap a field in to grab the byte position of the data during deserialization.
https://docs.rs/toml/0.5.6/toml/struct.Spanned.html

@dtolnay dtolnay changed the title [Enhancement] Store span information/line & column numbers in deserialization Store span information/line & column numbers in deserialization May 16, 2020
@jfrimmel
Copy link
Author

Hello,
thanks for your reply. So, that feature is out of scope of serde, which is perfectly fine for me (I wasn't sure, that's why I opened this issue first before working on it). I'll close this issue then.

Thank you for the link. Unfortunately this Spanned<T> type has to be provided by "each" deserializer/data format and there is no way to write a hypothetical serde_spanned crate, that provides such a type, unless I'm mistaken. Please correct me if this would indeed be possible 😉

@TedDriggs
Copy link
Contributor

@jfrimmel I think it might be possible to make a serde_spanned crate and then submit PRs to have the different deserializers use that type?

I'm trying to figure out how I could retain span information from multiple file formats so that during later consumption of the input data I could tie failures back to line and column values for an editor to show problems. Validation during deserialization, as seen in toml-rs/toml-rs#236, doesn't work for my use-case.

As an illustration of why, assume I have these two structs.

#[derive(Deserialize)]
struct Rule {
    id: Spanned<String>,
    outcome: Spanned<CatalogId>
}

#[derive(Deserialize)]
struct CatalogItem {
    id: Spanned<CatalogId>,
    name: Spanned<String>,
}

I want to deserialize the rules from one file and the catalog items from another, then create diagnostic messages for each rule where outcome isn't in the set of id values found in the catalog.

Without a standard Spanned<T> that's supported by each deserializer which supports span information, I can't think of a way to do this that doesn't leak the file format of the rules or catalog files into the struct definition. That's unfortunate, because it means without macros I don't think it's possible to even make a usable generic over this whole thing.

Alternatively, a WithSpan<T, S> struct would work, since then I could write:

struct Rule<S> {
    id: WithSpan<String, S>,
    outcome: WithSpan<CatalogId, S>,
}

struct CatalogItem<S> {
    id: WithSpan<CatalogId, S>,
    name: WithSpan<String, S>,
}

trait IntoErrorLocation {
    fn to_range(&self, file_contents: &str) -> Option<Range>;
}

impl IntoErrorLocation for toml::Span {}

impl IntoErrorLocation for serde_json::Span {}

However, for that one I'm not sure if it would be possible to support deriving the Deserialize impl, and handwriting those impls would defeat the goal of making this easy to add.

@Diggsey
Copy link

Diggsey commented Aug 2, 2021

@dtolnay not all formats use byte locations... but a lot of them do, and even the ones that don't will often have some kind of span information.

What about adding a new method tell() that would return the current location in some format determined by the Deserializer. The default implementation could return no information. For most formats this could be a byte location.

The technique used by toml makes it impossible to abstract over different formats and still provide location information.

@Benjamin-L
Copy link

Benjamin-L commented Oct 18, 2022

The implementation of toml::Spanned is kinda horrifying. It would be nice if there was a cleaner way to do this that didn't involve encoding the span information in-band using magic string constants.


I think the simplest option for support on serde's side would be adding a tell method to the Deserializer trait, like @Diggsey suggested, but having it either return a byte offset or None:

fn tell(&self) -> Option<usize> { None }

This prevents us from supporting spans for formats that have a position representation other than byte offsets. A more generic solution might involve adding an associated type to Deserializer, like this:

trait Deserializer<'de> {
    ...

    type Position = ();
    
    fn tell(&self) -> Self::Position;
}

This would require also adding a type parameter to Deserialize, which would be a breaking change:

trait Deserialize<'de, P> {
    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
        where D: Deserializer<'de, Position=P>;
}

Then, the generic implementation for Spanned would look like:

struct Spanned<T, P> {
    start: P;
    end: P;
    value: T;
}

impl<'de, T, P> Deserialize<'de, P> for Spanned<T, P> where
    T: Deserialize<'de, P>
{
    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
        where D: Deserializer<'de, Position=P>
    {
        let start = deserializer.tell();
        let value = T::deserialize(deserializer)?;
        let end = deserializer.tell();
        Ok(Spanned { start, end, value })
    }
}

This still bakes in the assumption that spans have a range with a start position and an end position. I played around with possible ways to make the span type itself generic. The only way I could find to make this work is to add a deserialize_with_span method to Deserializer, which seems kinda at-odds with the existing API structure:

struct Spanned<T, S> {
    span: S;
    value: T;
}

trait Deserializer<'de> {
    ...

    type Span = ();
    
    // We can have a default implementation of this that just doesn't record span info
    fn deserialize_with_span<T>(&self) -> Result<Spanned<T, Self::Span>, Self::Error>
        where T: Deserialize<'de, Self::Span>
    {
        Ok(Spanned {
            span: (),
            value: T::deserialize(self)?,
        })
    }
}

impl<'de, T, S> Deserialize<'de, S> for Spanned<T, S>
    where T: Deserialize<'de, S>
{
    // Then the Deserialize impl for Spanned just forwards to deserialize_with_span
    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
        where D: Deserializer<'de, Span=S>
    {
        deserialize.deserialize_with_span()
    }
}

The only option I could find that doesn't involve a breaking change is tell() -> Option<usize>. One possible path here could be to start by adding tell() -> Option<usize> now, then implement one of the more general versions if we ever do a serde 2.0.


It might be useful here to look at the current set of serde formats and categorize them based on what span representations they could have. My hunch is that almost all formats either have byte-offset spans or don't have a notion of spans at all. There are probably a few (envy maybe?) that support spans that aren't byte-offsets.

@Benjamin-L
Copy link

A potential option that allows generic spans but doesn't break backwards compatibility would be to introduce a new DeserializeWithSpan trait, with the extra type parameter, and leave the Deserialize trait as-is. We could then have a blanket impl:

impl<'de, T, S> DeserializeWithSpan<'de, T, S> for T
    where T: Deserialize<'de>

The big downside of this approach is that it introduces more complexity to serde's API surface purely for backwards compatibility reasons. Every Deserialize impl can be trivially transformed into a DeserializeWithSpan impl. In the end state, where spans are supported everywhere that they can be supported, the Deserialize trait is entirely redundant. Either way you do this, supporting generic spans requires churn throughout large parts of the existing serde ecosystem. Breaking backwards compatibility trades off a cleaner end state for a more costly migration, while introducing a new trait has a messier end state with an easier migration.

Like I mentioned in the previous comment, it's possible to add support for a hardcoded span representation to serde, and then later add support for generic spans without breaking existing Deserializer implementations that use the hardcoded representation. I think if we're going to do anything in terms of span support, the first step should probably be this. It won't work for all formats, but currently no formats have span support without resorting to wild hacks like what toml is doing. Formats with a span representation that isn't compatible with the representation we choose will remain unsupported, while formats with a compatible representation can now support spans when they couldn't before. In the future, we still have the option to extend support for formats with alternate span representations (and deal with all the associated tradeoffs).

@dtolnay I'd be happy to write up a PR for this

@andrewdavidmackenzie
Copy link

I have a very similar need to the originally submitted issue description and wondered @jfrimmel what you did in the end, or if you have any advice for someone wanting to do something very similar?

@ozars
Copy link

ozars commented Mar 10, 2024

Another alternative might be adding something like ContextAccess, giving a first class support similar to what serde_spanned is already doing through MapAccess in an indirect way.

As a side note, currently serde_spanned forces span information to precede value information due to order of branches in MapAccess. This requires some caching or lookahead for some data formats to figure out where the span ends before parsing the value.

I implemented a simple prototype to see what it would look like.

It adds a new ContextAccess trait similar to MapAccess, SeqAccess etc.:

pub trait ContextAccess<'de> {
    type Error: Error;

    fn span(&mut self) -> Result<Range<usize>, Self::Error>;

    fn inner_value<V>(&mut self) -> Result<V, Self::Error>
    where
        V: Deserialize<'de>;
}

Visitor trait is extended to support contextful visits:

pub trait Visitor<'de>: Sized {
    // [...]

    fn visit_context<A>(self, context: A) -> Result<Self::Value, A::Error>
    where
        A: ContextAccess<'de>,
    {
        let _ = context;
        Err(Error::invalid_type(Unexpected::Other("contextful value"), &self))
    }
}

Deserializer trait is also extended for support, with a default implementation to keep backward-compatibility (and for formats which won't support this):

pub trait Deserializer<'de>: Sized {
    // [...]

    fn deserialize_context<V>(self, visitor: V) -> Result<V::Value, Self::Error>
    where
        V: Visitor<'de>,
    {
        let _ = visitor;
        Err(Error::custom("contextful values are not supported"))
    }
}

The interface for downstream libraries look like this (protyped here):

#[derive(Debug)]
struct ContextfulTrimAccess<'de> {
    de: StrDeserializer<'de, Error>,
    span: Range<usize>,
}

impl<'de> ContextAccess<'de> for ContextfulTrimAccess<'de> {
    type Error = Error;

    fn span(&mut self) -> Result<Range<usize>, Self::Error> {
        Ok(self.span.clone())
    }

    fn inner_value<V>(&mut self) -> Result<V, Self::Error>
    where
        V: Deserialize<'de>,
    {
        V::deserialize(self.de)
    }
}

#[derive(Debug)]
struct Spanned<T> {
    inner: T,
    span: Range<usize>,
}

impl<'de, T> Deserialize<'de> for Spanned<T>
where
    T: Deserialize<'de>,
{
    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
    where
        D: Deserializer<'de>,
    {
        struct SpannedVisitor<T>(PhantomData<T>);

        impl<'de, T> Visitor<'de> for SpannedVisitor<T>
        where
            T: Deserialize<'de>,
        {
            type Value = Spanned<T>;

            fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
                write!(formatter, "a spanned value")
            }

            fn visit_context<A>(self, mut context: A) -> Result<Self::Value, A::Error>
            where
                A: ContextAccess<'de>,
            {
                Ok(Spanned {
                    inner: context.inner_value()?,
                    span: context.span()?,
                })
            }
        }

        deserializer.deserialize_context(SpannedVisitor(PhantomData))
    }
}

This is just a simple working protoype (full diff).

There are some further details that would likely need to be wrinkled out, provieded that this looks like a reasonable/viable direction.

@dtolnay WDYT?

@ozars
Copy link

ozars commented Mar 23, 2024

Could we reopen this for tracking?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

7 participants