Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor error messages for untagged enums ("error occurred in:") #28

Closed
emk opened this issue Jun 7, 2022 · 3 comments
Closed

Poor error messages for untagged enums ("error occurred in:") #28

emk opened this issue Jun 7, 2022 · 3 comments
Labels
wontfix This will not be worked on

Comments

@emk
Copy link
Contributor

emk commented Jun 7, 2022

openapi-interfaces generates some particularly confusing error messages. These look something like:

ERROR: error: components.interfaces: expected oneOf interface: unknown field `discriminator`, expected one of `description`, `title`, `oneOf`
error occurred in:
---
(entire interface's worth of YAML here)

Sometimes, we may get several of these errors nested inside of each other, with multiple error occurred in: lines showing larger and smaller chunks of context. This is awful, but it's surprisingly hard to fix. I'm creating this bug to so I remember why, and don't need to work it out from first principles every couple of years.

Untagged enums

An untagged enum uses #[serde(untagged)] to allow several different variants.

/// Custom interface type.
///
/// This is our main extension to OpenAPI. It allows specifying an object schema
/// in way that's less "validation-like" and more "type-like".
#[derive(Clone, Debug, Eq, PartialEq, Serialize)]
#[serde(untagged)]
#[allow(clippy::large_enum_variant)]
pub enum Interface {
    /// An interface that `$includes` another interface. We can't parse this
    /// until the inclusion has been computed.
    Includes(IncludesInterface),
    /// An type-union interface. This exists so it can use `$interface:
    /// ...#SameAsInterface` in a type union.
    OneOf(OneOfInterface),
    /// A fully-resolved interface definition.
    Basic(BasicInterface),
}

#[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize)]
#[serde(rename_all = "camelCase")]
pub struct IncludesInterface {
    /// This field is unique to this variant.
    #[serde(rename = "$includes")]
    base: String,

    /// This can be literally any value, which is treated as a JSON Merge patch
    /// against the base type.
    #[serde(flatten)]
    merge_patch: Map<String, Value>,
}

#[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize)]
#[serde(rename_all = "camelCase", deny_unknown_fields)]
pub struct OneOfInterface {
    /// This field is unique to this variant.
    one_of: Vec<InterfaceRef>,

    #[serde(default)]
    description: Option<String>,

    #[serde(default, skip_serializing_if = "Option::is_none")]
    title: Option<String>,
}

#[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize)]
#[serde(rename_all = "camelCase", deny_unknown_fields)]
pub struct BasicInterface {
    #[serde(default, skip_serializing_if = "BTreeMap::is_empty")]
    members: BTreeMap<String, Member>,

    // ...more
}

On the wire, these interfaces look like:

interface:
  SampleIncludesInterface:
    $includes: "BaseInterface"
    # Arbitrary unstructured YAML
  SampleOneOfInterface:
    oneOf: [...]
  SampleBasicInterface:
    members: { ... }

We can tell them apart by checking for $includes and oneOf.

serde parsing strategy

To parse Interface, serde attempts to parse each of the three variants, and it takes the first one which matches. If we expand the #[derive(Deserializer)] macro, we get:

// Recursive expansion of Deserialize! macro
// ==========================================

#[doc(hidden)]
#[allow(non_upper_case_globals,unused_attributes,unused_qualifications)]
const _:() = {
  #[allow(unused_extern_crates,clippy::useless_attribute)]
  extern crate serde as _serde;
  #[allow(unused_macros)]
  macro_rules!try {
    ($__expr:expr) = >{
      match$__expr {
        _serde::__private::Ok(__val) = >__val,_serde::__private::Err(__err) = >{
          return _serde::__private::Err(__err);    
          }
      }
    }
  }#[automatically_derived]
  impl < 'de>_serde::Deserialize< 'de>for Interface {
    fn deserialize<__D>(__deserializer:__D) -> _serde::__private::Result<Self,__D::Error>where __D:_serde::Deserializer< 'de> ,{
      let __content = try!(<_serde::__private::de::Content as _serde::Deserialize> ::deserialize(__deserializer));
      if let _serde::__private::Ok(__ok) = _serde::__private::Result::map(<IncludesInterface as _serde::Deserialize> ::deserialize(_serde::__private::de::ContentRefDeserializer:: <__D::Error> ::new(&__content)),Interface::Includes){
        return _serde::__private::Ok(__ok);
        
      }if let _serde::__private::Ok(__ok) = _serde::__private::Result::map(<OneOfInterface as _serde::Deserialize> ::deserialize(_serde::__private::de::ContentRefDeserializer:: <__D::Error> ::new(&__content)),Interface::OneOf){
        return _serde::__private::Ok(__ok);
        
      }if let _serde::__private::Ok(__ok) = _serde::__private::Result::map(<BasicInterface as _serde::Deserialize> ::deserialize(_serde::__private::de::ContentRefDeserializer:: <__D::Error> ::new(&__content)),Interface::Basic){
        return _serde::__private::Ok(__ok);
        
      }_serde::__private::Err(_serde::de::Error::custom("data did not match any variant of untagged enum Interface"))
    }
    }
  };

In pseudocode, this is:

const content = deserialize_as::<serde::__private::de::Content>(&input)?;
if let Ok(includes) = deserialize_as::<IncludesInterface>(&content)? {
    return Ok(Interface::Includes(includes));
}
if let Ok(oneOf) = deserialize_as::<OneOfInterface>(&content)? {
    return Ok(Interface::OneOf(oneOf));
}
if let Ok(basic) = deserialize_as::<BasicInterface>(&content)? {
    return Ok(Interface::Basic(basic));
}
Err(serde::de::Error::custom("data did not match any variant of untagged enum Interface"))

The code first parses the object as a value of type serde::__private::de::Content (see below). This loses line & column number information, and it looks something like:

    pub enum Content<'de> {
        Bool(bool),

        U8(u8),
        U16(u16),
        U32(u32),
        // ...
}

There are a few unfortunate things going on in the generated deserializer:

  1. If we can't match any interface, we print a generic error.
  2. Even if we have a value containing the key "oneOf", and the actual error occurs many levels down inside, we still fail to match OneOfInterface, and we fall back to the unhelpful generic error.
  3. We lose line number information for errors, which is incredibly frustrating in a large JSON file.

Things which don't work

  1. We can't use serde::__private::de::Content directly, because it's unstable and only allowed to be used in code generated by serde_derive.
  2. Even if we could use Content, it doesn't actually buy us anything, because it still loses line numbers.
  3. We can replace Content with serde_yaml::Value, which is basically the same thing. It also lacks line numbers.
  4. In fact, we could write our own Content type, and we'd still have trouble getting line numbers, because Deserialize is normally implemented in a format-independent way, expecting an arbitrary implementation of Deserializer. And many serde formats don't have line numbers, so Deserializer offers no generic way to get the current line number.
  5. Even if we could use some trickery to reach through Deserializer to get line numbers, serde_yaml doesn't provide them, anyways.
  6. The yaml_rust library used by serde_yaml actually does have some line number support. See Marker and mark. But we'd have to patch serde_rust to make this available.

Also, if we use a tagged enum like:

#[derive(Deserialize)]
#[serde(tag = "type")]
enum Foo {
    Bar { x: String },
    Baz { y: String },
}

...the generated code still goes through Content in some cases. So serde basically always follows the same strategy with enums in YAML-like formats that aren't "self-describing": Parse them to Content, lose line numbers, then re-parse Content.

Our general approach

We use a custom Deserialize implementation that parses the input to serde_yaml:Mapping, losing line number information. Then we use contains_key to manually check for each of our special keys. Once we figure out what kind of structure we're parsing, we parse it normally using deserialize_enum_helper to consume our serde_yaml:Mapping.

impl<'de> Deserialize<'de> for Interface {
    // Manually deserialize for slightly better error messages.
    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
    where
        D: serde::Deserializer<'de>,
    {
        use serde_yaml::{Mapping, Value};

        // Parse it as raw YAML.
        let yaml = Mapping::deserialize(deserializer)?;

        // Look for `$includes`.
        let includes_key = Value::String(String::from("$includes"));
        let oneof_key = Value::String(String::from("oneOf"));
        if yaml.contains_key(&includes_key) {
            Ok(Interface::Includes(deserialize_enum_helper::<D, _>(
                "`$includes` interface",
                yaml,
            )?))
        } else if yaml.contains_key(&oneof_key) {
            Ok(Interface::OneOf(deserialize_enum_helper::<D, _>(
                "oneOf interface",
                yaml,
            )?))
        } else {
            Ok(Interface::Basic(deserialize_enum_helper::<D, _>(
                "interface",
                yaml,
            )?))
        }
    }
}

Drawbacks to this approach include:

  1. We lose line number information early, and we can't get it back without ridiculous amounts of work.
  2. We may have multiple levels of untagged enums nested inside each other.
  3. All our errors are passed through serde as raw strings, not structured errors. So it's hard to coordinate between multiple levels of tagged enums.

Possible fixes

  1. Any of Better message when failing to match any variant of an untagged enum serde-rs/serde#2157, Collect errors when deserializing untagged enums serde-rs/serde#1544 or this serde fork might allow us to get rid of the custom Deserialize. But that would still leave the loss of line-number information, and the confusing errors from nested untagged enum types.
  2. We could at least improve errors somewhat if we (ugh) encoded our errors in a machine-readable format, and used that information to detect when we were inside multiple levels of tagged enums. This would allow us to clean our error messages up slightly, at least.
@emk
Copy link
Contributor Author

emk commented Jun 7, 2022

Ah, I had completely forgotten the last alternative here! Many of the various serde bindings provide a Spanned type, as discussed in dtolnay/serde-yaml#201. This has been officially rejected from serde_yaml.

See https://docs.rs/toml/latest/toml/struct.Spanned.html, which looks fairly clean. And https://docs.rs/json-spanned-value/latest/json_spanned_value/, which appears to only work for spanned::Value(???), and which relies on a hacked Reader type that uses thread-local storage and forces 1-byte-at-a-time reads.

We could implement something inspired by toml::Spanned by forking serde_yaml, if we really wanted it that badly.

Note that to implement $include, we convert two parsed data structures to JSON, merge them using JSON Merge Patch, and then parse them again. Preserving line numbers through this process would require a custom YAML SpannedValue = Spanned<UnspannedValueWithSpannedChildren> type, and a quick reimplementation of Merge Patch (which is like 5 lines). But this would further require reading spans from two kinds of input:

  1. We'd need to be able to deserialize Spanned<T> from raw YAML.
  2. We'd need to be able to deserialize Spanned<T> from our custom SpannedValue type.

So this is possible, but it's a huge project.

@emk
Copy link
Contributor Author

emk commented Jun 7, 2022

@emk
Copy link
Contributor Author

emk commented Jun 7, 2022

OK, by inspecting the text of the error messages, and by adding two new custom Deserialize routines, we can transform this:

ERROR: error parsing merged "Widget"
  caused by: expected $ref, $interface or a schema with one of allOf, oneOf, or type: expected schema: data did not match any variant of untagged enum VecOrScalar
error occurred in:
---
type: foo

error occurred in:
---
type: foo

...into this:

ERROR: error parsing merged "Widget"
  caused by: unknown variant `foo`, expected one of `string`, `number`, `integer`, `object`, `array`, `boolean`, `null`
error occurred when expecting one of "string", "number", "integer", "object", "array", "boolean" or "null" in:
  ---
  foo
  ---
...which appeared when expecting schema in:
  ---
  type: foo
  ---

The most important part of this change is unknown variant 'foo', expected one of ..., which is much better than expected $ref, $interface or a schema with one of allOf, oneOf, or type: expected schema: data did not match any variant of untagged enum VecOrScalar.

These errors may still be hard to find a in big source file, but at least they should be a lot clearer. Sadly, line numbers are not on the roadmap.

@emk emk added the wontfix This will not be worked on label Jun 7, 2022
@emk emk closed this as completed in a0c67a9 Jun 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

1 participant