Invalid unicode code point with flatten #1742

spease · 2020-02-26T11:35:54Z

It seems that if an unconsumed field's value has an invalid code point and flatten is specified it will trigger "invalid unicode code point".

I think - though I'm not sure - that it may be possible to get rid of the error for known fields by using serde_bytes. I'm still checking through the code I have. However, this leaves open the door to a new field adding invalid code points that can't be protected against by setting the field to use serde_bytes (because of course it isn't known).

Example:

use serde::{Deserialize, Serialize};

#[derive(Deserialize, Serialize)]
struct A;

#[derive(Deserialize, Serialize)]
struct B {
    #[serde(flatten)]
    a: A,
}

#[tokio::main]
pub async fn main() {
    serde_json::from_slice::<B>(b"{\"c\":\"\xc0\xaf\"}").unwrap();
}

Thanks.

The text was updated successfully, but these errors were encountered:

jonasbb · 2020-02-26T20:04:36Z

\xc0\xaf is not a valid UTF-8 sequence, however strings in JSON must always be valid UTF-8. The reason is, the encoding is not minimal. \xc0\xaf would be identical to / so the single byte encoding is required.
This problem is also not related to flatten, as the same error appears for

#[derive(Deserialize, Serialize)]
struct B {
   	c: String,
}

spease · 2020-02-28T00:52:12Z

I’m aware of this. The problem I’m trying to solve is that there is some legacy JSON I need to parse which can contain strings with URLs with attacks of this nature:
https://security.stackexchange.com/questions/48879/why-does-directory-traversal-attack-c0af-work

The advice I saw elsewhere is to use serde_bytes in conjunction with a u8-oriented type. After this I can sanitize the data into proper UTF-8. The concern/issue I have though is that I think if there’s ever a case where deserialize_any gets implicitly called it will serialized as a string rather than bytes, and consequently not return any of the JSON at all, losing what is effectively a valuable message.

I could preprocess the text before serde_json, my concern with doing that is that I only want to sanitize data within strings, outside of those it should be a legitimate parse error. This would imply writing a limited JSON parser myself, which seems a bit excessive.

spease · 2020-03-05T18:20:12Z

I think a new version just landed, because I came back to this ticket and I got an error about flatten only working with structs or maps that I don't remember.

If I change it to something like this:

use serde::{Deserialize, Serialize};

#[derive(Deserialize, Serialize)]
struct A {
    b: Option<String>,
}

#[derive(Deserialize, Serialize)]
struct B<'a> {
    #[serde(flatten)]
    a: A,
    c: &'a [u8]
}

#[tokio::main]
pub async fn main() {
    serde_json::from_slice::<B>(b"{\"c\":\"\xc0\xaf\"}").unwrap();
}

then I don't see the error anymore. So, I don't think it's flatten at least.

spease · 2020-03-05T20:44:08Z

Here we go. This doesn't work. In struct B, 'c' works fine. In struct A, 'c' fails.

use serde::{Deserialize, Serialize};

#[derive(Deserialize, Serialize)]
struct A<'a> {
    b: Option<String>,
    c: &'a [u8]
}

#[derive(Deserialize, Serialize)]
struct B<'a> {
    #[serde(borrow, flatten)]
    a: A<'a>,
}

#[tokio::main]
pub async fn main() {
    serde_json::from_slice::<B>(b"{\"c\":\"\xc0\xaf\"}").unwrap();
}

dtolnay · 2020-05-09T23:37:25Z

This is a duplicate of #1183. Allowing escape sequences that don't produce UTF-8 is a serde_json-specific feature that kicks in when running &[u8]'s Deserialize impl, and doesn't kick in when running the Deserialize impl that flatten uses.

spease · 2020-05-10T00:16:09Z

Thanks

spease closed this as completed Mar 5, 2020

spease reopened this Mar 5, 2020

dtolnay closed this as completed May 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid unicode code point with flatten #1742

Invalid unicode code point with flatten #1742

spease commented Feb 26, 2020

jonasbb commented Feb 26, 2020

spease commented Feb 28, 2020 •

edited

Loading

spease commented Mar 5, 2020 •

edited

Loading

spease commented Mar 5, 2020

dtolnay commented May 9, 2020

spease commented May 10, 2020

Invalid unicode code point with flatten #1742

Invalid unicode code point with flatten #1742

Comments

spease commented Feb 26, 2020

jonasbb commented Feb 26, 2020

spease commented Feb 28, 2020 • edited Loading

spease commented Mar 5, 2020 • edited Loading

spease commented Mar 5, 2020

dtolnay commented May 9, 2020

spease commented May 10, 2020

spease commented Feb 28, 2020 •

edited

Loading

spease commented Mar 5, 2020 •

edited

Loading