Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid unicode code point with flatten #1742

Closed
spease opened this issue Feb 26, 2020 · 6 comments
Closed

Invalid unicode code point with flatten #1742

spease opened this issue Feb 26, 2020 · 6 comments

Comments

@spease
Copy link

spease commented Feb 26, 2020

It seems that if an unconsumed field's value has an invalid code point and flatten is specified it will trigger "invalid unicode code point".

I think - though I'm not sure - that it may be possible to get rid of the error for known fields by using serde_bytes. I'm still checking through the code I have. However, this leaves open the door to a new field adding invalid code points that can't be protected against by setting the field to use serde_bytes (because of course it isn't known).

Example:

use serde::{Deserialize, Serialize};

#[derive(Deserialize, Serialize)]
struct A;

#[derive(Deserialize, Serialize)]
struct B {
    #[serde(flatten)]
    a: A,
}

#[tokio::main]
pub async fn main() {
    serde_json::from_slice::<B>(b"{\"c\":\"\xc0\xaf\"}").unwrap();
}

Thanks.

@jonasbb
Copy link
Contributor

jonasbb commented Feb 26, 2020

\xc0\xaf is not a valid UTF-8 sequence, however strings in JSON must always be valid UTF-8. The reason is, the encoding is not minimal. \xc0\xaf would be identical to / so the single byte encoding is required.
This problem is also not related to flatten, as the same error appears for

#[derive(Deserialize, Serialize)]
struct B {
   	c: String,
}

@spease
Copy link
Author

spease commented Feb 28, 2020

I’m aware of this. The problem I’m trying to solve is that there is some legacy JSON I need to parse which can contain strings with URLs with attacks of this nature:
https://security.stackexchange.com/questions/48879/why-does-directory-traversal-attack-c0af-work

The advice I saw elsewhere is to use serde_bytes in conjunction with a u8-oriented type. After this I can sanitize the data into proper UTF-8. The concern/issue I have though is that I think if there’s ever a case where deserialize_any gets implicitly called it will serialized as a string rather than bytes, and consequently not return any of the JSON at all, losing what is effectively a valuable message.

I could preprocess the text before serde_json, my concern with doing that is that I only want to sanitize data within strings, outside of those it should be a legitimate parse error. This would imply writing a limited JSON parser myself, which seems a bit excessive.

@spease
Copy link
Author

spease commented Mar 5, 2020

I think a new version just landed, because I came back to this ticket and I got an error about flatten only working with structs or maps that I don't remember.

If I change it to something like this:

use serde::{Deserialize, Serialize};

#[derive(Deserialize, Serialize)]
struct A {
    b: Option<String>,
}

#[derive(Deserialize, Serialize)]
struct B<'a> {
    #[serde(flatten)]
    a: A,
    c: &'a [u8]
}

#[tokio::main]
pub async fn main() {
    serde_json::from_slice::<B>(b"{\"c\":\"\xc0\xaf\"}").unwrap();
}

then I don't see the error anymore. So, I don't think it's flatten at least.

@spease spease closed this as completed Mar 5, 2020
@spease
Copy link
Author

spease commented Mar 5, 2020

Here we go. This doesn't work. In struct B, 'c' works fine. In struct A, 'c' fails.

use serde::{Deserialize, Serialize};

#[derive(Deserialize, Serialize)]
struct A<'a> {
    b: Option<String>,
    c: &'a [u8]
}

#[derive(Deserialize, Serialize)]
struct B<'a> {
    #[serde(borrow, flatten)]
    a: A<'a>,
}

#[tokio::main]
pub async fn main() {
    serde_json::from_slice::<B>(b"{\"c\":\"\xc0\xaf\"}").unwrap();
}

@spease spease reopened this Mar 5, 2020
@dtolnay
Copy link
Member

dtolnay commented May 9, 2020

This is a duplicate of #1183. Allowing escape sequences that don't produce UTF-8 is a serde_json-specific feature that kicks in when running &[u8]'s Deserialize impl, and doesn't kick in when running the Deserialize impl that flatten uses.

@dtolnay dtolnay closed this as completed May 9, 2020
@spease
Copy link
Author

spease commented May 10, 2020

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants