Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved URIUtil.safeDecodePath(String) #9462

Closed
wants to merge 4 commits into from

Conversation

joakime
Copy link
Contributor

@joakime joakime commented Mar 3, 2023

More testing of Issue #9444 revealed other cases where we should be safer with our decoding.

This includes new tests for URIUtil.safeDecodePath(String), along with the ability to configure Utf8Appendable instances to use a replacement character instead of always throwing an exception in the case of bad UTF-8 sequences.

@joakime joakime requested review from gregw and sbordet March 3, 2023 16:46
@joakime joakime self-assigned this Mar 3, 2023
@joakime joakime added this to the 12.0.x milestone Mar 3, 2023
@joakime joakime linked an issue Mar 3, 2023 that may be closed by this pull request
buffer.append(bytes[i]);
}

assertEquals("ةة", buffer.toString());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case seems out of place for this PR.
But I thought I discovered a bug in Utf8Appendable with bidi unicode glyphs.
Turns out it works as expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this test.

Copy link
Contributor

@gregw gregw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to protect control characters.
I'm moderately sure we don't need extra protection of bad UTF-8
I'm pondering if we should ever fallback to ISO8859-1 and would like to work out why we do that or if anybody else does. Maybe time to stop it.

Comment on lines +615 to +616
if (!fallbackToIso88591)
throw new IllegalArgumentException("Not UTF-8, not decoding in ISO-8859-1", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't think we need this... unless an illegal UTF8 character can be decoded as / in iso8859-1, which I don't believe is possible... but even if it were, then we'd leave the / encoded and there would be nothing ambigous.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if anything, I'm dubious that we should ever fall back to iso8859-1

buffer.append(bytes[i]);
}

assertEquals("ةة", buffer.toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this test.


@ParameterizedTest
@MethodSource("safeDecodePathSource")
public void testSafeDecodePath(String encodedPath, String decodedPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good test to have, but I think you are making too many things safe.

Copy link
Contributor

@sbordet sbordet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer to @gregw.

I think falling back to ISO 8859-1 is a thing of the past, as it seems to me that every new spec, everybody and everything is on UTF-8, so I won't do it.

Arguments.of("/foo%2fbar", "/foo%2Fbar"),
Arguments.of("/foo%252fbar", "/foo%252fbar"),
Arguments.of("/foo%3bbar", "/foo;bar"),
Arguments.of("/foo%3fbar", "/foo%3Fbar"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were the f in %2f capitalized?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because of our canonical path mechanism. We produce an encoded path that can be string compared. I.e. we take out all variations, leave only characters that need encoding encoded and uniform capitalization

@@ -81,6 +81,7 @@ public abstract class Utf8Appendable implements CharsetStringBuilder
};

private int _codep;
private boolean _throwOnInvalid = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this?
If we keep it, it cannot be mutable, should be set as final in the constructor and never modified.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not. See my other PR that is taking another approach to solving this problem.

@joakime
Copy link
Contributor Author

joakime commented Mar 13, 2023

Closing as PR #9479 was the final version that was merged.

PR #9496 has some of the bits around removing the ISO-8859-1 fallback.

@joakime joakime closed this Mar 13, 2023
@joakime joakime deleted the fix/12.0.x/safedecodepath-testing branch March 13, 2023 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unexpected encoding in request.getPathInfo() with Jetty 12 beta 0
3 participants