Add support for wildcard routes: `/foo/bar/{rest:.*}` #110

ahl · 2021-05-28T22:09:12Z

This will match paths such as /foo/bar and /foo/bar/baz.css

We can use this to serve static assets such as css, html, and image
files.

Express uses this syntax: /foo/bar/:rest(.*)
Dropwizard / Jersey / JAX-RS does: /foo/bar/{rest:.*}
Actix does: /foo/bar/{rest:.*}

This also adds support for a tag (unpublished = true) in the #[endpoint]
macro to omit certain endpoints from the OpenAPI output. This is both
useful for non-API endpoints and necessary in that OpenAPI doesn't
support any type of multi-segment matching of routes.

david-crespo

Looks good to me.

There are a number of spots where we're doing regex-ish logic with loops, which makes me wonder if there's a point where it starts to make sense to just use regexes. Based on this tiny benchmark I wrote for valid_identifier it doesn't seem like raw performance is a big distinguishing factor, though that doesn't say anything about crate bloat or other code organization questions, or for that matter, more complex regexes (which we should, of course, avoid).

https://gist.github.com/david-crespo/1ed0857caa91d37c0a8b9b47f0f124ed

loop                    time:   [15.361 ns]                  
loop invalid            time:   [11.025 ns]                          

regex                   time:   [23.126 ns]                   
regex invalid           time:   [15.915 ns]

davepacheco · 2021-06-03T21:06:28Z

dropshot/src/router.rs

 }

 /**
 * `PathSegment` represents a segment in a URI path when the router is being
 * configured. Each segment may be either a literal string or a variable (the
- * latter indicated by being wrapped in braces.
+ * latter indicated by being wrapped in braces). Variables may consume a single
+ * /-delimited segment or several as defined by a regex (currently only `.*` is


There's a couple of things that make me uncomfortable here but these fears might be specious.

One is that we might be conflating "consumes more than one path segment" with "is matched by a regular expression". I could imagine someone wanting to use regular expressions to match just a segment, as in /instances/{instance_id:[0-9]+}/status. (I'm not really saying we should try to support that, so much as: if we're trying to leave the door open to regular expressions in path segments, it's not obvious to me now that this shouldn't be possible.)

Another is that I think there's some non-trivial work that goes into correctly separating the path segments in HTTP. This is the kind of stuff I'm thinking about

dropshot/dropshot/src/router.rs

Lines 474 to 502 in e0201cf

* RFC 7230 HTTP/1.1 Syntax and Routing

* (particularly: 2.7.3 on normalization)

* RFC 3986 Uniform Resource Identifier (URI): Generic Syntax

* (particularly: 6.2.2 on comparison)

*

* TODO-hardening We should revisit this. We want to consider a few things:

* - whether our input is already (still?) percent-encoded or not

* - whether our returned representation is percent-encoded or not

* - what it means (and what we should do) if the path does not begin with

* a leading "/"

* - whether we want to collapse consecutive "/" characters (presumably we

* do, both at the start of the path and later)

* - how to handle paths that end in "/" (in some cases, ought this send a

* 300-level redirect?)

* - are there other normalization considerations? e.g., ".", ".."

*

* It seems obvious to reach for the Rust "url" crate. That crate parses

* complete URLs, which include a scheme and authority section that does

* not apply here. We could certainly make one up (e.g.,

* "http://127.0.0.1") and construct a URL whose path matches the path we

* were given. However, while it seems natural that our internal

* representation would not be percent-encoded, the "url" crate

* percent-encodes any path that it's given. Further, we probably want to

* treat consecutive "/" characters as equivalent to a single "/", but that

* crate treats them separately (which is not unreasonable, since it's not

* clear that the above RFCs say anything about whether empty segments

* should be ignored). The net result is that that crate doesn't buy us

* much here, but it does create more work, so we'll just split it

* ourselves.

Okay, I mostly punted on that, but the goal was to ensure that there's one place where the correct code would go. These feel like the sort of things that are fine to get wrong for a while and then really hose you when you run into the case where they're wrong. I'm a little worried that if we allow variables to match multiple segments, then the consumer will try to interpret this path themselves, and it will become really easy to do something that appears to work in some cases but breaks when the input is percent-encoded or something like that. (If percent-encoding is a real issue, I could imagine this behavior resulting in a vulnerability where somebody thinks they're rejecting requests that contain /.., but they're missing requests that contain '/%2e%2e', and this lets people read /etc/passwd or whatever. Again, this is potentially already broken today because all I have is a TODO, not a correct implementation, but it would suck for this problem to sprawl.) Does this go against the principle of "make it hard to do the wrong thing"?

We could potentially fix this by saying that instead of a String, you get an array of path segments or something like that.

Am I being too paranoid?

davepacheco · 2021-06-03T21:08:42Z

dropshot_endpoint/src/lib.rs

@@ -150,6 +153,14 @@ fn do_endpoint(
 })
 .unwrap_or_default();

+ let visible = if let Some(true) = metadata.unpublished {


Is there a test case around visible/unpublished? (Sorry if I missed it!)

There is. The openapi tests include an unpublished endpoint. We can tell they're successful because no change was required for the output json.

I wasn't sure about the visible vs. unpublished disparity, but I thought that unpublished was the right setting for endpoints, but it felt weird to carry the implicit negative into the struct; let me know if you have suggestions for clarity here.

This will match paths such as `/foo/bar` and `/foo/bar/baz.css` We can use this to serve static assets such as css, html, and image files. Express uses this syntax: `/foo/bar/:rest(.*)` Dropwizard / Jersey / JAX-RS does: `/foo/bar/{rest:.*}` Actix does: `/foo/bar/{rest:.*}` This also adds support for a tag (unpublished = true) in the #[endpoint] macro to omit certain endpoints from the OpenAPI output. This is both useful for non-API endpoints and necessary in that OpenAPI doesn't support any type of multi-segment matching of routes.

modified map Deserializer to allow for compound values

ahl · 2021-06-25T07:05:46Z

Based on a discussion with @davepacheco, I changed this around quite a bit. Rather than receiving the path in a String this now receives the wild card path in a Vec<String>. To do this, I modified the from_map() deserializing function to accommodate values that may have an iterator as the value whereas previous only a String was allowed.

I observed that consumers could use types for their APIs that were impossible to serialize at runtime when the API endpoint is invoked. To address this, I added more stringent type checking. It's still at runtime, but it happens during ApiDescription::register() so very early in the program.

This exposed areas where we were using types that could never work (e.g. Vec<u16>) in some tests.

There's still more work to be done here:

more/better type checking for APIs
compile-time type checking w/ the #[endpoint ... ] macro
better coordination between path values and path parameter type checking

But this felt like a significant step forward and I wanted to unblock work that will consume the wildcard path matching.

smklein · 2021-06-29T21:14:16Z

dropshot/src/router.rs

 pub enum PathSegment {
 /** a path segment for a literal string */
 Literal(String),
 /** a path segment for a variable */
- Varname(String),
+ VarnameSegment(String),
+ /** a path segment for a variable with a regex */


IMO would be helpful to describe why there are two Strings here (what does each represent)?

dropshot/src/router.rs

smklein · 2021-06-29T21:32:51Z

dropshot/src/router.rs

+ assert!(
+ pat == ".*",
+ "Only the pattern '.*' is currently supported"
+ );
+ PathSegment::VarnameRegEx(var.to_string(), pat.to_string())


Have you had a chance to look at BurntSushi's regex crate?

Although it probably wouldn't be necessary for a single pattern match, it might be a worthwhile investment before going much further down the "stringly-typed representation of regex patterns".

Also notable, a little mention in the documentation for RegexSet:

If one has hundreds or thousands of regexes to match repeatedly (like a URL router for a complex web application or a user agent matcher), then a regex set can realize huge performance gains.

Seems like a neat use-case :)

Reading through more of this PR, it seems like we're doing a lot of special casing for "exactly the wildcard" - not just that this is the only permitted regex, but also things like checking that no segments exist after the wildcard.

Seems kinda like those special cases might be hard to undo later - as an example, is it possible to validate that the intersection of N regexes is empty for all possible input paths, to perform a similar collision detection?

Yes, I'm familiar with the regex crate. I don't follow your second comment.

ah, let me try to re-phrase.

Basically, when I see structures like this:

dropshot/dropshot/src/router.rs

Lines 94 to 101 in ee7fc26

enum HttpRouterEdges<Context: ServerContext> {

/** Outgoing edges for literal paths. */

Literals(BTreeMap<String, Box<HttpRouterNode<Context>>>),

/** Outgoing edge for variable-named paths. */

VariableSingle(String, Box<HttpRouterNode<Context>>),

/** Outgoing edge that consumes all remaining components. */

VariableRest(String, Box<HttpRouterNode<Context>>),

}

(specifically the VariableRest part)

and checks like this:

dropshot/dropshot/src/router.rs

Lines 297 to 305 in ee7fc26

| HttpRouterEdges::VariableRest(varname, _) => {

panic!(

"URI path \"{}\": attempted to register route \

for literal path segment \"{}\" when a route \

exists for variable path segment (variable \

name: \"{}\")",

path, lit, varname

);

}

I interpret that check as: "we'd like to stop someone from registering both /foo/{bar:.*} and /foo/baz" (as covered by tests below) - but in a more general sense, it seems like we could say "we'd like to prevent callers from registering multiple routes that overlap with one another".

When considering regexes beyond just wildcards, this is where my original question came up - we can maintain checks like this, where we handle only "full wildcard", but making the jump to any other type of regex seems like it'll significantly alter this implementation.

I guess I'm asking what direction the ship is steering towards:

Do we think pure wildcards will be the only supported regex for the foreseeable future?

Or do we expect other regexes to be supported? (This led to my question about "finding the intersection between multiple regexes", which seemed like the generalized solution to the "do the routes overlap" question).

I think I see. I'm going to change VarnameRegEx to VarnameWildcard to remove the false generalization.

I posted my second comment before I saw your response. I don't anticipate other regexes being supported any time soon.

dropshot/src/router.rs

dropshot/src/from_map.rs

dropshot/src/router.rs

smklein

Looks good to me - to keep up with all other major changes, maybe worth a note in the CHANGELOG?

(I don't believe this is breaking, but it is a visible change)

smklein · 2021-07-15T16:18:47Z

FYI, apparently this PR is causing test failures in Omicron, and preventing rolling forward:

oxidecomputer/omicron#154

Trivially repro'd in the Omicron repo by running ./target/debug/nexus omicron-nexus/examples/config.toml --openapi , which shows "Aborted (core dumped)".

I'll dig around a bit more, and see if I can root-cause

smklein · 2021-07-15T17:18:07Z

... weird, toggling the panic = "abort" in omicron/Cargo.toml to panic = "unwind" avoids the abort altogether (the openapi generated spec is printed without error).

smklein · 2021-07-15T21:17:40Z

Fixed by #122

ahl requested a review from davepacheco May 28, 2021 22:09

david-crespo approved these changes Jun 1, 2021

View reviewed changes

david-crespo mentioned this pull request Jun 2, 2021

Serve static files #115

Closed

davepacheco reviewed Jun 3, 2021

View reviewed changes

ahl mentioned this pull request Jun 23, 2021

Fix handling of escape codes in paths #118

Merged

ahl added 5 commits June 23, 2021 14:24

fix handling of escape codes in paths

3048aaa

cleaned up; about to start trying stuff

0924c3f

I think this approach might work

53ecfc1

much more stringent type checking

ee7fc26

modified map Deserializer to allow for compound values

ahl force-pushed the wild branch from f4eeb34 to ee7fc26 Compare June 25, 2021 06:58

ahl requested a review from smklein June 25, 2021 07:05

smklein reviewed Jun 30, 2021

View reviewed changes

ahl added 2 commits June 30, 2021 10:15

feedback from sean

96e1816

add a test with a slash after the wildcard specifier

4f12b90

smklein approved these changes Jun 30, 2021

View reviewed changes

add change to log

1426cb9

ahl merged commit 41060ad into main Jul 1, 2021

ahl deleted the wild branch July 1, 2021 14:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for wildcard routes: `/foo/bar/{rest:.*}` #110

Add support for wildcard routes: `/foo/bar/{rest:.*}` #110

ahl commented May 28, 2021

david-crespo left a comment •

edited

Loading

davepacheco Jun 3, 2021

davepacheco Jun 3, 2021

ahl Jun 4, 2021

ahl commented Jun 25, 2021

smklein Jun 29, 2021

smklein Jun 29, 2021

smklein Jun 30, 2021

ahl Jun 30, 2021

smklein Jun 30, 2021

ahl Jun 30, 2021

ahl Jun 30, 2021

smklein left a comment

smklein commented Jul 15, 2021

smklein commented Jul 15, 2021 •

edited

Loading

smklein commented Jul 15, 2021

	* RFC 7230 HTTP/1.1 Syntax and Routing
	* (particularly: 2.7.3 on normalization)
	* RFC 3986 Uniform Resource Identifier (URI): Generic Syntax
	* (particularly: 6.2.2 on comparison)
	*
	* TODO-hardening We should revisit this. We want to consider a few things:
	* - whether our input is already (still?) percent-encoded or not
	* - whether our returned representation is percent-encoded or not
	* - what it means (and what we should do) if the path does not begin with
	* a leading "/"
	* - whether we want to collapse consecutive "/" characters (presumably we
	* do, both at the start of the path and later)
	* - how to handle paths that end in "/" (in some cases, ought this send a
	* 300-level redirect?)
	* - are there other normalization considerations? e.g., ".", ".."
	*
	* It seems obvious to reach for the Rust "url" crate. That crate parses
	* complete URLs, which include a scheme and authority section that does
	* not apply here. We could certainly make one up (e.g.,
	* "http://127.0.0.1") and construct a URL whose path matches the path we
	* were given. However, while it seems natural that our internal
	* representation would not be percent-encoded, the "url" crate
	* percent-encodes any path that it's given. Further, we probably want to
	* treat consecutive "/" characters as equivalent to a single "/", but that
	* crate treats them separately (which is not unreasonable, since it's not
	* clear that the above RFCs say anything about whether empty segments
	* should be ignored). The net result is that that crate doesn't buy us
	* much here, but it does create more work, so we'll just split it
	* ourselves.

	enum HttpRouterEdges<Context: ServerContext> {
	/** Outgoing edges for literal paths. */
	Literals(BTreeMap<String, Box<HttpRouterNode<Context>>>),
	/** Outgoing edge for variable-named paths. */
	VariableSingle(String, Box<HttpRouterNode<Context>>),
	/** Outgoing edge that consumes all remaining components. */
	VariableRest(String, Box<HttpRouterNode<Context>>),
	}

	\| HttpRouterEdges::VariableRest(varname, _) => {
	panic!(
	"URI path \"{}\": attempted to register route \
	for literal path segment \"{}\" when a route \
	exists for variable path segment (variable \
	name: \"{}\")",
	path, lit, varname
	);
	}

Add support for wildcard routes: /foo/bar/{rest:.*} #110

Add support for wildcard routes: /foo/bar/{rest:.*} #110

Conversation

ahl commented May 28, 2021

david-crespo left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahl commented Jun 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein left a comment

Choose a reason for hiding this comment

smklein commented Jul 15, 2021

smklein commented Jul 15, 2021 • edited Loading

smklein commented Jul 15, 2021

Add support for wildcard routes: `/foo/bar/{rest:.*}` #110

Add support for wildcard routes: `/foo/bar/{rest:.*}` #110

david-crespo left a comment •

edited

Loading

smklein commented Jul 15, 2021 •

edited

Loading