Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for wildcard routes: /foo/bar/{rest:.*} #110

Merged
merged 8 commits into from
Jul 1, 2021
Merged

Add support for wildcard routes: /foo/bar/{rest:.*} #110

merged 8 commits into from
Jul 1, 2021

Conversation

ahl
Copy link
Collaborator

@ahl ahl commented May 28, 2021

This will match paths such as /foo/bar and /foo/bar/baz.css

We can use this to serve static assets such as css, html, and image
files.

Express uses this syntax: /foo/bar/:rest(.*)
Dropwizard / Jersey / JAX-RS does: /foo/bar/{rest:.*}
Actix does: /foo/bar/{rest:.*}

This also adds support for a tag (unpublished = true) in the #[endpoint]
macro to omit certain endpoints from the OpenAPI output. This is both
useful for non-API endpoints and necessary in that OpenAPI doesn't
support any type of multi-segment matching of routes.

@ahl ahl requested a review from davepacheco May 28, 2021 22:09
Copy link
Contributor

@david-crespo david-crespo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

There are a number of spots where we're doing regex-ish logic with loops, which makes me wonder if there's a point where it starts to make sense to just use regexes. Based on this tiny benchmark I wrote for valid_identifier it doesn't seem like raw performance is a big distinguishing factor, though that doesn't say anything about crate bloat or other code organization questions, or for that matter, more complex regexes (which we should, of course, avoid).

https://gist.github.com/david-crespo/1ed0857caa91d37c0a8b9b47f0f124ed

loop                    time:   [15.361 ns]                  
loop invalid            time:   [11.025 ns]                          

regex                   time:   [23.126 ns]                   
regex invalid           time:   [15.915 ns]      

@david-crespo david-crespo mentioned this pull request Jun 2, 2021
}

/**
* `PathSegment` represents a segment in a URI path when the router is being
* configured. Each segment may be either a literal string or a variable (the
* latter indicated by being wrapped in braces.
* latter indicated by being wrapped in braces). Variables may consume a single
* /-delimited segment or several as defined by a regex (currently only `.*` is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a couple of things that make me uncomfortable here but these fears might be specious.

One is that we might be conflating "consumes more than one path segment" with "is matched by a regular expression". I could imagine someone wanting to use regular expressions to match just a segment, as in /instances/{instance_id:[0-9]+}/status. (I'm not really saying we should try to support that, so much as: if we're trying to leave the door open to regular expressions in path segments, it's not obvious to me now that this shouldn't be possible.)

Another is that I think there's some non-trivial work that goes into correctly separating the path segments in HTTP. This is the kind of stuff I'm thinking about

* RFC 7230 HTTP/1.1 Syntax and Routing
* (particularly: 2.7.3 on normalization)
* RFC 3986 Uniform Resource Identifier (URI): Generic Syntax
* (particularly: 6.2.2 on comparison)
*
* TODO-hardening We should revisit this. We want to consider a few things:
* - whether our input is already (still?) percent-encoded or not
* - whether our returned representation is percent-encoded or not
* - what it means (and what we should do) if the path does not begin with
* a leading "/"
* - whether we want to collapse consecutive "/" characters (presumably we
* do, both at the start of the path and later)
* - how to handle paths that end in "/" (in some cases, ought this send a
* 300-level redirect?)
* - are there other normalization considerations? e.g., ".", ".."
*
* It seems obvious to reach for the Rust "url" crate. That crate parses
* complete URLs, which include a scheme and authority section that does
* not apply here. We could certainly make one up (e.g.,
* "http://127.0.0.1") and construct a URL whose path matches the path we
* were given. However, while it seems natural that our internal
* representation would not be percent-encoded, the "url" crate
* percent-encodes any path that it's given. Further, we probably want to
* treat consecutive "/" characters as equivalent to a single "/", but that
* crate treats them separately (which is not unreasonable, since it's not
* clear that the above RFCs say anything about whether empty segments
* should be ignored). The net result is that that crate doesn't buy us
* much here, but it does create more work, so we'll just split it
* ourselves.

Okay, I mostly punted on that, but the goal was to ensure that there's one place where the correct code would go. These feel like the sort of things that are fine to get wrong for a while and then really hose you when you run into the case where they're wrong. I'm a little worried that if we allow variables to match multiple segments, then the consumer will try to interpret this path themselves, and it will become really easy to do something that appears to work in some cases but breaks when the input is percent-encoded or something like that. (If percent-encoding is a real issue, I could imagine this behavior resulting in a vulnerability where somebody thinks they're rejecting requests that contain /.., but they're missing requests that contain '/%2e%2e', and this lets people read /etc/passwd or whatever. Again, this is potentially already broken today because all I have is a TODO, not a correct implementation, but it would suck for this problem to sprawl.) Does this go against the principle of "make it hard to do the wrong thing"?

We could potentially fix this by saying that instead of a String, you get an array of path segments or something like that.

Am I being too paranoid?

@@ -150,6 +153,14 @@ fn do_endpoint(
})
.unwrap_or_default();

let visible = if let Some(true) = metadata.unpublished {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test case around visible/unpublished? (Sorry if I missed it!)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is. The openapi tests include an unpublished endpoint. We can tell they're successful because no change was required for the output json.

I wasn't sure about the visible vs. unpublished disparity, but I thought that unpublished was the right setting for endpoints, but it felt weird to carry the implicit negative into the struct; let me know if you have suggestions for clarity here.

ahl added 5 commits June 23, 2021 14:24
This will match paths such as `/foo/bar` and `/foo/bar/baz.css`

We can use this to serve static assets such as css, html, and image
files.

Express uses this syntax: `/foo/bar/:rest(.*)`
Dropwizard / Jersey / JAX-RS does: `/foo/bar/{rest:.*}`
Actix does: `/foo/bar/{rest:.*}`

This also adds support for a tag (unpublished = true) in the #[endpoint]
macro to omit certain endpoints from the OpenAPI output. This is both
useful for non-API endpoints and necessary in that OpenAPI doesn't
support any type of multi-segment matching of routes.
modified map Deserializer to allow for compound values
@ahl
Copy link
Collaborator Author

ahl commented Jun 25, 2021

Based on a discussion with @davepacheco, I changed this around quite a bit. Rather than receiving the path in a String this now receives the wild card path in a Vec<String>. To do this, I modified the from_map() deserializing function to accommodate values that may have an iterator as the value whereas previous only a String was allowed.

I observed that consumers could use types for their APIs that were impossible to serialize at runtime when the API endpoint is invoked. To address this, I added more stringent type checking. It's still at runtime, but it happens during ApiDescription::register() so very early in the program.

This exposed areas where we were using types that could never work (e.g. Vec<u16>) in some tests.

There's still more work to be done here:

  • more/better type checking for APIs
  • compile-time type checking w/ the #[endpoint ... ] macro
  • better coordination between path values and path parameter type checking

But this felt like a significant step forward and I wanted to unblock work that will consume the wildcard path matching.

@ahl ahl requested a review from smklein June 25, 2021 07:05
pub enum PathSegment {
/** a path segment for a literal string */
Literal(String),
/** a path segment for a variable */
Varname(String),
VarnameSegment(String),
/** a path segment for a variable with a regex */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO would be helpful to describe why there are two Strings here (what does each represent)?

dropshot/src/router.rs Outdated Show resolved Hide resolved
Comment on lines 156 to 160
assert!(
pat == ".*",
"Only the pattern '.*' is currently supported"
);
PathSegment::VarnameRegEx(var.to_string(), pat.to_string())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you had a chance to look at BurntSushi's regex crate?

Although it probably wouldn't be necessary for a single pattern match, it might be a worthwhile investment before going much further down the "stringly-typed representation of regex patterns".

Also notable, a little mention in the documentation for RegexSet:

If one has hundreds or thousands of regexes to match repeatedly (like a URL router for a complex web application or a user agent matcher), then a regex set can realize huge performance gains.

Seems like a neat use-case :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading through more of this PR, it seems like we're doing a lot of special casing for "exactly the wildcard" - not just that this is the only permitted regex, but also things like checking that no segments exist after the wildcard.

Seems kinda like those special cases might be hard to undo later - as an example, is it possible to validate that the intersection of N regexes is empty for all possible input paths, to perform a similar collision detection?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm familiar with the regex crate. I don't follow your second comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, let me try to re-phrase.

Basically, when I see structures like this:

enum HttpRouterEdges<Context: ServerContext> {
/** Outgoing edges for literal paths. */
Literals(BTreeMap<String, Box<HttpRouterNode<Context>>>),
/** Outgoing edge for variable-named paths. */
VariableSingle(String, Box<HttpRouterNode<Context>>),
/** Outgoing edge that consumes all remaining components. */
VariableRest(String, Box<HttpRouterNode<Context>>),
}

(specifically the VariableRest part)

and checks like this:

| HttpRouterEdges::VariableRest(varname, _) => {
panic!(
"URI path \"{}\": attempted to register route \
for literal path segment \"{}\" when a route \
exists for variable path segment (variable \
name: \"{}\")",
path, lit, varname
);
}

I interpret that check as: "we'd like to stop someone from registering both /foo/{bar:.*} and /foo/baz" (as covered by tests below) - but in a more general sense, it seems like we could say "we'd like to prevent callers from registering multiple routes that overlap with one another".

When considering regexes beyond just wildcards, this is where my original question came up - we can maintain checks like this, where we handle only "full wildcard", but making the jump to any other type of regex seems like it'll significantly alter this implementation.

I guess I'm asking what direction the ship is steering towards:

  • Do we think pure wildcards will be the only supported regex for the foreseeable future?
  • Or do we expect other regexes to be supported? (This led to my question about "finding the intersection between multiple regexes", which seemed like the generalized solution to the "do the routes overlap" question).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I see. I'm going to change VarnameRegEx to VarnameWildcard to remove the false generalization.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I posted my second comment before I saw your response. I don't anticipate other regexes being supported any time soon.

dropshot/src/router.rs Show resolved Hide resolved
dropshot/src/from_map.rs Outdated Show resolved Hide resolved
dropshot/src/from_map.rs Outdated Show resolved Hide resolved
dropshot/src/router.rs Show resolved Hide resolved
Copy link
Contributor

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me - to keep up with all other major changes, maybe worth a note in the CHANGELOG?

(I don't believe this is breaking, but it is a visible change)

@ahl ahl merged commit 41060ad into main Jul 1, 2021
@ahl ahl deleted the wild branch July 1, 2021 14:37
@smklein
Copy link
Contributor

smklein commented Jul 15, 2021

FYI, apparently this PR is causing test failures in Omicron, and preventing rolling forward:

oxidecomputer/omicron#154

Trivially repro'd in the Omicron repo by running ./target/debug/nexus omicron-nexus/examples/config.toml --openapi , which shows "Aborted (core dumped)".

I'll dig around a bit more, and see if I can root-cause

@smklein
Copy link
Contributor

smklein commented Jul 15, 2021

... weird, toggling the panic = "abort" in omicron/Cargo.toml to panic = "unwind" avoids the abort altogether (the openapi generated spec is printed without error).

@smklein
Copy link
Contributor

smklein commented Jul 15, 2021

Fixed by #122

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants