Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling URLs that end with * #14

Closed
anjackson opened this issue Jun 13, 2018 · 2 comments
Closed

Handling URLs that end with * #14

anjackson opened this issue Jun 13, 2018 · 2 comments

Comments

@anjackson
Copy link
Contributor

In a wide crawl, we appear to be hitting URLs that end with *, which leads to queries to OutbackCDX that look like:

/dc?limit=1&sort=reverse&url=https%3A%2F%2Fhips.hearstapps.com%2Ftoc.h-cdn.co%2Fassets%2F16%2F46%2F3200x1600%2Flandscape-1479498518-cindy-crawford-rande-gerber-house.jpg%3Fresize%3D1200%3A*

The * on the end forces the matchType to be PREFIX and this is true even if you specify a matchType parameter, and even if the * is encoded as %2A.

For now, I'll work around it but I'd like to know how best to handle this situation in the future.

Thanks!

@ato ato closed this as completed in a2c4158 Jun 13, 2018
@anjackson
Copy link
Contributor Author

👍

@ato
Copy link
Member

ato commented Jun 13, 2018

Oops. Looks like that's a bit of a gotcha in the design of the CDX server API.

I've implemented the solution you alluded to. Specifying matchType=exact will now stop wildcards from being expanded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants