Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validURI characters #36

Open
rodney757 opened this issue Nov 30, 2017 · 2 comments
Open

validURI characters #36

rodney757 opened this issue Nov 30, 2017 · 2 comments

Comments

@rodney757
Copy link
Contributor

After doing some research on N2T and various browser tests, here should be the allowed characters for URIs (I think should constrain the identifiers to the "allowed" set, rather than looking for "excluded" characters))

A-Z
a-z
0-9
-_.:=+

Everything else is unpredictable in an identifier, and either conflicts with ARK or DOI rules or would interfere with REST-based URL parsing systems. We should also examine this set to see how it behaves with the current set of identifiers in the system.

@rodney757
Copy link
Contributor Author

The following set, in addition to a-z;A-Z;0-9 works properly with N2T forwarding and does not interfere with ARK reserved meanings:
()_:=+

It turns out -(dash) gets stripped by N2T (i didn't notice this before) and () (parantheses) are actually OK.

These two characters also work in conjunction with N2T but will have mangled interpretation if ever turned into ARK identifiers:
/.

As far as fixing previously accepted identifiers, the behaviour across N2T and any downstream REST services is pretty dicey and upredictable when it comes to inserting encodings. Its far safer, from my tests using curl to ONLY use approved characters.

@rodney757
Copy link
Contributor Author

Here is an explanation from the EZID team about dashes. In short, if we want to use dashes in suffixes through the N2T resolver we can't expect them to come out unscathed.

_Yes, it's more or less intentional in the sense that ARKs are defined so
that hyphens are "identity inert" (by analogy with phone numbers,
1-800-555-1212 should not be considered distinct from 18005551212).

I said "more or less" because I think normalization should be applied at
end points (eg, your receiving resolvers) rather than imposed by
intermediary resolvers (like n2t). The real reason it's happening in this
case is that n2t works by first looking up the identifier verbatim, and
failing to find it, it will then normalize according to the id type (eg,
ARK if if begins with "ark:") and look it up again.

Failing that second lookup, n2t applies a number of tricks to figure out
what to do with the id, but it applies them to the normalized identifier
and to its normalized parts. Hyphens are never touched in query strings,
but suffix parts to the left of a query string get normalized according to
ARK scheme rules before the suffix passthrough trick is applied.

So, unfortunately, to get ark ids working with the hyphen in them, either
your end resolvers would have to handle the hyphen-less forms (which I
think is the best long term strategy) or you'd have to register each
individual ark-with-hyphen in n2t._

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant