validURI characters #36

rodney757 · 2017-11-30T20:37:07Z

After doing some research on N2T and various browser tests, here should be the allowed characters for URIs (I think should constrain the identifiers to the "allowed" set, rather than looking for "excluded" characters))

A-Z
a-z
0-9
-_.:=+

Everything else is unpredictable in an identifier, and either conflicts with ARK or DOI rules or would interfere with REST-based URL parsing systems. We should also examine this set to see how it behaves with the current set of identifiers in the system.

rodney757 · 2017-11-30T20:37:15Z

The following set, in addition to a-z;A-Z;0-9 works properly with N2T forwarding and does not interfere with ARK reserved meanings:
()_:=+

It turns out -(dash) gets stripped by N2T (i didn't notice this before) and () (parantheses) are actually OK.

These two characters also work in conjunction with N2T but will have mangled interpretation if ever turned into ARK identifiers:
/.

As far as fixing previously accepted identifiers, the behaviour across N2T and any downstream REST services is pretty dicey and upredictable when it comes to inserting encodings. Its far safer, from my tests using curl to ONLY use approved characters.

rodney757 · 2017-11-30T20:37:25Z

Here is an explanation from the EZID team about dashes. In short, if we want to use dashes in suffixes through the N2T resolver we can't expect them to come out unscathed.

_Yes, it's more or less intentional in the sense that ARKs are defined so
that hyphens are "identity inert" (by analogy with phone numbers,
1-800-555-1212 should not be considered distinct from 18005551212).

I said "more or less" because I think normalization should be applied at
end points (eg, your receiving resolvers) rather than imposed by
intermediary resolvers (like n2t). The real reason it's happening in this
case is that n2t works by first looking up the identifier verbatim, and
failing to find it, it will then normalize according to the id type (eg,
ARK if if begins with "ark:") and look it up again.

Failing that second lookup, n2t applies a number of tricks to figure out
what to do with the id, but it applies them to the normalized identifier
and to its normalized parts. Hyphens are never touched in query strings,
but suffix parts to the left of a query string get normalized according to
ARK scheme rules before the suffix passthrough trick is applied.

So, unfortunately, to get ark ids working with the hyphen in them, either
your end resolvers would have to handle the hyphen-less forms (which I
think is the best long term strategy) or you'd have to register each
individual ark-with-hyphen in n2t._

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

validURI characters #36

validURI characters #36

rodney757 commented Nov 30, 2017

rodney757 commented Nov 30, 2017

rodney757 commented Nov 30, 2017

validURI characters #36

validURI characters #36

Comments

rodney757 commented Nov 30, 2017

rodney757 commented Nov 30, 2017

rodney757 commented Nov 30, 2017