Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Permalinks in S3 storage identifiers: treatment of slashes #10771

Closed
vera opened this issue Aug 13, 2024 · 10 comments · Fixed by #10775
Closed

Permalinks in S3 storage identifiers: treatment of slashes #10771

vera opened this issue Aug 13, 2024 · 10 comments · Fixed by #10775
Labels
Milestone

Comments

@vera
Copy link
Contributor

vera commented Aug 13, 2024

After creating a dataset in a Dataverse configured for Permalink PIDs and S3 storage, I noticed that the dataset's storage identifier looks like this:

s3://https://example.com/mypermalinks//XYZ...

(where https://example.com/mypermalinks/ is the authority I have configured and XYZ... is the unique part of the identifier)

In this example, the Permalink PID that is displayed in the Dataverse UI is https://example.com/mypermalinks/XYZ....

Note that the storage identifier contains a double slash preceding the unique part, but the PID does not. Is this a bug in the storage identifier generation?

In my S3 management console as well as using the S3 command line tool s3cmd, the double slashes (both the "buggy" slash and the one in https://) are interpreted as a kind of "folder" structure that is weird to navigate.

Here is a snippet from the Amazon S3 docs about object key names:

The console uses the key name prefixes (Development/, Finance/, and Private/) and delimiter ('/') to present a folder structure.
...
Amazon S3 supports buckets and objects, and there is no hierarchy. However, by using prefixes and delimiters in an object key name, the Amazon S3 console and the AWS SDKs can infer hierarchy and introduce the concept of folders.
The Amazon S3 console implements folder object creation by creating a zero-byte object with the folder prefix and delimiter value as the key.
...

Characters that might require special handling

The following characters in a key name might require additional code handling and likely need to be URL encoded or referenced as HEX.
...

  • ...
  • Forward slash ("/")
  • ...

https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html

Is this creation of a folder structure in S3 intended? If yes, I think the folder structure should be created in a way that makes more sense (e.g. don't create a weird empty-name folder because of the double slash at the start of a URL (https://...)).

In any case, I think it might make sense to URL-encode some or all of the slashes inside Permalink PIDs when generating the storage identifier, as suggested by the Amazon docs (e.g. set a storage identifier like s3://https%3A%2F%2Fexample.com%2Fmypermalinks%2F%2FXYZ... (< this would be an option that wouldn't create any nested folder structure inside the S3 bucket)).

What do you think?

@vera vera added the Type: Suggestion an idea label Aug 13, 2024
@qqmyers
Copy link
Member

qqmyers commented Aug 13, 2024

FWIW: I haven't had time to look into this, but I don't think permalinks starting with https:// or having double slashes in general was considered during the design (versus having a base-url and having a URL form like DOIs with https://doi.org/ ). There are probably some parts of the code that assume the last // is a special separator in the storageidentifier, e.g. that everything before it is the bucket name, which I think would break with URLs as permalink identifiers.

@vera
Copy link
Contributor Author

vera commented Aug 14, 2024

I see. I actually didn't consider the base-url setting. Thanks for pointing that out.

I tried again to configure my Permalink PID provider using that setting. The storage identifier looks fine then (no double slashes). However, the permalink itself is now https://example.com/mypermalinks/citation?persistentId=perma:XYZ instead of https://example.com/mypermalinks/XYZ....

Is there an intended way to get permalinks that look like the latter?

@qqmyers
Copy link
Member

qqmyers commented Aug 14, 2024

Nothing implemented I think. The only use case we had for the initial development was for locally minted ones. In other places (e.g. external controlled vocab), we've set up settings with simple ways to substitute values in to create URLs. So here perhaps something like *.resolver-format={pid} (or "https://example.com/mypermalinks/{pid}" - if there are cases where always prepending the base-url doesn't work) where {pid} would be the bare numeric PID. That, or something similar, would be consistent with the earlier design discussions - I'd be happy to review any PR along those lines or discuss further if you think something else is needed.

@vera
Copy link
Contributor Author

vera commented Aug 14, 2024

I've just pushed a commit here with an idea of how the handling of the base-url setting could be changed:

vera@0ccde22

This would allow more flexible permalink formats, not limited to permalinks including the hardcoded "/citation?persistentId=" string. If a base-url is set, it is used as-is. (For example, this would allow my https://example.com/mypermalinks/{pid} permalinks.) Of course, if the intention is for the permalink to include the "/citation?persistentId=" string, it could be set as part of the base-url. If no base-url is set, this would default to the configured Dataverse host name + "/citation?persistentId=" (same as now).

I've also updated the PidUtilTest so you can see in the code how it would work.

(By the way, I think this would also fix a majority of the other issues we had with permalinks #10768 and #10769)

@qqmyers
Copy link
Member

qqmyers commented Aug 14, 2024

Thanks - looks good to me and simpler than what I suggested. Your solution assumes the PID will always be at the end of the URL to resolve it, which could be true in practice. If not, we can look for options if/when someone has that use case.

@johannes-darms
Copy link
Contributor

@vera When you create the PR could you update the documentation with a hint to not use / and most likely other special character for authority and shoulder as they are used for file paths to store datafiles.

@vera
Copy link
Contributor Author

vera commented Aug 14, 2024

Your solution assumes the PID will always be at the end of the URL to resolve it, which could be true in practice. If not, we can look for options if/when someone has that use case.

Sounds good.

The only other issue I see with my suggestion is, it requires updating the Permalink configuration for all installations that have a base-url configured. I don't know if we need to look out for that.

@qqmyers
Copy link
Member

qqmyers commented Aug 14, 2024

Yeah, probably rare at this point. Since this involves MicroProfile settings, which could be jvm-options or something else, I'd suggest just adding a release note in the PR and pointing out that anyone with a perma base-url configured needs to add ?citation... to it (or whatever the correct change is) when upgrading (and avoiding giving specific instructions for the various MicroProfile options or trying to automate the change in some way).

@qqmyers
Copy link
Member

qqmyers commented Aug 14, 2024

@vera When you create the PR could you update the documentation with a hint to not use / and most likely other special character for authority and shoulder as they are used for file paths to store datafiles.

If you want to go farther, I think it would make sense to actually consider these errors and block them.

I'd also suggest some warning about using chars that have special meaning in URLs - I think things like ? and & would break resolver URLs, XML special chars might break xml metadata exports, etc. Perhaps something vague like "In general, PermaLink authority/shoulder values should be alphanumeric. For other cases, admins may need to consider the potential impact of special characters in resolver URLs, exports, etc.

@vera
Copy link
Contributor Author

vera commented Aug 15, 2024

Just opened a PR: #10775

For now, we've just added a warning in the docs about special characters in authority/shoulder values, as you suggested.

@pdurbin pdurbin added this to the 6.4 milestone Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants