Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Employ Unidecode for path mangling #232

Open
mih opened this issue Mar 13, 2023 · 1 comment · May be fixed by #261
Open

Employ Unidecode for path mangling #232

mih opened this issue Mar 13, 2023 · 1 comment · May be fixed by #261
Milestone

Comments

@mih
Copy link
Member

mih commented Mar 13, 2023

Rescuing #83 (comment)

If export mode should continue to be supported (#230), this is something to consider in order to be able to deliver any meaningful outcome for filename with non-latin (ascii) chars.

Possibly even outside export mode unicode handling would be needed, if URL keys might contain such chars.

Confirmed:

The name field at the end has a format dependent on the backend. It is always the last field, and is prefixed with "--". Unlike other fields, it may contain "-" in its content. It should not contain newline characters or "/"; otherwise nearly anything goes. The "E" variants of hash keys include a filename extension after the hash.

Unicode handling is needed uniformly.

Given that the the mangle/unmangle_path() function pair aims to provide a reversible mapping, and unicode->ascii cannot possibly be that, we need a solution on top.

In principle this should be possible, because we never actually unmangle a path, but only use forward-mangling to match against a state reported by dataverse (code confirms no usage of unmangle_path() outside tests).

@mih mih changed the title Employ Unidecode for export mode path mangling Employ Unidecode for path mangling Mar 13, 2023
@christian-monch
Copy link
Contributor

christian-monch commented Mar 15, 2023

I think this issue should be fixed by PR #240
PR #240 encodes all characters that are not in the supported dataverse-character set. This is done by an injective mapping. That means there are no collisions in encoded names, i.e. different un-encoded names will be mapped on different encoded names

A side effect of the injectivity of the mapping is that an encoded name could be decoded to yield the original name. As pointed out in #232 (comment), that has currently no application beyond the tests.

@mih mih added this to the 1.1 release milestone Mar 16, 2023
@christian-monch christian-monch linked a pull request Mar 16, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants