Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mangle directoryLabel on export as necessary to meet dataverse naming rules #83

Closed
jernsting opened this issue Jun 17, 2022 · 5 comments · Fixed by #231
Closed

Mangle directoryLabel on export as necessary to meet dataverse naming rules #83

jernsting opened this issue Jun 17, 2022 · 5 comments · Fixed by #231
Assignees
Labels
bug Something isn't working

Comments

@jernsting
Copy link
Collaborator

When exporting data (exporttree=yes) dataverse seems to remove leading "." chars from hidden direcotries in directoryLabel. Any ideas how to handle this?

Other characters like leading underscore work.

@welcome
Copy link

welcome bot commented Jun 17, 2022

Hi! 👋 We are happy that you opened your first issue here! 😄 If you haven't done so already, please make sure you check out our Code of Conduct.

@mih mih changed the title Fix exporttree functions Mangle directory names on export as necessary to meet dataverse naming rules Jul 4, 2022
@mih
Copy link
Member

mih commented Jul 4, 2022

As stated in #17 (comment) I think we should implement directory name mangling/unmangling for exports. Even if dataverse fixes IQSS/dataverse#8807 for the immediate .datalad, we cannot force users to name their directories according to dataverse rules (they might have to meet completely unrelated naming standards).

@mih mih added the bug Something isn't working label Jul 7, 2022
@bpoldrack bpoldrack self-assigned this Jul 19, 2022
bpoldrack added a commit to bpoldrack/datalad-dataverse that referenced this issue Jul 19, 2022
Dataverse doesn't currently allow for a leading dot in directory names.
Hence, replace with `_._` on the remote end (but keep things as they are
locally).

(Closes datalad#83)
@bpoldrack
Copy link
Member

bpoldrack commented Aug 14, 2022

The leading dot issue is fixed in PR #147. However, the issue is more severe really.

  1. The restrictions are more severe than just the leading dot. While (with PR Fixing special remote #147) we are mangling directory names to address the immediate . issue, any other restriction isn't properly dealt with yet. Instead we throw an error when we detect an invalid name. As @mih pointed out (Fixing special remote #147 (comment)), this is not a good approach.
  2. We don't even bother checking filenames so far.

I do not have a proper solution, though. What are reliable replacement rules? Do we consider unicode file/directory names? How can we reliably "escape" every possible character, given that we only have alphanumeric + _, -, . ?
Something like __DL-DV__ before and after? Imagine what that looks like for - say a cyrillic - filename. You may simply not want an export in that case.

Comment on code, though: In export mode all paths go through mangle_directory_name in PR #147 already. Any (more) replacements that we come up with, can be put in there.

@mih
Copy link
Member

mih commented Dec 1, 2022

It seems logical to use the same mangling as dataverse:

The following sanitizing rules will be applied to all the existing folder names in the database: any invalid characters will be replaced by the '.' character. Any sequences of dots will be further replaced with a single dot.

IQSS/dataverse#8807 (comment)

Any cyrillic folder name would become a single . with these rules -- which is again forbidden. So we'd have to add a leading _ to get _. (maybe).

Here is something that might be workable:

OBSCURE_FILENAME = ' |;&%b5{}\'"<>ΔЙקم๗あ .datc '

mangled = ''.join(
    '.' if re.match(r'[^a-zA-Z0-9_\-\.]', c) else c
    for c in unicodedata.normalize(
        'NFKD', OBSCURE_FILENAME).encode(
            'ascii', 'xmlcharrefreplace').decode('ascii'))

which would give

'.....b5........916...1048...774...1511...1605...3671...12354...datc.'

for the "most obscure name", and which we would then

re.sub('[\.]+\.', '.', mangled)

to get '.b5.916.1048.774.1511.1605.3671.12354.datc.'

where a leading _ could be applied.

Is this beautiful? No. But given that dataverse is forcing the non-ascii world into this corset, beauty is out of this picture.

Practically speaking: noone with non-ascii folder names would want to export to dataverse. I think it is not for datalad to fix that.

We still have to confirm that no similar rules apply to filenames. If that is anyhow the case, we will also have problems uploading URL keys in non-export mode.

@mih mih reopened this Dec 1, 2022
@mih
Copy link
Member

mih commented Mar 10, 2023

I think we should start using https://pypi.org/project/Unidecode/

This shrinks the problem from "all of unicode" to 127 ascii symbols. Those could then by mapped by-hand.

This approach is not without issues. People being offended by its ignorance to language not being the least of them. See the project's readme for some inspiration.

Still with dataverse imposing this substantial namespace constraint, it boils down to being able to use it, or not -- with unicode chars.

@mih mih changed the title Mangle directory names on export as necessary to meet dataverse naming rules Mangle directoryLabel on export as necessary to meet dataverse naming rules Mar 10, 2023
mih added a commit to mih/datalad-dataverse that referenced this issue Mar 13, 2023
The special remote implementations no longer need to worry about this.
All API methods that accept paths take care of the mangling
automatically and internally.

Closes datalad#83
mih added a commit to mih/datalad-dataverse that referenced this issue Mar 13, 2023
The special remote implementations no longer need to worry about this.
All API methods that accept paths take care of the mangling
automatically and internally.

Closes datalad#83
@mih mih assigned mih and unassigned bpoldrack Mar 13, 2023
@mih mih closed this as completed in #231 Mar 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants