-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mangle directoryLabel
on export as necessary to meet dataverse naming rules
#83
Comments
Hi! 👋 We are happy that you opened your first issue here! 😄 If you haven't done so already, please make sure you check out our Code of Conduct. |
As stated in #17 (comment) I think we should implement directory name mangling/unmangling for exports. Even if dataverse fixes IQSS/dataverse#8807 for the immediate |
Dataverse doesn't currently allow for a leading dot in directory names. Hence, replace with `_._` on the remote end (but keep things as they are locally). (Closes datalad#83)
The leading dot issue is fixed in PR #147. However, the issue is more severe really.
I do not have a proper solution, though. What are reliable replacement rules? Do we consider unicode file/directory names? How can we reliably "escape" every possible character, given that we only have alphanumeric + Comment on code, though: In export mode all paths go through |
It seems logical to use the same mangling as dataverse:
Any cyrillic folder name would become a single Here is something that might be workable: OBSCURE_FILENAME = ' |;&%b5{}\'"<>ΔЙקم๗あ .datc '
mangled = ''.join(
'.' if re.match(r'[^a-zA-Z0-9_\-\.]', c) else c
for c in unicodedata.normalize(
'NFKD', OBSCURE_FILENAME).encode(
'ascii', 'xmlcharrefreplace').decode('ascii')) which would give
for the "most obscure name", and which we would then re.sub('[\.]+\.', '.', mangled) to get where a leading Is this beautiful? No. But given that dataverse is forcing the non-ascii world into this corset, beauty is out of this picture. Practically speaking: noone with non-ascii folder names would want to export to dataverse. I think it is not for datalad to fix that. We still have to confirm that no similar rules apply to filenames. If that is anyhow the case, we will also have problems uploading URL keys in non-export mode. |
I think we should start using https://pypi.org/project/Unidecode/ This shrinks the problem from "all of unicode" to 127 ascii symbols. Those could then by mapped by-hand. This approach is not without issues. People being offended by its ignorance to language not being the least of them. See the project's readme for some inspiration. Still with dataverse imposing this substantial namespace constraint, it boils down to being able to use it, or not -- with unicode chars. |
directoryLabel
on export as necessary to meet dataverse naming rules
The special remote implementations no longer need to worry about this. All API methods that accept paths take care of the mangling automatically and internally. Closes datalad#83
The special remote implementations no longer need to worry about this. All API methods that accept paths take care of the mangling automatically and internally. Closes datalad#83
When exporting data (exporttree=yes) dataverse seems to remove leading "." chars from hidden direcotries in directoryLabel. Any ideas how to handle this?
Other characters like leading underscore work.
The text was updated successfully, but these errors were encountered: