Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update filenaming.rst - minor typo and url fixes #796

Merged
merged 1 commit into from
Dec 13, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 10 additions & 10 deletions docs/intro/filenaming.rst
Original file line number Diff line number Diff line change
@@ -8,7 +8,7 @@ One of the most fundamental data management tasks is naming files.
.. figure:: https://imgs.xkcd.com/comics/documents.png

This may sound mundane to some --- "yeah, I guess I can't manage data if it doesn't have a file name, but what's the big deal? I already know that `I shouldn't use spaces in file names <https://superuser.com/questions/29111/what-technical-reasons-exist-for-not-using-space-characters-in-file-names>`_".
Other people take it to the edge of `bikeshedding <https://en.wiktionary.org/wiki/bikeshedding>`_ an beyond: "Let's start each file name with the ISO 8601 date format of its first modification for better sorting!"
Other people take it to the edge of `bikeshedding <https://en.wiktionary.org/wiki/bikeshedding>`_ and beyond: "Let's start each file name with the ISO 8601 date format of its first modification for better sorting!"
But between the two extremes, there are a surprising amount of *interoperability* considerations --- practices and tips to ensure cross-platform compatibility --- in order to create file names that work on your own as well as on anyone else's machine.

This matters because the files, directories, or repositories that you may at some point share with friends, colleagues, or the entire internet should successfully make it to those other people's computers, and on those computers they should be usable for their intended purpose.
@@ -62,7 +62,7 @@ If you can, try to avoid trouble with unicode
Lucky are the people with boring names without accents and special characters.
The others may have an extra bit of fun in their lives when software can not handle their names.

Even though certain names look identical across file system or operating systems, their underlying unicode character sequences can differ.
Even though certain names look identical across file systems or operating systems, their underlying unicode character sequences can differ.
For example, the character "é" can be represented as the single Unicode character u+00E9 (latin small letter e with acute), or as the two Unicode characters u+0065 and u+0301 (the letter "e" plus a combining acute symbol).
This is called `canonical equivalence <https://en.wikipedia.org/wiki/Unicode_equivalence>`_ and can be confusing: While file names are visually indistinguishable, certain tools, operating systems, or file systems can normalize their underlying unicode differently and cause errors in the process.
It becomes a problem, potentially even leading to permanent data loss, when `one tool or filesystem won't recognize a file anymore that has been normalized by a different tool or filesystem <https://web.archive.org/web/20100109162824/http://forums.macosxhints.com/archive/index.php/t-99344.html>`_.
@@ -84,8 +84,8 @@ And places that ban certain file names will not be able to create, receive, or u
Avoid illegal characters
========================

Different operating system disallow certain characters in file names, and things will be messy if you were to share a file with a character that works on your machine with a machine that regards it as illegal.
Let's start easy and with characters that you can actually find on your keyboard:
Different operating systems disallow certain characters in file names, and things will be messy if you were to share a file with a character that works on your machine with a machine that regards it as illegal.
Let's start easy, and with characters that you can actually find on your keyboard...

On Unix systems, the forward slash ``/`` can not be used in file names.
This is because this character is used to denote directory boundaries.
@@ -106,7 +106,7 @@ On Windows systems, on the other hand, there is quite a long list of characters:
And in addition, its also not possible to end a file name with a period (``.``) or a space on Windows.

Especially Unix users can thus inadvertently create files that a Windows system couldn't handle.
But in the case of files with illegal names being committed into Git repositories, the consequences are actually more sever than just a single file that can't be used or copied.
But in the case of files with illegal names being committed into Git repositories, the consequences are actually more severe than just a single file that can't be used or copied.
A dataset with a file with invalid characters likely fails to be cloned, because a checkout of that file will fail.
If a file with an invalid character exists on the non-default :term:`branch`, then that branch likely can't be checked out.
So while having invalid characters in your files is 1) a considerably convoluted way of keeping a Git repository private from that one co-worker who uses Windows, it is mostly 2) a `major interoperability hassle <https://dwheeler.com/essays/fixing-unix-linux-filenames.html>`_, even more so in the context of version control.
@@ -149,15 +149,15 @@ Here are more examples on how to be nice to the system that receives your files:
Keep file names below 260 characters
====================================

If you want to annoy a foe with a Windows-based machine, or convince a Windows-friend to switch to Unix, send them files that exceed Window's (default) maximum path length limitation of 260 characters, or make them copy files with acceptable length into a too long, nested directory hierarchy [#f3]_.
If you want to annoy a foe with a Windows-based machine, or convince a Windows-friend to switch to Unix, send them files that exceed Windows' (default) maximum path length limitation of 260 characters, or make them copy files with acceptable length into a too long, nested directory hierarchy [#f3]_.
This limit exists for all machines running Windows before Windows 10 version 1607, and for all later versions of Windows if the maximum path limit has not been manually removed in the settings.
For more gotchas that Windows users may run into, take a look at the section :ref:`ohnowindows`.

Prevent paths to be interpreted as command line arguments
=========================================================

While its not "illegal" to start a directory of file name with a hyphen (``-``) its a bad idea and disallowed by certain tools due to security risks.
In theory, a file name with a hyphen can clash with a command line argument, and a tool called to operate on that file may then misinterpret it as an argument.
While it's not "illegal" to start a directory of file name with a hyphen (``-``), it's a bad idea, and doing so is disallowed by certain tools due to security risks.
In theory, a file name starting with a hyphen can clash with a command line argument, and a tool called to operate on that file may then misinterpret it as an argument name.
If you were to create a file called ``-n`` on a Unix system, an ``ls`` or ``cat`` on this file (unless you would add a ``./`` prefix to indicate a file in the current directory) would behave different than expected, parametrizing the command line tool instead of displaying any file information.
Because this can be a security hazard, for example leading to remote code execution, `Git will refuse to operate on submodules that start with a hyphen (CVE-2018-17456) <https://www.exploit-db.com/exploits/45631>`_.

@@ -175,12 +175,12 @@ Much information and some general structure of this page is taken from `RFC 8493
The links used throughout this overview provide details and further information for particular issues.
A good general overview on how to name files can be found at `psychoinformatics-de.github.io/rdm-course/02-structuring-data/index.html <https://psychoinformatics-de.github.io/rdm-course/02-structuring-data/index.html>`_.

Do you know more? `We're eager to include your advice <github.com/datalad-handbook/book/issues/new/>`_!
Do you know more? `We're eager to include your advice <https://github.com/datalad-handbook/book/issues/new/>`_!

.. rubric:: Footnotes

.. [#f1] or bore them to death -- depends on your friends

.. [#f2] If you are on Windows, you can try and go out of your way to create a file with that name. Windows does everything in its power to prevent you from doing it, but you can succeed. But be mindful - should you succeed, you won't get rid of this file, nor of any folder hierarchy it may be contained in.

.. [#f3] Copying vacation snapshots into ``C:\Users\"Bob McBobface"\Desktop\Pictures\"Vacation Pictures"\2020\Saint-Remy-en-Bouzemont-Saint-Genest-et-Isson\"From Alice and Sasha"\Camera\`` is as doomed to fail. Sorry. Better just dump those straight onto your Desktop or something...
.. [#f3] Copying vacation snapshots into ``C:\Users\"Bob McBobface"\Desktop\Pictures\"Vacation Pictures"\2020\Saint-Remy-en-Bouzemont-Saint-Genest-et-Isson\"From Alice and Sasha"\Camera\`` is as doomed to fail. Sorry. Better just dump those straight onto your Desktop or something...