Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document and enforce what characters can be used in dataset names #497

Closed
craigds opened this issue Nov 4, 2021 · 7 comments · Fixed by #501
Closed

Document and enforce what characters can be used in dataset names #497

craigds opened this issue Nov 4, 2021 · 7 comments · Fixed by #501

Comments

@craigds
Copy link
Member

craigds commented Nov 4, 2021

It's currently possible to import a dataset with a colon (:) in its name.

This breaks at least one current feature of kart:

  • kart diff accepts filters in the format DATASET[:PRIMARY_KEY]; since string PKs can contain colons the parsing is ambiguous

We should specify that dataset names don't contain colons, and reject them during import if they do.

@craigds
Copy link
Member Author

craigds commented Nov 4, 2021

A more complete list of things we should disallow:

  • :, as above
  • Any ASCII control characters (ie bytes less than 0x20, especially \0, \r, \n), just to avoid general pitfalls.
  • any path component beginning with a .:
    • we use /.table-dataset/ as a terminator for dataset paths, and other dataset types will exist in future. Disallowing r"\/\.[a-z]+-dataset/" is too complicated a rule, so I think we should just disallow leading . instead
    • leading . components will cause the dataset to become hidden on unix systems when checked out, so that seems non-ideal
    • this also avoids security concerns from datasets containing .. or . components

In addition Windows seems to have some interesting ideas about file names; to prevent Windows checkouts from going awry perhaps we should also adhere to some/all of these:

  • two datasets in a repo shouldn't have the exact same name ignoring case (this would also be a problem on most MacOS filesystems AFAIK)
  • Windows additionally disallows these characters: <>"|?*
  • path components apparently shouldn't be any of these things (⁉️): CON, PRN, AUX, NUL, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9
  • path components shouldn't end with a . or (space)

@olsen232
Copy link
Collaborator

olsen232 commented Nov 5, 2021

Could just ban . altogether - it's confusing when you have a table with a fully qualified name like "a.b"."c.d" where a.b is the schema and then c.d is the table-name.

@olsen232
Copy link
Collaborator

olsen232 commented Nov 5, 2021

I'm +1 to disallowing certain characters that are likely to cause problems, but -1 to having an extra list of banned tokens due them not working as intended on certain OS's - that would mean the idiosyncrasies of that OS have now spread to another OS via kart.
I'm more +0 on whether we allow two datasets with the same name but different case. I guess probably it's good to disallow since it's probably user error.

@craigds
Copy link
Member Author

craigds commented Nov 5, 2021

yeah, that's reasonable. So let's skip the handling of COM et al.

I think disallowing . in dataset names altogether might be overkill, because it's reasonably common in dataset names I've seen (in ESRI rest services mostly) and it's not strictly necessary. But we should disallow leading and trailing . in path components because I'm pretty sure they'll come back to bite us if we don't.

@craigds
Copy link
Member Author

craigds commented Nov 5, 2021

I'm more +0 on whether we allow two datasets with the same name but different case. I guess probably it's good to disallow since it's probably user error.

this one's kind of annoying to implement, since it depends on the repo - unlike the other rules which depend on nothing except the path you're supplying them. I'll skip it for now.

@rcoup
Copy link
Member

rcoup commented Nov 5, 2021

TBC, we're talking dataset names/paths only here.

Note that macOS filesystems are case-insensitive by default. NTFS is case-sensitive, but parts of the Win32 API aren't. I think for avoiding obvious & inevitable footguns we should very much try and help people not to do that.

I don't see any particular need for using any of the reserved characters for Win32 (<>:"/\|?* and 0x00-0x1F) — they'll cause issues on unixish platforms too unless users are very adept at escaping them in the right way. Paths ending in spaces are equally ick. I know Git has some code to check for CON/PRN/AUX/etc filenames, but AFAIK it just bans checking them out on Windows.

Our primary goal with this is to help users avoid bad experiences for them and others — we will get a bug report when a repo doesn't work on a different OS, and blaming the OS isn't a helpful solution for the user (especially when it's not their repo).

If we did this checking strictly at creation time then at least it's all in one place?

@craigds
Copy link
Member Author

craigds commented Nov 8, 2021

one more restriction I discussed with @rcoup offline: we should only allow alphabet characters for the first character (no numerals or symbols). This will make it easier to support a wider variety of databases / working copies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants