Skip to content
This repository has been archived by the owner on Oct 15, 2024. It is now read-only.

Reserved Characters in Key Names #3223

Closed
10 tasks
markus2330 opened this issue Nov 16, 2019 · 62 comments
Closed
10 tasks

Reserved Characters in Key Names #3223

markus2330 opened this issue Nov 16, 2019 · 62 comments

Comments

@markus2330
Copy link
Contributor

markus2330 commented Nov 16, 2019

I propose following characters to be reserved and escapable (to remove special meaning) within key names:

  • % (already used for empty names and contextual values)
  • # (is used for arrays, see arrays with natural sorting #2698)
  • _ (is used for globbing)
  • * (is used for globbing)
  • & (unused)
  • ? (unused)
  • @ (is used for referring to parentKey, e.g. in conditionals plugin)
  • . (used to refer to key name part itself, or one above)
  • / (for separating paths)
  • \ (for escaping)

What do you think?

@markus2330
Copy link
Contributor Author

See also #2698 for discussion which character to be used for array (globbing)

@kodebach
Copy link
Member

First of all, if you list \, you should also put / on the list. The slash is clearly the prototypical reserved character (character with special meaning) in Elektra.

Second, "used for globbing" and "used for referring to parentKey, e.g. in conditionals plugin" are not valid use cases here. These are not special meanings during unescaping. If you want to prevent a special interpretation that happens later (e.g. by a plugin or library), you either have to know that this interpretation is based on the escaped key name (which would be bad, as outlined in #2698 (comment)), or you have to ensure that the unescaped name cannot be specially interpreted.

Let's take # as an example. Our globbing library interprets the key name /a/# as meaning "match any array element in the array /a". If we don't want that, but want to match /a/# literally, we need to use some kind of escaping. However, /a/\# cannot work. What does the backslash achieve during unescaping of the key name? To prevent the globbing library's special interpretation, it would have to stay there during unescaping. But that violates the rule (°) that a backslash must be escaped, if it is not an escape character itself (it is not an escape character, if doesn't disappear during unescaping). So we need to use /a/\\#, which will be unescaped into ⊙a⊙\# (⊙ in place of NUL, namespace and terminator omitted). But in that case # has no special meaning for key name unescaping and doesn't have to be considered at all.

In other words, to give a reason why a character has to escaped (why it is reserved), you have to think about "What behaviour of elektraKeyNameCanonicalize or elektraKeyNameUnescape would escaping prevent?".


For the characters that we actually want to make "reserved", there is a decision we have to make here. Are these characters with special meaning or actual reserved characters?

In my mind, characters with special meaning are normal characters that can be used as such, but may have a special interpretation in a certain context (e.g. % if it is the whole key part).

Reserved characters on the other hand are reserved for special contexts. They cannot be used normally and are only allowed in the special contexts, in which they have a special meaning.

A character with special meaning only has to be escaped in the context, in which it has its special interpretation, if this special interpretation is not desired. Outside of this context, escaping may be allowed or disallowed, depending on our stance on unnecessary escaping (°°).

A reserved character on the other hand must be escaped whenever it shall have no special meaning. This means, in the context where it has special meaning it can be escaped, to avoid the special meaning. Outside this context, it must be escaped.

As an example, let's look at %. If a whole key part is just % (e.g. /%/), this part is unescaped into an empty part. In both cases, the backslash in /\%/ will prevent the special interpretation and it will be unescaped into a part that is %. The difference becomes noticeable, however, if look at different contexts, like /a%b/.

If we treat % as a reserved character, then /a%b/ is an illegal key name/part and we should throw an error. The correct way to achieve the unescaped part a%b with a reserved character would be using /a\%b/.

If we see % as a character with special meaning, then /a%b/ would be correct and unescape into a%b. Again, whether /a\%b/ is also allowed, depends on our stance on unnecessary escaping. If it is allowed, it also unescapes into a%b.

Note: For / there is no decision, both interpretations are equivalent, since there is no context, in which / has no special meaning. For \ we already decided (in #3115) to treat it as a reserved character and therefore require escaping, when it is not used as an escape character itself.


The decision above of course also requires us to define, in which contexts can these characters, now or in future, have special meanings.

There are two easy solutions here:
Treat everything (except / and \) as characters with special meaning, allow unecessary escaping and for the unused ones define that the have a special interpretation in all contexts. The special interpretation for these would be that they make a key name illegal. Therefore the unused ones always have to be escaped. Any other of the characters list can always be escaped, but only must be, if we want to prevent a specific special interpretation.



(°) As stated in #3115, this rule makes the whole system a lot easier to understand. In an escaped (valid) key name, a backslash always escapes something. In an unescaped (valid) key name, a backslash is always literally a backslash.

(°°) If we adopt a very strict stance on unnecessary escaping, it should be possible to get a 1:1 relation between escaped and unescaped key names. This of course requires more effort in validating key names and therefore will have a performance impact (whether this impact is in any way significant must be tested).

@markus2330
Copy link
Contributor Author

First of all, if you list , you should also put / on the list. The slash is clearly the prototypical reserved character (character with special meaning) in Elektra.

Thank you, I added it. If you find further characters, you can simply add them to the top-post.

Second, "used for globbing" and "used for referring to parentKey, e.g. in conditionals plugin" are not valid use cases here. These are not special meanings during unescaping.

Yes, of course we can also use another escape character or two \ for some of these characters. But this will be more complicated for users?

What does the backslash achieve during unescaping of the key name?

I had in mind that keyAddBaseName(key, "_") will actually add \_. So to get a _ (with the meaning of the globbing character), you would need to use keyAddName.

E.g. @sanssecours currently actually uses _ inside key names for the directoryvalue plugin. Of course there is no way to escape this, as such escaping engines are non-trivial to write. I had the hope we could have a single place to implement such escaping.

For the characters that we actually want to make "reserved", there is a decision we have to make here. Are these characters with special meaning or actual reserved characters?

If they do not have a special meaning now, they are reserved so that we can give them a special meaning later. We cannot "reserve" or "give a special meaning" at a later point of time, as then applications might already use such characters.

To make it clear: In the top post the "special meaning" and "reserved" are synonyms for the escaping engine. And I would keep them as synonyms to make escaping easier.

Reserved characters on the other hand are reserved for special contexts. They cannot be used normally and are only allowed in the special contexts, in which they have a special meaning.

To make this whole escaping engine sane, I would simply assume that the special characters are always special, except when they are escaped.

In particular, I would like to avoid situations where characters get special meaning if an escape character was added. This is an ugly hack and makes e.g. regex unnecessarily complicated.

If we see % as a character with special meaning, then /a%b/ would be correct and unescape into a%b.
Again, whether /a%b/ is also allowed, depends on our stance on unnecessary escaping. If it is allowed, it also unescapes into a%b.

Exactly. The engine cannot fully prevent from creating names with wrong meanings. So I think we should not try to do it. E.g. % in the placeholders need to have an even number of occurrences but there are exceptions. And maybe some extension of contextual values will add further exceptions.

If we adopt a very strict stance on unnecessary escaping, it should be possible to get a 1:1 relation between escaped and unescaped key names. This of course requires more effort in validating key names and therefore will have a performance impact (whether this impact is in any way significant must be tested).

I think our most important goal should be that the escaping is easy to understand and use. The previous escaping tried hard that every name is valid. Such goals mostly cause headache but actually do not improve the usability. If someone places \ at arbitrary places, it is okay to reject some names. But in front of special characters, \ should be always valid and take away any (future) special meaning.

@kodebach
Copy link
Member

I had in mind that keyAddBaseName(key, "_") will actually add \_. So to get a _ (with the meaning of the globbing character), you would need to use keyAddName.

This is super confusing to me...

The semantics of keyAddBaseName (Key * key, const char * x) are:
x is an unescaped part. It will be added directly to the end of key->ukey. Its escaped form will be added to the end of key->key.

The semantics of keyAddName (Key * key, const char * x) are:
x is an escaped name suffix (i.e. zero or more parts, no namespace). It will canonicalized and added to the end of key->key. The unescaped form will be added to the end of key->ukey.

Now, you want keyAddBaseName (key, "_") to add the part \_ to the escaped name and, the part _ to the unescaped name. And that should not have any impact on globbing. But keyAddName (key, "_") should have meaning for globbing, so what does it do? _ can't be unescaped in any way, it's just a single character. So we have to add _ to both escaped and unescaped name. But then both versions produce the same unescaped name, which means the should have the same meaning for globbing...

currently actually uses _ inside key names for the directoryvalue plugin.

AFAIK directoryvalue only uses the key part /__directoryvalue/. This in no way collides with globbing's use of /_/.

We cannot "reserve" or "give a special meaning" at a later point of time, as then applications might already use such characters.

It would just be a big breaking change, like the one we are doing to key names now.


I think we need to distinguish two things:

  1. Characters that have special meaning to keySetName (or more specifically elektraKeyNameCanonicalize and elektraKeyNameUnescape).
  2. Characters that have special meaning to a plugin, library or other part of Elektra not concerned with key name manipulation.

The first group will be some form of reserved character, improper use of which will result in a failed keySetName (and keyNew will return NULL). This is where escaping with a backslash helps, as the backslash tells keySetName to take the following character literally and ignore any meaning it might have.

The second group is entirely different. There is nothing we can do during keySetName (or keyAddName, keyAddBaseName, etc.) to change whether a character of this group has special meaning to some plugin or not. It is entirely up to the plugin to decide that and any form of escaping the plugin might allow to prevent the special meaning must make it to the plugin. Keep in mind, that we said keySetName fails and keyNew returns NULL for invalid names. In keySetName
we cannot for example say /# has special meaning, /#a is not allowed, but /\# has no special meaning an /\#a is allowed. We have no idea, how this is going to be used. If we disallow some keys, no plugin can use them. (Note: in the paragraph above: plugin = plugin, library, application, etc.)


If you really want to unify when something can have special meaning to a plugin/library, I can only see something like this working:

Define some escaping rules for %, ., / and \, as the affect keySetName. Do the array index stuff from #2698. Key name parts (escaped and unescaped) starting with @ must be of the form (regex) @[^@]+@.+. The character @ may not appear in any other context. Everything else has no special meaning during keySetName and may be used unrestricted. Plugins, libraries, etc. MAY NOT attribute special meaning to any individual character or to a key name part except, to unescaped key name parts that start with @. An unescaped key name part of the form (regex) @[^@]+@.+ can only have special meaning to the plugin named by the [^@] part of the regex. What that meaning is, is up to the plugin and determined by the part after the second @.

I recognize that this is totally different from the current approach and entirely unreasonable to implement, if we want to release 1.0 any time soon, but it is the only working approach I can see, that allows us to definitively say whether a key part has any special meaning or not.

@markus2330
Copy link
Contributor Author

I think we need to distinguish two things:

My hope was that we can discard this distinction to make the whole escaping easier to understand. Then keyAddBaseName never adds something that has a meaning to Elektra.

If this is not possible, the minimum requirement is that it is possible to pass any string to keyAddBaseName and get back the same string when asking for the base name with keyBaseName, see #2698.

Yes, it would break some occurrences of keyAddBaseName where array names were added. But it will break some cases anyway, as now not every array name is valid, so we need to break some code either:

  1. the code that assumed every string can be set or
  2. the code that assumed arrays can be added.

It is very clear to me that we should not break 1., i.e. code that assumed every string can be passed to keyAddBaseName. "2." is not so important, these are a few occurrences that can be fixed.

Another example next to SSID: fstab entries can have basically everything including /. Or the mountpoint names within Elektra.

@kodebach
Copy link
Member

Then keyAddBaseName never adds something that has a meaning to Elektra.

I see now that this once again comes done the major problem at the heart of Elektra. Everything should be coherent and well defined system, but at the same time everything should be flexible and plugins should be used as much as possible. This just can't work.

The problem specifically is the phrase "meaning to Elektra". We can ensure keyAddBaseName never adds something that has meaning to elektra-core, or in more clear terms, that keyAddBaseName adds its argument as a new unescaped key name part. But to an application "Elektra" also means any plugin of Elektra. In that interpretation it becomes very hard to prevent keyAddBaseName from adding something "with meaning". Any plugin could decide to attribute meaning to any unescaped key name at any point. Like I said, the only solution is to introduce a very restrictive system, like the one with @ proposed above. If you come up with something different, let me know, but for everything I came up with until know, I found a problem.

It is very clear to me that we should not break 1., i.e. code that assumed every string can be passed to keyAddBaseName

Why? It is all but clear to me, why this is preferable? UNIX filenames for example are not allowed to contain / and nobody ever thought that was a problem. (Not allowing / makes splitting UNIX paths into parts a lot simpler than with key names).

We are already breaking (almost) EVERY piece of code interfacing with Elektra by adding the : after the namespace. At the current state of affairs, only cascading keys without any array parts and also excluding some other special character things will continue to work without requiring change. Backwards compatibility is long out the window.

"2." is not so important, these are a few occurrences that can be fixed.

I would not be so sure about that. In fact I know of only one part of Elektra that truely assumes that keyAddBaseName escapes everything and that is elektra-kdb and the tools that create mountpoints.

fstab entries can have basically everything including /

Not sure why that is relevant?

Or the mountpoint names within Elektra.

To create the mountpoint user:/my/mount the tools put the mountpoint configuration under system:/elektra/mountpoints/user:\/my\/mount, but the could just as well use any other unique key directly below system:/elektra/mountpoints, because the actual mountpoint is the value of e.g. system:/elektra/mountpoints/user:\/my\/mount/mountpoint. AFAICT system:/elektra/mountpoints might as well be an array. I have not found any place that calls keyBaseName on a key directly below system:/elektra/mountpoints.

There might be some place that uses mountGetMountpoint or mountGetBackend to get the mountpoint and reconstruct system:/elektra/mountpoints/user:\/my\/mount/ to access some of the config, but that could easily be solved by adding the system:/elektra/mountpoints key below which the config is stored as a field to struct _Backend.


In short:

the minimum requirement is that it is possible to pass any string to keyAddBaseName and get back the same string when asking for the base name with keyBaseName

This is easy. But combining it with "keyAddBaseName shall accept all strings" is very hard or almost impossible.

@markus2330
Copy link
Contributor Author

I see now that this once again comes done the major problem at the heart of Elektra. Everything should be coherent and well defined system, but at the same time everything should be flexible and plugins should be used as much as possible. This just can't work.

Yes, but I would not call it a problem but a challenge. We want the best possible Elektra and of course sometimes contradicting goals occur.

Like I said, the only solution is to introduce a very restrictive system, like the one with @ proposed above.

Yes, I also see that it will not be possible to have everything. With array escaping we are already quite limited to a prefix escaping, i.e. we cannot escape other characters within the base names.

But this prefix escaping might be enough. And this should be extensible to all the characters proposed above. The limitation would be that either the whole key name part is escaped or nothing at all.

Of course the escaping would not be magically but applications would need to check if the meaning of the keyBaseName has been taken away or we provide a different keyUnescapedBaseName so that API users can have the string with or without prefix escaping.

Why? It is all but clear to me, why this is preferable? UNIX filenames for example are not allowed to contain / and nobody ever thought that was a problem. (Not allowing / makes splitting UNIX paths into parts a lot simpler than with key names).

It is a problem for everyone implementing a file browser. Some applications had URL encoding techniques to get / into the file names. The more modern "solution" is to use the Unicode Character 'DIVISION SLASH' (U+2215) for slashes.

Furthermore, keep in mind that the whole reason for existence of Elektra is that UNIX file system semantics are not suitable for configuration. In configuration arbitrary sequences for key part names including / are commonly used and thus should be supported. E.g. in YAML / are allowed characters in key name parts and also in TOML there are "Quoted keys".

We are already breaking (almost) EVERY piece of code interfacing with Elektra by adding the : after the namespace. At the current state of affairs, only cascading keys without any array parts and also excluding some other special character things will continue to work without requiring change. Backwards compatibility is long out the window.

Yes, and I am super-grateful that you are doing this. I would not have been able to do this alone.

I would not be so sure about that. In fact I know of only one part of Elektra that truely assumes that keyAddBaseName escapes everything and that is elektra-kdb and the tools that create mountpoints.

Another part is the SSIDs. This was not a synthetic example but Broadcom has implemented this for their bluetooth devices. So they rely on that they can add any base name.

This is easy. But combining it with "keyAddBaseName shall accept all strings" is very hard or almost impossible.

Yes, very hard but also a very nice unique feature that not even UNIX had before. We can be proud of that.

@kodebach
Copy link
Member

E.g. in YAML / are allowed characters in key name parts and also in TOML there are "Quoted keys".

This is actually what I thought we could introduce, see below.


I think all of this can be solved by introducing something similar to "quoted keys". I am going to call them literal key parts.

If a key part starts with an @ (could be another character of course), the following key part is a literal key part. A literal key part can contain ANY byte sequence including zero bytes.

Their unescaped form is as follows:

  • A literal key part starts with an @ character.
  • Further @ characters are encoded as @@.
  • Zero bytes are encoded as @0.
  • @ may not occur otherwise.
  • As always a zero byte terminates the key part.

The escaped form is a bit different, to allow easier printing:

  • A literal key part starts with an @ character.
  • The literal key part is NOT terminated by a /. Only another @ followed by a / or a zero byte (end of string), terminates the _literal key part.
  • Any byte may be encoded as @XX@, where XX is the hexadecimal form of the byte. Therefore @ characters must be encoded as @40@.

Now keySetBaseName and keyAddBaseName accept ANY valid unescaped key part as arguments. They set/add their argument unmodified as the last part of the unescaped key name. Therefore keyBaseName will return the same argument. In addition keySetBaseName and keyAddBaseName turn the unescaped part into its escaped form and set/add it as the last part of the escaped name.

To solve the problem of "we want to add anything as a key part", we introduce:

size_t keySetLiteralBaseName (Key * key, char * bytes, size_t size);
size_t keyAddLiteralBaseName (Key * key, char * bytes, size_t size);

These functions accept ANY byte sequence as their bytes argument. They then turn the argument into an unecaped literal key part by prefixing it with an @, replacing other @ with @@ and zero bytes with @0. This is then passed to keySetBaseName or keyAddBaseName.

To make decoding literal key part easier we could add decoding functions to the public API as well.

I like this much better, because it makes it very clear when the user intends to add arbitrary things to the key name and when this is not the case.

@markus2330
Copy link
Contributor Author

markus2330 commented Nov 17, 2019

keySet/AddLiteralBaseName is a proposal to add a new feature. Currently we do not have a way to encode null bytes in key names. When somebody needs this, we can get back to your proposal.

For now, however, we need good semantics for keyAdd/SetBaseName. As outlined very long, this ideally should be any string, as otherwise many plugins and applications will be buggy because they do not expect that some strings will not work. #_10 is a valid key name part, it can be used as key name part in basically all storage plugins or applications, we cannot discuss around that. It is definitely not a good idea to let a storage plugin fail because it found #_10 somewhere as key name part during parsing. And it also not a good idea to require them to use a different API with a size argument if they do not have null bytes anyway. Actually, that such cases as #_10 or / do not need special handling is the reason for the existence of keyAdd/SetBaseName.

@markus2330
Copy link
Contributor Author

Btw. maybe it best if we meet, e.g. on Tuesday?

@kodebach
Copy link
Member

Currently we do not have a way to encode null bytes in key names. When somebody needs this, we can get back to your proposal.

This is the least important point of the proposal. It only allows zero bytes, because I don't see a reason why it wouldn't.

Anyways, there already is a use case:

Another part is the SSIDs. This was not a synthetic example but Broadcom has implemented this for their bluetooth devices. So they rely on that they can add any base name.

Apart from the fact that SSIDs have nothing to do with bluetooth, EEE 802.11 clearly defines an SSID as any byte sequence up to 32 bytes long. So either Broadcom already does some pre-processing to handle zero bytes or their code is already buggy. In any case, this is clearly a use case for encoding zero bytes in key base names.


In any case I have no idea about possible future extensions to the set of special characters, but for the current set, I think the solution implemented in #3115 might work as is.

If we try the scenario from #2698 (comment) in this version, the following happens:

Calling keyAddBaseName (key, "#_10") results in both the last part of key->key and of key->ukey being #_10. Why? Because #_10 is a valid array element. Yes, if #_10 was an SSID we wouldn't see it as an array element, but globbing with # would match it, but does that really matter?

If we take an invalid array element like #10 or #abc, keyAddBaseName will escape and the unescaped name's last part will be #10 or #abc, while the escaped name's last part will be \#10 or \#abc.

I have not yet thought about the theoretical rules we need to make this work for future extensions, but for now I think we can leave it like this. It would be nice to have array elements without the starting character, but does it really matter? Storage plugins are free to encode arrays differently, if they have a problem with the # character and for the command-line we should just add kdb array* commands that take normal integer parameters, e.g. kdb arrayset /my/array 10 a.


Btw. maybe it best if we meet, e.g. on Tuesday?

Tuesday is bad for me... I will send you an email.

@markus2330
Copy link
Contributor Author

Apart from the fact that SSIDs have nothing to do with bluetooth,

I meant Blueray devices (that have wifi capability).

EEE 802.11 clearly defines an SSID as any byte sequence up to 32 bytes long. So either Broadcom already does some pre-processing to handle zero bytes or their code is already buggy. In any case, this is clearly a use case for encoding zero bytes in key base names.

Yes, you are right. But it is a new feature. Any string in keySetBaseName is not a new feature.

Because #_10 is a valid array element. Yes, if #_10 was an SSID we wouldn't see it as an array element, but globbing with # would match it, but does that really matter?

No that does not matter.

It would be nice to have array elements without the starting character, but does it really matter?

No, the most important thing is that any string can be fed to keySetBaseName, and keyBaseName returns what was fed before. With the need of start characters in arrays I can definitely (and happily) live.

[...] and for the command-line we should just add kdb array* commands that take normal integer parameters, e.g. kdb arrayset /my/array 10 a.

Sounds like a good plan.

@kodebach
Copy link
Member

After re-examining everything, I think there is only one conflict. The following to statements cannot be unified.

  1. An unescaped key part may contain any byte that is not zero.
  2. A substring of an unescaped key part can determine a property of the unescaped key part.

In our particular case. The first statement follows from the guarantees for keyBaseName et al. together with the fact the you don't want keyAddBaseName to reject an argument. In addition, the proposal "unesaped key parts starting with # are always array indices" is a special case of 2.

This essentially means to keep 1, which we want to, we becomes necessary to always re-check a full unescaped key part, if you want to test a certain property. In particular, you have to check the full unescaped key part to know that it is an array index. I find this an unfortunate situation, but it seems unavoidable.

This also means that we could still do the whole "escaped key parts consisting entirely of digits are array indices" thing. The side effect would be that in the SSID example keyAddBaseName (key, "#_10") would result in an escaped name ending in /10 instead of /#_10. But since you already accepted the fact that keyAddBaseName (key, "#abc") yields an escaped name ending in /\#abc, I assume this would not be a problem either?
Implementing this would be a bigger task, so it would be a follow-up PR to #3115. I would need to check that everything that works with array indices actually uses the unescaped name, in addition to updating everything that creates keys with array parts.

@markus2330
Copy link
Contributor Author

Actually 2. is not necessarily needed. We can also store a 3rd string which contains the basename to be returned (maybe only then when it is actually needed to not waste too much memory).

Implementing this would be a bigger task

This is not good, we also need to somehow come to an end with this. Maybe the 3rd string will rescue us, we can optimize this later (if it is needed).

Either we make a full list of prioritized requirements in a decision or we meet? The decision will be handy anyway so that people later will understand why we did it like this and which requirements we sacrificed for other requirements.

@kodebach
Copy link
Member

Actually 2. is not necessarily needed.

Like I said, it was a proposal. It would be nice to have, but unfortunately it doesn't work.

We can also store a 3rd string which contains the basename to be returned (maybe only then when it is actually needed to not waste too much memory).

We'd need to store a whole 3rd version of the key name, otherwise keySetBaseName (key, NULL) would work properly.

Right now there are not many properties a key part can have. At least not properties that are interesting across all of Elektra. "Is it an array index part?" is in fact the only one I can think of. If in future we have more properties (e.g. if we add the concept of "contextual placeholders" to the core), we can think about optimizations.

This is not good, we also need to somehow come to an end with this.

I know, which is exactly why I said I won't do it in #3115. If we decide we need it, I'll open another PR. This second PR will be a lot of work and it will be a breaking change. However, it should be a much smaller breaking change and (most) other simultaneous PRs shouldn't be affected as much as they will when we merge #3115.

Either we make a full list of prioritized requirements in a decision or we meet?

I will write documentation, including a decision, when the code of #3115 is done. The open question right now are:

  1. Other than / and \, which characters can have special meaning at the start of a key part?
    In [VeryOld] Keyname overhaul #3115 these currently are #, %, . and @.
  2. What is the special meaning? This can be different depending on the rest of the key part.
    In [VeryOld] Keyname overhaul #3115 we have the following:
    • # means array part, if the rest of the key part is only digits or first n underscores and then n + 1 digits (n >= 0) and nothing else. A part of just # has no special meaning in general (it can mean array globbing to some parts of Elektra). Any other part starting with # means the key name is invalid.
    • % means empty part if the part is just %. Any other part starting with % means the key name is invalid.
    • . has special meaning only if the part is just . or exactly ... All other parts starting with . currently have no special meaning, i.e. the key name is not invalid. (Previously a part starting with . made a key inactive)
    • @ is reserved. Any part starting with @ means the key name is invalid.
  3. Should escaping the special characters from 1 always be allowed or only when required?
    In [VeryOld] Keyname overhaul #3115 escaping is always allowed.
  4. Do we want that key parts consisting entirely of digits are interpreted as array parts? If so, what is the canonical representation? Just digits as well, starting with a # then just digits or the same as the unescaped version, i.e. # and underscores?
    In [VeryOld] Keyname overhaul #3115 we always require array parts to start with # and their canonical representation is the same as the unescaped one.

My suggested answers are:

  1. #, %, ., @, $, & and ? should be enough for possible future uses. In particular I think we should not reserve _ that will only cause problems.
  2. Like we have it in [VeryOld] Keyname overhaul #3115. $, & and ? are reserved for future use like @.
  3. Always allowed.
  4. Personally, I am against that. Having the starting character makes it easier to spot array parts when looking at key names. As suggested in arrays with natural sorting #2698, any collisions with special characters in storage formats should be solved by storage plugins not the core and the kdb tool should just have special array commands.

Slightly unrelated thing: It might be possible to turn directoryvalue into a library (or even make it part of the core) and give it a private interface that allows creating keys that would not be possible through the public API. Then we wouldn't have to worry about collisions because of the name __directoryvalue. One of the reserved characters would be used for that. I haven't thought much about it, but it might be a good idea, since most of the storage formats are good for human readability don't support keys with both a value and keys below.

@markus2330
Copy link
Contributor Author

Other than / and , which characters can have special meaning at the start of a key part?
In #3115 these currently are #, %, . and @.

This is what this issue basically is about: to collect characters which potentially have a special meaning (now or reserved).

Should escaping the special characters from 1 always be allowed or only when required?

Yes, always be allowed.

Do we want that key parts consisting entirely of digits are interpreted as array parts? If so, what is the canonical representation? Just digits as well, starting with a # then just digits or the same as the unescaped version, i.e. # and underscores?

This heavily depends on which properties of base names we lose. From user perspective it is "nice" if no # is needed, from API perspective it is very important that keySet/AddBaseName just works.

In particular I think we should not reserve _ that will only cause problems.

And to change the globbing to *?

Having the starting character makes it easier to spot array parts when looking at key names.

I do not think it is a big issue: numbers within a string are also quite easy to spot:

user/here/is/a/test/1/not/a/number/4/here/is/a/number

Then we wouldn't have to worry about collisions because of the name __directoryvalue.

If correctly implemented we do not need to worry also if implemented as plugin, the plugin could escape already present __directoryvalue.

My dream is that keySetBaseName would escape _directoryvalue (one underscore or @ would be enough), but the directoryvalue plugin would use keyAddName to get a raw _directoryvalue inside the key name. (Which would be protected because underscore or @ is a reserved character not to be used for applications).

@markus2330
Copy link
Contributor Author

Hopefully we can discuss this today, I hope this way we can come easier to conclusion.

@kodebach
Copy link
Member

This heavily depends on which properties of base names we lose.

As I said, we shouldn't loose anything compared to the current state of #3115, but we also wouldn't necessarily gain a lot. The main question is how much more work do we want to invest into the key name change. Lets talk about this today.

And to change the globbing to *?

* already has meaning to the globbing library. It means any key part. # matches just array parts and _ matches non-array parts.

I do not think it is a big issue: numbers within a string are also quite easy to spot

This might just be me, but I didn't spot the /1/ at first.

If correctly implemented we do not need to worry also if implemented as plugin, the plugin could escape already present __directoryvalue.

See #3256 for an alternative solution using metakeys.

My dream is that keySetBaseName would escape _directoryvalue

I am strongly against that. The core should not be concerned in any way with what plugins do, or what requirements they have. It also sets a bad precedent. What if another plugin uses _averyspecialkey for something? Are we going to update keySetBaseName every time this happens?

@markus2330
Copy link
Contributor Author

Outcome of talk:

  • base name properties already seem to be more or less as they were before
  • arrays will continue to start with # (but _ will not be needed)
  • characters above will be escaped if they are on the first position, except: #
  • keySetBaseName will not and cannot be updated for new characters, the above list (when finalized) is a complete list for Elektra 1.0

@kodebach
Copy link
Member

characters above will be escaped if they are on the first position, except: #

The current behaviour (in #3115) of keySetBaseName and keyAddBaseName is defined by this function:

size_t elektraKeyNameEscapePart (const char * part, char ** escapedPart)
{
// TODO (kodebach): array parts
if (!part) return 0;
size_t partLen = strlen (part);
if (partLen == 0)
{
elektraRealloc ((void **) escapedPart, 2);
**escapedPart = '%';
*(*escapedPart + 1) = '\0';
return 1;
}
if (partLen == 1 && part[0] == '.')
{
elektraRealloc ((void **) escapedPart, 3);
**escapedPart = '\\';
*(*escapedPart + 1) = '.';
*(*escapedPart + 2) = '\0';
return 2;
}
if (partLen == 2 && part[0] == '.' && part[1] == '.')
{
elektraRealloc ((void **) escapedPart, 4);
**escapedPart = '\\';
*(*escapedPart + 1) = '.';
*(*escapedPart + 2) = '.';
*(*escapedPart + 3) = '\0';
return 3;
}
if (part[0] == '#' && partLen != 1)
{
size_t underscores = 0;
while (part[1 + underscores] == '_')
{
++underscores;
}
size_t digits = 0;
while (isdigit (part[1 + underscores + digits]))
{
++digits;
}
if (partLen != 1 + underscores + digits || digits == 0 || underscores != digits - 1 || digits > 19 ||
(digits == 19 && strncmp (&part[1 + underscores], "9223372036854775807", 19) > 0) ||
(part[1 + underscores] == '0' && digits > 1))
{
// not a correct array part -> need to escape
elektraRealloc ((void **) escapedPart, partLen + 2);
**escapedPart = '\\';
memcpy (*escapedPart + 1, part, partLen);
*(*escapedPart + partLen + 1) = '\0';
return partLen + 1;
}
}
int specialStart = strchr ("%@/\\", part[0]) != NULL;
size_t special = specialStart ? 1 : 0;
const char * cur = part + 1;
while ((cur = strpbrk (cur, "/\\")) != NULL)
{
++special;
++cur;
}
elektraRealloc ((void **) escapedPart, partLen + special + 1);
char * outPtr = *escapedPart;
if (specialStart)
{
*outPtr++ = '\\';
}
const char * last = part;
while ((cur = strpbrk (last, "/\\")) != NULL)
{
size_t len = cur - last;
memcpy (outPtr, last, len);
outPtr += len;
if (cur > part)
{
// start backslash already added above
*outPtr++ = '\\';
}
*outPtr++ = *cur;
last = cur + 1;
}
size_t len = strlen (last);
memcpy (outPtr, last, len);
outPtr += len;
*outPtr++ = '\0';
return outPtr - *escapedPart - 1;
}

We only escape when it is needed to avoid unnecessary escaping in the escaped name.

Also since you asked, if there is a 1:1 relation between escaped and unescaped key name:

No, there is not. We decided to allow unnecessary escaping:

Should escaping the special characters from 1 always be allowed or only when required?

Yes, always be allowed.

This means that e.g. /key/%abc and /key/\%abc are different escaped key names for the same unescaped key name. Namely the one with cascading namespace and the two parts key and %abc.

If we want to ensure a 1:1 relation, we could either say that escaping is only allowed when required (don't like that), or remove unnecessary backslashes during the canonicalization step.
In any case the 1:1 relation requires that keySetBaseName and keyAddBaseName behave as described above, i.e. it has to know when to escaping is required.

Personally, I don't think the 1:1 relation is required. We only need a n:1 relation to make ksLookup work. It normally shouldn't be a problem, if we look up /key/\%abc and get back /key/%abc or vice versa.

@markus2330
Copy link
Contributor Author

Thank you for investigating the round-trip behavior. Yes, the lookup is not a big deal (it is already 1:n because of cascading) but that keys with different names remove themselves when appending them to the keyset is ugly. And of course that it does not round-trip is another important goal: Every KeySet should after serializing and serializing be exactly the same, and this would not be given if \% and % are "identical" in the internal representation.

Do you maybe have a solution to the ksAppend and round-tripping problem?

@kodebach
Copy link
Member

Its a matter of definition. What uniquely identifies a key?

The technical answer is the unescaped name. It is what a KeySet uses to differentiate between keys.
Users just have to get used to knowing when two key names are the same. It's kind of similar to spotting that the fractions 2/3 and 4/6 are the same.

If you still want the 1:1 relation, escaping must only be allowed when necessary. That is not more effort to implement, but might be annoying to users.

@markus2330
Copy link
Contributor Author

Its a matter of definition. What uniquely identifies a key?

Until now saying "the name identifies a key" was enough because there was 1:1 between escaped and unescaped. They were only stored both as performance improvement. (space vs. time)

I would prefer if it can stay this way.

If you still want the 1:1 relation, escaping must only be allowed when necessary. That is not more effort to implement, but might be annoying to users.

Why would it be annoying?

My idea of \#123 vs. #123 was that it is not the same key name and does not have the same meaning (one means the literal #123, the other is the 124th array entry). They both would have different escaped and unescaped names.

If they now have the same unescaped name, we can simply reject keyAddName(k, "\\#123"). There is no benefit of having it, or is there?

@kodebach
Copy link
Member

kodebach commented Nov 21, 2019

My idea of \#123 vs. #123 was that it is not the same key name and does not have the same meaning (one means the literal #123, the other is the 124th array entry). They both would have different escaped and unescaped names.

That example does not demonstrate anything. Obviously \#123 and #123 have different unescaped versions, otherwise it wouldn't be possible to get the unescaped part #123 in any way.

Arrays are also a bad example, because we said # must introduce a valid array part or be escaped, to avoid easy mistakes. Instead lets look at %.

If the escaped part is /%/, we get an empty unescaped part and if it is /\%/ we get the unescaped part %. Obviously both of these must be valid escaped key parts. So far so good.
The question now is about the parts /%abc/ and /\%abc/. Both map to the unescaped part %abc. Are they both allowed, or is one of them invalid? If both are allowed, we don't have a 1:1 relation.

This is what I meant with:

Should escaping the special characters from 1 always be allowed or only when required?

To which you replied:

Yes, always be allowed.

Which I interpreted as both /%abc/ and /\%abc/ are allowed. I realise now that you might have meant "Yes, always required", i.e. only /\%abc/ is valid.

Anyways...
We have two options here:

  1. A reserved character must introduce a valid part with special meaning, or it must be escaped.
    In the example above this means only /\%abc/ is the valid.
  2. A reserved character must only be escaped, when the part would have special meaning without the escape.
    In the example above this means only /%abc/ is valid.

The choice can be made independently for each of the reserved characters. Only one of these options can be chosen per character to retain the 1:1 relation.


In #3115 it is currently option 1 for # and option 2 for . and %. For @ both options are the same, it always requires escaping. There are no valid parts starting with @. I'll have to check everything else is correct, but I think we have a 1:1 relation in #3115 right now.

@sanssecours
Copy link
Member

@sanssecours can you look into this? If we recognize arrays just because of the structure (names with #0) the KeySet will not round-tip.

I updated the behavior of YAML CPP and the Directory Value plugin in PR 3357. The pull request also updates part of the documentation to take the mandatory array metadata into account.

@kodebach
Copy link
Member

kodebach commented Sep 28, 2020

I have now updated the proposal in #3447. I hope it is now easier to understand (it is still very long).

https://github.com/kodebach/libelektra/blob/dcec6e865bba32f6d83c40c2f711c2e70dde6d62/doc/proposals/keynames.md

@kodebach
Copy link
Member

kodebach commented Oct 2, 2020

@markus2330 I'm gonna put this here even though the previous discussion was in #3223, because I think it fits better.

I just discovered/remembered that in the current version of Elektra [1], it's already the case (and implemented that way) that "any ESCAPED Key Name part that start with @ makes the whole Key invalid". This means we could -- without implementing any of the other changes from the proposal -- use field names starting with @ for special purposes in storage plugins, like the proposal suggests with $meta and $value.

@bauhaus93 This might help you when you implement and metadata in toml (and also for non-leaf value keys, to some extent).

For those that haven't read the proposal, the idea would essentially be (for JSON):

{
   "key": {
        "@value": "127.0.0.1",
        "sub": "xyz",
        "@meta": {
            "check/ipaddr": "ipv4"
      }
}

Turns into two keys:

key = 127.0.0.1
key/sub = xyz

and key also has this metadata:

check/ipaddr = ipv4

Since both key/@value and key/@meta are reserved key names (or even invalid in #3447), this cannot cause any conflicts.


[1] master only mentions this in the documentation, but does let you create such Keys. #3447 on the other hand forbids these Keys entirely requires that @ is escaped at the start of key name parts in escaped key names. The documentation also has a similar provision for _, but that is not enforced in #3447 either.

@kodebach
Copy link
Member

kodebach commented Oct 3, 2020

Sorry for the confusion, but I had to clarify parts of the comment above. The part about the @ only applies to the escaped Key Name. The unescaped name (as a C-String) "\01\00key\00@meta\00" is still valid in #3447, like it ever was. This means @meta cannot be used for metadata without further considerations in the storage plugin.

However, since it has always been a reserved name and we will require escaping in #3447, I think it is still fair to use it in a storage plugin. IMO a user should not expect that a reserved name can always be used without issues. But of course we will need to take some precautions. One possibility would be this (JSON again):

{
   "key": {
        "@value": "127.0.0.1",
        "@sub": "xyz",
        "@meta": {
            "check/ipaddr": "ipv4"
      }
}

Turns into two keys (note the \ for escaping):

key = 127.0.0.1
key/\@sub = xyz

But this

{
   "key": {
        "@@value": "127.0.0.1",
        "@@sub": "xyz",
        "@@@sub": "abc",
        "@@meta": {
            "check/ipaddr": "ipv4"
      }
}

Turns into (again \ for escaping):

key/\@value = 127.0.0.1
key/\@sub = xyz
key/\@@sub = abc
key/\@meta/check\/ipaddr = ipv4

The rules here would be:

  • field name does not start with an @ -> use the field name and add with keyAddBaseName
  • field name starts with more than one @ -> remove a first @ and use the rest for keyAddBaseName
  • field name starts with a single @ -> check if field name has special meaning (e.g. @value or @meta):
    • if it has special meaning, process accordingly
    • otherwise just use the field name for keyAddBaseName (e.g. @sub)

(Note: keyAddBaseName takes an unescaped name and does the required escaping to keep the key name valid)

@markus2330
Copy link
Contributor Author

I do not understand the aim of the last two comments, is this a proposal about the directoryvalue plugin replacement?

(note the \ for escaping)

For @ we usually used @ for escaping, so to simply discard the first @ if there are several. Would be good if this is also documented in the key name description. We could change to \ (if it makes the escaping for key names in general easier understandable) but this needs changes at least in the base64 plugin. (null plugin gets deleted in #3491)

@kodebach
Copy link
Member

kodebach commented Oct 3, 2020

I do not understand the aim of the last two comments, is this a proposal about the directoryvalue plugin replacement?

The @value part is yes. But it is more general. @meta could be used for storing metadata inside a format with a single nested object structure, like JSON or TOML (still not sure about YAML, because tags).
Basically, because everything in the JSON file refers to some key somewhere, there is simply no place to store metadata. Unless impose some limitations on the structure of the JSON file, but then you probably can't just mount some config file. [1]

Like I said, you need to impose restrictions to fit metadata into JSON (AFAIK the same goes for TOML). So yes, my proposal in the comments above does impose a restriction on the JSON files, but applications that use @ at the start of a JSON field are probably quite rare.

For @ we usually used @ for escaping, so to simply discard the first @ if there are several. Would be good if this is also documented in the key name description. We could change to \ (if it makes the escaping for key names in general easier understandable) but this needs changes at least in the base64 plugin.

Not sure how the base64 plugin is relevant here, but like I to be very clear:

On the current master (6a1535c) keyNew ("/elektra/@abc", KEY_END) works and creates an unescaped name with the parts elektra and @abc.
In #3447 keyNew ("/elektra/@abc", KEY_END) returns NULL because the name is deemed invalid, but keyNew ("/elektra/\@abc", KEY_END) works and creates an unescaped name with the parts elektra and @abc.
Not sure what keyNew ("/elektra/\@abc", KEY_END) does on master.

keyAddBaseName (k, "@abc") works in both cases, the only difference is what escaped name it creates.

I also don't remember, why I made this change. There probably was a good reason, but in any case it is done now and I don't think changing back would be easy.


[1] As always, the problem here is that Elektra wants to do everything and please everyone. On the one hand we want to allow storage plugins that can read arbitrary files in a standard format (e.g. JSON) and turn them into KeySets. On the other hand we want to allow applications to use arbitrary strings as a part of a key name. This might work for some time, but at least when you get to metadata it breaks for certain formats.

JSON is a single tree with values only at the leaf nodes. KeySets are a tree with values at every node, and some of the node have a link to whole separate tree as well. You might be able to convert between the two, but one of them must impose limitations on the other to have a lossless conversion.

@mpranj
Copy link
Member

mpranj commented Oct 3, 2020

Like I said, you need to impose restrictions to fit metadata into

@kodebach does this mean you are in favour of restricting the formats s.t. metadata can be stored?

I never stumbled upon this, because I mostly used dump and then mmapstorage and only rarely imported/exported configs using other plugins. This is quite unfortunate. As a user I would rather be able to import/export to standard spec (losing metadata) while storing the real data (incl. metadata) using an arbitrary elektra internal format.

I think restricting the format could be confusing to users. For example, if Elektra provides a non-experimental JSON plugin, then I would expect it to be able to handle arbitrary spec-conforming inputs. It's unclear to me how the user would be able to check that his config files are conforming to the restricted spec.

@markus2330 what do you think about imposing restrictions on the format?

EDIT: I agree that "Elektra wants to do everything and please everyone" is a problem.

@kodebach
Copy link
Member

kodebach commented Oct 3, 2020

does this mean you are in favour of restricting the formats s.t. metadata can be stored?

We need to define our priorities. What exactly is an absolute must, and where can things change. As it is right now, the only way for many formats is to store metadata in comments (@bauhaus93 did that in #3500 for TOML). IMO this is not a solution, but merely a clunky work around. The worst part is that is abuses the one part that should never be touched by a parser and can drastically affect the readability of a storage file. Some formats (namely JSON) also do not have comments.

Very basic restrictions we could impose (using JSON as an example again) would be that all files must look like this:

{
    "kdb": {
       "elektra": {
          "version": 1
        }
    },
    "meta": {
       "elektra/version": {
           "type": "long"
       }
    }
}

Other "restrictions" include interpreting certain key names like @meta in special ways and providing escape methods.

The kdb object contains the actual Keys and meta contains the metadata. (Note: to solve non-leaf keys with values, you'd need another object)

Instead of a two JSON objects we could also use separate files. But that would need support from at least the resolver and maybe even the backend code/plugin. Separate files would make it possible to import standard JSON files, without any metadata. How non-leaf keys with values would be handled in this case is not clear.

As a user I would rather be able to import/export to standard spec (losing metadata) while storing the real data (incl. metadata) using an arbitrary elektra internal format.

I agree, but importing/exporting everything goes a bit against the mountpoint idea of Elektra.

I already know, nobody wants to implement this, but there is a very general solution:

If a mountpoint uses a storage plugin that doesn't not support all of Elektra's features, the backend plugin creates a secondary fallback storage file in a format that does support all features (e.g. quickdump, mmapstorage seems like a waste of disk space here). When writing such a mountpoint, we simply write with both plugins. When reading we first read the secondary fallback file and then the primary file. This would require a merge process, but we will take the primary file as a source of through so the merge is quite simple.

I think restricting the format could be confusing to users.

Definitely. But it is clear to me that currently Elektra allows to much and something has to be restricted.

Even the solution with multiple files requires some kind of restriction to config file names to avoid collisions.

Originally I thought the easiest solution would be to restrict Elektra's Key Names, since that is the part we have the most authority over. But apparently there are applications that use WiFi SSIDs [1] directly as a Key Name Part, so that doesn't work. At least not without huge effort, as my proposal showed...
IMO it would still be the best solution for all of this to just say "Key Name Parts are not arbitrary sequences of non-zero bytes anymore" and be done with it. For comparison, POSIX only allows the following characters in filenames:

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9 . _ -

And that seems to be enough for people. All of the file systems listed on Wikipedia (except the ones without directories) have some kind of restriction on the filenames.

I want to stress again: All of this boils down to the fact that we have taken "Key Name Parts are arbitrary sequences of non-zero bytes" as an axiom [2]. Changing this would be every easy and it would fix everything. It could even be a tiny restriction e.g. "Key Name Parts are arbitrary sequences of non-zero bytes, except those that both start with @$% and end in !%& at the same time".
Just reserving a single Key Name Part e.g. "Key Name Parts are arbitrary sequences of non-zero bytes, except !@& is not", would give us lots of possibilities. Heck, even the restriction "the last Key Name Part in a Key Name must not end with @%&" should suffice. Any kind of restriction [3] no matter how tiny would give us leeway for implementing all sorts of things.
Of course, many restrictions make solutions possible, but good and usable solution might require bigger restrictions.

Sorry, if this seems like a rant, but at some point it just gets really frustrating that everyone agrees there is a problem, but every solution is deemed to have a flaw or is seen as too complicated.


[1] According to Wikipedia until the 802.11-2012 a WiFi SSID was "zero to 32 octets (32 bytes) long" including zero-bytes and "Wireless network stacks must still be prepared to handle arbitrary values in the SSID field", even though SSIDs must now be valid UTF-8. So just keyAddBaseNameing a WiFi SSID without any additional logic would not be standards compliant anyway.

[2] That is actually, why the new documentation in #3447 starts with Key Name Parts. It is easiest (and IMO the most logical) to explain the current Key Names by starting with Key Name Parts and going from there. But the old way of

take a Unix/POSIX file system path and add a namespace, but also there is this special thing like empty parts and there are escape sequences, and then there is the unescaped version -- it is semi-hidden but should probably be used for everything other than printing -- oh and arrays are a thing, also some parts are reserved but not actually reserved reserved so we don't actually use them for anything low-level, ....

shows a lot better how insanely complex and complicated Key Names are.

[3] In case you are wondering, yes #3447 imposes some new restrictions, but those are useless, because they do not affect the Unescaped Key Name (except for the namespace) in any way.

@mpranj
Copy link
Member

mpranj commented Oct 4, 2020

Separate files would make it possible to import standard JSON files, without any metadata. How non-leaf keys with values would be handled in this case is not clear.

We discussed this regarding the cache implementation and decided against it. The current implementation of the resolver has stronger consistency guarantees than can be achieved when using two files. The resolver uses the atomic rename() function to ensure that a process either reads the old or the new file, but always a consistent one. Unfortunately this cannot (to my knowledge) be done with multiple files while having the same guarantees. I would be reluctant to use multiple files where we need to ensure consistency.

I agree, but importing/exporting everything goes a bit against the mountpoint idea of Elektra.

True.

All of this boils down to the fact that we have taken "Key Name Parts are arbitrary sequences of non-zero bytes" as an axiom [2].

I agree, this is the part where restrictions make the most sense. As you've said, filenames are quite restricted and it still works really well. Unless anyone knows a good reason not to restrict Key names (maybe @markus2330)?

Sorry, if this seems like a rant

No, and thank you for discussing this. Unless @markus2330 knows some reasons why we can not change this, I am fully open to your suggestion (like restricting !@&).

@kodebach
Copy link
Member

kodebach commented Oct 4, 2020

I would be reluctant to use multiple files where we need to ensure consistency.

That is a good reason, against multiple files yes...

I am fully open to your suggestion (like restricting !@&).

More realistically we could impose the restriction: "A Key Name must not contain the Key Name Part @elektra" (possibly a more obscure prefix, if that helps). I think using elektra might make the whole thing a bit more reasonable to users.

Personally, I think we should leave room for the future, but if restricting all Key Names is too much, we could also restrict what kdbGet can return and what kdbSet can accept. So the restriction above would only apply to what the last plugin during kdbGet returns and to the KeySet given to kdbSet. Basically we'd only restrict what Keys can be stored from an application's point of view. This would be similar to what we already have with system:/elektra (although this time we'd actually enforce the restriction).

@kodebach kodebach mentioned this issue Oct 12, 2020
16 tasks
@markus2330
Copy link
Contributor Author

Thank you very much for these valuable discussions. I started #3514 to summarize the important decisions scattered around everywhere. Please prefer to directly comment on the text in #3514, so that we can come to conclusions.

The @value part is yes. But it is more general. @meta could be used for storing metadata inside a format with a single nested object structure, like JSON or TOML (still not sure about YAML, because tags).
Basically, because everything in the JSON file refers to some key somewhere, there is simply no place to store metadata.

I do not think anymore that it should be our goal to circumvent restrictions of file formats. If JSON does not support meta-data and non-leaf values, Elektra with a JSON file mounted, simply forwards its restrictions on KeySets. We can document them, like proposed in #3504 and the storage plugins should fail on such KeySets but nothing more. (Of course someone can come up with a super-JSON file format with less restrictions but this is beyond our current man power and as such irrelevant.)

but applications that use @ at the start of a JSON field are probably quite rare.

Who knows?

keyAddBaseName (k, "@abc") works in both cases, the only difference is what escaped name it creates.

Sounds good.

but at least when you get to metadata it breaks for certain formats.

Yes, we need to drop the idea that any format can support anything.

It's unclear to me how the user would be able to check that his config files are conforming to the restricted spec.

I think that most applications will not be very demanding anyway (e.g. only have leaf-values and do not need any meta-data except of type), so all we discuss here is only relevant to exceptional applications. For these applications, I think it is best that application developers develop their own storage plugins or they use our best storage plugins, which support such things.

@markus2330 what do you think about imposing restrictions on the format?

It is not a good idea. We should support configuration file formats as standardized. It can be a big problem if some valid JSON/XML files get rejected by Elektra. But on the other hand, we do not need to support anything more than standardized. To try this was a mistake from the past (trying to please anybody).

We need to define our priorities. What exactly is an absolute must, and where can things change.

doc/GOALS.md in #3514

The worst part is that is abuses the one part that should never be touched by a parser and can drastically affect the readability of a storage file.

I agree it is not the best idea, it already had plenty of problems in the ini plugin. badcec5

Instead of a two JSON objects we could also use separate files.

See below.

For comparison, POSIX only allows the following characters in filenames:

This is the "portable Filename Character Set" and I doubt that there is any file system in use anymore which only supports this. In my experience any interpretation of the file names only calls for trouble (e.g. JFS had such features).

And that seems to be enough for people.

For Elektra's source it nearly is #1615 😉

But real applications immediately wanted more... (e.g. spaces, umlauts, ....)

Changing this would be every easy and it would fix everything.

It is not as simple as that: it makes storage plugins much more complicated. It simply moves logic to other places.

except those that both start with @$% and end in !%& at the same time

This idea is similar to CDATA. It is not a pretty solution and unhandy to use in practice.

HERE documents are a bit better, as the user can define the escape sequence.

But for what feature exactly do you think we need it? I think we should rather forget about these features and properly use meta-data to give semantics to keys.

Sorry, if this seems like a rant, but at some point it just gets really frustrating that everyone agrees there is a problem, but every solution is deemed to have a flaw or is seen as too complicated.

My point is: maybe we do not need a solution because the problem does not exist.

must now be valid UTF-8

Can they still have null-characters? Btw. I came up with the SSID example because this was a real requirement of a real customer (broadcom). Null-characters were not required, it is obvious that a char* without size cannot support this, so we did not even discuss about this.

shows a lot better how insanely complex and complicated Key Names are.

Yes, but better they are like this than to have such logic in every application and storage plugin.

In case you are wondering, yes #3447 imposes some new restriction

Yes, the key name can have any restrictions. There always was the restriction of dangling escape, so a few more restrictions do not matter. #3447 will be a huge improvement for Elektra.

Unfortunately this cannot (to my knowledge) be done with multiple files while having the same guarantees

It can be done by writing a journal log where you write everything you plan to do and in case of any problems you can recover by replaying the journal. To implement this is very much beyond our man power. It also does not really fit to a configuration system (nor file systems), it is a technique for structured databases.

I made it a non-goal in multiple_file_backends.md 59066aa

Basically we'd only restrict what Keys can be stored from an application's point of view. This would be similar to what we already have with system:/elektra (although this time we'd actually enforce the restriction).

We can enforce the restriction of system:/elektra. But why do you want to restrict what applications are storing in their part of their hierarchy?

@kodebach
Copy link
Member

I think it is best that application developers develop their own storage plugins

Thereby defeating one of the main benefits of Elektra: "The admin/user decides the config format not the application developer".

It can be a big problem if some valid JSON/XML files get rejected by Elektra

That was never my suggestion. Of course we should accept every valid JSON. The question is what KeySet do we produce. If we use "@value" for non-leaf values as suggested above, we could still accept every valid JSON. The resulting KeySet would just have a slightly different structure some of the time.

For comparison, POSIX only allows the following characters in filenames:

This is the "portable Filename Character Set" and I doubt that there is any file system in use anymore which only supports this. In my experience any interpretation of the file names only calls for trouble (e.g. JFS had such features).

And that seems to be enough for people.

For Elektra's source it nearly is #1615

But real applications immediately wanted more... (e.g. spaces, umlauts, ....)

They still don't allow / or empty names. Most also have maximum lengths for file names and for paths. The point is applications find ways to work around limitations. And in our case we can even help them with libraries.

It is not as simple as that: it makes storage plugins much more complicated. It simply moves logic to other places.

I do not agree. Storage plugins are already extremely complex, relatively speaking this would only be a minor inconvenience. (assuming we provide a good storage library with helper functions)

This idea is similar to CDATA. It is not a pretty solution and unhandy to use in practice.

HERE documents are a bit better, as the user can define the escape sequence.

Both CDATA and HERE documents need to reserve some syntax, from Elektra's current point of view there is no difference. Both versions are a restriction on key names.

Also both are much more powerful than what I proposed. My proposal would only "escape" a full key name part at a time.

I think we should rather forget about these features and properly use meta-data to give semantics to keys.

You just said, that some storage plugins might not support metadata. So it seems this is not an option.

My point is: maybe we do not need a solution because the problem does not exist.

Of course, if you just change your mind and say a storage format can limit Elektra's abilities the problem goes away. But that fundamentally changes Elektra's promise to users. We would limit all users just to allow some developer to safe a few lines of code dealing with encoding special values.

Can they still have null-characters?

Probably, NUL is a valid UTF-8 character.

without size cannot support this

Well, the maximum size of an SSID is 32 bytes, so you don't have to pass the size arround, as long as the buffer is always 32 bytes and zero-padded.

Yes, the key name can have any restrictions.

Please make up your mind. This whole time you have argued that there cannot be any restrictions.

There always was the restriction of dangling escape, so a few more restrictions do not matter.

"Dangling escape" is not a restriction. It is a syntax error in the escaped key name. Escaped key names never were arbitrary strings. This is a totally different thing.

We can enforce the restriction of system:/elektra.

We could require a flag in keyNew to allow such names. Otherwise an application that accepts key names from the user, would have to sanitize the input. See also the kdb import problem in #3299.

But why do you want to restrict what applications are storing in their part of their hierarchy?

I don't want to restrict what applications are store, but how they are storing it. This gives (storage) plugins more freedom.

@markus2330
Copy link
Contributor Author

Thereby defeating one of the main benefits of Elektra: "The admin/user decides the config format not the application developer".

It was never possible that any storage plugin can be used for any application. This does not mean that we withdraw the power of admin/user to change mountpoints. But we need to be honest and give a big disclaimer that doing so involves some risks.

That was never my suggestion. Of course we should accept every valid JSON. The question is what KeySet do we produce. If we use "@value" for non-leaf values as suggested above, we could still accept every valid JSON. The resulting KeySet would just have a slightly different structure some of the time.

On this problem we already wasted extremely many hours in about 10 years, without any progress. The first idea was keytometa, then directoryvalue, with many steps in between. Why not accepting JSON as it is? It will work for most applications. Is a slightly better structure really worth it?

The point is applications find ways to work around limitations.

Applications even find ways to work around the absence of a configuration library. Why would they use a configuration library which gives them limitations they currently do not have? It is quite trivial to implement a property file that allows any key names with any values. Why would Elektra be of any use if it does not even deliver this?

And in our case we can even help them with libraries.

First we need to clarify the goals before we make complex solutions.

I do not agree. Storage plugins are already extremely complex,

Did you look at simpleini, ni, c or mini? You can still easily write a useful storage plugin within less than a day.

relatively speaking this would only be a minor inconvenience. (assuming we provide a good storage library with helper functions)

Unfortunately the assumption does not hold. We also tried to create helper libraries since many years, with very little success. What is done in the core properly always was the best solution.

Both CDATA and HERE documents need to reserve some syntax, from Elektra's current point of view there is no difference. Both versions are a restriction on key names.

No, HERE documents do not reserve any syntax. You can decide on-the-fly (e.g. based on the data) which keyword you want to use to mark the end of the document.

I think we should rather forget about these features and properly use meta-data to give semantics to keys.
You just said, that some storage plugins might not support metadata. So it seems this is not an option.

The proper way is to specify keys in the namespace spec. The storage plugins in the other namespaces then do not need to store any metadata. See arbitrary_metadata.md decision in #3514

Of course, if you just change your mind and say a storage format can limit Elektra's abilities the problem goes away. But that fundamentally changes Elektra's promise to users. We would limit all users just to allow some developer to safe a few lines of code dealing with encoding special values.

I separated this question to #3515

Yes, the key name can have any restrictions.
Please make up your mind. This whole time you have argued that there cannot be any restrictions.

Key names aka keySetName can have restrictions (i.e. fail for some arguments), key part names aka keySetBaseName cannot have restrictions (never fails on any argument). I am not fanatic nor dogmatic, we can discuss this and everything else if you see a fundamental problem with our most important goals.

We can enforce the restriction of system:/elektra.
We could require a flag in keyNew to allow such names. Otherwise an application that accepts key names from the user, would have to sanitize the input.

I was thinking of a plugin that rejects anything written to /elektra which do not confirm to what tools are allowed to write there. I think it is low-priority, though.

@kodebach
Copy link
Member

No, HERE documents do not reserve any syntax. You can decide on-the-fly (e.g. based on the data) which keyword you want to use to mark the end of the document.

Yes, they do << is still reserved for starting HERE documents. The end-marker is user-defined, but the start obviously can't be.
Anyway, this is not important.


It was never possible that any storage plugin can be used for any application.

It's fine to me, if not all plugins support everything, but if an application uses a custom storage plugin the admin/user doesn't have any choice. There should be at least some choice, even if it is just e.g. YAML or TOML. If only a custom format tailored to Elektra supports everything, then I fear no one will use Elektra. To me one of the most important features of Elektra is that a amdin/user can always use their favourite config format no matter what application.

Did you look at simpleini, ni, c or mini? You can still easily write a useful storage plugin within less than a day.

c is write-only. mini doesn't support metadata and AFAICT deletes all comments from a file. simpleini also doesn't support metadata and lists a lot more restrictions in the README.

In general INI plugins are not a good example, because INI does not really have a specification, so there are not limitations coming from the format.

ni is based on an existing library. That's why the code is quite short and simple.

But ni is actually a very good example of what I'd like to do. It uses a very specific interpretation of INI files that might not be want application developers want. This actually raises another question.

What is the more important use case?

  1. Converting an existing application to Elektra and keeping the config files the same
  2. Writing new applications whose config format will be based on what fits best for Elektra

For the first case, we need plugins that can read any file and produce useful KeySets.
For the second case, even severe restriction like those imposed by ni may not be a problem.

The proper way is to specify keys in the namespace spec. The storage plugins in the other namespaces then do not need to store any metadata.

Yes, this is a possible solution. But it requires a lot of work to make spec work properly. At the very least, we need some guarantees of plugin ordering within the postgetstorage position.

But it causes some difficulties with metadata inferred from the storage format (e.g. type, array). Would it be an error, if we read a JSON array that has no matching array metadata in spec?

However, this doesn't solve the problem for non-leaf keys with values.

Key names aka keySetName can have restrictions (i.e. fail for some arguments), key part names aka keySetBaseName cannot have restrictions (never fails on any argument).

These are two entirely different things. keySetName accepts an escaped key name, while keySetBaseName accepts a part of an unescaped key name. Of course an escaped key name has restrictions.

However, everything around Elektra's key names is based on a key name part (see also the docu in #3447). So the only question we need to discuss is:

What is a key name part?

@markus2330
Copy link
Contributor Author

It's fine to me, if not all plugins support everything, but if an application uses a custom storage plugin the admin/user doesn't have any choice.

I do not see how this would be even possible. In Elektra you can always implement alternatives as long this alternative yields suitable KeySets for the application. The only thing fixed is the KeySet but how the KeySet is produced is always something interchangeable. For me this is the main benefit of Elektra.

That is why the KeySet (including the key names) are the most important (and most difficult) decisions of all.

There should be at least some choice, even if it is just e.g. YAML or TOML. If only a custom format tailored to Elektra supports everything, then I fear no one will use Elektra. To me one of the most important features of Elektra is that a amdin/user can always use their favourite config format no matter what application.

I do not see how any of the decisions we are talking about effects the possibility. If you put enough effort into a storage plugin for your favorite config format (including syntactical extensions or similar) you will be able to get any application running.

mini doesn't support metadata and AFAICT deletes all comments from a file. simpleini also doesn't support metadata and lists a lot more restrictions in the README.

For most applications these restrictions are irrelevant. One tool I use works with simpleini and also mini as backends.

I think we landed already nearly at the same conclusion: That whatever happens to yield the correct KeySet, need to happen within the storage plugin. Using a "storage library" is of course advantageous but in the end it is an implementation detail. The idea that "later plugins fix what the storage plugin made wrong" failed. So why do we need further syntax in the key name? Whatever syntax is needed can be eliminated within the storage plugin.

  1. Converting an existing application to Elektra and keeping the config files the same
  2. Writing new applications whose config format will be based on what fits best for Elektra

Imho "1." is more important now, as we need this for KDE. So we somehow need Elektra in a shape to not destroy the kconfig plugin.

About "2." I do not know what "fits best for Elektra" means. Elektra has no purpose for itself, its only purpose is to serve the needs of developers, users and admins.

But it causes some difficulties with metadata inferred from the storage format (e.g. type, array). Would it be an error, if we read a JSON array that has no matching array metadata in spec?

No, only the other way is an error: if the spec requires an array but there is none.

Yes, this is a possible solution. But it requires a lot of work to make spec work properly. At the very least, we need some guarantees of plugin ordering within the postgetstorage position.

Yes, I totally agree. And something needs to be done with spec anyway, we cannot leave it as is.

However, this doesn't solve the problem for non-leaf keys with values.

You argued that this is not important anyway 😜

And I agree, for nearly all applications it will not be important.

What is a key name part?

Can you make an issue that lists the open/unclear points? Or even better: add it to #3514?

@kodebach
Copy link
Member

It's fine to me, if not all plugins support everything, but if an application uses a custom storage plugin the admin/user doesn't have any choice.

I do not see how this would be even possible. In Elektra you can always implement alternatives as long this alternative yields suitable KeySets for the application. The only thing fixed is the KeySet but how the KeySet is produced is always something interchangeable. For me this is the main benefit of Elektra.

That is why the KeySet (including the key names) are the most important (and most difficult) decisions of all.

There should be at least some choice, even if it is just e.g. YAML or TOML. If only a custom format tailored to Elektra supports everything, then I fear no one will use Elektra. To me one of the most important features of Elektra is that a amdin/user can always use their favourite config format no matter what application.

I do not see how any of the decisions we are talking about effects the possibility. If you put enough effort into a storage plugin for your favorite config format (including syntactical extensions or similar) you will be able to get any application running.

I should have added more context to my response. I think we are on the page in this regard. My comment was in response to this statement from you:

I think that most applications will not be very demanding anyway (e.g. only have leaf-values and do not need any meta-data except of type), so all we discuss here is only relevant to exceptional applications. For these applications, I think it is best that application developers develop their own storage plugins or they use our best storage plugins, which support such things.

In particular the part: "I think it is best that application developers develop their own storage plugins"

While this will always be an option, I think it should never be something we recommend or even suggest as a solution. Using a custom storage plugin would not be in line with Elektra's philosophy and should be avoided as far as possible. Note: by "custom storage plugin" I mean a plugin that is designed to work with a specific application but makes no guarantees to be compatible with the rest of Elektra. It would be totally okay, if someone wants to use a custom format and develops a corresponding plugin that is compatible with the standard Elektra environment.

I think we landed already nearly at the same conclusion: That whatever happens to yield the correct KeySet, need to happen within the storage plugin. Using a "storage library" is of course advantageous but in the end it is an implementation detail. The idea that "later plugins fix what the storage plugin made wrong" failed.

Yes, that is basically it. The storage plugin cannot rely on other plugins (except maybe binary values and such things).

Although maybe a plugin is actually the solution after all. I'll have to think about it further and will post the idea, if I think it works.

But it causes some difficulties with metadata inferred from the storage format (e.g. type, array). Would it be an error, if we read a JSON array that has no matching array metadata in spec?

No, only the other way is an error: if the spec requires an array but there is none.

Let's go a step further. What if we have this JSON [ "a", "b" ] with no spec and export it into a format without metadata and without arrays (e.g. mini)? The information that this is an array would be lost.

As this is an edge case, we might just say: This is unfortunate, but it is what it is.

And I agree, for nearly all applications it will not be important.

The same can be said about using arbitrary key names. Nearly all applications wouldn't have a problem, if they can't use /@elektra/ in their key names.

Can you make an issue that lists the open/unclear points? Or even better: add it to #3514?

I will make a list somewhere, so we can discuss it on Monday.

@markus2330
Copy link
Contributor Author

It would be totally okay, if someone wants to use a custom format and develops a corresponding plugin that is compatible with the standard Elektra environment.

I was actually thinking along these lines. But yes, particularities easily creep in. E.g. the kconfig plugin the metadata can only be a single character long so it will be obviously not very compatible to our metadata (at least as long as the parser does not remap this character to the actual meaning, which at the moment does not happen). At the moment we simply pass the one-char metadata through our layer and, if needed, do in the kconfig library what would have happened if it would have directly parsed the file.

@mpranj did you check already if this actually works? At least [$e] seems to exist quite often.

Although maybe a plugin is actually the solution after all. I'll have to think about it further and will post the idea, if I think it works.

At the moment I would prefer if we fix the things where we already know how to fix them.

As this is an edge case, we might just say: This is unfortunate, but it is what it is.

Exactly, if you want "feature X" you cannot have a plugin without this feature in the import/export chain. The same with comments and so on.

non-leaf keys with values.
The same can be said about using arbitrary key names. Nearly all applications wouldn't have a problem, if they can't use /@elektra/ in their key names.

Yes. But there is also a big difference between something that does not work with some plugins and something that will never work at all with Elektra.

I will make a list somewhere, so we can discuss it on Monday.

👍

@kodebach
Copy link
Member

But there is also a big difference between something that does not work with some plugins and something that will never work at all with Elektra.

I get your point, but realistically how often will not having /@elektra/ keys actually be a problem?

Although maybe a plugin is actually the solution after all. I'll have to think about it further and will post the idea, if I think it works.

At the moment I would prefer if we fix the things where we already know how to fix them.

The idea was a plugin that is always mounted and does some encoding/decoding of key names. Then a storage plugin only has to deal with KeySets that e.g. don't contain /@elektra/ keys. But because of the encoding the application can still use any key name. The storage plugin can then use /@elektra/ keys for its own purpose, e.g. metadata and non-leaf values.

The effect would be a small limitation on storage files, but no limitation for applications. The complexity of storage plugins would also be unaffected, because the storage plugin doesn't have to use any /@elektra/ keys. Only those plugins that want to support e.g. metadata have to write additional code, but that is always the case.

@markus2330
Copy link
Contributor Author

I get your point, but realistically how often will not having /@elektra/ keys actually be a problem?

Probably no problem at all. But it will be a problem now as much as it will be later. So why reserve it now and not when we really find a use case where something like this is needed? Then we know it is not only a fantasy and can choose an appropriate name.

In programming languages we (usually) only reserve key words when we actually do syntactical extensions.

e.g. metadata

For metadata on keys, spec is the better solution as it refers to all namespaces. Non-spec metadata is basically only useful formatting preserving and similar.

because the storage plugin doesn't have to use any /@elektra/ keys

It is not that a storage plugin uses any name, the question is the behavior if it encounters @elektra = something in a file.

@kodebach
Copy link
Member

So why reserve it now and not when we really find a use case where something like this is needed?

Because this is a huge breaking change and before 1.0 is the right moment to do such things. Especially since are already changing key names.

There also is a very real use-case for it right now. At the very least, it allows storing non-leaf keys with values in any format.

For metadata on keys, spec is the better solution as it refers to all namespaces.

I'll just mention array. Think a bit about the stuff in #3514 and then tell me how to properly store an array in a Java properties file.

It would be nice, if metadata was only stored in spec:/ (that would also solve some problem with the spec plugin), but would require some very big changes to Elektra to make it work well.

the question is the behavior if it encounters @elektra = something in a file

If the plugin uses @elektra to encode something (e.g. non-leaf values) than that's what happens.
Otherwise: "ERROR: illegal key name"

@kodebach
Copy link
Member

I will make a list somewhere, so we can discuss it on Monday.

As promised in #3223 (comment), I now added #3517.

@markus2330
Copy link
Contributor Author

tell me how to properly store an array in a Java properties file.

It only works with spec.

but would require some very big changes to Elektra to make it work well.

Are there some changes needed which are not yet in the issue tracker? I think it is a must-have for 1.0 that at least the basics of spec work well (and to get rid of the not-working stuff).

@kodebach
Copy link
Member

It only works with spec.

But how? You need the array metadata in the other namespaces to know how big the array actually is.

Are there some changes needed which are not yet in the issue tracker? I think it is a must-have for 1.0 that at least the basics of spec work well (and to get rid of the not-working stuff).

It's impossible to say. We need to decide what spec should actually do. Then we need to find out, if and how that is possible. Even then it is very likely that we find some edge cases that don't work. Elektra is just so complicated that it is nearly impossible reason about and good test coverage is at least as hard. A lot of Elektra is also still based on loosely or not at all defined principles.

The biggest problem are definitely the current plugin positions. But those will take a lot of work to fix. IMO that part essentially needs a complete redesign.

@markus2330
Copy link
Contributor Author

markus2330 commented Oct 20, 2020

But how? You need the array metadata in the other namespaces to know how big the array actually is.

Yes, spec would really need to understand the semantics of the metadata, like arrays. So it would check all namespaces how the array actually looks like, and then add the proper metadata array.

Elektra is just so complicated that it is nearly impossible reason about and good test coverage is at least as hard.

I agree, so let it make more simple. Now is the perfect opportunity. We learned a lot about what cannot be done. So let us get rid of that stuff.

The biggest problem are definitely the current plugin positions. But those will take a lot of work to fix. IMO that part essentially needs a complete redesign.

I am more and more agreeing with @mpranj that hooks for specific global plugins is not such a bad idea. The requirements for such plugins are quite complicated and thus generic positioning would become too complicated.

So we simply would not have any generic global plugin positions but only ones specific for mmap, spec and notification (the only one we really need for 1.0). Generic positions would be nice, but it is clearly beyond our capacity to test this. I modified the decision in 89b2903

Further discussions please in #3514

@kodebach kodebach mentioned this issue Oct 20, 2020
16 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants