Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EQL: Replace ?" and ?' with """ #62645

Closed
costin opened this issue Sep 18, 2020 · 21 comments
Closed

EQL: Replace ?" and ?' with """ #62645

costin opened this issue Sep 18, 2020 · 21 comments
Assignees
Labels
:Analytics/EQL EQL querying >breaking Team:QL (Deprecated) Meta label for query languages team

Comments

@costin
Copy link
Member

costin commented Sep 18, 2020

As part of #61659, the topic of raw strings was brought up.
Currently ?' and ?" are used for defining a raw/non-escaped string literal. This is problematic because:

  • usage of single quote has been deprecated/removed ' and thus should ?'
  • ?" makes it hard to define regex/patterns containing double quotes
  • ? is a valid regex character and while it is not conflicting, it can be confusing to the user:
    ?"?ab"

The proposal is to replace raw-string declaration with triple quotes, """ since the same character (double quotes ") is used and the risk of clashes (and thus forcing escaping) is minimal.

Based on the info in #61659, there are 21 rules, 0.8% that use ?" and 8 rules/0.33% that use ?' from an total of 2389 rules.
That's a total of 29 rules or 1.13% that will be affected by this change.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-ql (:Query Languages/EQL)

@elasticmachine elasticmachine added the Team:QL (Deprecated) Meta label for query languages team label Sep 18, 2020
@rw-access
Copy link
Contributor

rw-access commented Sep 18, 2020

Thanks for starting this discussion on an issue @costin. I'm optimistic that we'll agree on something.

Instead of interpreting this issue as Replace ?" and ?' with """, I'm hoping to address the underlying question What syntax is ideal for representing raw strings?

Most of my thoughts boil down to three high-level questions

  1. What contexts do we think this will be used? Purely in == and != equality comparisons? Do we think users will just copy and paste a value directly into a query this way? Or will it be used within the wildcard or match function? I think we should know our use-cases first, and work from there.

  2. What alternatives have we considered, and is this the best option?
    Since this is a feature request that is user-facing, I want to make sure we find the right solution, because maintaining compatibility could pigeonhole us in the user. So we want to avoid breaking changes down the road.

A handful of alternatives that we've discussed:

  • use /.../ syntax for representing raw literals. We brought this one up, but the biggest concern is how common / is for unix paths. For that reason, it would need to be escaped a lot, and is problematic.
  • use '...' approach for literal strings. The ' character is used for strings in many languages, and using it for escaping identifiers in SQL seems to be more the exception than the rule. Personally, using'...' for literal string seems the most intuitive is my favorite approach and is consistent with several programming languages and serialization formats. However, this syntax isn't currently available to us, because we need to first deprecate/remove it before we can repurpose it.
  • use backticks `...`. But this makes more sense for escaping identifiers than escaping strings.
  • use '''...'''. This is nearly identical to """, however it could be misread in non-monospace fonts.
  1. What is the desired syntax and behavior allowed between the """ pairs? Can the string contain """ at all? How many quotes could it contain?

How do we handle these scenarios?

  • newlines between """
  • tabs between """
  • what whitespace is allowed between?
  • """""" (6 quotes): is this an empty string?
  • 7 quotes: should this parse to " or should it fail?
  • 8 quotes: will this parse to "" or should it fail?
  • 9 quotes: will this parse to """ or shoul it fail?
  • 10 quotes: will this parse to """ or should it fail?
  • """"""abc""" (6 quotes, then abc"""): will this parse as """abc or should it fail?

@astefan
Copy link
Contributor

astefan commented Sep 21, 2020

For consistency with the rest of the language and to keep symbols usage to a minimum (which adds additional complexity and potentially can confuse users), using """ is a good option. The double quote is already used, so it doesn't come as a surprise to a user.

For the other four suggested alternatives:

  • / feels regex-ish to me and for the reason @rw-access mentioned, as well, I don't think is a good alternative
  • the single quote ' could be used for identifiers and I like this idea better than using the single quote for raw strings
  • the backtick ` is not so common imo and tends to suggest escaping (see Markdown). And, again, there is no need to add more symbols to the language if we don't actually need it.
  • fully agree with @rw-access regarding '''

@matriv
Copy link
Contributor

matriv commented Sep 21, 2020

I think that this change is quite safe even for any existing users, and the chance that the unescaped string literal is used.

  • We provide an error message when using the previous syntax pointing to the new syntax
  • The introduction of """ has the exact same behaviour regarding " and ' in the unescaped string:
    • ?"asdf"asdf"
    • ?"""
    • ?""""
    • ?"'"
    • and so on

It's the same behaviour for every language out there and every unescaped syntax. What about if the unescaped syntax is marked with an #? Then expressions like ##foo#bar# will result int #foo#bar.

Imho, we should proceed with this breaking change.

@sethpayne
Copy link

sethpayne commented Sep 21, 2020

How do we handle these scenarios?

  • newlines between """
  • tabs between """

@paulewing and @MikePaquette -- can you comment on the above?

Going to add my perspective on the below:

  • what whitespace is allowed between?

Any. If """ ... """ encloses literal strings, whitespace should be interpreted literally as well.

  • """""" (6 quotes): is this an empty string?

Yes. This should be an empty string

  • 7 quotes: should this parse to " or should it fail?

This should fail since we would need to make assumptions about user intent, otherwise. I think we need to avoid this and require users to be very explicit

  • 8 quotes: will this parse to "" or should it fail?

This should be interpreted as the a string: ""

  • 9 quotes: will this parse to """ or shoul it fail?

Fail. Same reasoning as for 7 quotes

  • 10 quotes: will this parse to """ or should it fail?

Parse to """ as with 8 -- or any even numbered count of double quotes

  • """"""abc""" (6 quotes, then abc"""): will this parse as """abc or should it fail?

Parse as """abc -- again, I think we should require explicit input from users.

@costin
Copy link
Member Author

costin commented Sep 21, 2020

Folks, before getting lost in the details let's clarify one thing: this ticket is not about changing the raw string semantics.

What contexts do we think this will be used?

  1. Again, the aim of this ticket is not introduce new behavior but to simplify the grammar. In other words, the behavior of the raw string declaration does not and should not change.
    Further more the context of a definition should not matter. It does not for strings so it shouldn't for raw strings either. If it does, that's a bug.

What alternatives have we considered, and is this the best option?

  1. The issue with only picking any one character is that it leads to escaping that character inside the string.
    If we pick [x], declaring the raw string x becomes x\xx. Sure the same thing applies with a repetition of characters but it's much more unlikely.
    I'm not clear whether you propose an alternative or not...

What is the desired syntax and behavior allowed between the """ pairs? ...

  1. I'm not sure I follow - this is not a new requirement but replacing the current raw string declaration with a new one. The existing behavior will remain the same. See 1).

@rw-access
Copy link
Contributor

rw-access commented Sep 21, 2020

I believe we're on the same page. I asked these questions to make this proposal crystal clear.
This is a specification we're talking about, so I don't want to assume anything.

Further more the context of a definition should not matter. It does not for strings so it shouldn't for raw strings either. If it does, that's a bug.

  • My preference is that """ and " be fully interchangeable, but merely have different syntax (escaping vs raw). It sounds like we're agreement.

What contexts do we think this will be used?

  • My reason for this question is that I want to make sure we are also clear on why we need this syntax and what contexts it will most likely be used in. I believe there are two: (1) regular expressions and (2) copy and paste without escaping

What alternatives have we considered, and is this the best option?

  • I asked this question, because I want to make sure that we aren't simply charging forward with one option, but that we are exploring all options, because this a change we only want to make once.

I'm not clear whether you propose an alternative or not...

I recapped four alternatives and added reasons for why most of them were inadequate:

  • /.../
  • '...'
  • `...`
  • '''...'''

What is the desired syntax and behavior allowed between the """ pairs? ...

  • My reason for asking this question is to make sure that we are proposing a precise syntax and that we understand the implications of our choices.
  • We should understand exactly what is and is not allowed between """. Newlines, tabs, unprintable characters, extra quotes, etc.

@costin
Copy link
Member Author

costin commented Sep 22, 2020

If my understanding is correct, most of confusion revolves around escaping " inside a raw string.

My preference is that """ and " be fully interchangeable, but merely have different syntax (escaping vs raw). It sounds like we're agreement.

👍

My reason for this question is that I want to make sure we are also clear on why we need this syntax and what contexts it will most likely be used in. I believe there are two: (1) regular expressions and (2) copy and paste without escaping

This is where we have different goals. The aim of this ticket is to simplify the grammar by 7.10. Not to impact in anyway the semantics for unescaped string declaration. Whether what we have is useful or not, is outside the scope of this ticket and 7.10 as far as I'm concerned.

My reason for asking this question is to make sure that we are proposing a precise syntax and that we understand the implications of our choices.

Is the current syntax for ?' and ?" unclear ?

We should understand exactly what is and is not allowed between """. Newlines, tabs, unprintable characters, extra quotes, etc.

I'm not sure what's unclear. This is not a new concept, the only change in the grammar really is the defining string, from ?' and ?" to """ as you've seen in the PR: #62539

The same rules for newlines, tabs, unprintable characters, etc.. that apply for ?' and ?", apply for """.
The only difference is obviously escaping the """ sequence which seems to be confusing so before discussing that, I'd like to check that is the case.

@rw-access
Copy link
Contributor

rw-access commented Sep 22, 2020

This is where we have different goals. The aim of this ticket is to simplify the grammar by 7.10. Not to impact in anyway the semantics for unescaped string declaration. Whether what we have is useful or not, is outside the scope of this ticket and 7.10 as far as I'm concerned.

I'm referring to user behavior not syntax or semantic behavior. For the grammar, we're in agreement both on semantics and syntax: it's a string with no escape sequences between the """, and it's interchangeable with other strings. I want to make sure we enumerate the use cases of this functionality, and make sure we're asking the question "how well does it accomplish the expected user behavior?"

If we create a new syntax for raw strings, but it's awful at capturing regular expressions, then we probably did a bad job. I think """ is the best proposal we've seen for many use cases, and much better than both ?" and ?'. I just want to make sure we are keeping typical user experience and at the forefront and that """ is fulfilling their needs. One hypothetical concern is the extra verbosity.

For example, if we want to use """ strings as regular expressions, what does a regex look like? Is that something that looks overwhelming or is writing it something that's comfortable?

  • Do we like the way this looks and feels? match(process.path, """C:\\Program Files( \(x86\))?\\Microsoft\\Office.*\\.*?\.exe""")
  • What about this common example, where we are leading with an extra quote? process.command_line == """"C:\Windows\System32\cmd.exe" /c start /b whoami /all"""

My questions about how many quotes do we allow were again to make sure that this proposal successfully accomplishes its goals and use cases. I think the only logical regular expression is this: """[^\r\n\f\v]*?["]{3,}, and this means that you can have as many trailing quotes as you want, as long as there are at least three. The original text will always be recovered by removing the first 3 and last 3 characters. That means you can start with up to five quotes (3 will be dropped), and end with at least 3.

I don't think of that is simply an implementation detail, it's part of the proposal for """ in the specification and will eventually be in the user docs (but much more polished).

@rw-access
Copy link
Contributor

rw-access commented Sep 22, 2020

Screenshot of that regex and how it matches strings. I'm pleased with this behavior, and fairly confident that it's desirable and it matches our use cases. """ in the middle of a string seems very unlikely, and I can't enumerate more use cases than regex and just avoiding cumbersome escape sequences in general.
image

If we can't come up with any use cases where the syntax is confusing, inadequate, undesirable or misleading, then I think we can safely move ahead.

@costin
Copy link
Member Author

costin commented Sep 23, 2020

It looks like we're reaching consensus. I'm removing the team-discuss label but will keep the ticket open just in case there are some details popping up regarding the grammar itself.

@matriv
Copy link
Contributor

matriv commented Sep 24, 2020

For python when you have something enclosed in """ you cannot have the """ sequence inside it, since the approach is the non-greedy one and therefore, when the sequence is met again it denotes the end of string. For python you just need to escape on of the " or all of them in the sequence to make it work:

"""asd\"\"\"qwe"""
or
"""asd\"""qwer"""
or
"""asd"\""qwer"""
"""asd""\"qwer"""

@rw-access
Copy link
Contributor

One big downside for that is that you can't have trailing double quotes at all, which is problematic. Also, this is most comparable to r"""

>>> r""""abc""""
  File "<stdin>", line 1
    r""""abc""""
               ^
SyntaxError: EOL while scanning string literal

@matriv
Copy link
Contributor

matriv commented Sep 28, 2020

I tend to lean towards the greedy approach, so that once you have the leading """ you expect the trailing """ and everything in-between is unescaped raw string including any combination/series of ".

@rw-access
Copy link
Contributor

That means you can have at most one """ in an expression. Everything between the first """ and the last """ will be consumed.

@matriv
Copy link
Contributor

matriv commented Sep 28, 2020

Yes, so as many " or """ inside the expression as long as it ends with a closing """. At least that's my proposal/suggestion, but I don't have any really strong arguments against escaping each " within the expression. I'd definitely won't go with the python way that you would have to escape just one of the sequence, I find it confusing.

@rw-access
Copy link
Contributor

rw-access commented Sep 28, 2020

Maybe these examples better illustrate the problem that I see with greedy matching. What strings are in each expression?

foo == """one string""" and bar == """""another string"""""
concat("""abcdefg""", """""hijklmnop"""")

Non-greedy:

  • one string

  • ""another string""

  • abcdefg

  • ""hijklmnop"

Greedy:

  • one string""" and bar == """""another string

  • bcdefg""", """""hijklmnop"

Also the analogous python syntax is r""" for raw strings, not """ for escaped strings.

@matriv
Copy link
Contributor

matriv commented Sep 28, 2020

No, it's not like that, the greedy approach works within the expression, and, or, etc. breaks the expression anyway.

@matriv
Copy link
Contributor

matriv commented Sep 28, 2020

@matriv matriv self-assigned this Sep 28, 2020
@rw-access
Copy link
Contributor

rw-access commented Sep 28, 2020

But what if you want a \" inside the string? Now there's no way to do that.
I'm just confused why there's escaping at all in a string that's supposed to solve the problem of not needing escape sequences.

I still think """[^\r\n\f\v]*?["]{3,} is the safest and most clear specification. It forces the string to be in a single line, can end or start with a double quote inside the string, and has no escape sequences.

@costin
Copy link
Member Author

costin commented Oct 7, 2020

Closed by #62539

@costin costin closed this as completed Oct 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/EQL EQL querying >breaking Team:QL (Deprecated) Meta label for query languages team
Projects
None yet
Development

No branches or pull requests

6 participants