Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dropped support for named group captures with underscore #108

Open
kmerz opened this issue Feb 26, 2019 · 6 comments
Open

Dropped support for named group captures with underscore #108

kmerz opened this issue Feb 26, 2019 · 6 comments

Comments

@kmerz
Copy link

kmerz commented Feb 26, 2019

With #53 we lost the ability to have named group captures with underscore like (?<test_field>test).

java-grok had the support as long as we used the com.google.code.regexp.Pattern.
Now with java.util.regex.Pattern we use the java regex engine which does not support underscores:

https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#groupname

A capturing group can also be assigned a "name", a named-capturing group, and then be back-
referenced later by the "name". Group names are composed of the following characters. The first
character must be a letter.
The uppercase letters 'A' through 'Z' ('\u0041' through '\u005a'),
The lowercase letters 'a' through 'z' ('\u0061' through '\u007a'),
The digits '0' through '9' ('\u0030' through '\u0039'),

This broke backward compatibility with already stored patterns.

Preferable fix was to bring back com.google.code.regexp.Pattern.

@kmerz
Copy link
Author

kmerz commented Feb 28, 2019

Here a diff you can apply which tests the missing underscore support:

diff --git a/src/main/resources/patterns/patterns b/src/main/resources/patterns/patterns
index 52235d8..2eb7233 100644
--- a/src/main/resources/patterns/patterns
+++ b/src/main/resources/patterns/patterns
@@ -106,3 +106,6 @@ COMMONAPACHELOG_DATATYPED %{IPORHOST:clientip} %{USER:ident;boolean} %{USER:auth
 
 # Log Levels
 LOGLEVEL ([A|a]lert|ALERT|[T|t]race|TRACE|[D|d]ebug|DEBUG|[N|n]otice|NOTICE|[I|i]nfo|INFO|[W|w]arn?(?:ing)?|WARN?(?:ING)?|[E|e]rr?(?:or)?|ERR?(?:OR)?|[C|c]rit?(?:ical)?|CRIT?(?:ICAL)?|[F|f]atal|FATAL|[S|s]evere|SEVERE|EMERG(?:ENCY)?|[Ee]merg(?:ency)?)
+
+# NamedGroup with underscore
+NAMEDGROUPWITHUNDERSCORE (?<test_field>test)
diff --git a/src/test/java/io/krakens/grok/api/GrokTest.java b/src/test/java/io/krakens/grok/api/GrokTest.java
index de4d714..3bff5a8 100644
--- a/src/test/java/io/krakens/grok/api/GrokTest.java
+++ b/src/test/java/io/krakens/grok/api/GrokTest.java
@@ -672,4 +672,13 @@ public class GrokTest {
     instant = (Instant) grok.match(dateWithTimeZone).capture().get("timestamp");
     assertEquals(ZonedDateTime.parse(dateWithTimeZone, dtf.withZone(ZoneOffset.ofHours(8))).toInstant(), instant);
   }
+
+  @Test
+  public void testNamedGroupWithUnderscore() {
+    String grokPatternName = "NAMEDGROUPWITHUNDERSCORE";
+    String testString = "test";
+    Grok grok = compiler.compile("%{" + grokPatternName + "}");
+    String result = (String) grok.match(testString).capture().get(grokPatternName);
+    assertEquals("test", result);
+  }
 }

@ottobackwards
Copy link
Contributor

The fact that this broke _ support was known at the time of the change and referenced in #53, so this was on purpose and accepted as far as I can tell. I don't think you have to convince anyone this is an issue.

What you would have to convince people of is that the original PR ( 2016?) was not worth dropping this support.

@kmerz
Copy link
Author

kmerz commented Feb 28, 2019

For us in Graylog it is simply the fact that we have users with Grok Patterns which do contain named groups with underscores. And dropping the support means a lot of work for the users, manually rewriting there Grok Patterns.

@ottobackwards
Copy link
Contributor

ottobackwards commented Feb 28, 2019

So, from what I can tell that support was lost because the library that provides it is for java 5/6, so it just can't be 'added back'. Maybe you can submit a PR? But there might be other reasons.....

@anthonycorbacho ?

@kmerz
Copy link
Author

kmerz commented Feb 28, 2019

I would have a PR ready when you want. But as I read in #53 performance with named-regexp was an issue?

@ottobackwards
Copy link
Contributor

So we really need @anthonycorbacho to comment. I was not involved at that time, and I am not a committer now, although I have submitted a couple of PR etc. Switching the regex engine, as I believe this change did was and is a big deal, and as I said they knew they were making this breaking change and accepted it.

I don't think there is any going back.

I would imagine if you wrote something that say did mapping between _ and some allowable character on the fly ( so that the caller can use _ but inside it is replaced ) that would be the way to go.

But @anthonycorbacho leads the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants