Confused about the actual use case #264

emrakyz · 2024-05-11T07:45:30Z

emrakyz
May 11, 2024

First of all, thanks a lot for this tool.

The idea looks really good and promising but I couldn't understand what the actual use case was. The documentation lacks in terms of various interesting examples.

I have tried to refind some of the patterns I had already written myself just to test grex.

For example I use the below pattern to extract doi addresses from various inputs and/or pdfs:
sed -n -E 's/.*((DOI|doi)((\.(org))?\/?|:? *))([^: ]+[^ .]).*/doi:\6/p; q'

The actual pattern is this:
.*((DOI|doi)((\.(org))?\/?|:? *))([^: ]+[^ .]).*

The part that I aim to capture is (6th captured group which is the DOI Address): ([^: ]+[^ .])

As an example, it captures this part at the end: 10.36227/techrxiv.22659061.v1

I have tried to place lots of valid cases on each line (doi addresses in different forms as in the above regex pattern) to a test.txt file.

I used grex -r -g -c --no-start-anchor -f "test.txt" command. I knew that it couldn't give me a pattern similar to my original one but the resulting output was even much more different than I expected. I got an extremely long regex pattern which also captured unwanted parts (false-positive constants) that would break the command for my actual use case. This is understandable but impossible to avoid without infinite examples that are completely different from each other, except the actual constants.

I have also tried to test different cases in order to refind the regex patterns I had written before with simpler patterns using some made-up examples in the test file. The below pattern that had been written before, could be an example:
^ *([0-9]+).*\s{2,}(.+)$

But grex outputs a pattern that is always wrong and very long; not similar to my actual pattern. The output is not "wrong" in technical sense but definitely not usable to achieve something. It's not even appropriate to be modified to some extent manually and then used. No matter the sample size, this was always the case in my tests.

Even for very basic cases; since we can't be expressive enough, the output is not usable. Without proper expression, this tool can only create patterns which are usable only with almost infinite example cases that cover all possibilities.

The problem is that, - as far as I understand - we can't be expressive enough especially in terms of constants, variables, wanted parts, unwanted parts, the actual main pattern that should captured and all. Without these, I could not find a proper use case but I really want to use this tool in an actual scenario. How can we be more expressive so we can automate creating at least a base pattern to work on?

I have also tried to find the final pattern in a segmented way but failed similarly.

Writing regex is fairly easy for small, simple tasks. What I initially had in mind for this tool was that it would be helpful for us to create regex, for very complex patterns easily in a more efficient, more correct way. Right now I feel like we have a very powerful and robust but useless tool. Is this just "experimental" or a kind of a base that will be used by future tools?

Could you please inform what I do wrong? What is the best practice to solve an actual problem using grex? What type of problems are best to get help from grex?

I probably misunderstood the tool or made a mistake regarding the intended use case.

pemistahl · 2024-09-09T07:39:17Z

pemistahl
Sep 9, 2024
Maintainer

I'm sorry for the late response. Well, the intended use case currently is just what the README says:

The currently best use case for grex is to find an initial correct regex which should be inspected by hand if further optimizations are possible.

The programming involved in the current state of the tool is already very complex. It is desirable to make the tool more powerful by allowing more generalizations. But this is hard to implement and it's difficult to guarantee valid and correct output for each arbitrary input.

Feel free to improve the implementation and make the tool more intelligent. PRs are always welcome.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confused about the actual use case #264

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Confused about the actual use case #264

emrakyz May 11, 2024

Replies: 1 comment

pemistahl Sep 9, 2024 Maintainer

emrakyz
May 11, 2024

pemistahl
Sep 9, 2024
Maintainer