Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex search #6706

Merged
merged 35 commits into from
Sep 6, 2024
Merged

Regex search #6706

merged 35 commits into from
Sep 6, 2024

Conversation

Alkarex
Copy link
Member

@Alkarex Alkarex commented Aug 12, 2024

fix #3549

Support regex filters and regex search (working with PostgreSQL and MySQL and SQLite).

Works like any regular expression. Must be enclosed in / /

Currently supporting the prefix intitle: or no prefix at all (will add author: etc. soon)

Case sensitive by default, but can be made case-insensitive with i modifier like /Alice/i

Supports multiline mode with m modifier like /^Alice/m

Example of filter to keep only the entries, which title ends with .0 from https://github.com/FreshRSS/FreshRSS/releases.atom , marking as read the entries not matching this filter:
image

image

Requires PHP 8.0+ due to use of new function str_contains, as planned requirement for FreshRSS 1.25.0+ #6711

@Alkarex Alkarex added this to the 1.25.0 milestone Aug 12, 2024
@Alkarex Alkarex marked this pull request as ready for review August 13, 2024 20:29
@mtalexan
Copy link

mtalexan commented Aug 27, 2024

I'm running this branch in a custom docker-compose build based on the Docker-Alpine. I see you pretty much only have intitle: and the prefix-less (context) matching implemented so far so I'm focusing on testing regex features and combos of the regex feature with other pre-existing filter types.


I'd recommend being explicit about what type of regex patterns you support. The most common standard types are Posix (very limited), Extended, and Perl, though there are lots of variations like Java, RE2 (Golang), etc that slightly tweak the syntax, usually primarily only of more advanced features. Maybe if you add a link in the documentation to the regex engine's definition that would be helpful as well.

Working:

  • intitle with regex
  • intitle with case insensitive regex
  • content with regex
  • content with case insensitive regex
  • intitle with regex and content with non-regex
  • intitle with case insensitive regex and non-regex
  • intitle with regex ORed with intitle regex
  • intitle without regex
  • content without regex
  • intitle regex and tag (non-regex) pattern
  • character exclusion regex patterns (i.e. [^a-z])
  • Wildcard match patterns (.* between words)
  • Start of boundary patterns (/^Israel/)
  • One-or-more wildcard match patterns ([^a-z]+)
  • End of boundary match patterns (routing\.$)
  • Alternatives matching, though see the issues with grouping. messages?( routing|$)

Not working:

  • grouping in regex patterns (sort of, see below)
  • multi-line mode (see below)
  • (user error?) explicit and between patterns. i.e. intitle:/election/i and intitle:/year/i
  • (pre-existing?) inurl (non-regex) can't match anything (i.e. `inurl:'p88#')
  • (pre-existing?) matches are applied to the raw text, not rendered HTML. i.e. 'hit 100,000 downloads' where the 100,000 downloads part is a hyperlink fails to match.
  • (not supported?) Look ahead patterns (/election (?!year)/i)

Bugs

Pattern Parenthesis

It's common to use parentheses in regex patterns to form conditional groups. For example: intitle:/messages( routing|$)/i to match either messages at the end of a line, or messages routing anywhere in the title.

The parenthesis currently require manual escaping however, they're being captured by the filter logic outside the regex. This also appears to be an existing bug in how the non-regex filters work as well, but it's much more significant for regex patterns because parentheses are used so frequently in them.

For example, this fails to match anything intitle:/messages( routing|$)/i, but this intitle:/messages\( routing|$\)/i matches two different articles, one with messages routing in the middle of the title, and another that ends with messages. I also verified that literal parentheses aren't being matched by unescaped parentheses in the pattern, intitle:/(Zachary/i fails to match a title containing (Zachary Cohen/CNN) in it. In fact, matching a literal parenthesis requires escaping it from both the filter engine (\() and the regex pattern ([(]), resulting in needing to specify [\(] and [\)].

I did verify that this same parenthesis bug affects quoted non-regex patterns though too. Doing infile:'(Zachary' is also failing to match the title with (Zachary Cohen/CNN) in it, and I have to do infile:'\(Zachary' to get it to match. It's just less severe there because literal parentheses are much less common than needing to use a parenthesis in a regex pattern.

Multi-line mode

I'm not sure how to test multi-line mode exactly. It works fine if I put the m on an otherwise working case sensitive or insensitive pattern, but it doesn't add any functionality to do so. With text like this in the middle of the content of an item:

Better than Signal? Looks promising.
A few bugs and UX issues but great foundation. Love that it’s public domain.

The pattern /Looks promising\./ works, as does /looks promising\./i and I can put an m on the end of either and it still works (/Looks promising\./m or /looks promising\./im).
But if I try to match across the lines: /Looks promising\..*A few bugs and UX issues/m it fails to match. Or if I try to explicitly match the newlines it also doesn't work: /Looks promising\.(\r)?\n(\r)?A few/im (though maybe there's some escaping issues here). Here's no way to insert a literal newline in the ad-hoc search box that I've been doing most of my testing with that I'm aware of, so I'd have to have some way to match past a line ending.

I suspect you want . to match line endings as well, and probably need to set a special option for the regex engine to let . match those end-of-line terminators when multi-line mode is enabled. For some reason that's frequently something regex engines require you to separately enable from multi-line mode, even though they'd really only ever expect to be used together.

@Alkarex
Copy link
Member Author

Alkarex commented Aug 27, 2024

Hello @mtalexan and thanks for testing 👍🏻
Quick answer already:

  • there is no explicit AND operator; just use a space instead.
  • The regex engine is not fully consistent and depends on which database you are using:
  • Searches are indeed applied to the raw HTML content
  • The multiline mode allows searching for instance for a text at the start or end of a line in the raw HTML content

@mtalexan
Copy link

mtalexan commented Aug 28, 2024

Excellent, thanks for the clarification. For reference, I'm testing with PostgreSQL.

  • Lookahead is working, it just needed the parentheses escaped.
  • Literal parentheses can be escaped with \\(, i.e. an escaped \\, but for some reason a single \ not preceeding a ( never needs to be escaped.
  • Multi-line matching is working, it's just usually hard to find a use for since multi-line content usually has HTML or something included that's harder to match against. The . is non-newline-sensitive like you'd expect/want.
  • Literal / in a pattern can successfully be escaped (\/)

It seems only parentheses have weird escaping rules. Pattern escape characters like \d don't need the \ part escaped. Only a ( or ) in the regex needs to be escaped, or a \ before a ( or ) that is supposed to be in the regex pattern (i.e. \\( and \\) -> \( and \) in the pattern). That's incredibly strange behavior because it means \ isn't a consistent escape character like almost every other parser ever written, instead there's weird multi-character matching that must be going on instead of nested interpretation(?).

The super strange behavior with parenthesis and escaping definitely need to be documented. No one will ever figure out how to use it without a lot of trial-and-error, and it's such a standard feature of regex patterns to use parentheses you can guarantee almost everyone will encounter the issue. I'd suggest dropping the mention in the documentation on filtering along with the links to the regex syntaxes supported by the different databases.


Is there anything else you'd like me to test?
The existing non-regex patterns still work the same, a solid sample of the range of all supported regex features work for both case-insensitive and multi-line mode, and can be used in combination with other regex and/or non-regex patterns.

It looks like the same code path is used for all cases of filters being used, so I assume there's not much point in re-testing with per-feed filters, label filters, or saved user filters. Correct?

@mtalexan
Copy link

Oh, forgot to mention I'd tested negation of a regex filter in my original set of tests. It's working fine too.

@mtalexan
Copy link

As I work on using the regexes more, I'm finding more and more cases where it's not clear if there was a regex error, or if the pattern is matching unexpected things. For example, if I use the word-boundary escape character (\y), it doesn't seem to be working and the only way I can determine that is because the same top 10 things appear in my view as if I had no search filter.

This for example doesn't work: intitle:/\yart/i, and it causes 10 items to appear in the view that don't match anything close to what I'm looking for (the case-insensitive prefix "art", making use of the \y` constraint escape for a word boundary). Recognizing the top list of items are unchanged despite the whole list blanking and re-loading, and none of them match what I'm trying to search for isn't exactly easy to determine.


Related: it appears constraint escapes aren't working. Class-Shorthand escapes and Backreferences are working however.

@Alkarex
Copy link
Member Author

Alkarex commented Aug 28, 2024

Thanks for the additional tests and feedback, @mtalexan 👍🏻
Yes, the regex applied to other operators such as author:XXX will work identically.

Indeed, we do not have any feedback to the user when the search expression is wrong, for instance with wrong parentheses, invalid regex, etc. But we should add that.
And the whole parsing of the search query has grown from very simple rules to more complicated ones, and everything should be re-written with a proper parser. The question is how much should be done now, to avoid delaying the landing of the regex features (which we could brand as beta for now).

Adding an error feedback in particular should not be too difficult, but should still be left to a follow-up PR.

@mtalexan
Copy link

Thanks for the additional tests and feedback, @mtalexan 👍🏻 Yes, the regex applied to other operators such as author:XXX will work identically.

Indeed, we do not have any feedback to the user when the search expression is wrong, for instance with wrong parentheses, invalid regex, etc. But we should add that. And the whole parsing of the search query has grown from very simple rules to more complicated ones, and everything should be re-written with a proper parser. The question is how much should be done now, to avoid delaying the landing of the regex features (which we could brand as beta for now).

Adding an error feedback in particular should not be too difficult, but should still be left to a follow-up PR.

I'm definitely on board with splitting the work up. Having access to regex parsing at all is a major feature that shouldn't be held up by reduced usability.

@mtalexan
Copy link

Did you mean to include the \b (non-printing bell character) in the regex example in filtering Readmes? intitle:/^Lo+l\b/i

@Alkarex
Copy link
Member Author

Alkarex commented Aug 28, 2024

Added author: and inurl: and some documentation (excluding the parentheses aspects for now). Additional tests welcome.

Example : author:/^Alice Doe$/im as we split multiple authors in different lines

@Alkarex
Copy link
Member Author

Alkarex commented Aug 28, 2024

Did you mean to include the \b (non-printing bell character) in the regex example in filtering Readmes? intitle:/^Lo+l\b/i

I meant \b like in word boundary. Should be tested, by the way

@Alkarex
Copy link
Member Author

Alkarex commented Aug 28, 2024

@mtalexan
Copy link

To my future self:

I'm not familiar, is preg_match an option for all 3 database types and that's why it's used for auto-labeling and auto-mark-as-read rather than a database-specific syntax? Or is all the content being stored in an internal SQLite database during loading and auto-* matching before (possibly) being transferred to the alternative database? Or does searching on the SQLite database require PHP to do the filtering instead of the database, unlike other database types, and the auto-* matching occurs before the content reaches a database?

It's using PHP to pre-filter incoming content as it's being added to the database so the associated labels and read state can be added at the same time. SQLite happens to use a PHP function you provide for it to do the filtering with when doing a database search.

@Alkarex
Copy link
Member Author

Alkarex commented Aug 30, 2024

luckily there's no real harm for my use case since I only expect a few 10s of thousands of articles

Note that SQLite's performance is surprisingly good, even with many articles, with or without regex

@mtalexan
Copy link

mtalexan commented Sep 2, 2024

@Alkarex it appears the tag searching has the same issues as the inurl:, it doesn't work if quotes are included. #6761

@Alkarex Alkarex linked an issue Sep 2, 2024 that may be closed by this pull request
@Alkarex
Copy link
Member Author

Alkarex commented Sep 2, 2024

Tags fixed (not tested). Tests welcome

@Alkarex
Copy link
Member Author

Alkarex commented Sep 2, 2024

And regex added for tags

@Alkarex Alkarex linked an issue Sep 5, 2024 that may be closed by this pull request

Example to search entries, which title starts with the *Lol* word, with any number of *o*: `intitle:/^Lo+l/i`

As opposed to normal searches, HTML special characters are not escaped in regex searches, to allow searching HTML code, like: `/Hello <span>world<\/span>/`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also has the effect that I have to use the HTML encoding of any character literals I'm looking for in the body as well, right? So &amp; and &quot; to find the post-rendered characters & and " for example.

Might it be worth doing this only if you have an r character on the regex? Like /Hello <span>world<\/span>/r?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the search is supposed to match the HTML content verbatim. No need to extra encode anything though, just exactly at it appears in the original HTML document.

We could consider adding some modifiers, but let's see if there is a sufficient need first. Another modifier I am considering is an automated transformation from PCRE syntax (subset) to the other variants such as PostgreSQL.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might run into issues trying to translate between the regex versions. If you settle on the PCRE syntax there are a lot of advanced (?...) features that aren't present in Postgres. Postgres also has explicit start and end word boundary markers, while PCRE has only a singular word boundary marker. Then there's MySQL, which supports only a very small subset of regex features.

I could see maybe trying to do something with the constraint escapes for Postgres, but that's about the only thing I can see that could possibly be translated and it would probably just confuse the situation with the databases having different regex formats.

@Alkarex Alkarex merged commit 1a552bd into FreshRSS:edge Sep 6, 2024
2 checks passed
@Alkarex Alkarex deleted the regex branch September 6, 2024 07:36
@Alkarex Alkarex mentioned this pull request Sep 6, 2024
3 tasks
@Alkarex
Copy link
Member Author

Alkarex commented Sep 7, 2024

#6785

@arseru
Copy link

arseru commented Oct 8, 2024

Hi @Alkarex, I was trying to find issues/discussions where the regex filters were mentioned, and I ended up here. I have an issue where I'm trying to assign labels automatically to incoming feed items by using intitle regex filters, but it has some erratic behaviour that it's difficult for me to identify, and I was hoping you could give me a hint of what's happening.

So I tried the same as you did in your first comment for the FreshRSS releases feed:

image

If I use the filter intitle:/\.0$/, I get no results:

image

Any idea why is this happening in my instance? I'm using the default Docker deploy.

@Alkarex
Copy link
Member Author

Alkarex commented Oct 8, 2024

Any idea why is this happening in my instance? I'm using the default Docker deploy.

What version of FreshRSS are you running? Try freshrss/freshrss:edge and make sure it was pulled recently

@arseru
Copy link

arseru commented Oct 8, 2024

What version of FreshRSS are you running?

Version 1.24.3.

Try freshrss/freshrss:edge and make sure it was pulled recently

Awesome! With this one (1.25.0-dev), it works as expected:

image

Problem solved then, thanks a lot for the quick reply :D

EDIT: duh, I just realized from your first comment that regex search is only available from FreshRSS 1.25.0+... sorry for not checking that before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants