Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: add ReplacingReader #153

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

bobozaur
Copy link

@bobozaur bobozaur commented Feb 5, 2025

Hello 👋 ,

I hope it's fine that I directly created a PR without opening an issue. This PR adds a ReplacingReader and methods to construct one, which allows replacing matches from an underlying reader while streaming.

The mechanism is similar to the AhoCorasick::try_stream_replace_all and AhoCorasick::try_stream_replace_all_with methods, but instead of piping data to a given writer in one go, a reader is returned which can then stream data with bits of it replaced. This can be very handy in some situations, like when you'd want to use serde to deserialize data directly from a reader.

That's in fact the use case that drove me to create this PR. I initially wanted to create a standalone library for this and was looking into string searching algorithms and aho-corasick seemed particularly suitable. Lo and behold, the library was already almost doing what I wanted! So I figured that it might be better to create a PR here which allows reusing internal bits.

src/ahocorasick.rs Outdated Show resolved Hide resolved
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be infallible variants of the try_to_replacing_reader and try_to_replacing_reader_with methods?

src/automaton.rs Outdated Show resolved Hide resolved
@BurntSushi
Copy link
Owner

Thanks for this! It's okay to just submit PRs, but I usually encourage issues to avoid wasted work. However, in this case, I do actually think this is a pretty useful feature. It's off-loading a lot of complexity that callers would have to do to achieve this same task, so I like the trade-off here.

With that said, the type signatures here are quite gnarly. And the fact that an impl Automaton is being returned and exposed from the AhoCorasick API is bothersome to me. I think you can probably get rid of that one by just using an AhoCorasick internally? Although it might require more code.

The impl FnMut is probably unavoidable. Unless we just remove that convenience API. Which I think is my preference at this point. It keeps the change smaller.

I'd also suggest looking at the rustdoc rendering of your comments. It's bad juju to document items that appear in an overview page (types and modules mostly) with a Markdown paragraph of more than ~1 line.

@bobozaur
Copy link
Author

bobozaur commented Feb 6, 2025

Thanks for this! It's okay to just submit PRs, but I usually encourage issues to avoid wasted work...

I already wrote the reader state machine when I was experimenting with naive search & replace approaches, so it wouldn't have been the end of the world if this PR ended up as an unnecessary effort. But I'm definitely glad you're on board with it 😄 !

With that said, the type signatures here are quite gnarly. And the fact that an impl Automaton is being returned...

A bit easier said than done. I went with the impl Automaton approach because ReplacingReader uses StreamChunkIter internally, which contains are reference to a type implementing Automaton. StreamChunkIter mostly gets constructed from the trait methods as well, so changing it to accommodate the removal of impl Automaton does not seem very straightforward.

We could return Arc<dyn AcAutomaton> instead, but AcAutomaton is private and that will cause issues. And casting between trait objects isn't really a thing. I looked into whether we could leverage Any, but both dyn AcAutomaton and Automaton would require modifications that might break other stuff.

If you've go some other idea, I'm all ears!

The impl FnMut is probably unavoidable. Unless we just remove that convenience API...

Fair point, agreed.

I'd also suggest looking at the rustdoc rendering of your comments. It's bad juju to document items that appear in an overview page (types and modules mostly) with a Markdown paragraph of more than ~1 line.

Oops! Will definitely look into it!

@bobozaur
Copy link
Author

bobozaur commented Feb 6, 2025

I hereby declare myself bamboozled. I went back on the "casting between trait objects isn't really a thing" because I knew I had some luck with that before just that I did not remember exactly how as it's been a while and quick searches didn't turn out fruitious.

Nevertheless, I slept on it and remembered that it had to do with declaring a conversion method in the trait itself. So I tried extending AcAutomaton:

impl<A> AcAutomaton for A
where
    A: Automaton + Debug + Send + Sync + UnwindSafe + RefUnwindSafe + 'static,
{
    fn as_dyn_automaton(self: Arc<Self>) -> Arc<dyn Automaton> {
        self
    }
}

However, an error saying "method cannot be invoked on a trait object" is returned when doing something like:

        self.aut
            .as_dyn_automaton()
            .try_to_replacing_reader_with(rdr, replace_with)

But I don't get why it works with Arc<dyn AcAutomaton> and not with Arc<dyn Automaton>. They're both trait objects! Like, I get that the Self: Sized bound in try_to_replacing_reader_with is preventing this, but why does it work with dyn AcAutomaton? I'll be looking some more into it.

@bobozaur
Copy link
Author

bobozaur commented Feb 6, 2025

Just a side note that I tried other variations of this like returning a &dyn Automaton instead of an Arc, to no avail.

@BurntSushi
Copy link
Owner

I haven't looked too closely here, but I was thinking it could be achievable with either more code or another layer of indirection (e.g., an internal trait). But I'd have to look more closely, and I'm not sure when I'll have the time for that.

@bobozaur
Copy link
Author

bobozaur commented Feb 6, 2025

No worries, I'm still looking into it. I'll do some more digging and get back.

@bobozaur
Copy link
Author

bobozaur commented Feb 6, 2025

Okay so I jumped outside the box and realized that instead of doing trait gymnastics we can do something else to avoid the impl Automaton in the type signature: a type alias, a default generic on ReplacingReader or a newtype wrapper.

While experimenting with these approaches I noticed that this is already being done for StreamFindIter using a newtype. Therefore I did the same thing here to match the library's style.

I'll look into the docs matter next.

@bobozaur bobozaur requested a review from BurntSushi February 6, 2025 14:15
Comment on lines +117 to +120
This example shows how to execute a search and replace on a stream while piping
data to a writer without loading the entire stream into memory first. This is
advantageous over combining [`AhoCorasick::try_to_replacing_reader_with`] and
something like [`std::io::copy`] because it avoids double buffering.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extended the comment here just to make it clearer that AhoCorasick::try_stream_replace_all is to be preferred when all one wants to do is pipe data between a reader and a writer.

@bobozaur
Copy link
Author

bobozaur commented Feb 6, 2025

But I don't get why it works with Arc<dyn AcAutomaton> and not with Arc<dyn Automaton>. They're both trait objects! Like, I get that the Self: Sized bound in try_to_replacing_reader_with is preventing this, but why does it work with dyn AcAutomaton? I'll be looking some more into it.

FWIW, I know now why this was the behavior.

There's an impl Automaton for Arc<dyn AcAutomaton>, so the Automaton::try_to_replacing_reader_with gets an &Arc<dyn AcAutomaton> as &self, not a &dyn AcAutomaton. The former is Sized while the latter isn't.

While casting Arc<dyn AcAutomaton> to Arc<dyn Automaton> can be achieved with some trait (or just by extending AcAutomaton), the Automaton methods take &self. That means that what would be required is casting &Arc<dyn AcAutomaton> to &Arc<dyn Automaton> which I don't think is possible. Either a referenced trait object or an arc'ed one works, but a reference to an arc'ed trait object seems like quite the stretch 🙃 .

But if that worked, then an impl Automaton for Arc<dyn Automaton or a tweak to the impl<A: Automaton> Automaton for &A so that it leverages Deref would then provide the necessary glue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants