Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a StreamInterceptor interface to allow users to plug in custom interceptors for formats like Zstd. #930

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

tgregg
Copy link
Contributor

@tgregg tgregg commented Aug 30, 2024

Description of changes:

This library has always had auto-detection of GZIP streams built in, meaning that when users attempt to construct an IonReader from an InputStream or byte[], the given input is checked for the GZIP format header and wrapped in a GZIPInputStream if the header is present.

Now that many users are replacing GZIP with other compression formats like Zstd, we have to decide how to make this library as user-friendly as possible for users making that transition while limiting the amount of special-case code and dependencies that we add to the library.

This PR sketches out one possibility: to define an interface (provisionally named StreamInterceptor) that can be implemented either by users directly, or by external libraries that we vend, to plug in support for any desired format. The PR demonstrates how this mechanism is used by replacing the existing GZIP detection support with support that is delivered via a new GZIPStreamInterceptor implementation.

The ZstdStreamInterceptorTest demonstrates how a StreamInterceptor that recognizes Zstd streams can be plugged into the IonReaderBuilder and IonSystem. In summary, users that wish to support Zstd would change existing code that looks like

IonReaderBuilder readerBuilder = IonReaderBuilder.standard();

to

IonReaderBuilder readerBuilder = IonReaderBuilder.standard().addStreamInterceptor(ZstdStreamInterceptor.INSTANCE);

and code that looks like

IonSystem ION_SYSTEM = IonSystemBuilder.standard().build();

to

IonSystem ION_SYSTEM = IonSystemBuilder.standard()
    .withReaderBuilder(IonReaderBuilder.standard().addStreamInterceptor(ZstdStreamInterceptor.INSTANCE))
    .build();

Critically, this does not require code changes in every location that an IonReader is constructed, and works with all methods of constructing readers (e.g. IonSystem.newReader variants, IonSystem.singleValue, IonSystem.iterate, IonLoader.load, IonReaderBuilder.build variants, etc.).

Comments on the approach are welcomed.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@artemkach
Copy link

I think a bit of working backwards would help inform your design. Thinking out loud, for a customer migrating to new formats, the desired experience would be, in order of preference:

  1. No-op.
  2. Configuration change.
  3. Trivial code change.
  4. Non-trivial code change.

#1 is out due to the desire to limit this library's dependencies and scope.

An example of #2 would be someone else building a new library that wraps ion-java and injects new functionality in a completely transparent way. This could be done with intercepting proxies using an AOP library or java.lang.reflect.Proxy. Customer would add the new library as a dependency without having to make any code changes. While that's cool in theory, ion-java has multiple places where IonReader is constructed, as you mention, so this might be challenging to implement.

#3 is potentially a good tradeoff: if customers are willing to make a small configuration change, they might as well agree to a trivial code change. I think your proposal works well to enable this approach: they add the new library that provides custom ion-java interceptors and they make a low-risk code change where their reader is constructed.

The other aspect of migration is being able to handle both the old gzip format and the new zstd/whatever format. Your proposed design addresses that by supporting multiple interceptors and using the first one that claims a header match. One can think of use cases where chaining more than one interceptor would be beneficial, but I don't know if that's stepping into overengineering territory. Just make sure to avoid a one-way door and leave the opening for future extension.

Another possibly-overengineering point is that not all format detection logic fits into the "fixed header" mold. Sometimes headers are not fixed length, and sometimes they are not headers at all. This works 99% of the time though, so the same comment about avoiding one-way doors.

@toddjonker
Copy link
Contributor

toddjonker commented Sep 11, 2024

Please consider automatically discovering interceptors via the services API, to make integration as easy as dropping them on the classpath. It's going to be annoying if one needs to configure these all over, when I expect most of the time a customer will want to enable something for the entire application or application suite.

[Update] Per @artemkach comment, this is effectively a 1.5 classpath-only change, and much simpler for everyone than AOP or proxy injection.

Copy link
Contributor

@toddjonker toddjonker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new classes should be in com.amazon.ion.util since they aren't coupled to Ion really.

* @see com.amazon.ion.system.IonReaderBuilder#addStreamInterceptor(StreamInterceptor)
* @see com.amazon.ion.system.IonSystemBuilder#withReaderBuilder(IonReaderBuilder)
*/
public interface StreamInterceptor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Stream" has several meanings in this package, I suggest renaming to InputStreamInterceptor to be more specific.

Comment on lines 26 to 30
/**
* The length of the byte header that identifies streams in this format.
* @return the length in bytes.
*/
int headerLength();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since some formats (eg Ion itself) may have variable-length headers, I'd make this a bit more general. Maybe headerMatchLength.

* Determines whether the given candidate byte sequence matches this format.
* @param candidate the candidate byte sequence.
* @param offset the offset into the candidate bytes to begin matching.
* @param length the number of bytes (beginning at 'offset') in the candidate byte sequence.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some connection between this length and the headerLength?

The class could use more docs on exactly how the matching process works. After seeing headerLength I expected this method to receive that number of bytes in candidate, and nothing else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to clarify the documentation. length is intended to be the actual length of bytes in candidate, starting at offset. It need not be the same as headerLength, but if it is less than headerLength then the candidate sequence cannot be a match.

* @return a new InputStream.
* @throws IOException if thrown when constructing the new InputStream.
*/
InputStream newInputStream(InputStream interceptedStream) throws IOException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether this should have a sibling method for character streams, so one can accomplish transformations of text inputs.

For example, suppose I want to teach the Ion reader to ignore shebang lines atop my Fusion scripts...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can understand the use case, though I don't think it's as simple as adding one sibling method; headerLength and matchesHeader would also need different flavors to operate on text instead of bytes. At that point I think a different interface would be best. If we do this, then we could use the same pattern for registration/discovery as we establish in this PR, but I'll leave that out of scope for now.

@tgregg
Copy link
Contributor Author

tgregg commented Sep 26, 2024

Please consider automatically discovering interceptors via the services API, to make integration as easy as dropping them on the classpath. It's going to be annoying if one needs to configure these all over, when I expect most of the time a customer will want to enable something for the entire application or application suite.

That sounds like a good experience. I will look into it.

@tgregg tgregg force-pushed the compressed-stream-interceptor branch from 8d3aede to 4c24d36 Compare December 5, 2024 21:10
@tgregg
Copy link
Contributor Author

tgregg commented Dec 5, 2024

Revision 2:

  • Incorporates feedback from @toddjonker about naming, documentation, and class locations.
  • Allows users to add InputStreamInterceptor implementations by registering service providers on the classpath (as suggested by @toddjonker and @artemkach's suggestion to consider making this available via configuration change), as an alternative to adding them manually using IonReaderBuilder.addInputStreamInterceptor. In addition to the added unit tests, I tried this out in a separate project and was able to register a stream interceptor by simply adding a file named com.amazon.ion.util.InputStreamInterceptor, containing the fully qualified class name of my custom implementation, to the project's META-INF/services/. More information about service providers can be found at https://docs.oracle.com/javase/9/docs/api/java/util/ServiceLoader.html

@tgregg tgregg marked this pull request as ready for review December 5, 2024 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants