Adds support for macro-aware transcoding from binary to text. #1000

tgregg · 2024-11-23T02:33:27Z

Description of changes:

Supports macro-aware transcoding from binary to text, preserving symbol tables, encoding directives, and e-expression invocations. Support for transcoding from text and to binary will be added in a future PR.

All of the API design and behavior introduced by this PR is negotiable. This is not considered a "public" API, so we have room to change it if necessary.

I deliberately did not try to build this into IonWriter.writeValues(IonReader), which could be used for system-level transcoding of Ion 1.0 streams. That works well in Ion 1.0 because symbol tables are still part of the data model, and at the system level they just look like regular structs. E-expressions are different because they occur only in the encoding, not in the data model, and as a result there are no user-exposed IonWriter APIs to preserve them. Accordingly, I've proposed a solution in this PR that hooks into the core-level reader (via the new MacroAwareIonReader interface), and can rely on encoding information to manipulate a MacroAwareIonWriter appropriately. The MacroAwareIonReader will need to be implemented by the text reader in the future to enable macro-aware transcoding from text.

The transcoding is performed using:

try (
  MacroAwareIonReader reader = ((_Private_IonReaderBuilder) IonReaderBuilder.standard()).buildMacroAware(data);
  MacroAwareIonWriter writer = (MacroAwareIonWriter) IonEncodingVersion.ION_1_1.textWriterBuilder().build(out);
) {
  reader.transcodeTo(writer);
}

And produces output like the following, when provided with the equivalent binary stream:

$ion_1_1
(:$ion::add_symbols
  (::
    "Pi"
  )
)
(:$ion::add_macros
  (
    macro
    $66
    ()
    3.14159
  )
)
$66
(:Pi)
(:$ion::set_symbols
  (::
    "Pi"
    "foo"
  )
)
(:$ion::add_macros
  (
    macro
    $2
    ()
    "bar"
  )
)
(:foo)
$1
(:$ion::set_macros
  (
    macro
    $2
    ()
    "baz"
  )
)
(:foo)
$2

When transcoded using IonWriter.writeValues(IonReader), the same input produces:

Pi
3.14159
"bar"
Pi
"baz"
foo

This can be used as a debugging tool and will be leveraged in ion-java-benchmark-cli to perform faithful re-writes of input data during write benchmarking.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

tgregg · 2024-11-23T02:35:02Z

src/main/java/com/amazon/ion/impl/_Private_IonReaderBuilder.java

+     * @param ionData the data to read.
+     * @return a new MacroAwareIonReader instance.
+     */
+    public MacroAwareIonReader buildMacroAware(byte[] ionData) {


Note: this is "hidden" in _Private_IonReaderBuilder much like newSystemReader is hidden in _Private_IonSystem.

tgregg · 2024-11-23T02:36:54Z

src/main/java/com/amazon/ion/impl/bin/IonManagedWriter_1_1.kt

+     * and updates internal state accordingly. This always appends to the current encoding context. If there is nothing
+     * to append, calling this function is a no-op.
+     */
+    private fun writeVerboseEncodingDirective() {


The "Verbose" methods added in this PR are not new; they were replaced with the equivalent system macro invocations in #987. I've added them back because we'll use them in the macro-aware transcoding when the source stream contains verbose encoding directives.

tgregg · 2024-11-23T02:38:08Z

src/main/java/com/amazon/ion/impl/macro/IonReaderFromReaderAdapter.kt

+/**
+ * An [IonReader] that delegates to a [ReaderAdapter].
+ */
+internal class IonReaderFromReaderAdapter(val reader: ReaderAdapter) : IonReader {


This was the easiest way to be able to provide an IonReaderContinuableCore (which doesn't implement IonReader, but does have a ReaderAdapter) to IonWriter.writeValue(IonReader).

tgregg · 2024-11-23T02:39:35Z

src/main/java/com/amazon/ion/impl/IonReaderContinuableCoreBinary.java

+                expressionArgsReader.beginEvaluatingMacroInvocation(macroEvaluator);
+                macroEvaluatorIonReader.transcodeArgumentsTo(writer);


This is a full materialization of the e-expression arguments followed by a re-write. If we change to lazy evaluation of e-expression arguments, this will need to change.

tgregg · 2024-11-23T02:41:22Z

src/test/java/com/amazon/ion/Ion_1_1_RoundTripTest.kt

The diff looks large because I moved the companion object up a level so I could use it outside of the base class, resulting in a lot of whitespace changes. I'll point out what I actually added.

No worries, it's pretty easy to ignore whitespace when reviewing :)

tgregg · 2024-11-23T02:41:43Z

src/test/java/com/amazon/ion/Ion_1_1_RoundTripTest.kt

        @JvmStatic
        protected val WRITER_INLINE_DELIMITED: (OutputStream) -> IonWriter = ION_1_1.binaryWriterBuilder()
            .withSymbolInliningStrategy(SymbolInliningStrategy.ALWAYS_INLINE)
            .withLengthPrefixStrategy(LengthPrefixStrategy.NEVER_PREFIXED)::build

+        @JvmStatic
+        fun assertReadersHaveEquivalentValues(expectedDataReader: IonReader, actualDataReader: IonReader) {


This method just moved, but I didn't modify it.

tgregg · 2024-11-23T02:41:52Z

src/test/java/com/amazon/ion/Ion_1_1_RoundTripTest.kt

+        }
+
+        @JvmStatic
+        fun isIonVersionMarker(symbol: SymbolToken?): Boolean {


Moved, not modified.

tgregg · 2024-11-23T02:44:24Z

src/test/java/com/amazon/ion/Ion_1_1_RoundTripTest.kt

-    abstract val readerFn: (ByteArray) -> IonReader
-    val systemReaderFn: (ByteArray) -> IonReader = ION::newSystemReader
+    @Nested
+    inner class BinaryMacroAwareTranscode_ReaderNonContinuableBufferDefault {


This is the only thing I added in here, in order to test that a macro-aware transcode off all the test data results in data-model-equivalent results.

tgregg · 2024-11-23T02:45:04Z

src/test/java/com/amazon/ion/impl/EncodingDirectiveCompilationTest.java

+        int expectedNumberOfIonVersionMarkers,
+        int expectedNumberOfAddSymbolsInvocations,
+        int expectedNumberOfAddMacrosInvocations,
+        int expectedNumberOfSetSymbolsInvocations,
+        int expectedNumberOfSetMacrosInvocations,
+        int expectedNumberOfExplicitEncodingDirectives


This is ugly; open to suggestions.

Here's a suggestion using a custom matcher: https://gist.github.com/jobarr-amzn/b009638a868941d63c4f1cf8277a7dfc#file-encodingdirectivecompilationtest-java-L925-L1013

Thanks, done

jobarr-amzn · 2024-11-25T21:47:33Z

src/main/java/com/amazon/ion/MacroAwareIonReader.kt

+     * The following limitations should be noted:
+     * 1. Encoding directives with no effect on the encoding context may be
+     *    elided from the transcoded stream. An example would be an encoding
+     *    directive that re-exports the existing context but adds no new
+     *    macros or new symbols.
+     * 2. When transcoding from text to text, comments will not be preserved.
+     * 3. Open content in encoding directives (e.g. macro invocations that
+     *    expand to nothing) will not be preserved.
+     * 4. Granular details of the binary encoding, like inlining vs. interning
+     *    for a particular symbol or length-prefixing vs. delimiting for a
+     *    particular container, may not be preserved. It is up to the user
+     *    to provide a writer configured to match these details if important.


These are technical limitations, right?

We could code around most of these, but it's generally not worth the effort.

jobarr-amzn · 2024-11-25T23:32:11Z

src/main/java/com/amazon/ion/impl/IonReaderContinuableCoreBinary.java

+                        String macroName = macroCompiler.getMacroName();
+                        if (preserveMacroNames && macroName != null) {
+                            newMacros.put(MacroRef.byName(macroName), newMacro);
+                        }


I've been thinking about this since #996 and working with the MacroTable abstraction there... I think we ought to support both lookups all the time. That would make preserveMacroNames unnecessary, yeah?

What considerations (other than map/keyset size) am I missing here?

It's the map size and the general ugliness about the fact that there are always two keys that map to the same value. I'm actually going to make the change to support both lookups all the time though, which is how it's handled when read from the text format.

jobarr-amzn · 2024-11-25T23:33:41Z

src/main/java/com/amazon/ion/impl/IonReaderContinuableCoreBinary.java

+        registerIvmNotificationConsumer((x, y) -> {
+            resetEncodingContext();
+            writer.startEncodingSegmentWithIonVersionMarker();
+        });


I'm assuming x and y are the major and minor versions? How does the writer know which IVM to use?

Yes, I renamed the parameters to make this obvious. Which IVM to write is inherent to the writer implementation -- we don't have a single implementation that writes both formats.

src/main/java/com/amazon/ion/impl/macro/IonReaderFromReaderAdapter.kt

jobarr-amzn · 2024-11-26T17:21:09Z

src/test/java/com/amazon/ion/Ion_1_1_RoundTripTest.kt

No worries, it's pretty easy to ignore whitespace when reviewing :)

src/test/java/com/amazon/ion/impl/EncodingDirectiveCompilationTest.java

jobarr-amzn · 2024-11-26T19:09:26Z

src/test/java/com/amazon/ion/impl/EncodingDirectiveCompilationTest.java

+        int expectedNumberOfIonVersionMarkers,
+        int expectedNumberOfAddSymbolsInvocations,
+        int expectedNumberOfAddMacrosInvocations,
+        int expectedNumberOfSetSymbolsInvocations,
+        int expectedNumberOfSetMacrosInvocations,
+        int expectedNumberOfExplicitEncodingDirectives


Here's a suggestion using a custom matcher: https://gist.github.com/jobarr-amzn/b009638a868941d63c4f1cf8277a7dfc#file-encodingdirectivecompilationtest-java-L925-L1013

src/main/java/com/amazon/ion/impl/IonReaderContinuableCoreBinary.java

tgregg · 2024-11-26T21:39:56Z

See how this is used in ion-java-benchmark-cli here: amazon-ion/ion-java-benchmark-cli#66

…th ion-java-benchmark-cli.

jobarr-amzn · 2024-11-27T16:21:31Z

src/main/java/com/amazon/ion/impl/IonReaderContinuableCoreBinary.java

+        registerIvmNotificationConsumer((major, minor) -> {
            resetEncodingContext();
            writer.startEncodingSegmentWithIonVersionMarker();


Suggested change

registerIvmNotificationConsumer((major, minor) -> {

resetEncodingContext();

writer.startEncodingSegmentWithIonVersionMarker();

registerIvmNotificationConsumer((major, minor) -> {

resetEncodingContext();

// Which IVM to write is inherent to the writer implementation--

// We don't have a single implementation that writes both formats.

writer.startEncodingSegmentWithIonVersionMarker();

jobarr-amzn · 2024-11-27T16:27:15Z

src/test/java/com/amazon/ion/impl/EncodingDirectiveCompilationTest.java

+            for (Matcher<String> expectation : expectations) {
+                assertThat(rewritten, expectation);
+            }


As an FYI this can be expressed as assertThat(rewritten, allOf(expectations)), but I opted not to because this approach makes a cleaner error message. allOf describes itself with essentially "${e1.describeTo} and ${e2.describeTo} and ... ", which ends up being a pain to parse manually to find the violated expectation.

... which makes me realize there's probably a good opportunity there for a PR to Hamcrest...

jobarr-amzn · 2024-11-27T16:28:46Z

src/test/java/com/amazon/ion/impl/IonReaderContinuableTopLevelBinaryTest.java

+                0xE7, 0x01, 0x63, // One FlexSym annotation, with opcode, opcode 63 = system symbol 3 = $ion_symbol_table
+                0xD7, // {
+                0x0D, // FlexUInt 6 = imports
+                0xEE, 0x03, // System symbol value 3 = $ion_symbol_table (denoting symbol table append)
+                0x0F, // FlexUInt 7 = symbols
+                0xB2, 0x91, 'a', // ["a"]
+                0xE1, SystemSymbols_1_1.size() + 1 // first user symbol = a


I know it's your common practice, but I appreciate these comments. Even small uncommented binary literals are much more of a speedbump, the comments really help.

jobarr-amzn · 2024-11-27T17:25:32Z

src/test/java/com/amazon/ion/impl/EncodingDirectiveCompilationTest.java

+            for (Matcher<String> expectation : expectations) {
+                assertThat(rewritten, expectation);
+            }


My comment made me go look at the AllOf implementation, where I realized that there is already a distinction between describeTo and describeMismatch, which AllOf uses to enumerate only the failing matches. I just had to read more of the failure message. Given that, I suggest this instead:

Suggested change

for (Matcher<String> expectation : expectations) {

assertThat(rewritten, expectation);

}

assertThat(rewritten, allOf(expectation));

This will also require the appropriate import static org.hamcrest.Matchers.allOf; in the imports section. For a failing test, it will generate a message like this:

Expected: (a String including 1 occurrences of $ion_1_1 and a String including 1 occurrences of add_symbols and a String including 0 occurrences of add_macros and a String including 0 occurrences of set_symbols and a String including 0 occurrences of set_macros and a String including 2 occurrences of $ion_encoding) but: a String including 1 occurrences of add_symbols was "$ion_1_1 $ion_encoding::((symbol_table [\"SimonSays\",\"anything\"])) $ion_encoding::((symbol_table [\"foo\"]) (macro_table (macro SimonSays (anything) (%anything)))) (:SimonSays [(:SimonSays {$1:1.23e0}),(:SimonSays 123),\"abc\"]) [] "

The 'x and y and z and... ' part is not super helpful, but the but: a String including 1 occurrences of add_symbols was ... part is. YMMV.

Done, thanks

jobarr-amzn · 2024-11-27T17:26:07Z

src/test/java/com/amazon/ion/impl/EncodingDirectiveCompilationTest.java

@@ -44,10 +48,13 @@
 import java.util.TreeMap;
 import java.util.function.Consumer;

+import static com.amazon.ion.BitUtils.bytes;
+import static org.hamcrest.MatcherAssert.assertThat;


Suggested change

import static org.hamcrest.MatcherAssert.assertThat;

import static org.hamcrest.MatcherAssert.assertThat;

import static org.hamcrest.Matchers.allOf;

Only if you want to use allOf below.

…es minor cleanups.

Adds support for macro-aware transcoding from binary to text.

a7e081d

tgregg commented Nov 23, 2024

View reviewed changes

jobarr-amzn approved these changes Nov 26, 2024

View reviewed changes

tgregg mentioned this pull request Nov 26, 2024

Supports macro-aware transcoding of Ion 1.1 streams. amazon-ion/ion-java-benchmark-cli#66

Open

Fixes gaps in macro-aware transcoding identified while integrating wi…

1e0dfe0

…th ion-java-benchmark-cli.

jobarr-amzn approved these changes Nov 27, 2024

View reviewed changes

Improves the factoring of the macro-aware transcoding tests and appli…

b3368c8

…es minor cleanups.

tgregg force-pushed the ion-11-encoding-transcode branch from 2126110 to b3368c8 Compare November 27, 2024 18:04

tgregg merged commit 44da868 into ion-11-encoding Nov 27, 2024
17 checks passed

tgregg deleted the ion-11-encoding-transcode branch November 27, 2024 18:49

tgregg mentioned this pull request Dec 9, 2024

Adds support for macro-aware transcoding from text. #1010

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds support for macro-aware transcoding from binary to text. #1000

Adds support for macro-aware transcoding from binary to text. #1000

tgregg commented Nov 23, 2024

tgregg Nov 23, 2024

tgregg Nov 23, 2024

tgregg Nov 23, 2024

tgregg Nov 23, 2024

tgregg Nov 23, 2024

jobarr-amzn Nov 26, 2024

tgregg Nov 23, 2024

tgregg Nov 23, 2024

tgregg Nov 23, 2024

tgregg Nov 23, 2024

jobarr-amzn Nov 26, 2024

tgregg Nov 27, 2024

jobarr-amzn Nov 25, 2024

tgregg Nov 27, 2024

jobarr-amzn Nov 25, 2024

tgregg Nov 27, 2024

jobarr-amzn Nov 25, 2024

tgregg Nov 27, 2024

jobarr-amzn Nov 26, 2024

jobarr-amzn Nov 26, 2024

tgregg commented Nov 26, 2024

jobarr-amzn Nov 27, 2024

tgregg Nov 27, 2024

jobarr-amzn Nov 27, 2024

jobarr-amzn Nov 27, 2024

jobarr-amzn Nov 27, 2024

tgregg Nov 27, 2024

jobarr-amzn Nov 27, 2024

		expressionArgsReader.beginEvaluatingMacroInvocation(macroEvaluator);
		macroEvaluatorIonReader.transcodeArgumentsTo(writer);

	import static org.hamcrest.MatcherAssert.assertThat;
	import static org.hamcrest.MatcherAssert.assertThat;
	import static org.hamcrest.Matchers.allOf;

Adds support for macro-aware transcoding from binary to text. #1000

Adds support for macro-aware transcoding from binary to text. #1000

Conversation

tgregg commented Nov 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgregg commented Nov 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment