Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds support for macro-aware transcoding from binary to text. #1000

Merged
merged 3 commits into from
Nov 27, 2024

Conversation

tgregg
Copy link
Contributor

@tgregg tgregg commented Nov 23, 2024

Description of changes:

Supports macro-aware transcoding from binary to text, preserving symbol tables, encoding directives, and e-expression invocations. Support for transcoding from text and to binary will be added in a future PR.

All of the API design and behavior introduced by this PR is negotiable. This is not considered a "public" API, so we have room to change it if necessary.

I deliberately did not try to build this into IonWriter.writeValues(IonReader), which could be used for system-level transcoding of Ion 1.0 streams. That works well in Ion 1.0 because symbol tables are still part of the data model, and at the system level they just look like regular structs. E-expressions are different because they occur only in the encoding, not in the data model, and as a result there are no user-exposed IonWriter APIs to preserve them. Accordingly, I've proposed a solution in this PR that hooks into the core-level reader (via the new MacroAwareIonReader interface), and can rely on encoding information to manipulate a MacroAwareIonWriter appropriately. The MacroAwareIonReader will need to be implemented by the text reader in the future to enable macro-aware transcoding from text.

The transcoding is performed using:

try (
  MacroAwareIonReader reader = ((_Private_IonReaderBuilder) IonReaderBuilder.standard()).buildMacroAware(data);
  MacroAwareIonWriter writer = (MacroAwareIonWriter) IonEncodingVersion.ION_1_1.textWriterBuilder().build(out);
) {
  reader.transcodeTo(writer);
}

And produces output like the following, when provided with the equivalent binary stream:

$ion_1_1
(:$ion::add_symbols
  (::
    "Pi"
  )
)
(:$ion::add_macros
  (
    macro
    $66
    ()
    3.14159
  )
)
$66
(:Pi)
(:$ion::set_symbols
  (::
    "Pi"
    "foo"
  )
)
(:$ion::add_macros
  (
    macro
    $2
    ()
    "bar"
  )
)
(:foo)
$1
(:$ion::set_macros
  (
    macro
    $2
    ()
    "baz"
  )
)
(:foo)
$2

When transcoded using IonWriter.writeValues(IonReader), the same input produces:

Pi
3.14159
"bar"
Pi
"baz"
foo

This can be used as a debugging tool and will be leveraged in ion-java-benchmark-cli to perform faithful re-writes of input data during write benchmarking.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

* @param ionData the data to read.
* @return a new MacroAwareIonReader instance.
*/
public MacroAwareIonReader buildMacroAware(byte[] ionData) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this is "hidden" in _Private_IonReaderBuilder much like newSystemReader is hidden in _Private_IonSystem.

* and updates internal state accordingly. This always appends to the current encoding context. If there is nothing
* to append, calling this function is a no-op.
*/
private fun writeVerboseEncodingDirective() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "Verbose" methods added in this PR are not new; they were replaced with the equivalent system macro invocations in #987. I've added them back because we'll use them in the macro-aware transcoding when the source stream contains verbose encoding directives.

/**
* An [IonReader] that delegates to a [ReaderAdapter].
*/
internal class IonReaderFromReaderAdapter(val reader: ReaderAdapter) : IonReader {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the easiest way to be able to provide an IonReaderContinuableCore (which doesn't implement IonReader, but does have a ReaderAdapter) to IonWriter.writeValue(IonReader).

Comment on lines +1835 to +1836
expressionArgsReader.beginEvaluatingMacroInvocation(macroEvaluator);
macroEvaluatorIonReader.transcodeArgumentsTo(writer);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a full materialization of the e-expression arguments followed by a re-write. If we change to lazy evaluation of e-expression arguments, this will need to change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff looks large because I moved the companion object up a level so I could use it outside of the base class, resulting in a lot of whitespace changes. I'll point out what I actually added.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, it's pretty easy to ignore whitespace when reviewing :)

@JvmStatic
protected val WRITER_INLINE_DELIMITED: (OutputStream) -> IonWriter = ION_1_1.binaryWriterBuilder()
.withSymbolInliningStrategy(SymbolInliningStrategy.ALWAYS_INLINE)
.withLengthPrefixStrategy(LengthPrefixStrategy.NEVER_PREFIXED)::build

@JvmStatic
fun assertReadersHaveEquivalentValues(expectedDataReader: IonReader, actualDataReader: IonReader) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method just moved, but I didn't modify it.

}

@JvmStatic
fun isIonVersionMarker(symbol: SymbolToken?): Boolean {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved, not modified.

abstract val readerFn: (ByteArray) -> IonReader
val systemReaderFn: (ByteArray) -> IonReader = ION::newSystemReader
@Nested
inner class BinaryMacroAwareTranscode_ReaderNonContinuableBufferDefault {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only thing I added in here, in order to test that a macro-aware transcode off all the test data results in data-model-equivalent results.

Comment on lines 937 to 942
int expectedNumberOfIonVersionMarkers,
int expectedNumberOfAddSymbolsInvocations,
int expectedNumberOfAddMacrosInvocations,
int expectedNumberOfSetSymbolsInvocations,
int expectedNumberOfSetMacrosInvocations,
int expectedNumberOfExplicitEncodingDirectives
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ugly; open to suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done

Comment on lines +21 to +32
* The following limitations should be noted:
* 1. Encoding directives with no effect on the encoding context may be
* elided from the transcoded stream. An example would be an encoding
* directive that re-exports the existing context but adds no new
* macros or new symbols.
* 2. When transcoding from text to text, comments will not be preserved.
* 3. Open content in encoding directives (e.g. macro invocations that
* expand to nothing) will not be preserved.
* 4. Granular details of the binary encoding, like inlining vs. interning
* for a particular symbol or length-prefixing vs. delimiting for a
* particular container, may not be preserved. It is up to the user
* to provide a writer configured to match these details if important.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are technical limitations, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could code around most of these, but it's generally not worth the effort.

Comment on lines 1503 to 1506
String macroName = macroCompiler.getMacroName();
if (preserveMacroNames && macroName != null) {
newMacros.put(MacroRef.byName(macroName), newMacro);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this since #996 and working with the MacroTable abstraction there... I think we ought to support both lookups all the time. That would make preserveMacroNames unnecessary, yeah?

What considerations (other than map/keyset size) am I missing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the map size and the general ugliness about the fact that there are always two keys that map to the same value. I'm actually going to make the change to support both lookups all the time though, which is how it's handled when read from the text format.

Comment on lines 1777 to 1780
registerIvmNotificationConsumer((x, y) -> {
resetEncodingContext();
writer.startEncodingSegmentWithIonVersionMarker();
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming x and y are the major and minor versions? How does the writer know which IVM to use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I renamed the parameters to make this obvious. Which IVM to write is inherent to the writer implementation -- we don't have a single implementation that writes both formats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, it's pretty easy to ignore whitespace when reviewing :)

Comment on lines 937 to 942
int expectedNumberOfIonVersionMarkers,
int expectedNumberOfAddSymbolsInvocations,
int expectedNumberOfAddMacrosInvocations,
int expectedNumberOfSetSymbolsInvocations,
int expectedNumberOfSetMacrosInvocations,
int expectedNumberOfExplicitEncodingDirectives
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgregg
Copy link
Contributor Author

tgregg commented Nov 26, 2024

See how this is used in ion-java-benchmark-cli here: amazon-ion/ion-java-benchmark-cli#66

Comment on lines 1803 to 1807
registerIvmNotificationConsumer((major, minor) -> {
resetEncodingContext();
writer.startEncodingSegmentWithIonVersionMarker();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
registerIvmNotificationConsumer((major, minor) -> {
resetEncodingContext();
writer.startEncodingSegmentWithIonVersionMarker();
registerIvmNotificationConsumer((major, minor) -> {
resetEncodingContext();
// Which IVM to write is inherent to the writer implementation--
// We don't have a single implementation that writes both formats.
writer.startEncodingSegmentWithIonVersionMarker();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Comment on lines 978 to 980
for (Matcher<String> expectation : expectations) {
assertThat(rewritten, expectation);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an FYI this can be expressed as assertThat(rewritten, allOf(expectations)), but I opted not to because this approach makes a cleaner error message. allOf describes itself with essentially "${e1.describeTo} and ${e2.describeTo} and ... ", which ends up being a pain to parse manually to find the violated expectation.

... which makes me realize there's probably a good opportunity there for a PR to Hamcrest...

Comment on lines +6094 to +6100
0xE7, 0x01, 0x63, // One FlexSym annotation, with opcode, opcode 63 = system symbol 3 = $ion_symbol_table
0xD7, // {
0x0D, // FlexUInt 6 = imports
0xEE, 0x03, // System symbol value 3 = $ion_symbol_table (denoting symbol table append)
0x0F, // FlexUInt 7 = symbols
0xB2, 0x91, 'a', // ["a"]
0xE1, SystemSymbols_1_1.size() + 1 // first user symbol = a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's your common practice, but I appreciate these comments. Even small uncommented binary literals are much more of a speedbump, the comments really help.

Comment on lines 978 to 980
for (Matcher<String> expectation : expectations) {
assertThat(rewritten, expectation);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment made me go look at the AllOf implementation, where I realized that there is already a distinction between describeTo and describeMismatch, which AllOf uses to enumerate only the failing matches. I just had to read more of the failure message. Given that, I suggest this instead:

Suggested change
for (Matcher<String> expectation : expectations) {
assertThat(rewritten, expectation);
}
assertThat(rewritten, allOf(expectation));

This will also require the appropriate import static org.hamcrest.Matchers.allOf; in the imports section. For a failing test, it will generate a message like this:

Expected: (a String including 1 occurrences of $ion_1_1 and a String including 1 occurrences of add_symbols and a String including 0 occurrences of add_macros and a String including 0 occurrences of set_symbols and a String including 0 occurrences of set_macros and a String including 2 occurrences of $ion_encoding)
     but: a String including 1 occurrences of add_symbols was "$ion_1_1 $ion_encoding::((symbol_table [\"SimonSays\",\"anything\"])) $ion_encoding::((symbol_table [\"foo\"]) (macro_table (macro SimonSays (anything) (%anything)))) (:SimonSays [(:SimonSays {$1:1.23e0}),(:SimonSays 123),\"abc\"]) [] "

The 'x and y and z and... ' part is not super helpful, but the but: a String including 1 occurrences of add_symbols was ... part is. YMMV.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks

@@ -44,10 +48,13 @@
import java.util.TreeMap;
import java.util.function.Consumer;

import static com.amazon.ion.BitUtils.bytes;
import static org.hamcrest.MatcherAssert.assertThat;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import static org.hamcrest.MatcherAssert.assertThat;
import static org.hamcrest.MatcherAssert.assertThat;
import static org.hamcrest.Matchers.allOf;

Only if you want to use allOf below.

@tgregg tgregg force-pushed the ion-11-encoding-transcode branch from 2126110 to b3368c8 Compare November 27, 2024 18:04
@tgregg tgregg merged commit 44da868 into ion-11-encoding Nov 27, 2024
17 checks passed
@tgregg tgregg deleted the ion-11-encoding-transcode branch November 27, 2024 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants