-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds support for macro-aware transcoding from binary to text. #1000
Conversation
* @param ionData the data to read. | ||
* @return a new MacroAwareIonReader instance. | ||
*/ | ||
public MacroAwareIonReader buildMacroAware(byte[] ionData) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this is "hidden" in _Private_IonReaderBuilder
much like newSystemReader
is hidden in _Private_IonSystem
.
* and updates internal state accordingly. This always appends to the current encoding context. If there is nothing | ||
* to append, calling this function is a no-op. | ||
*/ | ||
private fun writeVerboseEncodingDirective() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "Verbose" methods added in this PR are not new; they were replaced with the equivalent system macro invocations in #987. I've added them back because we'll use them in the macro-aware transcoding when the source stream contains verbose encoding directives.
/** | ||
* An [IonReader] that delegates to a [ReaderAdapter]. | ||
*/ | ||
internal class IonReaderFromReaderAdapter(val reader: ReaderAdapter) : IonReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the easiest way to be able to provide an IonReaderContinuableCore
(which doesn't implement IonReader
, but does have a ReaderAdapter
) to IonWriter.writeValue(IonReader)
.
expressionArgsReader.beginEvaluatingMacroInvocation(macroEvaluator); | ||
macroEvaluatorIonReader.transcodeArgumentsTo(writer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a full materialization of the e-expression arguments followed by a re-write. If we change to lazy evaluation of e-expression arguments, this will need to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff looks large because I moved the companion object up a level so I could use it outside of the base class, resulting in a lot of whitespace changes. I'll point out what I actually added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, it's pretty easy to ignore whitespace when reviewing :)
@JvmStatic | ||
protected val WRITER_INLINE_DELIMITED: (OutputStream) -> IonWriter = ION_1_1.binaryWriterBuilder() | ||
.withSymbolInliningStrategy(SymbolInliningStrategy.ALWAYS_INLINE) | ||
.withLengthPrefixStrategy(LengthPrefixStrategy.NEVER_PREFIXED)::build | ||
|
||
@JvmStatic | ||
fun assertReadersHaveEquivalentValues(expectedDataReader: IonReader, actualDataReader: IonReader) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method just moved, but I didn't modify it.
} | ||
|
||
@JvmStatic | ||
fun isIonVersionMarker(symbol: SymbolToken?): Boolean { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved, not modified.
abstract val readerFn: (ByteArray) -> IonReader | ||
val systemReaderFn: (ByteArray) -> IonReader = ION::newSystemReader | ||
@Nested | ||
inner class BinaryMacroAwareTranscode_ReaderNonContinuableBufferDefault { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only thing I added in here, in order to test that a macro-aware transcode off all the test data results in data-model-equivalent results.
int expectedNumberOfIonVersionMarkers, | ||
int expectedNumberOfAddSymbolsInvocations, | ||
int expectedNumberOfAddMacrosInvocations, | ||
int expectedNumberOfSetSymbolsInvocations, | ||
int expectedNumberOfSetMacrosInvocations, | ||
int expectedNumberOfExplicitEncodingDirectives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ugly; open to suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a suggestion using a custom matcher: https://gist.github.com/jobarr-amzn/b009638a868941d63c4f1cf8277a7dfc#file-encodingdirectivecompilationtest-java-L925-L1013
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done
* The following limitations should be noted: | ||
* 1. Encoding directives with no effect on the encoding context may be | ||
* elided from the transcoded stream. An example would be an encoding | ||
* directive that re-exports the existing context but adds no new | ||
* macros or new symbols. | ||
* 2. When transcoding from text to text, comments will not be preserved. | ||
* 3. Open content in encoding directives (e.g. macro invocations that | ||
* expand to nothing) will not be preserved. | ||
* 4. Granular details of the binary encoding, like inlining vs. interning | ||
* for a particular symbol or length-prefixing vs. delimiting for a | ||
* particular container, may not be preserved. It is up to the user | ||
* to provide a writer configured to match these details if important. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are technical limitations, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could code around most of these, but it's generally not worth the effort.
String macroName = macroCompiler.getMacroName(); | ||
if (preserveMacroNames && macroName != null) { | ||
newMacros.put(MacroRef.byName(macroName), newMacro); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about this since #996 and working with the MacroTable abstraction there... I think we ought to support both lookups all the time. That would make preserveMacroNames
unnecessary, yeah?
What considerations (other than map/keyset size) am I missing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the map size and the general ugliness about the fact that there are always two keys that map to the same value. I'm actually going to make the change to support both lookups all the time though, which is how it's handled when read from the text format.
registerIvmNotificationConsumer((x, y) -> { | ||
resetEncodingContext(); | ||
writer.startEncodingSegmentWithIonVersionMarker(); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming x
and y
are the major and minor versions? How does the writer know which IVM to use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I renamed the parameters to make this obvious. Which IVM to write is inherent to the writer implementation -- we don't have a single implementation that writes both formats.
src/main/java/com/amazon/ion/impl/macro/IonReaderFromReaderAdapter.kt
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, it's pretty easy to ignore whitespace when reviewing :)
src/test/java/com/amazon/ion/impl/EncodingDirectiveCompilationTest.java
Outdated
Show resolved
Hide resolved
int expectedNumberOfIonVersionMarkers, | ||
int expectedNumberOfAddSymbolsInvocations, | ||
int expectedNumberOfAddMacrosInvocations, | ||
int expectedNumberOfSetSymbolsInvocations, | ||
int expectedNumberOfSetMacrosInvocations, | ||
int expectedNumberOfExplicitEncodingDirectives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a suggestion using a custom matcher: https://gist.github.com/jobarr-amzn/b009638a868941d63c4f1cf8277a7dfc#file-encodingdirectivecompilationtest-java-L925-L1013
See how this is used in ion-java-benchmark-cli here: amazon-ion/ion-java-benchmark-cli#66 |
…th ion-java-benchmark-cli.
registerIvmNotificationConsumer((major, minor) -> { | ||
resetEncodingContext(); | ||
writer.startEncodingSegmentWithIonVersionMarker(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
registerIvmNotificationConsumer((major, minor) -> { | |
resetEncodingContext(); | |
writer.startEncodingSegmentWithIonVersionMarker(); | |
registerIvmNotificationConsumer((major, minor) -> { | |
resetEncodingContext(); | |
// Which IVM to write is inherent to the writer implementation-- | |
// We don't have a single implementation that writes both formats. | |
writer.startEncodingSegmentWithIonVersionMarker(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
for (Matcher<String> expectation : expectations) { | ||
assertThat(rewritten, expectation); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an FYI this can be expressed as assertThat(rewritten, allOf(expectations))
, but I opted not to because this approach makes a cleaner error message. allOf
describes itself with essentially "${e1.describeTo} and ${e2.describeTo} and ... "
, which ends up being a pain to parse manually to find the violated expectation.
... which makes me realize there's probably a good opportunity there for a PR to Hamcrest...
0xE7, 0x01, 0x63, // One FlexSym annotation, with opcode, opcode 63 = system symbol 3 = $ion_symbol_table | ||
0xD7, // { | ||
0x0D, // FlexUInt 6 = imports | ||
0xEE, 0x03, // System symbol value 3 = $ion_symbol_table (denoting symbol table append) | ||
0x0F, // FlexUInt 7 = symbols | ||
0xB2, 0x91, 'a', // ["a"] | ||
0xE1, SystemSymbols_1_1.size() + 1 // first user symbol = a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's your common practice, but I appreciate these comments. Even small uncommented binary literals are much more of a speedbump, the comments really help.
for (Matcher<String> expectation : expectations) { | ||
assertThat(rewritten, expectation); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment made me go look at the AllOf implementation, where I realized that there is already a distinction between describeTo
and describeMismatch
, which AllOf
uses to enumerate only the failing matches. I just had to read more of the failure message. Given that, I suggest this instead:
for (Matcher<String> expectation : expectations) { | |
assertThat(rewritten, expectation); | |
} | |
assertThat(rewritten, allOf(expectation)); |
This will also require the appropriate import static org.hamcrest.Matchers.allOf;
in the imports section. For a failing test, it will generate a message like this:
Expected: (a String including 1 occurrences of $ion_1_1 and a String including 1 occurrences of add_symbols and a String including 0 occurrences of add_macros and a String including 0 occurrences of set_symbols and a String including 0 occurrences of set_macros and a String including 2 occurrences of $ion_encoding)
but: a String including 1 occurrences of add_symbols was "$ion_1_1 $ion_encoding::((symbol_table [\"SimonSays\",\"anything\"])) $ion_encoding::((symbol_table [\"foo\"]) (macro_table (macro SimonSays (anything) (%anything)))) (:SimonSays [(:SimonSays {$1:1.23e0}),(:SimonSays 123),\"abc\"]) [] "
The 'x and y and z and... ' part is not super helpful, but the but: a String including 1 occurrences of add_symbols was ...
part is. YMMV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks
@@ -44,10 +48,13 @@ | |||
import java.util.TreeMap; | |||
import java.util.function.Consumer; | |||
|
|||
import static com.amazon.ion.BitUtils.bytes; | |||
import static org.hamcrest.MatcherAssert.assertThat; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import static org.hamcrest.MatcherAssert.assertThat; | |
import static org.hamcrest.MatcherAssert.assertThat; | |
import static org.hamcrest.Matchers.allOf; |
Only if you want to use allOf
below.
…es minor cleanups.
2126110
to
b3368c8
Compare
Description of changes:
Supports macro-aware transcoding from binary to text, preserving symbol tables, encoding directives, and e-expression invocations. Support for transcoding from text and to binary will be added in a future PR.
All of the API design and behavior introduced by this PR is negotiable. This is not considered a "public" API, so we have room to change it if necessary.
I deliberately did not try to build this into
IonWriter.writeValues(IonReader)
, which could be used for system-level transcoding of Ion 1.0 streams. That works well in Ion 1.0 because symbol tables are still part of the data model, and at the system level they just look like regular structs. E-expressions are different because they occur only in the encoding, not in the data model, and as a result there are no user-exposed IonWriter APIs to preserve them. Accordingly, I've proposed a solution in this PR that hooks into the core-level reader (via the newMacroAwareIonReader
interface), and can rely on encoding information to manipulate aMacroAwareIonWriter
appropriately. TheMacroAwareIonReader
will need to be implemented by the text reader in the future to enable macro-aware transcoding from text.The transcoding is performed using:
And produces output like the following, when provided with the equivalent binary stream:
When transcoded using
IonWriter.writeValues(IonReader)
, the same input produces:This can be used as a debugging tool and will be leveraged in ion-java-benchmark-cli to perform faithful re-writes of input data during write benchmarking.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.