-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add managed writer for Ion 1.1 with basic round-trip testing #830
Add managed writer for Ion 1.1 with basic round-trip testing #830
Conversation
* TODO: | ||
* - Add proper tests | ||
*/ | ||
internal class BufferedOutputStream( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ This allows the text writer to be buffered, like the binary writer, so that we can make sure that we buffer enough before writing out the encoding directives. We still need to add some way of specifying parts of the encoding directives ahead of time so that we don't have to do any buffering. If we can guarantee that there will be no unexpected encoding directives, then we don't need to buffer the text writer.
@@ -17,6 +17,7 @@ import java.time.Instant | |||
* - Never writes using "long string" syntax in order to simplify the writer. | |||
* - Uses `[: ... ]` for expression groups. | |||
* - Does not try to resolve symbol tokens. That is the concern of the managed writer. | |||
* - To make it easier to concatenate streams, this eagerly emits a top-level separator after each top-level syntax item. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ I think this is something that we can probably live with.
return sign == 0 ? -0e0 : 0e0; | ||
return sign == 0 ? 0e0 : -0e0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ I'm pretty sure that if I'm understanding it right, when the sign bit is 0
, the float is positive. Either way, I needed to make this change to get the Ion 1.1 round trip tests to pass.
@@ -49,11 +36,11 @@ | |||
* <p> | |||
* Instances of this class are safe for use by multiple threads. | |||
*/ | |||
class LocalSymbolTable | |||
public class LocalSymbolTable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ Just visibility changes in this file.
src/main/java/com/amazon/ion/impl/bin/IonRawBinaryWriter_1_1.kt
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's open an issue with a list of the classes we had to make public in this PR so we don't forget to go back and revisit them.
Approving with comments, since we already plan to iterate on this.
* TODO: | ||
* - Add proper tests | ||
*/ | ||
internal class BufferedOutputStream( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment a little more on why this is needed? Also, I suggest a name that doesn't clash with a built-in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OutputStreamFastAppendable
buffers up to 4096 bytes before flushing to the output stream. If you have a text top-level value that is larger than 4096 bytes, then you can end up with partial or whole user values flushing to the OutputStream
before you write the system value that it depends on.
I can change the name.
* TODO: See if we can add other context, such as annotations that are going to be added to this container, | ||
* or the field name (if this container is in a struct). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there's a "WriterView" interface we could define that we could vend any desired context through.
fun writeDelimited(context: WriterView): Boolean
interface WriterView {
ContainerType getContainerDepth();
int getDepth();
// etc.
}
I'm also starting to think of the expense of this. Evaluating on every container might be prohibitive for more complicated rules. I wonder if we could somehow compile the rules into the writer so it can keep track of which options are active as it progresses in the stream, similar to how the path extractor keeps track of partial matches at its current location. If there are no potential matches, no invocations of the callback need be made.
I think we can go with something like what's proposed for the preview, and iterate once we have some performance numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The WriterView is a good idea.
* Options that are specific to Ion 1.1 and handled in the managed writer. | ||
* These are (mostly) generalizable to both text and binary. | ||
*/ | ||
data class ManagedWriterOptions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this have a _1_1
suffix?
/** | ||
* Whether the symbols in the encoding directive should be interned or not. | ||
* For binary, almost certainly want this to be true, and for text, it's | ||
* more readable if it's false. | ||
*/ | ||
val internEncodingDirectiveSymbols: Boolean = false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would the text look like if this were true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The symbol table would look something like this:
$3::{
$7: [ ... ]
}
vs.
$ion_symbol_table::{
symbols: [ ... ]
}
It's not super important, but when you're writing text with a symbol table, I think it help make it easier for a human to find and read the symbol table.
/** | ||
* Indicates whether a particular symbol text should be written inline (as opposed to writing as a SID). | ||
*/ | ||
fun shouldWriteInline(symbolKind: SymbolKind, text: String): Boolean | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going to call this on every field name, annotation, and symbol value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does right now. Now that I've been away from this for an hour, it does seem like that could be a performance problem. We can get rid of it if we need to.
if (content == null) { | ||
userData.writeNull(IonType.SYMBOL) | ||
} else { | ||
handleSymbolText(content, options.shouldWriteInline(SymbolKind.VALUE, content), userData::writeSymbol, userData::writeSymbol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the inlining decision for a particular symbol be cached?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe? A HashMap lookup might be more expensive than a simple check such as text.length > 5
.
But also, I think that caching the decision prevents some of the interesting implementation ideas that I mentioned. (E.g. only intern the symbol after it's occurred at least N times.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I've thought about this more, I think that caching works well if it is implemented as part of the SymbolInliningStrategy
. This way, the SymbolInliningStrategy
can manage the cache invalidation in a way that works with its logic.
Issue #, if available:
None
Description of changes:
This adds a "managed" writer for Ion 1.1 that is generic over whether the raw encoding is Text or Binary.
Caveats:
However, this does have some basic round-trip tests using the data in
ion-tests
and this is enough that we can start building other things around it an incrementally filling in the gaps.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.