Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add managed writer for Ion 1.1 with basic round-trip testing #830

Merged
merged 3 commits into from
May 6, 2024

Conversation

popematt
Copy link
Contributor

@popematt popematt commented May 3, 2024

Issue #, if available:
None

Description of changes:

This adds a "managed" writer for Ion 1.1 that is generic over whether the raw encoding is Text or Binary.

Caveats:

  • There are a lot of classes that need unit testing
  • I had to change the visibility of at least a half dozen existing classes. We will need to revisit the package organization and the visibility of those classes.
  • There is a lot of documentation that is not complete yet.
  • Does not support shared symbol tables.
  • ...probably others? I've exhausted my mental capacity for the week, and I just need to get this out.

However, this does have some basic round-trip tests using the data in ion-tests and this is enough that we can start building other things around it an incrementally filling in the gaps.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@popematt popematt requested a review from tgregg May 3, 2024 22:36
* TODO:
* - Add proper tests
*/
internal class BufferedOutputStream(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This allows the text writer to be buffered, like the binary writer, so that we can make sure that we buffer enough before writing out the encoding directives. We still need to add some way of specifying parts of the encoding directives ahead of time so that we don't have to do any buffering. If we can guarantee that there will be no unexpected encoding directives, then we don't need to buffer the text writer.

@@ -17,6 +17,7 @@ import java.time.Instant
* - Never writes using "long string" syntax in order to simplify the writer.
* - Uses `[: ... ]` for expression groups.
* - Does not try to resolve symbol tokens. That is the concern of the managed writer.
* - To make it easier to concatenate streams, this eagerly emits a top-level separator after each top-level syntax item.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ I think this is something that we can probably live with.

src/main/java/com/amazon/ion/impl/IonRawTextWriter_1_1.kt Outdated Show resolved Hide resolved
return sign == 0 ? -0e0 : 0e0;
return sign == 0 ? 0e0 : -0e0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ I'm pretty sure that if I'm understanding it right, when the sign bit is 0, the float is positive. Either way, I needed to make this change to get the Ion 1.1 round trip tests to pass.

@@ -49,11 +36,11 @@
* <p>
* Instances of this class are safe for use by multiple threads.
*/
class LocalSymbolTable
public class LocalSymbolTable
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Just visibility changes in this file.

Copy link
Contributor

@tgregg tgregg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's open an issue with a list of the classes we had to make public in this PR so we don't forget to go back and revisit them.

Approving with comments, since we already plan to iterate on this.

* TODO:
* - Add proper tests
*/
internal class BufferedOutputStream(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment a little more on why this is needed? Also, I suggest a name that doesn't clash with a built-in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OutputStreamFastAppendable buffers up to 4096 bytes before flushing to the output stream. If you have a text top-level value that is larger than 4096 bytes, then you can end up with partial or whole user values flushing to the OutputStream before you write the system value that it depends on.

I can change the name.

Comment on lines 17 to 18
* TODO: See if we can add other context, such as annotations that are going to be added to this container,
* or the field name (if this container is in a struct).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's a "WriterView" interface we could define that we could vend any desired context through.

fun writeDelimited(context: WriterView): Boolean
interface WriterView {
    ContainerType getContainerDepth();
    int getDepth();
    // etc.
}

I'm also starting to think of the expense of this. Evaluating on every container might be prohibitive for more complicated rules. I wonder if we could somehow compile the rules into the writer so it can keep track of which options are active as it progresses in the stream, similar to how the path extractor keeps track of partial matches at its current location. If there are no potential matches, no invocations of the callback need be made.

I think we can go with something like what's proposed for the preview, and iterate once we have some performance numbers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WriterView is a good idea.

* Options that are specific to Ion 1.1 and handled in the managed writer.
* These are (mostly) generalizable to both text and binary.
*/
data class ManagedWriterOptions(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have a _1_1 suffix?

Comment on lines 10 to 15
/**
* Whether the symbols in the encoding directive should be interned or not.
* For binary, almost certainly want this to be true, and for text, it's
* more readable if it's false.
*/
val internEncodingDirectiveSymbols: Boolean = false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the text look like if this were true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The symbol table would look something like this:

$3::{
  $7: [ ... ]
}

vs.

$ion_symbol_table::{
  symbols: [ ... ]
}

It's not super important, but when you're writing text with a symbol table, I think it help make it easier for a human to find and read the symbol table.

Comment on lines +31 to +35
/**
* Indicates whether a particular symbol text should be written inline (as opposed to writing as a SID).
*/
fun shouldWriteInline(symbolKind: SymbolKind, text: String): Boolean

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to call this on every field name, annotation, and symbol value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does right now. Now that I've been away from this for an hour, it does seem like that could be a performance problem. We can get rid of it if we need to.

if (content == null) {
userData.writeNull(IonType.SYMBOL)
} else {
handleSymbolText(content, options.shouldWriteInline(SymbolKind.VALUE, content), userData::writeSymbol, userData::writeSymbol)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the inlining decision for a particular symbol be cached?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? A HashMap lookup might be more expensive than a simple check such as text.length > 5.

But also, I think that caching the decision prevents some of the interesting implementation ideas that I mentioned. (E.g. only intern the symbol after it's occurred at least N times.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I've thought about this more, I think that caching works well if it is implemented as part of the SymbolInliningStrategy. This way, the SymbolInliningStrategy can manage the cache invalidation in a way that works with its logic.

@popematt popematt merged commit 627ffb3 into amazon-ion:ion-11-encoding May 6, 2024
15 of 34 checks passed
@popematt popematt deleted the ion-11-managed-writer branch May 6, 2024 23:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants