Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue for large data / kotlinx-io #217

Open
westnordost opened this issue Jun 14, 2024 · 30 comments
Open

Performance issue for large data / kotlinx-io #217

westnordost opened this issue Jun 14, 2024 · 30 comments

Comments

@westnordost
Copy link

TLDR: Library user experiences performance issues and requests support for kotlinx-io streaming primitives.


As one part of making an app multiplatform, I replaced Java HttpUrlConnection + XML pull parsing with ktor-client + xmlutil + kotlinx-serialization. Unfortunately, the performance at least on parsing large data (tested with ~10MB) got 3-6 times worse. (see StreetComplete#5686 for a short analysis and comparison).

I suspect that using xmlutil's streaming parser instead of deserializing the whole data structure with kotlinx-serialization might improve this somewhat because theoretically, xmlutil should be able start reading the bytes as they come through the wire and in a separate thread than the thread that receives the bytes. (But ultimately not knowing the internals, I can only guess. If you have any other suspicions for the cause of the performance issue, let me know!)

The documentation is a bit thin on streaming, but I understand I need to call xmlStreaming.newGenericReader(Reader) to get an XmlReader which is a xml pull parser interface. However, I need to supply a (Java) Reader, so currently it seems there is no interface for stream parsing on multiplatform.

I understand that byte streaming support and consequently text streaming built on top of that is a bit higgledy-piggledy right now in the Kotlin ecosystem because a replacement for InputSteam etc. has never been available in the Kotlin standard library. So, every library that does something with IO implemented their own thing, if anything at all - sometimes based on okio, sometimes an own implementation.

However, now it seems like things are about to get better: Both ktor and apparently kotlinx-serialization are being migrated to use kotlinx-io and thus the common interfaces like Sink, Source and Buffer, which are replacements for InputStream et al.

So, in case you'd agree that most likely my performance issue with large data stirs from the lack of XML streaming, I guess my request would be to move with kotlinx-serialization and ktor to support a common interface for streaming bytes and text.

@westnordost
Copy link
Author

westnordost commented Jun 14, 2024

I did two experiments:

1. decoding from stream instead

In my code, I replaced
xml.decodeFromString<ApiOsm>(string)
... with ...
xml.decodeFromReader<ApiOsm>(xmlStreaming.newReader(inputStream.bufferedReader())

However, it takes about the same time, actually is marginally slower than from string. Also, my assumption or hope that the parser can start parsing the incoming bytes while the http client is receiving them in parallel is apparently wrong. No idea why, but that would be on ktor developers, anyway. (Edit: Created an issue there)

2. writing a parser manually, using xmlutil's XmlReader

A manually written parser brings the desired performance gain, cutting the parsing time to about a third. It is still a bit slower than before, but not by much.

Streaming from a stream of bytes is (again) about 25% slower than streaming from a string. When taking into account the time it takes to stream the bytes into a string, it amounts to taking exactly the same time. (And in experiment one I already observed that ktor apparently doesn't make available the input stream until it has been completely received...)


In conclusion, streaming doesn't solve anything (for my use case) and for some reason either kotlinx-serialization or xmlutil's implementation seems to be quite slow.

@pdvrieze
Copy link
Owner

The generic parser on Jvm supports reading directly from InputStreams (with encoding detection) using the KtXmlReader factory function. If you want to use a "native" (stax) reader, you can just create this and get an xml reader calling StAXReader(myReader).

I've been having an initial stab at supporting kotlinx.io which is finally in an usable state. However it is held up a bit because kotlinx.io doesn't support encodings (and I haven't had time to have a decent stab at dealing with that added complexity).

As to performance, I would very much welcome a good performance test suite. As I haven't looked at it explicitly I'm not surprised that it is not optimal. Given all that it does it would never match a hand-rolled parser, but I expect there are quite some optimization options (hence the need to create the test suite to test it - I may repurpose the xmlschema test suite for that - I already include it in the work in progress xmlschema branch - it is big).

@westnordost
Copy link
Author

I am not able to completely follow. Is this about newReader(Reader) vs newGenericReader(Reader)? I haven't understood the difference between the two.

Given all that it does it would never match a hand-rolled parser

Right. The experiment above was made with your xmlutil XmlParser, though, so it is not completely hand-rolled.

kotlinx.io doesn't support encodings

You mean, any to string decoding, or just anything else than UTF-8?

@pdvrieze
Copy link
Owner

The difference between newReader and newGenericReader is that newGeneric reader will always return a KtXmlReader instance, and thus behave the same on each platform (modulo unsupported encodings). NewReader on JVM/Android/Browser will have different implementations.

What I meant with kotlinx.io decoding is that it doesn't support the byte to character conversion (charset decoding/encoding).

Btw. I've had a look at performance. I've done some testing on parsing xml schema (I already had that going, just needed to create the performance test). I've cut of some big chunks of time already, but there is probably a lot more that can be optimized. The main issue seems to be in building the descriptors (and if you want speed, don't put in order restrictions)

@westnordost
Copy link
Author

What I meant with kotlinx.io decoding is that it doesn't support the byte to character conversion (charset decoding/encoding).

There's public fun Source.readString(byteCount: Long): String and Sink.writeString(string: String, startIndex: Int = 0, endIndex: Int = string.length) (since 0.3.5) for UTF-8 strings

https://github.com/Kotlin/kotlinx-io/blob/7c4d095d2fe208abe1eb98a5e07c9b5a95cf3dc8/core/common/src/Utf8.kt#L232

The JVM-implementation includes a readString overload where a charset can be defined. (But for my and for most other people's use cases, UTF-8 is probably all they need anyway.)

Though, what most people probably want is a way to encapsulate that in the way the Reader does it on Java and it would make sense to add such a class for kotlinx-io directly rather than leave this up for implementation by every single parser/serializer lib that does something with text.
I am not acquainted at all with kotlinx-io yet, so I am not sure if they maybe already have or plan something like that, otherwise it would make sense to create an issue ticket over there.

Btw. I've had a look at performance. I've done some testing on parsing xml schema (I already had that going, just needed to create the performance test). I've cut of some big chunks of time already

🥳 , that's great! On kotlinx-serialization in particular or even in the core?

@pdvrieze
Copy link
Owner

I have already some UTF8 support in the library as it is needed for native serialization (it actually has a FileInputStream implementation). The Source.readString and Sink.writeString are not that helpful for me as the infrastructure reads by character. Again, I have UTF* code already in the library, but need to move it and create a way for it to work with charsets.

As to performance, this is mainly in the way descriptors (that drive the serialization) are created (there was way to much "convenient" code in there). I haven't actually done anything on the xml parsing itself, only serialization. I've already cut the time used to parse 10000 xmlschema documents in half (in 10 iterations, following a warmup run - warmup is significant here).

@pdvrieze
Copy link
Owner

pdvrieze commented Jul 3, 2024

I have done quite some optimization work on KtXmlReader (shaving at least 30% off the parsing time on the xml schema test suite - from approx 270ms to 200ms). The deserialization has also been optimized, and while it was in the 12000 ms range, it is now at approx 270ms for a list of xml events, or 478ms for both). And it should be worth noting that at least half of the parsing time is taken up by deflating from the resource jar that is used for the testing. I've also added a flag unchecked to the deserialization options that removes some correctness checks.

@westnordost
Copy link
Author

Awesome!

@westnordost
Copy link
Author

westnordost commented Jul 3, 2024

I do expect that and additional most significant speed boost on parsing large XML data in context of multiplatform (no Java) will arrive when xmlutil allows to parse from kotlinx-io's Source (and HTTP libraries like Ktor will provide a Source for the response body) because then parsing can take place in parallel to reading the data from disk.

So, xmlutil would be done parsing just right after the last bytes were read.

@pdvrieze
Copy link
Owner

pdvrieze commented Jul 3, 2024

If you want to take a look, I've pushed the "optimization" branch with this work.

Btw. there is actually already a native implementation of InputStream (and FileInputStream) that works with native files.

@EMaksymenko
Copy link

Hi. When do you plan to release this optimization?

@pdvrieze
Copy link
Owner

pdvrieze commented Jul 8, 2024

I'll probably move it to dev (and snapshot release) in the next few days. I'll try to see if I can add kotlinx.io support as well. Then some soaking/testing is probably good as there are quite some detailed changes that could have broken things.

@pdvrieze
Copy link
Owner

It should now be available as a snapshot release. Please have a go at it. I'll probably make it a beta soon (there are quite significant under the hood changes that may have inadvertently broken things). You may want to use the fast configuration preset

@EMaksymenko
Copy link

Hi, I had no chance to test night build yet.
Anybody tested this fix?
When will it be in beta release?

@pdvrieze
Copy link
Owner

@ComBatVision I'm building the beta now (there were still some bugs hitting the test suite, there are possibly further bugs/regressions present).

@EMaksymenko
Copy link

@pdvrieze Thank you. I will test it when beta will be available on maven central.

@pdvrieze
Copy link
Owner

It is there now.

@EMaksymenko
Copy link

This update requires Kotlin 2.0.0 We will test it little bit later. Our project was not migrated to 2.0 yet.

@westnordost
Copy link
Author

I did a small test. I can confirm a speed boost of about 30% with the new beta.

@pdvrieze
Copy link
Owner

This update requires Kotlin 2.0.0 We will test it little bit later. Our project was not migrated to 2.0 yet.

I may be able to build a version that uses the older kotlin and serialization. At least when not beta.

@pdvrieze
Copy link
Owner

I did a small test. I can confirm a speed boost of about 30% with the new beta.

I terms of speed it is important to retain the format across the runs as that caches the building of the file structure. Warmup costs are still quite large compared to further runs. I optimised mainly on repeated runs, rather than the first one.

@EMaksymenko
Copy link

This update requires Kotlin 2.0.0 We will test it little bit later. Our project was not migrated to 2.0 yet.

I may be able to build a version that uses the older kotlin and serialization. At least when not beta.

It will be good to support Kotlin 1.9.24 still, becasue migration to 2.0 on production is still to risky and to complex.
We have Compose in the project and migration is not trivial.

@westnordost
Copy link
Author

Already 0.90 is built with Kotlin 2.0. Can you link to a source why migration to 2.0 would be risky and complex?

@EMaksymenko
Copy link

@westnordost it is just risky for our project before major LTS release. We plan to migrate on Kotln 2.0 after it.

@pdvrieze When do you plan to make a stable (not beta) release? Any issues still available?

@pdvrieze
Copy link
Owner

pdvrieze commented Aug 9, 2024

I've been doing some work on my own system and found some bugs. In any case I will be doing a release candidate (just to let people know that there may be unforeseen regressions)

@pdvrieze
Copy link
Owner

pdvrieze commented Aug 9, 2024

Btw. As to a 1.9 version, I created a fork, but it failed some tests and I will need to fix them

@EMaksymenko
Copy link

Btw. As to a 1.9 version, I created a fork, but it failed some tests and I will need to fix them

We have started migration to Kotlin 2.0.0 so you may skip this idea about supporting Kotlin 1.9. Thanks.

@hfhbd
Copy link

hfhbd commented Nov 11, 2024

Do you still have any plans to natively integrate kotlinx.io? Maybe as a separate artifact?

@pdvrieze
Copy link
Owner

@hfhbd In principle yes. It should work as extension as the existing interface already works with readers/writers/appendables. The thing is that the mapping needs to be done, including handling charset conversions (the hanger).

@pdvrieze
Copy link
Owner

I've just created this in the dev channel (it is currently building). You should be able to access it as snapshot release. This adds extension functions: encodeToSink and decodeFromSource. They should be efficient, except for JavaScript, where decoding just converts the source to string. On non-JVM targets only UTF-8/UTF-16LE/UTF-16BE/UTF32 are supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants