Performance issue for large data / kotlinx-io #217

westnordost · 2024-06-14T10:46:51Z

TLDR: Library user experiences performance issues and requests support for kotlinx-io streaming primitives.

As one part of making an app multiplatform, I replaced Java HttpUrlConnection + XML pull parsing with ktor-client + xmlutil + kotlinx-serialization. Unfortunately, the performance at least on parsing large data (tested with ~10MB) got 3-6 times worse. (see StreetComplete#5686 for a short analysis and comparison).

I suspect that using xmlutil's streaming parser instead of deserializing the whole data structure with kotlinx-serialization might improve this somewhat because theoretically, xmlutil should be able start reading the bytes as they come through the wire and in a separate thread than the thread that receives the bytes. (But ultimately not knowing the internals, I can only guess. If you have any other suspicions for the cause of the performance issue, let me know!)

The documentation is a bit thin on streaming, but I understand I need to call xmlStreaming.newGenericReader(Reader) to get an XmlReader which is a xml pull parser interface. However, I need to supply a (Java) Reader, so currently it seems there is no interface for stream parsing on multiplatform.

I understand that byte streaming support and consequently text streaming built on top of that is a bit higgledy-piggledy right now in the Kotlin ecosystem because a replacement for InputSteam etc. has never been available in the Kotlin standard library. So, every library that does something with IO implemented their own thing, if anything at all - sometimes based on okio, sometimes an own implementation.

However, now it seems like things are about to get better: Both ktor and apparently kotlinx-serialization are being migrated to use kotlinx-io and thus the common interfaces like Sink, Source and Buffer, which are replacements for InputStream et al.

So, in case you'd agree that most likely my performance issue with large data stirs from the lack of XML streaming, I guess my request would be to move with kotlinx-serialization and ktor to support a common interface for streaming bytes and text.

The text was updated successfully, but these errors were encountered:

westnordost · 2024-06-14T15:23:11Z

I did two experiments:

1. decoding from stream instead

In my code, I replaced
xml.decodeFromString<ApiOsm>(string)
... with ...
xml.decodeFromReader<ApiOsm>(xmlStreaming.newReader(inputStream.bufferedReader())

However, it takes about the same time, actually is marginally slower than from string. Also, my assumption or hope that the parser can start parsing the incoming bytes while the http client is receiving them in parallel is apparently wrong. No idea why, but that would be on ktor developers, anyway. (Edit: Created an issue there)

2. writing a parser manually, using xmlutil's XmlReader

A manually written parser brings the desired performance gain, cutting the parsing time to about a third. It is still a bit slower than before, but not by much.

Streaming from a stream of bytes is (again) about 25% slower than streaming from a string. When taking into account the time it takes to stream the bytes into a string, it amounts to taking exactly the same time. (And in experiment one I already observed that ktor apparently doesn't make available the input stream until it has been completely received...)

In conclusion, streaming doesn't solve anything (for my use case) and for some reason either kotlinx-serialization or xmlutil's implementation seems to be quite slow.

pdvrieze · 2024-06-14T19:56:02Z

The generic parser on Jvm supports reading directly from InputStreams (with encoding detection) using the KtXmlReader factory function. If you want to use a "native" (stax) reader, you can just create this and get an xml reader calling StAXReader(myReader).

I've been having an initial stab at supporting kotlinx.io which is finally in an usable state. However it is held up a bit because kotlinx.io doesn't support encodings (and I haven't had time to have a decent stab at dealing with that added complexity).

As to performance, I would very much welcome a good performance test suite. As I haven't looked at it explicitly I'm not surprised that it is not optimal. Given all that it does it would never match a hand-rolled parser, but I expect there are quite some optimization options (hence the need to create the test suite to test it - I may repurpose the xmlschema test suite for that - I already include it in the work in progress xmlschema branch - it is big).

westnordost · 2024-06-15T11:16:18Z

I am not able to completely follow. Is this about newReader(Reader) vs newGenericReader(Reader)? I haven't understood the difference between the two.

Given all that it does it would never match a hand-rolled parser

Right. The experiment above was made with your xmlutil XmlParser, though, so it is not completely hand-rolled.

kotlinx.io doesn't support encodings

You mean, any to string decoding, or just anything else than UTF-8?

pdvrieze · 2024-06-18T08:42:39Z

The difference between newReader and newGenericReader is that newGeneric reader will always return a KtXmlReader instance, and thus behave the same on each platform (modulo unsupported encodings). NewReader on JVM/Android/Browser will have different implementations.

What I meant with kotlinx.io decoding is that it doesn't support the byte to character conversion (charset decoding/encoding).

Btw. I've had a look at performance. I've done some testing on parsing xml schema (I already had that going, just needed to create the performance test). I've cut of some big chunks of time already, but there is probably a lot more that can be optimized. The main issue seems to be in building the descriptors (and if you want speed, don't put in order restrictions)

westnordost · 2024-06-18T09:56:17Z

What I meant with kotlinx.io decoding is that it doesn't support the byte to character conversion (charset decoding/encoding).

There's public fun Source.readString(byteCount: Long): String and Sink.writeString(string: String, startIndex: Int = 0, endIndex: Int = string.length) (since 0.3.5) for UTF-8 strings

https://github.com/Kotlin/kotlinx-io/blob/7c4d095d2fe208abe1eb98a5e07c9b5a95cf3dc8/core/common/src/Utf8.kt#L232

The JVM-implementation includes a readString overload where a charset can be defined. (But for my and for most other people's use cases, UTF-8 is probably all they need anyway.)

Though, what most people probably want is a way to encapsulate that in the way the Reader does it on Java and it would make sense to add such a class for kotlinx-io directly rather than leave this up for implementation by every single parser/serializer lib that does something with text.
I am not acquainted at all with kotlinx-io yet, so I am not sure if they maybe already have or plan something like that, otherwise it would make sense to create an issue ticket over there.

Btw. I've had a look at performance. I've done some testing on parsing xml schema (I already had that going, just needed to create the performance test). I've cut of some big chunks of time already

🥳 , that's great! On kotlinx-serialization in particular or even in the core?

pdvrieze · 2024-06-18T19:23:43Z

I have already some UTF8 support in the library as it is needed for native serialization (it actually has a FileInputStream implementation). The Source.readString and Sink.writeString are not that helpful for me as the infrastructure reads by character. Again, I have UTF* code already in the library, but need to move it and create a way for it to work with charsets.

As to performance, this is mainly in the way descriptors (that drive the serialization) are created (there was way to much "convenient" code in there). I haven't actually done anything on the xml parsing itself, only serialization. I've already cut the time used to parse 10000 xmlschema documents in half (in 10 iterations, following a warmup run - warmup is significant here).

pdvrieze · 2024-07-03T16:28:42Z

I have done quite some optimization work on KtXmlReader (shaving at least 30% off the parsing time on the xml schema test suite - from approx 270ms to 200ms). The deserialization has also been optimized, and while it was in the 12000 ms range, it is now at approx 270ms for a list of xml events, or 478ms for both). And it should be worth noting that at least half of the parsing time is taken up by deflating from the resource jar that is used for the testing. I've also added a flag unchecked to the deserialization options that removes some correctness checks.

westnordost · 2024-07-03T16:31:40Z

Awesome!

westnordost · 2024-07-03T16:51:23Z

I do expect that and additional most significant speed boost on parsing large XML data in context of multiplatform (no Java) will arrive when xmlutil allows to parse from kotlinx-io's Source (and HTTP libraries like Ktor will provide a Source for the response body) because then parsing can take place in parallel to reading the data from disk.

So, xmlutil would be done parsing just right after the last bytes were read.

pdvrieze · 2024-07-03T19:14:41Z

If you want to take a look, I've pushed the "optimization" branch with this work.

Btw. there is actually already a native implementation of InputStream (and FileInputStream) that works with native files.

EMaksymenko · 2024-07-07T09:45:30Z

Hi. When do you plan to release this optimization?

pdvrieze · 2024-07-08T10:00:59Z

I'll probably move it to dev (and snapshot release) in the next few days. I'll try to see if I can add kotlinx.io support as well. Then some soaking/testing is probably good as there are quite some detailed changes that could have broken things.

pdvrieze · 2024-07-12T14:10:44Z

It should now be available as a snapshot release. Please have a go at it. I'll probably make it a beta soon (there are quite significant under the hood changes that may have inadvertently broken things). You may want to use the fast configuration preset

EMaksymenko · 2024-07-19T10:11:37Z

Hi, I had no chance to test night build yet.
Anybody tested this fix?
When will it be in beta release?

pdvrieze · 2024-07-21T16:12:25Z

@ComBatVision I'm building the beta now (there were still some bugs hitting the test suite, there are possibly further bugs/regressions present).

EMaksymenko · 2024-07-22T09:37:20Z

@pdvrieze Thank you. I will test it when beta will be available on maven central.

pdvrieze · 2024-07-22T18:01:14Z

It is there now.

EMaksymenko · 2024-07-22T18:30:49Z

This update requires Kotlin 2.0.0 We will test it little bit later. Our project was not migrated to 2.0 yet.

westnordost · 2024-07-22T21:39:01Z

I did a small test. I can confirm a speed boost of about 30% with the new beta.

pdvrieze · 2024-07-24T09:07:57Z

This update requires Kotlin 2.0.0 We will test it little bit later. Our project was not migrated to 2.0 yet.

I may be able to build a version that uses the older kotlin and serialization. At least when not beta.

pdvrieze · 2024-07-24T09:10:42Z

I did a small test. I can confirm a speed boost of about 30% with the new beta.

I terms of speed it is important to retain the format across the runs as that caches the building of the file structure. Warmup costs are still quite large compared to further runs. I optimised mainly on repeated runs, rather than the first one.

EMaksymenko · 2024-07-27T16:45:41Z

This update requires Kotlin 2.0.0 We will test it little bit later. Our project was not migrated to 2.0 yet.

I may be able to build a version that uses the older kotlin and serialization. At least when not beta.

It will be good to support Kotlin 1.9.24 still, becasue migration to 2.0 on production is still to risky and to complex.
We have Compose in the project and migration is not trivial.

westnordost · 2024-08-05T15:20:04Z

Already 0.90 is built with Kotlin 2.0. Can you link to a source why migration to 2.0 would be risky and complex?

EMaksymenko · 2024-08-09T13:13:54Z

@westnordost it is just risky for our project before major LTS release. We plan to migrate on Kotln 2.0 after it.

@pdvrieze When do you plan to make a stable (not beta) release? Any issues still available?

pdvrieze · 2024-08-09T13:16:50Z

I've been doing some work on my own system and found some bugs. In any case I will be doing a release candidate (just to let people know that there may be unforeseen regressions)

pdvrieze · 2024-08-09T13:17:37Z

Btw. As to a 1.9 version, I created a fork, but it failed some tests and I will need to fix them

EMaksymenko · 2024-08-16T13:57:45Z

Btw. As to a 1.9 version, I created a fork, but it failed some tests and I will need to fix them

We have started migration to Kotlin 2.0.0 so you may skip this idea about supporting Kotlin 1.9. Thanks.

hfhbd · 2024-11-11T20:41:20Z

Do you still have any plans to natively integrate kotlinx.io? Maybe as a separate artifact?

pdvrieze · 2024-11-13T09:15:08Z

@hfhbd In principle yes. It should work as extension as the existing interface already works with readers/writers/appendables. The thing is that the mapping needs to be done, including handling charset conversions (the hanger).

pdvrieze · 2024-11-15T17:02:50Z

I've just created this in the dev channel (it is currently building). You should be able to access it as snapshot release. This adds extension functions: encodeToSink and decodeFromSource. They should be efficient, except for JavaScript, where decoding just converts the source to string. On non-JVM targets only UTF-8/UTF-16LE/UTF-16BE/UTF32 are supported.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue for large data / kotlinx-io #217

Performance issue for large data / kotlinx-io #217

westnordost commented Jun 14, 2024

westnordost commented Jun 14, 2024 •

edited

Loading

pdvrieze commented Jun 14, 2024

westnordost commented Jun 15, 2024

pdvrieze commented Jun 18, 2024

westnordost commented Jun 18, 2024

pdvrieze commented Jun 18, 2024

pdvrieze commented Jul 3, 2024

westnordost commented Jul 3, 2024

westnordost commented Jul 3, 2024 •

edited

Loading

pdvrieze commented Jul 3, 2024 •

edited

Loading

EMaksymenko commented Jul 7, 2024

pdvrieze commented Jul 8, 2024

pdvrieze commented Jul 12, 2024

EMaksymenko commented Jul 19, 2024

pdvrieze commented Jul 21, 2024

EMaksymenko commented Jul 22, 2024

pdvrieze commented Jul 22, 2024

EMaksymenko commented Jul 22, 2024

westnordost commented Jul 22, 2024

pdvrieze commented Jul 24, 2024

pdvrieze commented Jul 24, 2024

EMaksymenko commented Jul 27, 2024

westnordost commented Aug 5, 2024

EMaksymenko commented Aug 9, 2024

pdvrieze commented Aug 9, 2024

pdvrieze commented Aug 9, 2024

EMaksymenko commented Aug 16, 2024

hfhbd commented Nov 11, 2024

pdvrieze commented Nov 13, 2024

pdvrieze commented Nov 15, 2024

Performance issue for large data / kotlinx-io #217

Performance issue for large data / kotlinx-io #217

Comments

westnordost commented Jun 14, 2024

westnordost commented Jun 14, 2024 • edited Loading

1. decoding from stream instead

2. writing a parser manually, using xmlutil's XmlReader

pdvrieze commented Jun 14, 2024

westnordost commented Jun 15, 2024

pdvrieze commented Jun 18, 2024

westnordost commented Jun 18, 2024

pdvrieze commented Jun 18, 2024

pdvrieze commented Jul 3, 2024

westnordost commented Jul 3, 2024

westnordost commented Jul 3, 2024 • edited Loading

pdvrieze commented Jul 3, 2024 • edited Loading

EMaksymenko commented Jul 7, 2024

pdvrieze commented Jul 8, 2024

pdvrieze commented Jul 12, 2024

EMaksymenko commented Jul 19, 2024

pdvrieze commented Jul 21, 2024

EMaksymenko commented Jul 22, 2024

pdvrieze commented Jul 22, 2024

EMaksymenko commented Jul 22, 2024

westnordost commented Jul 22, 2024

pdvrieze commented Jul 24, 2024

pdvrieze commented Jul 24, 2024

EMaksymenko commented Jul 27, 2024

westnordost commented Aug 5, 2024

EMaksymenko commented Aug 9, 2024

pdvrieze commented Aug 9, 2024

pdvrieze commented Aug 9, 2024

EMaksymenko commented Aug 16, 2024

hfhbd commented Nov 11, 2024

pdvrieze commented Nov 13, 2024

pdvrieze commented Nov 15, 2024

westnordost commented Jun 14, 2024 •

edited

Loading

westnordost commented Jul 3, 2024 •

edited

Loading

pdvrieze commented Jul 3, 2024 •

edited

Loading