-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issue for large data / kotlinx-io #217
Comments
I did two experiments: 1. decoding from stream insteadIn my code, I replaced However, it takes about the same time, actually is marginally slower than from string. Also, my assumption or hope that the parser can start parsing the incoming bytes while the http client is receiving them in parallel is apparently wrong. No idea why, but that would be on ktor developers, anyway. (Edit: Created an issue there) 2. writing a parser manually, using xmlutil's XmlReaderA manually written parser brings the desired performance gain, cutting the parsing time to about a third. It is still a bit slower than before, but not by much. Streaming from a stream of bytes is (again) about 25% slower than streaming from a string. When taking into account the time it takes to stream the bytes into a string, it amounts to taking exactly the same time. (And in experiment one I already observed that ktor apparently doesn't make available the input stream until it has been completely received...) In conclusion, streaming doesn't solve anything (for my use case) and for some reason either |
The generic parser on Jvm supports reading directly from InputStreams (with encoding detection) using the I've been having an initial stab at supporting kotlinx.io which is finally in an usable state. However it is held up a bit because kotlinx.io doesn't support encodings (and I haven't had time to have a decent stab at dealing with that added complexity). As to performance, I would very much welcome a good performance test suite. As I haven't looked at it explicitly I'm not surprised that it is not optimal. Given all that it does it would never match a hand-rolled parser, but I expect there are quite some optimization options (hence the need to create the test suite to test it - I may repurpose the xmlschema test suite for that - I already include it in the work in progress xmlschema branch - it is big). |
I am not able to completely follow. Is this about
Right. The experiment above was made with your xmlutil XmlParser, though, so it is not completely hand-rolled.
You mean, any to string decoding, or just anything else than UTF-8? |
The difference between newReader and newGenericReader is that newGeneric reader will always return a KtXmlReader instance, and thus behave the same on each platform (modulo unsupported encodings). NewReader on JVM/Android/Browser will have different implementations. What I meant with kotlinx.io decoding is that it doesn't support the byte to character conversion (charset decoding/encoding). Btw. I've had a look at performance. I've done some testing on parsing xml schema (I already had that going, just needed to create the performance test). I've cut of some big chunks of time already, but there is probably a lot more that can be optimized. The main issue seems to be in building the descriptors (and if you want speed, don't put in order restrictions) |
There's The JVM-implementation includes a Though, what most people probably want is a way to encapsulate that in the way the
🥳 , that's great! On kotlinx-serialization in particular or even in the core? |
I have already some UTF8 support in the library as it is needed for native serialization (it actually has a FileInputStream implementation). The Source.readString and Sink.writeString are not that helpful for me as the infrastructure reads by character. Again, I have UTF* code already in the library, but need to move it and create a way for it to work with charsets. As to performance, this is mainly in the way descriptors (that drive the serialization) are created (there was way to much "convenient" code in there). I haven't actually done anything on the xml parsing itself, only serialization. I've already cut the time used to parse 10000 xmlschema documents in half (in 10 iterations, following a warmup run - warmup is significant here). |
I have done quite some optimization work on KtXmlReader (shaving at least 30% off the parsing time on the xml schema test suite - from approx 270ms to 200ms). The deserialization has also been optimized, and while it was in the 12000 ms range, it is now at approx 270ms for a list of xml events, or 478ms for both). And it should be worth noting that at least half of the parsing time is taken up by deflating from the resource jar that is used for the testing. I've also added a flag |
Awesome! |
I do expect that and additional most significant speed boost on parsing large XML data in context of multiplatform (no Java) will arrive when xmlutil allows to parse from kotlinx-io's So, xmlutil would be done parsing just right after the last bytes were read. |
If you want to take a look, I've pushed the "optimization" branch with this work. Btw. there is actually already a native implementation of |
Hi. When do you plan to release this optimization? |
I'll probably move it to dev (and snapshot release) in the next few days. I'll try to see if I can add kotlinx.io support as well. Then some soaking/testing is probably good as there are quite some detailed changes that could have broken things. |
It should now be available as a snapshot release. Please have a go at it. I'll probably make it a beta soon (there are quite significant under the hood changes that may have inadvertently broken things). You may want to use the |
Hi, I had no chance to test night build yet. |
@ComBatVision I'm building the beta now (there were still some bugs hitting the test suite, there are possibly further bugs/regressions present). |
@pdvrieze Thank you. I will test it when beta will be available on maven central. |
It is there now. |
This update requires Kotlin 2.0.0 We will test it little bit later. Our project was not migrated to 2.0 yet. |
I did a small test. I can confirm a speed boost of about 30% with the new beta. |
I may be able to build a version that uses the older kotlin and serialization. At least when not beta. |
I terms of speed it is important to retain the format across the runs as that caches the building of the file structure. Warmup costs are still quite large compared to further runs. I optimised mainly on repeated runs, rather than the first one. |
It will be good to support Kotlin 1.9.24 still, becasue migration to 2.0 on production is still to risky and to complex. |
Already 0.90 is built with Kotlin 2.0. Can you link to a source why migration to 2.0 would be risky and complex? |
@westnordost it is just risky for our project before major LTS release. We plan to migrate on Kotln 2.0 after it. @pdvrieze When do you plan to make a stable (not beta) release? Any issues still available? |
I've been doing some work on my own system and found some bugs. In any case I will be doing a release candidate (just to let people know that there may be unforeseen regressions) |
Btw. As to a 1.9 version, I created a fork, but it failed some tests and I will need to fix them |
We have started migration to Kotlin 2.0.0 so you may skip this idea about supporting Kotlin 1.9. Thanks. |
Do you still have any plans to natively integrate kotlinx.io? Maybe as a separate artifact? |
@hfhbd In principle yes. It should work as extension as the existing interface already works with readers/writers/appendables. The thing is that the mapping needs to be done, including handling charset conversions (the hanger). |
I've just created this in the dev channel (it is currently building). You should be able to access it as snapshot release. This adds extension functions: |
TLDR: Library user experiences performance issues and requests support for kotlinx-io streaming primitives.
As one part of making an app multiplatform, I replaced Java HttpUrlConnection + XML pull parsing with ktor-client + xmlutil + kotlinx-serialization. Unfortunately, the performance at least on parsing large data (tested with ~10MB) got 3-6 times worse. (see StreetComplete#5686 for a short analysis and comparison).
I suspect that using xmlutil's streaming parser instead of deserializing the whole data structure with kotlinx-serialization might improve this somewhat because theoretically, xmlutil should be able start reading the bytes as they come through the wire and in a separate thread than the thread that receives the bytes. (But ultimately not knowing the internals, I can only guess. If you have any other suspicions for the cause of the performance issue, let me know!)
The documentation is a bit thin on streaming, but I understand I need to call
xmlStreaming.newGenericReader(Reader)
to get anXmlReader
which is a xml pull parser interface. However, I need to supply a (Java)Reader
, so currently it seems there is no interface for stream parsing on multiplatform.I understand that byte streaming support and consequently text streaming built on top of that is a bit higgledy-piggledy right now in the Kotlin ecosystem because a replacement for
InputSteam
etc. has never been available in the Kotlin standard library. So, every library that does something with IO implemented their own thing, if anything at all - sometimes based onokio
, sometimes an own implementation.However, now it seems like things are about to get better: Both
ktor
and apparentlykotlinx-serialization
are being migrated to usekotlinx-io
and thus the common interfaces likeSink
,Source
andBuffer
, which are replacements forInputStream
et al.So, in case you'd agree that most likely my performance issue with large data stirs from the lack of XML streaming, I guess my request would be to move with kotlinx-serialization and ktor to support a common interface for streaming bytes and text.
The text was updated successfully, but these errors were encountered: