Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add forEach extensions methods to iterate over codepoints #38

Merged
merged 3 commits into from
Jun 24, 2024

Conversation

OptimumCode
Copy link
Contributor

The CharSequence has forEach and forEachIndexed extension methods in Kotlin that allow to iterate over every char in the CharSequence.
I think it would be handy to have such methods to iterate over codepoints in CharSequence without overhead from using Iterator or Sequence for that task.

Please, let me know what you think about that.
If you have any suggestions on what should be changed in the current implementation I will be glad to hear them.

Thank you!

@cketti
Copy link
Owner

cketti commented Jun 14, 2024

By using CharSequence.codePointIterator() or CharSequence.codePointSequence() you can already use all of the extension functions the standard library provides for Iterator and Sequence.

For example:

charSequence.codePointSequence().forEach { codePoint -> /* do something */ }
charSequence.codePointSequence().forEachIndexed { index, codePoint -> /* do something */ }

charSequence.codePointIterator().forEach { codePoint -> /* do something */ }
charSequence.codePointIterator().withIndex().forEach { (index, codePoint) -> /* do something */ }

@OptimumCode
Copy link
Contributor Author

Yes, you are absolutely right. However, in order to use them you need to create an instance of iterator that adds additional overhead. I believe, for the same reason CharSequence has forEach and forEachIndexed methods defined on itself rather than delegating those actions to iterator or sequence objects.

@OptimumCode
Copy link
Contributor Author

OptimumCode commented Jun 15, 2024

These are the results of a local benchmarking.

The difference is about ~30% for small strings and ~70% for long strings (10561 codepoints) for JVM platform.

jvm summary:
Benchmark                                  (codepointsCount)   Mode  Cnt          Score         Error  Units
ForEachCodepointBenchmark.forEachInline                   10  thrpt   10  111769431.035 ± 4841285.611  ops/s
ForEachCodepointBenchmark.forEachInline                  400  thrpt   10    3959141.153 ±  468546.211  ops/s
ForEachCodepointBenchmark.forEachInline                10561  thrpt   10     158886.115 ±   11195.801  ops/s
ForEachCodepointBenchmark.forEachIterator                 10  thrpt   10   72715021.234 ± 5128515.082  ops/s
ForEachCodepointBenchmark.forEachIterator                400  thrpt   10    2568445.402 ±   29863.377  ops/s
ForEachCodepointBenchmark.forEachIterator              10561  thrpt   10      88499.198 ±    3840.411  ops/s
ForEachCodepointBenchmark.forEachSequence                 10  thrpt   10   83544541.500 ± 4711449.080  ops/s
ForEachCodepointBenchmark.forEachSequence                400  thrpt   10    2332128.666 ±   68867.751  ops/s
ForEachCodepointBenchmark.forEachSequence              10561  thrpt   10      91991.857 ±    1967.597  ops/s

For native targets (I don't have a Mac to measure IOS targets) the difference is much bigger (x5-10)

linuxX64 summary:
Benchmark                                  (codepointsCount)   Mode  Cnt         Score        Error    Units
ForEachCodepointBenchmark.forEachInline                   10  thrpt   10  16125271.590 ± 406177.167  ops/sec
ForEachCodepointBenchmark.forEachInline                  400  thrpt   10    715307.089 ±   2083.055  ops/sec
ForEachCodepointBenchmark.forEachInline                10561  thrpt   10     28210.850 ±     33.570  ops/sec
ForEachCodepointBenchmark.forEachIterator                 10  thrpt   10   2692273.012 ±  46398.630  ops/sec
ForEachCodepointBenchmark.forEachIterator                400  thrpt   10     77247.037 ±   1086.558  ops/sec
ForEachCodepointBenchmark.forEachIterator              10561  thrpt   10      2996.567 ±     12.099  ops/sec
ForEachCodepointBenchmark.forEachSequence                 10  thrpt   10   2706200.711 ±  22464.958  ops/sec
ForEachCodepointBenchmark.forEachSequence                400  thrpt   10     75313.375 ±   1328.986  ops/sec
ForEachCodepointBenchmark.forEachSequence              10561  thrpt   10      2861.650 ±     25.107  ops/sec

Also, the forEachIndexed methods on the iterator and sequence provide the index corresponding to the codepoint's number. You cannot do much with that index.
In forEachCodepointIndexed implemented for CharSequence the index is a start index of the codepoint in that CharSequence. With that information, you can fetch the previous or next codepoint if you need.

@cketti
Copy link
Owner

cketti commented Jun 23, 2024

Thank you for the pull request 👍

Sorry for not properly reading your description before posting the first comment.

I was sitting on this pull request for a while because I was contemplating the question in #39. For now I think the functions to be added should not have support for providing a start or end index.

But I think the forEachCodePoint* functions could also be useful for people only using kotlin-codepoints:

  • inline fun CharSequence.forEachCodePoint(action: (codePoint: Int) -> Unit)
  • inline fun CharSequence.forEachCodePointIndexed(action: (index: Int, codePoint: Int) -> Unit)

The kotlin-codepoints-deluxe variants could then make use of the implementation from kotlin-codepoints.

Without having to support a start and end index, the implementation could also be simplified to:

inline fun CharSequence.forEachCodePointIndexed(action: (index: Int, codePoint: Int) -> Unit) {
    var index = 0
    while (index < length) {
        val codePoint = codePointAt(index)
        action(index, codePoint)
        index += CodePoints.charCount(codePoint)
    }
}

Do you have time to make the proposed changes? Otherwise I'll do it myself.

@OptimumCode
Copy link
Contributor Author

Hi, @cketti. Thank you for taking a look at the PR.

Initially, I was thinking about adding those functions to both kotlin-codepoints and kotlin-codepoints-deluxe but signature like forEachCodePointIndexed(action: (index: Int, codePoint: Int) -> Unit) makes it easier to mistake the arguments because both have type Int. This is why I added functions only to deluxe version.

But if you think this won't be a problem I am happy to make proposed changes. Will do it tomorrow, I think

@cketti
Copy link
Owner

cketti commented Jun 23, 2024

Having two Int parameters is not great. But it's not a problem unique to this library. I think it's fine.

@OptimumCode
Copy link
Contributor Author

Hi, @cketti. I made the proposed changes. Could you please take a look?

However, I noticed that using this approach for interesting over codepoints

inline fun CharSequence.forEachCodePointIndexed(action: (index: Int, codePoint: Int) -> Unit) {
    var index = 0
    while (index < length) {
        val codePoint = codePointAt(index)
        action(index, codePoint)
        index += CodePoints.charCount(codePoint)
    }
}

introduces significant overhead for native targets (~30-70% depending on string length). Please, find the benchmark results below (for JVM target both methods performs more or less the same way).

linuxX64 summary:
Benchmark                                          (codepointsCount)   Mode  Cnt         Score        Error    Units
ForEachCodepointBenchmark.forEachInlineWithMethod                 10  thrpt   10  12517823.722 ± 211698.631  ops/sec
ForEachCodepointBenchmark.forEachInlineWithMethod                400  thrpt   10    424617.142 ±   8144.597  ops/sec
ForEachCodepointBenchmark.forEachInlineWithMethod                 -1  thrpt   10     16103.181 ±    738.442  ops/sec
ForEachCodepointBenchmark.forEachInline                           10  thrpt   10  16176578.718 ± 210191.849  ops/sec
ForEachCodepointBenchmark.forEachInline                          400  thrpt   10    675047.335 ±   6743.268  ops/sec
ForEachCodepointBenchmark.forEachInline                           -1  thrpt   10     27419.144 ±    192.799  ops/sec

Because of that, I think we should keep the current version of the method without using codePointAt inside it.

@cketti cketti force-pushed the for-each-codepoint branch from 4b09bc3 to f6b0a7c Compare June 24, 2024 18:25
@cketti cketti merged commit bd63005 into cketti:main Jun 24, 2024
1 check passed
@cketti
Copy link
Owner

cketti commented Jun 24, 2024

Thanks 👍

@OptimumCode OptimumCode deleted the for-each-codepoint branch June 25, 2024 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants