memory performance: two-level cmap lookup #697

lsf37 · 2020-01-01T08:52:16Z

This addresses the memory issue #199. With more recent Unicode versions, the
character map took 0x110000 * 4 bytes, i.e. ~4MB. Most of this is never
accessed, but it still increases overall memory consumption esp for multiple
scanners.

This PR:

changes the current one-level array to a two-level table structure
adds quickcheck tests for the table construction
updates runtime engine to use that structure
updates skeleton files to remove reference to old ZZ_CMAP
removes trailing white space in generated template code

Typical memory consumption for the char map decreases from ~4MB to < 100KB,
for simple scanners (little unicode use > 0xFF) to ~20KB.

Even though this increases the number of operations in the innermost loop,
with a bit of luck performance might actually benefit because of better
cache locality. Benchmarking still to be done.

This addresses the memory issue #199. With more recent Unicode versions, the character map took 0x110000 * 4 bytes, i.e. ~4MB. Most of this is never accessed, but it still increases overall memory consumption esp for multiple scanners. This commit: * changes the current one-level array to a two-level table structure * adds quickcheck tests for the table construction * updates runtime engine to use that structure * updates skeleton files to remove reference to old ZZ_CMAP * removes trailing white space in generated template code Typical memory consumption for the char map decreases from ~4MB to < 100KB, for simple scanners (little unicode use > 0xFF) to ~20KB. Even though this increases the number of operations in the innermost loop, with a bit of luck performance might actually benefit because of better cache locality. Benchmarking still to be done.

Calling getClassCode for each code point cost too much generator performance (10%-20% slower test suite). Going through the intervals incrementally speeds this up again and has no observable generator performance difference to the old single-level table setup.

One of the shrunk instances in a failed test looked like it had overlapping classes, which shouldn't be possible. Couldn't reproduce, but if it does happen, this should catch it.

lsf37 · 2020-01-01T10:23:21Z

This solution is similar to, but (hopefully) simpler than the one implemented for the IntelliJ plugins in #199, with similar memory improvements.

It does not really address the ArrayIndexOutOfBounds problem, but this has already disappeared with @sarowe's work on more recent Unicode versions. For %unicode 2.0, you will still get an exception for code points >= 0xFFFF, but that is as it should be, because that Unicode version doesn't have these characters. The later versions (including the default just %unicode) that do have higher code points don't throw the exception. This is similar to %7bit and %8bit scanners throwing the same exception for higher characters.

There is the question whether this is the best behaviour, i.e. one could throw a different exception or report a more user-friendly error, but that is a separate question.

lsf37 · 2020-01-01T10:25:42Z

The trailing whitespace removal is a bit out of place here -- it slipped in because my editor is set to stripping whitespace, which I forgot about. I think it's nicer to have them removed, though, so I left it.

lsf37 · 2020-01-01T10:26:56Z

ps: huge kudos to @sarowe for the excellent unicode test suite coverage. This and the new quickcheck setup helped a lot in getting that feature implemented much faster than usual.

lsf37 · 2020-01-03T05:38:48Z

Will merge this one now, because I have too many things stacking up, but am still interested in feedback if there is time.

lsf37 and others added 3 commits January 1, 2020 19:17

assert invariants in quickcheck shrinking

bab86ff

One of the shrunk instances in a failed test looked like it had overlapping classes, which shouldn't be possible. Couldn't reproduce, but if it does happen, this should catch it.

lsf37 requested a review from sarowe as a code owner January 1, 2020 08:52

lsf37 added this to the 1.8.0 milestone Jan 1, 2020

lsf37 added enhancement Feature requests performance labels Jan 1, 2020

lsf37 self-assigned this Jan 1, 2020

lsf37 requested a review from regisd January 1, 2020 10:15

This was referenced Jan 1, 2020

Performance benchmarking suite #698

Open

%unicode 2.0 lexers throw IOOBE on input with surrogate chars #199

Closed

lsf37 merged commit fe59529 into master Jan 3, 2020

lsf37 deleted the cmap-block branch January 3, 2020 05:38

lsf37 mentioned this pull request Jan 9, 2020

Benchmark #711

Merged

asfimport mentioned this pull request Mar 22, 2022

upgrade jflex (1.7.0 -> 1.8.2) [LUCENE-10239] apache/lucene#11275

Closed

lsf37 mentioned this pull request Jan 22, 2023

Reduce RAM usage of the char->char class map #196

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory performance: two-level cmap lookup #697

memory performance: two-level cmap lookup #697

lsf37 commented Jan 1, 2020

lsf37 commented Jan 1, 2020

lsf37 commented Jan 1, 2020

lsf37 commented Jan 1, 2020

lsf37 commented Jan 3, 2020

memory performance: two-level cmap lookup #697

memory performance: two-level cmap lookup #697

Conversation

lsf37 commented Jan 1, 2020

lsf37 commented Jan 1, 2020

lsf37 commented Jan 1, 2020

lsf37 commented Jan 1, 2020

lsf37 commented Jan 3, 2020