Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory performance: two-level cmap lookup #697

Merged
merged 3 commits into from
Jan 3, 2020
Merged

memory performance: two-level cmap lookup #697

merged 3 commits into from
Jan 3, 2020

Conversation

lsf37
Copy link
Member

@lsf37 lsf37 commented Jan 1, 2020

This addresses the memory issue #199. With more recent Unicode versions, the
character map took 0x110000 * 4 bytes, i.e. ~4MB. Most of this is never
accessed, but it still increases overall memory consumption esp for multiple
scanners.

This PR:

  • changes the current one-level array to a two-level table structure
  • adds quickcheck tests for the table construction
  • updates runtime engine to use that structure
  • updates skeleton files to remove reference to old ZZ_CMAP
  • removes trailing white space in generated template code

Typical memory consumption for the char map decreases from ~4MB to < 100KB,
for simple scanners (little unicode use > 0xFF) to ~20KB.

Even though this increases the number of operations in the innermost loop,
with a bit of luck performance might actually benefit because of better
cache locality. Benchmarking still to be done.

lsf37 and others added 3 commits January 1, 2020 19:17
This addresses the memory issue #199. With more recent Unicode versions, the
character map took 0x110000 * 4 bytes, i.e. ~4MB. Most of this is never
accessed, but it still increases overall memory consumption esp for multiple
scanners.

This commit:
 * changes the current one-level array to a two-level table structure
 * adds quickcheck tests for the table construction
 * updates runtime engine to use that structure
 * updates skeleton files to remove reference to old ZZ_CMAP
 * removes trailing white space in generated template code

Typical memory consumption for the char map decreases from ~4MB to < 100KB,
for simple scanners (little unicode use > 0xFF) to ~20KB.

Even though this increases the number of operations in the innermost loop,
with a bit of luck performance might actually benefit because of better
cache locality. Benchmarking still to be done.
Calling getClassCode for each code point cost too much generator performance
(10%-20% slower test suite). Going through the intervals incrementally speeds
this up again and has no observable generator performance difference to the
old single-level table setup.
One of the shrunk instances in a failed test looked like it had overlapping
classes, which shouldn't be possible. Couldn't reproduce, but if it does
happen, this should catch it.
@lsf37 lsf37 requested a review from sarowe as a code owner January 1, 2020 08:52
@lsf37 lsf37 added this to the 1.8.0 milestone Jan 1, 2020
@lsf37 lsf37 added enhancement Feature requests performance labels Jan 1, 2020
@lsf37 lsf37 self-assigned this Jan 1, 2020
@lsf37 lsf37 requested a review from regisd January 1, 2020 10:15
@lsf37
Copy link
Member Author

lsf37 commented Jan 1, 2020

This solution is similar to, but (hopefully) simpler than the one implemented for the IntelliJ plugins in #199, with similar memory improvements.

It does not really address the ArrayIndexOutOfBounds problem, but this has already disappeared with @sarowe's work on more recent Unicode versions. For %unicode 2.0, you will still get an exception for code points >= 0xFFFF, but that is as it should be, because that Unicode version doesn't have these characters. The later versions (including the default just %unicode) that do have higher code points don't throw the exception. This is similar to %7bit and %8bit scanners throwing the same exception for higher characters.

There is the question whether this is the best behaviour, i.e. one could throw a different exception or report a more user-friendly error, but that is a separate question.

@lsf37
Copy link
Member Author

lsf37 commented Jan 1, 2020

The trailing whitespace removal is a bit out of place here -- it slipped in because my editor is set to stripping whitespace, which I forgot about. I think it's nicer to have them removed, though, so I left it.

@lsf37
Copy link
Member Author

lsf37 commented Jan 1, 2020

ps: huge kudos to @sarowe for the excellent unicode test suite coverage. This and the new quickcheck setup helped a lot in getting that feature implemented much faster than usual.

@lsf37
Copy link
Member Author

lsf37 commented Jan 3, 2020

Will merge this one now, because I have too many things stacking up, but am still interested in feedback if there is time.

@lsf37 lsf37 merged commit fe59529 into master Jan 3, 2020
@lsf37 lsf37 deleted the cmap-block branch January 3, 2020 05:38
@lsf37 lsf37 mentioned this pull request Jan 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant