Skip to content

Saving CPU memory resources with low footprint mode

Jörn Franke edited this page Oct 28, 2018 · 7 revisions

Processing (reading/writing) Excel files requires a lot of CPU and memory, because they store a lot of information , which are relevant when using Excel, but may not be relevant if you just need to use the data.

Hence, we provide in version 1.0.4 (and since 1.2.0 an improved version) the low footprint mode for Excel files (based on the event and streaming APIs of Apache POI). Simply define the following options:

  • reading: hadoopoffice.read.lowFootprint - set it to true (old and new Excel files)
    • Optionally you can define different parsers for new Excel files providing different performance characteristics in terms of CPU and memory (see options)
  • writing: hadoopoffice.write.lowFootprint - set it to true (only new Excel files)
    • Optionally for writing: hadoopoffice.write.lowFootprint.cacherows which defined how many rows should be cached in-memory before flushing to disk (temporary local files not HDFS): 1000

No other changes are required!

Since we do not process all the data, there are certain limitations (cf. here for reading and here for writing).