-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[columnar] Adds column level caching and controls #86
Conversation
6ed367f
to
565cc14
Compare
Thanks for great write-up, this looks very promising! Can you explain the rationale for having |
a couple of reasons:
if a user is running large OLAP queries against multiple tables, with multiple grouping, that are largish (10+GB), then they should consider enabling it. if a user is simply doing |
Re-reading this today, maybe it's possible to analyze (i.e. based on the query plan / cost prediction) whether this caching should be enabled or not? Could be a future feature... |
I think that this overhead should be minimal, if it is possible to keep decompressed only in cache and provide only pointer memory data location when processing is done. |
Analysis of Caching Compressed Columns
We know that for some workloads, when using compression for columnar tables,
that we can uncompress the same compressed chunks many times during some
analytical queries. This causes a high CPU load, which can, under some
circumstances, negate any gains from columnar compression.
Example Query
This effect can be demonstrated with a TPC-H like database, and TPC-H like
queries. One example of this is:
Table Sizes
Tables are
Query Times
Cache Statistics
We can see from both the query times and the cache statistics that we spend a
lot of time decompressing the same chunks. While we are faster as a columnar
store, we can be faster by not discarding decompressed chunks. In this case,
we saved >24000 decompressions by keeping them in cache.
When to Cache
We need to make an intelligent decision as to when to cache vs not caching
when caching is enabled. Some of these choices are simple, such as when the
data is not compressed, we should not cache, but other choices are
necessitated by our usage.
Currently, the cache does not invalidate whenever data is change underneath
it. That means that if we cache
INSERT
,UPDATE
, orDELETE
whenever thereare supporting reads, we would need to do be dilagent about any changes and
instantly invalidate the cache. In this iteration, the choice was made to
only cache on a
SELECT
, and never cache whenINSERT
,UPDATE
, orDELETE
are executed.
Memory vs CPU Constraints
One of the biggest aspects of caching is the tradeoff of memory versus CPU.
When we are CPU constrained, but have extra memory to dedicate, the addition
of caching allows for faster queries when memory can be dedicated to
uncompressed values.
Cache Methodologies
The current mechanism for caching uses columns retrieved from chunks after
they have been decompressed.
New cache entries are added to the end of the list, and copied into the
caching memory context, where they do not get deleted until the end of the
query.
Eviction
Cache eviction occurs after each cache addition. If adding a new entry
causes the total amount of memory allocated to go above the maximum
memory specified, then an eviction process is started. This eviction
starts at the oldest cache entry and looks for entries that have not
been used and evicts them until the memory allocation size is below
what is requested (maximum * .9, or 90%). If not enough memory has been
released, then more passes are made through the cache entries, increasing
the threshold of uses until there is enough memory released.
Controlling Cache
Cache can be enabled or disabled on a query by query basis, or at the
database level.
columnar.enable_column_cache
false
columnar.column_cache_size
200
20
to20000
Note that this cache is per process, which can grow to be a multiple of the
number of parallel processes in a query.