[columnar] Adds column level caching and controls #86

JerrySievert · 2023-05-23T20:02:50Z

Analysis of Caching Compressed Columns

We know that for some workloads, when using compression for columnar tables,
that we can uncompress the same compressed chunks many times during some
analytical queries. This causes a high CPU load, which can, under some
circumstances, negate any gains from columnar compression.

Example Query

This effect can be demonstrated with a TPC-H like database, and TPC-H like
queries. One example of this is:

select
	s_name,
	s_address
from
	supplier,
	nation
where
	s_suppkey in (
		select
			ps_suppkey
		from
			partsupp,
			(
				select
					l_partkey agg_partkey,
					l_suppkey agg_suppkey,
					0.5 * sum(l_quantity) AS agg_quantity
				from
					lineitem
				where
					l_shipdate >= date '1996-01-01'
					and l_shipdate < date '1996-01-01' + interval '1' year
				group by
					l_partkey,
					l_suppkey
			) agg_lineitem
		where
			agg_partkey = ps_partkey
			and agg_suppkey = ps_suppkey
			and ps_partkey in (
				select
					p_partkey
				from
					part
				where
					p_name like 'chiffon%'
			)
			and ps_availqty > agg_quantity
	)
	and s_nationkey = n_nationkey
	and n_name = 'SAUDI ARABIA'
order by
	s_name
LIMIT 1;

Table Sizes

Tables are

Table	Type	Size
customer	columnar	83 MB
customer_heap	heap	290 MB
lineitem	columnar	1921 MB
lineitem_heap	heap	9492 MB
nation	columnar	24 kB
nation_heap	heap	8192 bytes
orders	columnar	454 MB
orders_heap	heap	2158 MB
part	columnar	56 MB
part_heap	heap	320 MB
partsupp	columnar	282 MB
partsupp_heap	heap	1432 MB
region	columnar	24 kB
region_heap	heap	8192 bytes
supplier	columnar	5400 kB
supplier_heap	heap	18 MB

Query Times

Type	Query Time
Heap	22.898s
Columnar	12.148s
Columnar/Cache	5.792s

Cache Statistics

Cache Hits: 24266
Cache Misses: 352
Cache Evictions: 0
Cache Writes: 177
Cache Maximum Size: 20981220
Cache Ending Size: 20981220
Total Cache Entries: 177

We can see from both the query times and the cache statistics that we spend a
lot of time decompressing the same chunks. While we are faster as a columnar
store, we can be faster by not discarding decompressed chunks. In this case,
we saved >24000 decompressions by keeping them in cache.

When to Cache

We need to make an intelligent decision as to when to cache vs not caching
when caching is enabled. Some of these choices are simple, such as when the
data is not compressed, we should not cache, but other choices are
necessitated by our usage.

Currently, the cache does not invalidate whenever data is change underneath
it. That means that if we cache INSERT, UPDATE, or DELETE whenever there
are supporting reads, we would need to do be dilagent about any changes and
instantly invalidate the cache. In this iteration, the choice was made to
only cache on a SELECT, and never cache when INSERT, UPDATE, or DELETE
are executed.

Memory vs CPU Constraints

One of the biggest aspects of caching is the tradeoff of memory versus CPU.
When we are CPU constrained, but have extra memory to dedicate, the addition
of caching allows for faster queries when memory can be dedicated to
uncompressed values.

Cache Methodologies

The current mechanism for caching uses columns retrieved from chunks after
they have been decompressed.

New cache entries are added to the end of the list, and copied into the
caching memory context, where they do not get deleted until the end of the
query.

Eviction

Cache eviction occurs after each cache addition. If adding a new entry
causes the total amount of memory allocated to go above the maximum
memory specified, then an eviction process is started. This eviction
starts at the oldest cache entry and looks for entries that have not
been used and evicts them until the memory allocation size is below
what is requested (maximum * .9, or 90%). If not enough memory has been
released, then more passes are made through the cache entries, increasing
the threshold of uses until there is enough memory released.

Controlling Cache

Cache can be enabled or disabled on a query by query basis, or at the
database level.

Variable	Default	Explanation
`columnar.enable_column_cache`	`false`	Enables the column cache for columnar queries
`columnar.column_cache_size`	`200`	Number of megabytes to use per process for the columnar cache, from `20` to `20000`

Note that this cache is per process, which can grow to be a multiple of the
number of parallel processes in a query.

wuputah · 2023-05-24T13:38:41Z

Thanks for great write-up, this looks very promising!

Can you explain the rationale for having columnar.enable_column_cache default to false? When should the user enable it?

JerrySievert · 2023-05-24T18:51:03Z

Can you explain the rationale for having columnar.enable_column_cache default to false?

a couple of reasons:

this is the first implementation, we should let it bake to make sure that there are no new bugs added
it is best for analytical workloads that re-read the same chunks or columns repeatedly, OLTP is generally not that, but OLAP is.

When should the user enable it?

if a user is running large OLAP queries against multiple tables, with multiple grouping, that are largish (10+GB), then they should consider enabling it.

if a user is simply doing COUNT(*) or OLTP workloads, then turning on the cache will likely slow the queries down due to the overhead of in-memory copying.

columnar/src/backend/columnar/columnar.c

columnar/src/backend/columnar/columnar_reader.c

columnar/src/backend/columnar/columnar_cache.c

wuputah · 2023-05-31T20:36:47Z

if a user is running large OLAP queries against multiple tables, with multiple grouping, that are largish (10+GB), then they should consider enabling it.

if a user is simply doing COUNT(*) or OLTP workloads, then turning on the cache will likely slow the queries down due to the overhead of in-memory copying.

Re-reading this today, maybe it's possible to analyze (i.e. based on the query plan / cost prediction) whether this caching should be enabled or not? Could be a future feature...

mkaruza · 2023-06-01T08:45:11Z

if a user is running large OLAP queries against multiple tables, with multiple grouping, that are largish (10+GB), then they should consider enabling it.
if a user is simply doing COUNT(*) or OLTP workloads, then turning on the cache will likely slow the queries down due to the overhead of in-memory copying.

Re-reading this today, maybe it's possible to analyze (i.e. based on the query plan / cost prediction) whether this caching should be enabled or not? Could be a future feature..

I think that this overhead should be minimal, if it is possible to keep decompressed only in cache and provide only pointer memory data location when processing is done.

JerrySievert added the enhancement New feature or request label May 23, 2023

JerrySievert requested review from wuputah and mkaruza May 23, 2023 20:02

JerrySievert self-assigned this May 23, 2023

JerrySievert force-pushed the caching branch 2 times, most recently from 6ed367f to 565cc14 Compare May 23, 2023 20:09

JerrySievert force-pushed the caching branch from 565cc14 to 0c5fa2a Compare May 24, 2023 16:18

JerrySievert force-pushed the caching branch from 0c5fa2a to 4ea4e48 Compare May 24, 2023 20:13

mkaruza requested changes May 25, 2023

View reviewed changes

JerrySievert force-pushed the caching branch from 4ea4e48 to 2b5d36e Compare June 1, 2023 21:07

[columnar] Adds column level caching and controls

f1a0dd5

JerrySievert force-pushed the caching branch from 2b5d36e to f1a0dd5 Compare June 1, 2023 21:28

mkaruza approved these changes Jun 2, 2023

View reviewed changes

JerrySievert merged commit e973247 into main Jun 2, 2023

JerrySievert deleted the caching branch June 2, 2023 16:51

wuputah mentioned this pull request Aug 29, 2023

improve peformance of index-backed scans #57

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[columnar] Adds column level caching and controls #86

[columnar] Adds column level caching and controls #86

JerrySievert commented May 23, 2023 •

edited

Loading

wuputah commented May 24, 2023 •

edited

Loading

JerrySievert commented May 24, 2023

wuputah commented May 31, 2023

mkaruza commented Jun 1, 2023

[columnar] Adds column level caching and controls #86

[columnar] Adds column level caching and controls #86

Conversation

JerrySievert commented May 23, 2023 • edited Loading

Analysis of Caching Compressed Columns

Example Query

Table Sizes

Query Times

Cache Statistics

When to Cache

Memory vs CPU Constraints

Cache Methodologies

Eviction

Controlling Cache

wuputah commented May 24, 2023 • edited Loading

JerrySievert commented May 24, 2023

wuputah commented May 31, 2023

mkaruza commented Jun 1, 2023

JerrySievert commented May 23, 2023 •

edited

Loading

wuputah commented May 24, 2023 •

edited

Loading