- redis/src/ziplist.h
- redis/src/ziplist.c
- redis/src/dict.h
- redis/src/dict.c
The ziplist is a specially encoded dually linked list that is designed to be very memory efficient. It stores both strings and integer values, where integers are encoded as actual integers instead of a series of characters. It allows push and pop operations on either side of the list in O(1) time. However, because every operation requires a reallocation of the memory used by the ziplist, the actual complexity is related to the amount of memory used by the ziplist.
From redis/src/ziplist.c
This is the overall layout of ziplist
Let's see a simple example
127.0.0.1:6379> HSET AA key1 33
(integer) 1
127.0.0.1:6379> OBJECT ENCODING AA
"ziplist"
As far as we can learn from the above picture
zlbytes
is the total bytes the ziplist occupy
zltail
is the offset from current field to zlend
zllen
indicates how many entries current ziplist have
zlend
is a one byte end mark of the ziplist
This is the layout of the newly inserted two entry(key1
and 33
)
prevlen
stores the length of the previous entry, so that you are able to traverse backward, it's either 1 byte represent length as uint8_t
which is less than 254 or 5 bytes, 1 for mark, the rest 4 can be represent length as uint32_t
encoding
indicate how contents are encoded in the entry, there are various encoding
for an entry
, we will see later
We now know that entry consists of three parts
prevlen
is either 5 bytes or 1 byte, if the first byte of prevlen
is 254, the following 4 bytes will be an unsigned integer value represent the length
The representation of the entry data
depends on the encoding
, and there are various kinds of encoding
entry data
stores the actual data, it's either byte buffer or integer value, the actual representation depends on the encoding
field
An empty ziplist
will be created for the first time you call hset
command with some key
127.0.0.1:6379> HSET AA key1 val1
(integer) 1
127.0.0.1:6379> HSET AA key2 123
(integer) 1
127.0.0.1:6379> HGET AA key2
"123"
The hget
command will traverse the ziplist
, extract the key according to the encoding
, and check the key to see if it's a match, it's an O(n) linear serach
This is the update part of the code
/* redis/src/t_hash.c */
/* Delete value */
zl = ziplistDelete(zl, &vptr);
/* Insert new value */
zl = ziplistInsert(zl, vptr, (unsigned char*)value, sdslen(value));
Let's see an example
127.0.0.1:6379> HSET AA key1 val_new
(integer) 0
The delete operation will take place first
Then it comes to the insert operation
Insert operation will invoke realloc
to extend the size of the ziplist
, move all the entry
after the inserted item backward and insert the item to specific entry
realloc
will delegate to zrealloc
defined in redis/src/zmalloc.h
, we will learn that in other article
Delete operation will only move all the memory after the current deleted item forward, because the prevlen
stores either 1 byte or 5 bytes, the CascadeUpdate
will only happen if the entry size before deleted can't be stored in 1 byte, and the current deleting item's size can be stored in 1 byte, so the next item's prevlen
field need to be extended, there must be a cascade update operation
For example, if we delete the second entry in the following picture
After delete and move every memory blocks forward
The 1 byte prevlen
can't stores the length of the previous entry
, the ziplist
needs to be realloc
and prevlen
will be extended
What if the length of the newly extended entry
now becomes longer than 255 bytes? the prevlen
of the next entry
can't stores the length again, so we need to realloc
again
If you have a ziplist
of N entry, every entry after the deleted entry need to be checked if it needs to be extended
This is the so called cascad update
/* redis/src/t_hash.c */
/* in hset */
for (i = start; i <= end; i++) {
if (sdsEncodedObject(argv[i]) &&
sdslen(argv[i]->ptr) > server.hash_max_ziplist_value)
{
hashTypeConvert(o, OBJ_ENCODING_HT);
break;
}
}
/* ... */
/* Check if the ziplist needs to be converted to a hash table */
if (hashTypeLength(o) > server.hash_max_ziplist_entries)
hashTypeConvert(o, OBJ_ENCODING_HT);
if the inserted object in of type sds and sds length is longer than hash-max-ziplist-value
(you can config it in configure file, default value is 64), the ziplist
will be converted to hash table(OBJ_ENCODING_HT
)
Or the length of the ziplist
after the isertion is loger than hash-max-ziplist-entries
(default 512), it will also be converted to hash table
I've set this line hash-max-ziplist-entries 0
in my configure file
hset AA key1 12
this is the layout after set a value
The SipHash 1-2(defined in redis/src/siphash.c
) is used for redis hash function instead of the murmurhash used in the previous redis version
According to the annotation in the redis source code, the speed of SipHash 1-2 is the same as Murmurhash2, but the SipHash 2-4 will slow down redis by a 4% figure more or less
The seed of the hash function is initlalized in the redis server start up
/* redis/src/server.c */
char hashseed[16];
getRandomHexChars(hashseed,sizeof(hashseed));
dictSetHashFunctionSeed((uint8_t*)hashseed);
Because the hashseed is random generated, you are not able to predict the index of hash table for different redis instance, or for the same redis server after restart
hset AA zzz val2
CPython use a probing algorithm for hash collisions, while redis use single linked list
If two keys hash into the same index, they are stored as a single linked list, the latest inserted item in is the front
hdel AA zzz
The delete operation will find the specific entry in the hash table, and resize if necessary
The function _dictExpandIfNeeded
will be called for every dictionary insertion
/* redis/src/dict.c */
/* Expand the hash table if needed */
static int _dictExpandIfNeeded(dict *d)
{
/* Incremental rehashing already in progress. Return. */
if (dictIsRehashing(d)) return DICT_OK;
/* If the hash table is empty expand it to the initial size. */
if (d->ht[0].size == 0) return dictExpand(d, DICT_HT_INITIAL_SIZE);
/* If we reached the 1:1 ratio, and we are allowed to resize the hash
* table (global setting) or we should avoid it but the ratio between
* elements/buckets is over the "safe" threshold, we resize doubling
* the number of buckets. */
if (d->ht[0].used >= d->ht[0].size &&
(dict_can_resize ||
d->ht[0].used/d->ht[0].size > dict_force_resize_ratio))
{
return dictExpand(d, d->ht[0].used*2);
}
return DICT_OK;
}
in order to not affect the server performance, redis implements the incremental resizing strategy, the resize operation is not done immediately, it's happening step by step in every CRUD request
let's see
127.0.0.1:6379> del AA
(integer) 1
127.0.0.1:6379> hset AA 1 2
(integer) 1
127.0.0.1:6379> hset AA 2 2
(integer) 1
127.0.0.1:6379> hset AA 3 2
(integer) 1
127.0.0.1:6379> hset AA 4 2
(integer) 1
127.0.0.1:6379> hset AA 5 2
(integer) 1
This time we insert a new entry, the _dictExpandIfNeeded
will be called and d->ht[0].used >= d->ht[0].size
will become true, dictExpand
simply created a new hash table with two times of the old size and stores the newly created hash table to d->ht[1]
/* redis/src/dict.c */
/* Expand or create the hash table */
int dictExpand(dict *d, unsigned long size)
{
/* the size is invalid if it is smaller than the number of
* elements already inside the hash table */
if (dictIsRehashing(d) || d->ht[0].used > size)
return DICT_ERR;
dictht n; /* the new hash table */
unsigned long realsize = _dictNextPower(size);
/* Rehashing to the same table size is not useful. */
if (realsize == d->ht[0].size) return DICT_ERR;
/* Allocate the new hash table and initialize all pointers to NULL */
n.size = realsize;
n.sizemask = realsize-1;
n.table = zcalloc(realsize*sizeof(dictEntry*));
n.used = 0;
/* Is this the first initialization? If so it's not really a rehashing
* we just set the first hash table so that it can accept keys. */
if (d->ht[0].table == NULL) {
d->ht[0] = n;
return DICT_OK;
}
/* Prepare a second hash table for incremental rehashing */
d->ht[1] = n;
d->rehashidx = 0;
return DICT_OK;
}
Because the rehashidx
is not -1, the new entry will be inserted to ht[1]
If the rehashidx
is not -1, every CRUD operation will invoke the rehash
function
127.0.0.1:6379> hget AA 5
"2"
You will notice that rehashidx
now becomes 1, and the bucket in table index[1] is moved down to the second table
127.0.0.1:6379> hget AA not_exist
(nil)
rehashidx
now becomes 3
127.0.0.1:6379> hget AA 5
"2"
if the bucket in index 3 is finished, the hash table is fully transfered, rehashidx
is set to -1 again, old table will be freed
and new table now becomes ht[0]
, the location of the two table will be exchanged
/* redis/src/dict.c */
int dictRehash(dict *d, int n) {
/* for every HGET command, n is 1 */
int empty_visits = n*10; /* Max number of empty buckets to visit. */
if (!dictIsRehashing(d)) return 0;
while(n-- && d->ht[0].used != 0) {
/* rehash n bucket */
dictEntry *de, *nextde;
/* Note that rehashidx can't overflow as we are sure there are more
* elements because ht[0].used != 0 */
assert(d->ht[0].size > (unsigned long)d->rehashidx);
while(d->ht[0].table[d->rehashidx] == NULL) {
/* skip the empty bucket */
d->rehashidx++;
if (--empty_visits == 0) return 1;
}
de = d->ht[0].table[d->rehashidx];
/* Move all the keys in this bucket from the old to the new hash HT */
while(de) {
uint64_t h;
nextde = de->next;
/* Get the index in the new hash table */
h = dictHashKey(d, de->key) & d->ht[1].sizemask;
de->next = d->ht[1].table[h];
d->ht[1].table[h] = de;
d->ht[0].used--;
d->ht[1].used++;
de = nextde;
}
d->ht[0].table[d->rehashidx] = NULL;
d->rehashidx++;
}
/* Check if we already rehashed the whole table... */
if (d->ht[0].used == 0) {
zfree(d->ht[0].table);
d->ht[0] = d->ht[1];
_dictReset(&d->ht[1]);
d->rehashidx = -1;
return 0;
}
/* More to rehash... */
return 1;
}
As the configure file says
Active rehashing uses 1 millisecond every 100 milliseconds of CPU time in order to help rehashing the main Redis hash table (the one mapping top-level keys to values). The hash table implementation Redis uses (see dict.c) performs a lazy rehashing: the more operation you run into a hash table that is rehashing, the more rehashing "steps" are performed, so if the server is idle the rehashing is never complete and some more memory is used by the hash table.
The default is to use this millisecond 10 times every second in order to actively rehash the main dictionaries, freeing memory when possible.
If unsure: use "activerehashing no" if you have hard latency requirements and it is not a good thing in your environment that Redis can reply from time to time to queries with 2 milliseconds delay.
use "activerehashing yes" if you don't have such hard requirements but want to free memory asap when possible.
Redis server use a hash table to store every key and value, as we learn from the above examples, the hash table won't take steps rehashing unless you keep searching/modifying the hash table, activerehashing
is used in the server main loop to help with that
/* redis/src/server.c */
/* Rehash */
if (server.activerehashing) {
for (j = 0; j < dbs_per_call; j++) {
/* try to use 1 millisecond of CPU time at every call of this function to perform some rehahsing */
int work_done = incrementallyRehash(rehash_db);
if (work_done) {
/* If the function did some work, stop here, we'll do
* more at the next cron loop. */
break;
} else {
/* If this db didn't need rehash, we'll try the next one. */
rehash_db++;
rehash_db %= server.dbnum;
}
}
}