redis As the most popular nosql Cache database , With its excellent performance 、 Rich data structure has become the preferred caching tool in most scenarios .
because redis It's a pure memory database , When storing large amounts of data , The memory footprint will be huge . So in some scenarios , By choosing the right data structure to store , Can greatly reduce the memory consumption , It can even reduce 80%-99% Memory footprint .
utilize zipList To replace a lot of Key-Value
Let's take a look at the scene , stay Dsp Advertising system 、 Mass user systems often encounter such needs , It is required to quickly find the user's id. For example, according to mac Address or uuid Or cell phone number md5, Go to find the user's id.
It's characterized by a large amount of data 、 Ten million or one hundred million ,key It's a long string , Such as 32 Bit md5 perhaps uuid such .
If not dealt with , Directly to key-value Form for storage , We can test it out very briefly , Go to redis Insert inside 1 Ten million pieces of data ,1550000000 - 1559999999, The form is key（md5（1550000000））→ value(1550000000) such .
And then use info memory Look at the memory usage .
You can see , this 1 Ten million pieces of data , Occupied redis total 1.17G Of memory . When the amount of data becomes 1 One hundred million hours , The measured occupancy is about 8 individual G.
The same data , Let's change the storage mode , Let's see the results first ：
Before we use zipList after , Memory usage is 123M, About reduced 85% The space occupied by , How does this work ？ Let's come from redis To analyze the underlying storage .
1 redis Data structure and coding
2 redis How to store strings
string yes redis The most commonly used data structure in ,redis The default string and C Language strings are different , It's a self built one called “ Simple dynamic string SDS” Abstract type of .
Specific to the string Underlying storage ,redis There are three common ways , Namely int、embstr and raw.
for example set k1 abc and set k2 123 It will be used separately embstr、int. When value Is longer than 44（ or 39, Different versions are different ） Bytes , Will be used raw.
int It's a fixed length structure , Occupy 8 Bytes （ Be careful , amount to java Inside long）, It can only be used to store long plastic .
embstr It's dynamic expansion , Every expansion 1 times , exceed 1M when , Just expand the capacity every time 1M.
raw Used to store larger than 44 A byte string .
In our case ,key yes 32 A byte string （embstr）,value It's a long plastic （int）, So if you can 32 Bit md5 become int, So in key It can be directly reduced on the storage 3/4 Memory footprint .
This is the first optimization point .
3 redis How to store Hash
from 1.1 We can see Hash data structure , There are two ways of coding ,1 yes hashTable,2 yes zipList.
hashTable You are familiar with , and java Inside hashMap It's like , It's all arrays + The way of linked list .java in hashmap In order to reduce the hash Conflict , Set the load factor to 0.75. Again ,redis Of hash There are similar expansion load factors . Don't mention the details , Just make an impression , use hashTable Code words , It will cost at least more than the stored data 25% Only in this space can we save these data . It looks like this ：
zipList, Compress the list , It looks like this ：
You can see ,zipList The biggest characteristic is , It's not at all hash structure , It's a long string , take key-value They are placed in a long string in order to store . If you want to find someone key Words , Just go through the whole long string .
So obviously ,zipList than hashTable Take up much less space . But it will cost more cpu To query .
So when to use hashTable、zipList Well ？ stay redis.conf You can find ：
It's when hash The inner layer of the structure field-value No more than 512, also value The number of bytes does not exceed 64 when , Just use zipList.
By measuring ,value Quantity in 512 when , Performance and simplicity hashTable Almost no difference , stay value No more than 1024 when , There's only a tiny drop in performance , A lot of times you can ignore .
And memory usage ,zipList Comparable hashTable It's down a lot .
This is the second optimization point .
4 use zipList Instead of key-value
Through the knowledge above , We came to two conclusions . use int As key, than string Save a lot of space . use hash Medium zipList, than key-value Save a lot of space .
So let's transform the original 1 Ten million key-value.
First step ：
We're going to 1 Ten million key value pairs , Put it in N individual bucket in , Every bucket It's a redis Of hash data structure , And let each bucket No more than the default 512 Elements （ If you change the configuration file , Such as 1024, The modified value cannot be exceeded ）, To avoid hash Code from zipList become hashTable.
1 Ten million / 512 = 19531. Because in the future, all of key Hash algorithm , To try to spread it all out bucket in , But because of the uncertainty of hash function , It may not be completely evenly distributed . So we need to reserve some space , For example, I assign 25000 individual bucket, or 30000 individual bucket.
The second step ：
Choose hash algorithm , It was decided that key Where to put it bucket. Here we use efficient and balanced well-known algorithm crc32, The hash algorithm can change a string into a long The number of type , By getting this md5 Type key Of crc32 after , Right again bucket Take the remaining quantity , You can be sure of key Which to put bucket in .
The third step ：
Through the second step , We're sure key To be stored in redis in hash The outer layer of the structure key, For the inner layer field, Let's choose another one hash Algorithm , To avoid two completely different values , adopt crc32（key） % COUNT after , happen field Again the same , produce hash The situation in which the conflict results in the value being overridden . Inner layer field We chose to use bkdr The hash algorithm （ Or directly choose Java Of hashCode）, The algorithm will also get a long Plastic numbers .value The storage of remains the same .
Step four ：
Load data . The original data structure is key-value,0eac261f1c2d21e0bfdbd567bb270a68 → 1550000000.
Now the data structure is hash,key by 14523,field yes 1927144074,value yes 1550000000.
By measuring , take 1 Tens of millions of data are stored in 25000 individual bucket after , whole hash More balanced , Every bucket There's probably 300 Multiple field-value Key value pair . In theory, as long as it doesn't happen twice hash After algorithm , All produce the same value , Then we can rely on key-field To find the original value. This can be confirmed by calculating the total amount . actually , stay bucket When there are more , And each bucket Next ,value It's not a lot , The probability of a continuous collision is extremely low , Measured in storage 50 In the case of 100 million cell phone numbers , No obvious collision .
Test query speed ：
After storing this 1 After tens of millions of data , We did a query test , use key-value The type and hash type , Check separately 100 Ten thousand data , Look at the impact on query speed .
key-value Time consuming ：10653、10790、11318、9900、11270、11029 millisecond
hash-field Time consuming ：12042、11349、11126、11355、11168 millisecond .
You can see , Take... As a whole hash After storage , Inquire about 100 Ten thousand of them take time , It just added 500 Less than a millisecond . The impact on performance is minimal . But memory usage from 1.1G Turned into 120M, Brought the approach 90% Memory saving .
a large number of key-value, Take up too much key,redis In order to deal with hash Collision , Need to take up more space to store these key-value data .
If key The long and short of , For example, some 40 position , There are some 10 position , Because of alignment problems , So there's a huge amount of memory fragmentation , The occupation of space is more serious . therefore , keep key The length is uniform （ For example, unified adoption int type , Fixed length 8 Bytes ）, It also helps with memory usage .
string Type md5, Occupied 32 Bytes . And by hash After algorithm , take 32 Down to 8 Byte long shaping , This significantly reduces key The space occupied by .
zipList Than hashTable Significantly reduced memory usage , Its storage is very compact , It also has little impact on query efficiency . So we should be good at using zipList, To avoid the hash In structure , Store more than 512 individual field-value Elements .
If value Is string 、 Object etc. , Try to use byte To store , It can also greatly reduce the memory consumption . For example, you can choose google Of Snappy Compression algorithm , Convert the string to byte, Very efficient , Compression is also very high .
In order to reduce redis Preallocation and expansion of strings （ Double every time ）, Cause memory fragmentation , Should not be used append,setrange etc. . It's direct use set, Replace the original .
Program drawback ：
hash Structure does not support for single field Timeout settings for . But the deletion can be controlled by code , For long-term data that doesn't need to time out , There is no such concern .
There is a smaller hash Probability of conflict , For extremely accurate data requirements , Not suitable for this kind of compression .
Based on the above scheme , I rewrote springboot Source code redisTemplate, Provides a CompressRedisTemplate class , It can be taken as redisTemplate Use , It will automatically key-value To hash For storage , To achieve the above .
follow-up , We'll be based on more extreme scenarios , Such as statistics of independent visitors , Take a look at redis The unusual data structure , How to use memory by 20G Down to 5M.
Welcome to the official account 【 Magnon blossoms 】 Learn and grow together
I'll share it all the time Java dried food , We will also share free learning materials, courses and interviews
reply ：【 Computer 】【 Design patterns 】【 interview 】 There's a surprise