author | Zeng fansong （ Chasing spirit ） Senior technical expert of Alibaba cloud container platform
This paper is compiled from 《CNCF x Alibaba Cloud native technology open class 》 The first 16 speak .
Reading guide ：etcd Distributed for shared configuration and service discovery 、 The consistency of the KV The storage system . This paper starts from etcd Several important moments of project development begin , I've introduced etcd The general framework and the basic principles of its design . Hope to help you better understand and use etcd.
One 、etcd The development of the project
etcd Born in CoreOS company , It was originally used to solve the problem of cluster management system OS Upgraded distributed concurrency control and storage and distribution of configuration files . Based on this ,etcd Designed to provide high availability 、 Strong consistent small keyvalue Data storage services .
The project currently belongs to CNCF The foundation , By AWS、Google、Microsoft、Alibaba And other large Internet companies are widely used .
first , stay 2013 year 6 Month by CoreOS Company to GitHub The first version of the initial code was submitted in .
here we are 2014 Year of 6 month , Something happened in the community ,Kubernetes v0.4 Version release . It is necessary to introduce Kubernetes project , First of all, it is a container management platform , Developed by Google and contributed to the community , Because it combines Google's many years of experience in container scheduling and cluster management , Since its birth, it has attracted much attention . stay Kubernetes v0.4 In the version , It has been used. etcd 0.2 Version as the storage service of experimental core metadata , Since then etcd The community has developed rapidly .
Soon , stay 2015 year 2 month ,etcd Released the first official stable version 2.0. stay 2.0 In the version ,etcd Redesigned Raft Consistency algorithm , And provides a simple tree data view for users , stay 2.0 In the version etcd Support more than per second 1000 Second write performance , It met the needs of most of the application scenarios at that time .2.0 After the release , Through constant iteration and improvement , Its original data storage scheme has gradually become the performance bottleneck in the new era , after etcd Launched the v3 Version design .
2017 year 1 Month of the month ,etcd Released 3.1 edition ,v3 The version scheme basically marks etcd It's technically mature . stay v3 In the version etcd A whole new set of API, More efficient and consistent reading methods have been realized , And provides a gRPC Of proxy Used to extend etcd Read performance . meanwhile , stay v3 There are a lot of GC Optimize , Great progress has been made in performance optimization , In this version etcd Can support more than 10000 Second write .
2018 year ,CNCF Many projects under the foundation use etcd Data storage as its core . according to an uncompleted statistic , Use etcd More projects than 30 individual , In the same year 11 month ,etcd The project itself has become CNCF Its incubator program . Get into CNCF After the foundation ,etcd Has more than 400 Contribution groups , It contains information from AWS、Google、Alibaba etc. 8 Of a company 9 Project maintainers .
2019 year ,etcd A brand new 3.4 edition , This version is made by Google、Alibaba And so on , Will be further improved etcd Performance and stability , To meet the demanding scenario requirements in the use of super large companies .
Two 、 Architecture and internal mechanism analysis
etcd It's a distributed one 、 reliable key-value The storage system , It is used to store key data in distributed systems , This definition is very important .
One etcd colony , Usually by 3 A or 5 The nodes make up , Multiple nodes pass through Raft The completion of consistency algorithm distributed consistency collaboration , The algorithm will select a master node as leader, from leader Responsible for data synchronization and data distribution . When leader In case of failure, the system will automatically select another node to become leader, And complete the data synchronization again . The client is in multiple nodes , Just choose any one of them to read and write data , The internal state and data collaboration are etcd Its complete .
stay etcd In the whole architecture , There is a very key concept called quorum,quorum Is defined as （n+1）/2, That is to say, more than half of the nodes in the cluster form a group , stay 3 In a cluster of nodes ,etcd It is permissible to 1 Node failure , That is, as long as there is any 2 Nodes available ,etcd You can continue to provide services . Empathy , stay 5 In a cluster of nodes , As long as there is any 3 Nodes available ,etcd You can continue to provide services , This is also etcd The key to cluster high availability .
Continue to provide services after allowing some nodes to fail , We need to solve a very complex problem ： Distributed consistency . stay etcd in , The distributed consistency algorithm consists of Raft Consistency algorithm complete , This algorithm itself is more complex and has the opportunity to expand it in detail , Here is just a brief introduction to make it easy for everyone to have a basic understanding of it .Raft A key point that consistency algorithm can work is ： Any two quorum There must be an intersection between the members of （ Members of the public ）, That is to say, as long as there is any one quorum Survive , There must be a node （ Members of the public ）, It contains all the committed data in the cluster . It's based on this principle ,Raft The consistency algorithm designs a set of data synchronization mechanism , stay Leader Can resynchronize the last... After the term switch quorum All data submitted , So as to ensure the consistency of data in the process of advancing the whole cluster state .
etcd The internal mechanism is more complex , but etcd The interface provided to customers is simple and direct . As shown in the figure above , We can go through etcd The client provided to access the cluster data , It can also be directly passed through http The way （ similar curl command ） Direct access etcd. stay etcd Inside , Its data representation is also relatively simple , We can put etcd Data storage is understood as an orderly map, It stores key-value data . meanwhile etcd In order to facilitate the client to subscribe to data changes , Also supported a watch Mechanism , adopt watch Get... In real time etcd Incremental update of data in , So as to realize and etcd Data synchronization and other business logic .
Let's take a look etcd Provided interface , There will be etcd The interface is divided into 5 Group ：
- The first group is Put And Delete. As you can see in the picture above put And delete The operation is very simple , Just provide one key And a value, You can write data to the cluster , When deleting data, you only need to specify key that will do ;
- The second group is query operation .etcd Two types of queries are supported ： The first is to specify a single key Query for , The second is the designated one key The scope of the ;
- The third group is data subscription .etcd Provides Watch Mechanism , We can use watch Subscribe to... In real time etcd Medium incremental data update ,watch Support specifying individual key, You can also specify a key The prefix of , In the actual application scenario, the second situation is usually adopted ;
- The fourth group of transaction operations .etcd Provides a simple transaction support , The user can perform certain actions when a set of conditions are met , Perform another set of operations when the condition is not true , Similar to if else sentence ,etcd Ensure the atomicity of the entire operation ;
- The fifth group is Leases Interface .Leases Interface is a common design pattern in distributed system , Its usage will be expanded later .
Data version mechanism
Use... Correctly etcd Of API, You must know the basic principle of internal corresponding data version number .
First etcd There was a term The concept of , It represents the whole cluster Leader The term of office of . When the cluster happens Leader Switch ,term It's worth it +1. Failure at node , perhaps Leader There is a problem with the node network , Or stop the whole cluster and pull it up again , Will happen Leader Handoff .
The second version number is called revision,revision Represents the version of the global data . When the data changes , Including the creation of 、 modify 、 Delete , Its revision The corresponding city will +1. Special , Cross in a cluster Leader Between terms of office ,revision Will keep the global monotone increasing . It is revision This characteristic of , Make any change in the cluster correspond to a unique revision, So we can go through revision To support data MVCC, It can also support data Watch.
For each of these KeyValue Data nodes ,etcd Three versions are recorded in ：
- The first version is called create_revision, yes KeyValue At creation time revision;
- The second is called mod_revision, It corresponds to the time when the data is operated revision;
- Third version It's a counter , On behalf of KeyValue How many times has it been modified .
Here we can show you the way of pictures ：
In the same Leader During the term of office , We found all the modifications , Their corresponding term Value is always equal to 2, and revision Keep increasing monotonously . When you restart the cluster , We will find that all modification operations correspond to term Value has become 3. In the new Leader Term of office , be-all term Value is equal to 3, And it won't change , And the corresponding revision The value also keeps increasing monotonously . Look at... From a larger dimension , It can be found in term=2 and term=3 Of the two Leader Between terms of office , Data corresponds to revision The value still keeps increasing monotonously .
mvcc & streaming watch
understand etcd After version number control , How to use etcd Multiple version numbers for concurrency control and data subscription （Watch）.
stay etcd Support for the same Key Initiate multiple data changes , Each data change corresponds to a version number .etcd The corresponding data of each modification is recorded in the implementation , It means a key stay etcd There are multiple historical versions in . If the version number is not specified when querying data ,etcd Returns the Key Corresponding latest version , Of course etcd It also supports specifying a version number to query historical data .
because etcd Record every change , Use watch When subscribing to data , Can support from any historical moment （ Appoint revision） Start creating a watcher, On the client side and etcd Build a data pipeline between ,etcd Will push from specified revision All data changes started .etcd Provided watch Mechanism guarantees , The Key After the subsequent modification of the data , Push it to the client in real time through this data pipeline .
As shown in the figure below ,etcd All data in is stored in one b+tree in （ gray ）, The b+tree Save on disk , And pass mmap Map to memory to support fast access . gray b+tree Maintain in revision To value The mapping relation of , Supported by revision Query the corresponding data . because revision It's monotonous , When we go through watch To subscribe to the specified revision After the data , Just subscribe to this b+ tree Data changes can be .
stay etcd There's another one inside btree（ Blue ）, It manages. key To revision The mapping relation of . When the client uses key When querying data , First you need to go through the blue btree take key Translate into the corresponding revision, And then through the gray btree Query the corresponding data .
Careful readers will find out ,etcd Recording every change will lead to continuous data growth , This will lead to memory and disk space consumption , It also affects b+tree The query efficiency of . To solve this problem , stay etcd Will run a Periodic Compaction The mechanism of To clean up historical data , Will be a period of time before the same Key The data of multiple historical versions of is cleaned up . The end result is grey b+tree It's still monotonous , But there may be some holes .
In understanding the mvcc Mechanism and watch After the mechanism , Continue to look at etcd Provided mini-transactions Mechanism .etcd Of transaction The mechanism is relatively simple , It can be understood as a paragraph if-else Program , stay if Multiple operations can be provided in , As shown in the figure below ：
If There are two conditions in it , When Value(key1) Greater than "bar" also Version(key1) The version of is equal to 2 When , perform Then The operation specified in ： modify Key2 The data is valueX, At the same time to delete Key3 The data of . If you don't do that , Then perform another operation ：Key2 It is amended as follows valueY.
stay etcd The atomicity of the whole transaction operation is guaranteed internally . in other words If Operate all comparison conditions , The view it sees must be consistent . At the same time, it can ensure that the atomicity of multiple operations does not occur Then Only half of the operations in are performed .
adopt etcd Transaction operations provided , We can guarantee the consistency of data reading and writing in multiple competitions , For example, as mentioned before Kubernetes project , It is the use of etcd Transaction mechanism of , To achieve multiple KubernetesAPI server Consistency of changes to the same data .
lease The concept and usage of
lease Is a common concept in distributed systems , Used to represent a distributed lease . Typically , When it is necessary to detect whether a node is alive in a distributed system , We need a lease mechanism .
The code example in the example above first creates a 10s Lease of , If you don't do anything after creating a lease , that 10s after , This lease will expire automatically . And then key1 and key2 Two key value Bind to this lease , So when the lease expires etcd It will automatically clean up key1 and key2, Make nodes key1 and key2 It has the ability to delete automatically after timeout .
If you want this lease to never expire , Need to call periodically KeeyAlive Method to refresh the lease . For example, we need to detect whether a process in a distributed system is alive , You can create a lease in progress , And call periodically in this process KeepAlive Methods . If everything goes well , The lease of this node will be consistent , If the process goes down , Eventually the lease will expire automatically .
stay etcd in , Allow multiple key The connection is in the same lease above , This design is very ingenious , Can be greatly reduced lease The cost of refreshing objects . Just imagine , If there's a lot of key All need to support similar lease mechanism , every last key We need to renew the lease independently , This will give etcd There's a lot of pressure . Through multiple key Bind to the same lease The pattern of , We can make super time similar to key Come together , So the cost of lease refresh is greatly reduced , It can greatly improve the activity without failure etcd Scale of use supported .
3、 ... and 、 Typical use scenario Introduction
Kubernetes Store the state you are using in etcd in , The high availability of its status data is given to etcd To solve ,Kubernetes The system itself does not need to deal with complex distributed system state processing , Its system architecture has been greatly simplified .
Server Discovery （Naming Service）
The second scene is Service Discovery, It's also called name service . In distributed systems , A common pattern is to have multiple back ends （ It could be hundreds of processes ） To provide a set of equivalent services , For example, retrieval services 、 Recommended services .
For such a back-end service , In general, in order to simplify the operation and maintenance cost of back-end services （ Nodes are replaced at any time in case of failure ）, This process on the back end will be similar to Kubernetes Such a cluster management system is scheduled , So when the user （ Or upstream services ） When called , We need a service discovery mechanism to solve the problem of service routing . This service can be used to find problems etcd To solve , The way is as follows ：
- After starting inside the process , You can register your address with etcd;
- API The gateway is enough to go through etcd Sense the address of the back-end process in time , When the back-end process fails to migrate, it will re register to etcd in ,API The gateway can also sense the new address in time ;
- utilize etcd Provided Lease Mechanism , If there is an exception in the running process of the service provider （crash）,API Gateway can also remove its traffic to avoid call timeout .
In this framework , Service status data is etcd To take over ,API The gateway itself is stateless , Can expand horizontally to serve more customers . At the same time benefit from etcd Good performance of , Nodes that can support tens of thousands of back-end processes , So that this architecture can serve large enterprises .
Distributed Coordination: leader election
In distributed systems , A typical design pattern is Master+Slave. Usually ,Slave Provides CPU、 Memory 、 Disk and network resources , and Master Used to reconcile these nodes to provide a service to the outside world （ Like distributed storage , Distributed computing ）. Typical distributed storage services （HDFS） And distributed computing services （Hadoop） They all use design patterns like this . Such a design pattern has a typical problem ：Master Availability of nodes . When Master After the breakdown , The service of the whole cluster will be suspended , There is no way to serve the user's request .
To solve this problem , The typical way is to start multiple Master node . because Master The node will contain control logic , State synchronization between multiple nodes is very complex , The most typical way of doing this is through the way of selecting the owner , Select one of the nodes as the primary node to provide services , Another node is waiting .
adopt etcd The mechanism provided can easily realize the main function of the distributed process , For example, you can use the same key To realize the logic of the robber . generally speaking , The chosen one Leader I will take my own IP Sign up to etcd in , bring Slave Nodes can get the current Leader Address , So that the system follows the previous single Master The way the node continues to work . When Leader After node exception , adopt etcd Can select a new node to be the master node , And sign up for a new IP after ,Slave It can pull new master nodes IP, Resume service .
Distributed Coordination Distributed system concurrency control
In distributed systems , When we go to perform some tasks , For example, upgrade OS、 Or upgrade OS On the software 、 Or to perform some computing tasks , Due to the bottleneck of back-end services or the consideration of business stability , In general, you need to control the concurrency of tasks . If the task lacks a harmonious Master node , Can pass etcd To complete such distributed system work .
In this mode, through etcd To implement a distributed semaphore , And it can be used etcd leases Mechanism to automatically remove the fault nodes . During process execution , If the process runs for a long time , We can store some state data during the process running to etcd, So that when the process fails and needs to recover to other places , Can from etcd To restore some execution state , It doesn't need to complete the whole calculation logic , To speed up the efficiency of the whole task .
This paper summarizes
This is the end of the main content of this article , Here is a brief summary for you ：
- The first part , I've introduced etcd How the project was born , And in etcd Several important moments in the development process ;
- The second part , I've introduced etcd And the basic operation interface inside , Understanding etcd How to achieve high availability on the basis of , It shows etcd Some basic operations of data and its internal implementation principle ;
- The third part , Three typical etcd Use scenarios , And in the corresponding scenario , Design idea of distributed system .
“ Alibaba cloud native WeChat official account （ID：Alicloudnative） Focus on microservices 、Serverless、 Containers 、Service Mesh And other technical fields 、 The trend of primary popular technology of focus cloud 、 Large scale practice of cloud original , Do official account of cloud native developers best. .”