Hello everyone , I'm Xiaoyu
Okay , Start to get to the point . What I bring to you today is about our old friends
Kafka The next life of .
With the increasing demand for real-time , So how can we ensure the rapid transmission of data in the process of huge data transmission , thus , Message queuing produces .
“ news ” A unit of data transmitted between two computers . The message can be very simple , For example, only text strings ; It can be more complicated , May contain embedded objects .
The message is sent to the queue .“ Message queue ” It's a container that holds messages during their transmission .Kafka It's a distributed message queue. It's essential for us to master it .
This paper deals with
Kafka The implementation details of the basic components and their basic applications are introduced in detail , meanwhile , I also stayed up for several days and nights and drew diagrams , I hope we can make you right Kafka We have a deeper understanding of the core knowledge , Finally, it also summarizes
Kafka Application in actual business . Follow Xiaoyu to get familiar with these things Kafka My little secret ：
Kafka It's a high throughput 、 Distributed 、 Based on the release / Subscribe to the The messaging system , By the first LinkedIn Companies to develop , Use Scala Language writing , At present, it is Apache Open source projects for .
Kafka Main components
broker：Kafka The server , Responsible for information Storage and forward
topic： Message categories ,Kafka according to topic Come on Classified messages
partition：topic Of Partition , One topic It can contain more than one partition,topic Messages are stored in various partition On
offset： Where the message is in the log , It's understandable that the news is coming partition Upper Offset , It is also the unique sequence number representing the message
Producer： Message producer
Consumer： Message consumer
Consumer Group： consumer grouping , Every Consumer Must belong to a group
Zookeeper： Holding clusters broker、topic、partition etc.
metadata ; in addition , Also responsible for broker Fault finding ,partition leader The election , Load balancing and other functions
decoupling ： The message system inserts an implicit 、 Data based The interface layer , Both processes need to implement this interface . This allows you to expand or modify the processing on both sides independently , Just make sure they follow the same Interface constraint .
redundancy ： Message queues persist data until they have been Deal with it completely , In this way, the risk of data loss is avoided . Used by many message queues " Insert - obtain - Delete " In the paradigm , Before deleting a message from the queue , Requires that your processing system explicitly indicate that the message has been processed , This ensures that your data is stored safely until you have finished using it .
Extensibility ： Because message queues decouple your processing , So it is easy to increase the frequency of message enqueue and processing , Just another Add processing that will do . No need to change code 、 There is no need to adjust the parameters . Expansion is as simple as turning up the power button .
flexibility & Peak processing capacity ： Using message queuing enables key components to Resist the sudden pressure to visit , It won't crash completely because of a sudden overload of requests .
Recoverability ： Message queues reduce the degree of coupling between processes , So even if a process that processes the message fails , Messages that are queued can still be sent in The system is recovered and processed .
Sequence assurance ： Most message queues are sorted anyway , And it guarantees that the data will be processed in a particular order .Kafka To ensure a Partition Internal The order of the message .
buffer ： Message queuing through a Buffer layer To help the most efficient execution of the task . The process of writing to the queue is as fast as possible . This buffer helps to control and optimize the speed of data flow through the system .
asynchronous communication ： Message queuing provides Asynchronous processing mechanism , Allows the user to queue a message , But not immediately . Put as many messages as you want into the queue , Then deal with them as needed .
Kafka Application scenarios
Activity tracking ： Tracking websites ⽤ The user and the front end should ⽤ Send by program ⽣ The interaction of life , Such as ： Website PV/UV analysis
The message ： Asynchronous information interaction between systems , Such as ： Marketing activities （ Send the coupon code after registration 利）
Log collection ： Collection system and should ⽤ In terms of program metrics 量 Indicators and ⽇ journal , Such as ： Application monitoring and alarm
Submission log ： Make the database more 更 New to kafka On , Such as ： hand over 易 Statistics
Kafka Data storage design
partition Data files for
partition Each of the article Message There are three properties ：
data, among offset surface in Message In this partition Medium Offset ,offset It's not the time to Message stay partition The actual storage location in the data file , It's a logical value , It's the only one that confirms partition One of the Message, It can be said that offset yes partition in Message Of id;MessageSize Represent message content data Of size ;data by Message Of The specific content .
Data file segmentation segment
partition In physics, there are many segment The composition of the document , Every segment equal , Sequential reading and writing . Every segment The data file is the smallest of the segments offset name , The file extension is .log. So you're looking for the specified offset Of Message When , use Two points search You can locate the Message In which segment In the data file .
Data file index
Kafka For each segmented data file Indexes file , The file name is the same as the name of the data file , It's just that the file extension is .index.index There's not every... In the data file Message Index , Instead, it adopted Sparse storage The way , An index is created every byte of data . This avoids the index file taking up too much space , So you can keep the index file in memory .
Zookeeper stay kafka The role of
Whether it's kafka colony , still producer and consumer Rely on a zookeeper To ensure system availability, the cluster keeps some
meta Information .
Kafka Use zookeeper As its distributed coordination framework , Good news production 、 Message store 、 The process of message consumption combination together .
At the same time with the help of zookeeper,kafka Be able to produce 、 Consumers and broker All of the components inside are stateless , To establish the relationship between producers and consumers Subscribe to the relationship between , And realize the load balance between producers and consumers .
Because of the news topic By multiple partition form , And partition Meeting Evenly distributed To different broker On , therefore , In order to make effective use of broker Cluster performance , Improve message throughput ,producer Can pass Random perhaps hash Methods such as , Send messages to multiple... On average partition On , To achieve load balancing .
It's an important way to improve message throughput ,Producer The end can merge multiple messages in memory , Send a batch of messages to... In one request broker, So it's greatly reduced broker Storage of messages IO Operating frequency . But it also affects the real-time of the message to a certain extent , At the cost of time delay , In exchange for better throughput .
Kafka Support to aggregate （batch） Send messages for units , On this basis ,Kafka It also supports compression of message collections ,Producer The end can pass through
Snappy The format compresses the message set .Producer After compression at the end , stay Consumer The client needs to decompress . The advantage of compression is to reduce the amount of data transferred , Reduce the pressure on network transmission , In big data processing , Bottlenecks tend to be on the Internet rather than CPU（ Compression and decompression consume some CPU resources ）.
So how to distinguish between compressed and uncompressed messages ,Kafka Add a... To the header of the message Describes the compression attribute byte , The last two bits of this byte indicate the encoding used for the compression of the message , If the last two are 0, The message is not compressed .
same Consumer Group In the multiple Consumer example , Different consumption of the same partition, Equivalent to Queue mode .partition The internal news is orderly ,Consumer adopt
pull Way to consume news .Kafka Don't delete consumed messages for partition, Read and write disk data sequentially , In terms of time complexity
O(1) Way to provide Message persistence capabilities .
Kafka As a message system
kafka By having the concept of parallelism in the topic - Partition - ,Kafka Ability to provide order assurance and load balancing in the consumer process pool . This is done by assigning partitions in a topic to consumers in a consumer group , So that each partition is used by only one consumer in the group . By doing so , We make sure that the user is in the partition The only reader And use the data in order . There are many partitions , This can still balance the load on many consumer instances . But please pay attention to , In the consumer group Consumer instance cannot exceed partition .
Kafka As a storage system
Kafka It's a very good storage system . write in Kafka Is written to disk and replicated for fault tolerance .Kafka Allow producers to Waiting for confirmation , So that writing is not considered complete before full replication , And even if the write server fails, the write still exists .
Disk structure Kafka It's a good use of scale - Whatever is on the server 50 KB still 50 TB Persistent data for ,Kafka They all do the same thing .
Because it takes storage seriously and allows the client to control where it is read , You can use Kafka As a dedicated to high performance , Low latency commit log storage , Special purpose for reproduction and dissemination distributed file system .
Kafka For stream processing
For complex transformations ,Kafka Fully integrated
Streams API. This allows you to build applications that perform non-trivial processing , These applications can compute the polymerization Or will flow Connect together .
This tool helps solve the challenges faced by such applications ： Processing unordered data , Reprocess input when code changes , Perform stateful calculations, etc .
flow API Builds on the Kafka On the core principle of providing ： It USES producers and consumers API Input , Use Kafka Conduct 8 Stateful storage , And use the same group mechanism between stream processor instances to implement Fault tolerance *.