This article was first published on the official account : Five minutes to learn big data
During the interview , I found that many interviewers like to ask Kafka Related issues , That's not hard to understand , Who let Kafka It's message queuing in the field of big data The only king , One hundred thousand throughput per machine , Millisecond delay , This inborn distributed message queue , Who can not love ?
In a recent interview , One of the interviewers saw that the items in his resume said Kafka 了 , Just ask Kafka, I didn't ask any other questions . Let's take a look at the examiners Kafka Ba Lian asked :
( The following answers are based on online materials , Only about a third of the actual interviews were answered )
1. Why use kafka?
-
Buffering and peak clipping : There is sudden flow in the upstream data , Downstream may not be able to carry , Or there are not enough machines downstream to ensure redundancy ,kafka In the middle can play a buffer role , Save the message for the time being kafka in , Downstream services can be handled slowly at their own pace .
-
Decoupling and scalability : At the beginning of the project , There's no way to identify specific needs . Message queuing can be used as an interface layer , Decouple important business processes . Just follow the agreement , For data programming, you can get the expansion capability .
-
redundancy : It can be one to many , A producer releases a message , Can be subscribed to multiple topic The service consumption of , For multiple unrelated businesses .
-
Robustness, : Message queues can stack requests , So even if the consumer business dies in a short time , It will not affect the normal operation of the main business .
-
asynchronous communication : A lot of times , The user does not want or need to process the message immediately . Message queues provide an asynchronous processing mechanism , Allows the user to queue a message , But not immediately . Put as many messages as you want into the queue , Then deal with them as needed .
2. Kafka How to consume the consumed news again ?
kafka Consumer news offset Is defined in zookeeper Medium , If you want to repeat consumption kafka The news of , Can be in redis Record yourself in the book offset Of checkpoint spot (n individual ), When you want to consume messages repeatedly , By reading the redis Medium checkpoint Point proceed zookeeper Of offset reset , In this way, we can achieve the goal of consuming messages repeatedly
3. kafka Is your data on disk or in memory , Why is it so fast ?
kafka Using disk storage .
It's fast because :
- Write in sequence : Because the hard disk is a mechanical structure , Every read and write will address -> write in , Where addressing is a “ Mechanical action ”, It's time-consuming . So hard drives “ hate ” Random I/O, I like order I/O. In order to improve the speed of reading and writing hard disk ,Kafka It's the order of use I/O.
- Memory Mapped Files( Memory mapped files ):64 Bit operating system can generally represent 20G Data files for , It works by using the operating system directly Page To map files directly to physical memory . After the mapping, your operations on physical memory will be synchronized to the hard disk .
- Kafka Efficient file storage design : Kafka hold topic In a parition Large files are divided into smaller file segments , Through multiple small file segments , It's easy to periodically clean up or delete consumed files , Reduce disk usage . The index information can be quickly located
message And make sure response Of Big Small . adopt index The metadata is all mapped to memory( Memory mapped files ),
You can avoid segment file Of IO Disk operating . Sparse storage through index files , Can be significantly reduced index The size of the file metadata footprint .
notes :
- Kafka One of the ways to solve the query efficiency is to segment the data file , Such as the 100 strip Message, Their offset It's from 0 To 99. Suppose you divide the data file into 5 paragraph , The first paragraph is 0-19, The second paragraph is 20-39, And so on , Each segment is placed in a separate data file , The data file is in this paragraph Small offset name . So you're looking for the specified offset Of
Message When , You can locate the Message In which paragraph . - Create... For data files Index data file segmentation Make it possible to find the corresponding in a smaller data file offset Of Message 了 , But it still takes sequential scanning to find the corresponding offset Of Message.
In order to further improve the efficiency of search ,Kafka An index file is created for each segmented data file , The file name is the same as the name of the data file , It's just that the file extension is .index.
4. Kafka How to protect data from loss ?
There are three points , One is the producer side , A consumer side , One broker End .
- No loss of producer data
kafka Of ack Mechanism : stay kafka When sending data , Every time a message is sent, there is a confirmation feedback mechanism , Make sure the message is received properly , The states are 0,1,-1.
If it's synchronous mode :
ack Set to 0, High risk , Generally, it is not recommended to set it to 0. Even if it is set to 1, Will follow leader Data loss due to downtime . So if we want to strictly ensure that the production side data is not lost , Can be set to -1.
If it's asynchronous :
I will also consider ack The state of , besides , Asynchronous mode has a buffer, adopt buffer To send control data , There are two values to control , The time threshold and the number threshold of messages , If buffer It's full and the data hasn't been sent out yet , One option is to configure whether to clear immediately buffer. It can be set to -1, Permanent blocking , Data is no longer produced . In asynchronous mode , Even if it is set to -1. It may also be because of the unscientific operation of the programmer , Loss of operational data , such as kill -9, But this is a special exception .
notes :
ack=0:producer Don't wait for broker Confirmation of synchronous completion , Continue to send the next ( batch ) Information .
ack=1( Default ):producer Waiting leader Successfully received data and confirmed , Send the next one message.
ack=-1:producer obtain follwer confirm , To send the next data .
- No loss of consumer data
adopt offset commit To ensure that the data is not lost ,kafka I recorded every consumption offset The number , The next time you keep spending , It's going to be like last time offset Consumption .
and offset Information on kafka0.8 Version previously saved in zookeeper in , stay 0.8 Save the version to topic in , Even if consumers hang up in the process , You'll find it when you start it again offset Value , Find the location of the previous consumption message , And then consumption , because offset Not every message is written after consumption , So this kind of situation may cause repeated consumption , But you don't lose messages .
The only exception is , We give two different functions in the program consumer Group settings
KafkaSpoutConfig.bulider.setGroupid It's set to the same groupid, This will cause the two groups to share the same piece of data , And you get groups A consumption partition1,partition2 The messages in the , Group B consumption partition3 The news of , In this way, messages consumed by each group will be lost , It's all incomplete . To ensure that each group has its own message data ,groupid Don't repeat it .
- kafka In the cluster broker Data of is not lost
Every broker Medium partition We usually have replication( copy ) The number of , The producer writes first according to the distribution policy ( Yes partition Press partition, Yes key Press key, No polling ) Write to leader in ,follower( copy ) Follow again leader Synchronous data , So there's a backup , It can also ensure that the message data is not lost .
5. Why collect data kafka?
Acquisition layer It can mainly be used Flume, Kafka Technology .
Flume:Flume It's pipe flow , There are a lot of default implementations , Let users deploy through parameters , And expansion API.
Kafka:Kafka It's a persistent distributed message queue . Kafka It's a very versatile system . You can have many producers and many consumers sharing multiple themes Topics.
by comparison ,Flume It's a special tool designed to HDFS,HBase send data . It's right HDFS There are special optimizations , And integrated Hadoop Safety features of .
therefore ,Cloudera It is suggested that if data is consumed by multiple systems , Use kafka; If the data is designed for Hadoop Use , Use Flume.
6. kafka Whether restart will cause data loss ?
- kafka Write data to disk , General data is not lost .
- But it's rebooting kafka In the process , If there's consumer news , that kafka If it's too late to submit offset, May cause inaccurate data ( Lost or re consumed ).
7. kafka How to solve the problem of downtime ?
- Consider whether the business is affected first
kafka It's down. , First of all, we should consider whether the services provided are affected by the down machine , If the service is available , If the disaster recovery mechanism of cluster is well implemented , Then there's no need to worry about this one .
- Node debugging and recovery
Nodes that want to restore the cluster , The main step is to find out the cause of node downtime through log analysis , So as to solve , Restore the node .
8. Why? Kafka Read write separation is not supported ?
stay Kafka in , Producer writes message 、 The consumer reads the message with leader Copy to interact with , from And it's a kind of Write and read Production and consumption model of .
Kafka Does not support Write and read , Because there is 2 One obvious drawback :
-
Data consistency issues : There must be a delay time window when data is transferred from the master node to the slave node , At this time Windows can cause data inconsistency between master and slave nodes . At some point , In the master node and the slave node A The values of the data are X, After that, it will be in the master node A Is changed to Y, So before the change is notified to the slave node , The application reads from the node A The value of the data is not up to date Y, This leads to the problem of inconsistent data .
-
Time delay problem : similar Redis This kind of component , The process from writing data to master node to synchronizing data to slave node needs to go through The Internet → Main node memory → The Internet → From node memory These stages , The whole process will take some time . And in the Kafka in , Master slave synchronization will be better than Redis More time consuming , It needs to go through The Internet → Main node memory → Primary node disk → The Internet → From the festival Point memory → From the node disk These stages . For delay sensitive applications , The function of master-slave reading is not very suitable .
and kafka Of Write and read There are a lot of advantages of :
- It can simplify the implementation logic of the code , Reduce the possibility of errors ;
- The load granularity is refined and evenly distributed , Compared with master writer and slave reader , Not only is load efficiency better , And it's controllable to users ;
- There is no delay effect ;
- With a stable copy , There will be no data inconsistency .