This article was first published on the official account : Five minutes to learn big data

During the interview , I found that many interviewers like to ask Kafka Related issues , That's not hard to understand , Who let Kafka It's message queuing in the field of big data The only king , One hundred thousand throughput per machine , Millisecond delay , This inborn distributed message queue , Who can not love ?

In a recent interview , One of the interviewers saw that the items in his resume said Kafka 了 , Just ask Kafka, I didn't ask any other questions . Let's take a look at the examiners Kafka Ba Lian asked :

( The following answers are based on online materials , Only about a third of the actual interviews were answered )

1. Why use kafka?

  1. Buffering and peak clipping : There is sudden flow in the upstream data , Downstream may not be able to carry , Or there are not enough machines downstream to ensure redundancy ,kafka In the middle can play a buffer role , Save the message for the time being kafka in , Downstream services can be handled slowly at their own pace .

  2. Decoupling and scalability : At the beginning of the project , There's no way to identify specific needs . Message queuing can be used as an interface layer , Decouple important business processes . Just follow the agreement , For data programming, you can get the expansion capability .

  3. redundancy : It can be one to many , A producer releases a message , Can be subscribed to multiple topic The service consumption of , For multiple unrelated businesses .

  4. Robustness, : Message queues can stack requests , So even if the consumer business dies in a short time , It will not affect the normal operation of the main business .

  5. asynchronous communication : A lot of times , The user does not want or need to process the message immediately . Message queues provide an asynchronous processing mechanism , Allows the user to queue a message , But not immediately . Put as many messages as you want into the queue , Then deal with them as needed .

2. Kafka How to consume the consumed news again ?

kafka Consumer news offset Is defined in zookeeper Medium , If you want to repeat consumption kafka The news of , Can be in redis Record yourself in the book offset Of checkpoint spot (n individual ), When you want to consume messages repeatedly , By reading the redis Medium checkpoint Point proceed zookeeper Of offset reset , In this way, we can achieve the goal of consuming messages repeatedly

3. kafka Is your data on disk or in memory , Why is it so fast ?

kafka Using disk storage .

It's fast because :

  1. Write in sequence : Because the hard disk is a mechanical structure , Every read and write will address -> write in , Where addressing is a “ Mechanical action ”, It's time-consuming . So hard drives “ hate ” Random I/O, I like order I/O. In order to improve the speed of reading and writing hard disk ,Kafka It's the order of use I/O.
  2. Memory Mapped Files( Memory mapped files ):64 Bit operating system can generally represent 20G Data files for , It works by using the operating system directly Page To map files directly to physical memory . After the mapping, your operations on physical memory will be synchronized to the hard disk .
  3. Kafka Efficient file storage design : Kafka hold topic In a parition Large files are divided into smaller file segments , Through multiple small file segments , It's easy to periodically clean up or delete consumed files , Reduce disk usage . The index information can be quickly located

    message And make sure response Of Big Small . adopt index The metadata is all mapped to memory( Memory mapped files ),

    You can avoid segment file Of IO Disk operating . Sparse storage through index files , Can be significantly reduced index The size of the file metadata footprint .

notes :

  1. Kafka One of the ways to solve the query efficiency is to segment the data file , Such as the 100 strip Message, Their offset It's from 0 To 99. Suppose you divide the data file into 5 paragraph , The first paragraph is 0-19, The second paragraph is 20-39, And so on , Each segment is placed in a separate data file , The data file is in this paragraph Small offset name . So you're looking for the specified offset Of

    Message When , You can locate the Message In which paragraph .
  2. Create... For data files Index data file segmentation Make it possible to find the corresponding in a smaller data file offset Of Message 了 , But it still takes sequential scanning to find the corresponding offset Of Message.

    In order to further improve the efficiency of search ,Kafka An index file is created for each segmented data file , The file name is the same as the name of the data file , It's just that the file extension is .index.

4. Kafka How to protect data from loss ?

There are three points , One is the producer side , A consumer side , One broker End .

  1. No loss of producer data

kafka Of ack Mechanism : stay kafka When sending data , Every time a message is sent, there is a confirmation feedback mechanism , Make sure the message is received properly , The states are 0,1,-1.

If it's synchronous mode :

ack Set to 0, High risk , Generally, it is not recommended to set it to 0. Even if it is set to 1, Will follow leader Data loss due to downtime . So if we want to strictly ensure that the production side data is not lost , Can be set to -1.

If it's asynchronous :

I will also consider ack The state of , besides , Asynchronous mode has a buffer, adopt buffer To send control data , There are two values to control , The time threshold and the number threshold of messages , If buffer It's full and the data hasn't been sent out yet , One option is to configure whether to clear immediately buffer. It can be set to -1, Permanent blocking , Data is no longer produced . In asynchronous mode , Even if it is set to -1. It may also be because of the unscientific operation of the programmer , Loss of operational data , such as kill -9, But this is a special exception .

notes :

ack=0:producer Don't wait for broker Confirmation of synchronous completion , Continue to send the next ( batch ) Information .

ack=1( Default ):producer Waiting leader Successfully received data and confirmed , Send the next one message.

ack=-1:producer obtain follwer confirm , To send the next data .

  1. No loss of consumer data

adopt offset commit To ensure that the data is not lost ,kafka I recorded every consumption offset The number , The next time you keep spending , It's going to be like last time offset Consumption .

and offset Information on kafka0.8 Version previously saved in zookeeper in , stay 0.8 Save the version to topic in , Even if consumers hang up in the process , You'll find it when you start it again offset Value , Find the location of the previous consumption message , And then consumption , because offset Not every message is written after consumption , So this kind of situation may cause repeated consumption , But you don't lose messages .

The only exception is , We give two different functions in the program consumer Group settings

KafkaSpoutConfig.bulider.setGroupid It's set to the same groupid, This will cause the two groups to share the same piece of data , And you get groups A consumption partition1,partition2 The messages in the , Group B consumption partition3 The news of , In this way, messages consumed by each group will be lost , It's all incomplete . To ensure that each group has its own message data ,groupid Don't repeat it .

  1. kafka In the cluster broker Data of is not lost

Every broker Medium partition We usually have replication( copy ) The number of , The producer writes first according to the distribution policy ( Yes partition Press partition, Yes key Press key, No polling ) Write to leader in ,follower( copy ) Follow again leader Synchronous data , So there's a backup , It can also ensure that the message data is not lost .

5. Why collect data kafka?

Acquisition layer It can mainly be used Flume, Kafka Technology .

Flume:Flume It's pipe flow , There are a lot of default implementations , Let users deploy through parameters , And expansion API.

Kafka:Kafka It's a persistent distributed message queue . Kafka It's a very versatile system . You can have many producers and many consumers sharing multiple themes Topics.

by comparison ,Flume It's a special tool designed to HDFS,HBase send data . It's right HDFS There are special optimizations , And integrated Hadoop Safety features of .

therefore ,Cloudera It is suggested that if data is consumed by multiple systems , Use kafka; If the data is designed for Hadoop Use , Use Flume.

6. kafka Whether restart will cause data loss ?

  1. kafka Write data to disk , General data is not lost .
  2. But it's rebooting kafka In the process , If there's consumer news , that kafka If it's too late to submit offset, May cause inaccurate data ( Lost or re consumed ).

7. kafka How to solve the problem of downtime ?

  1. Consider whether the business is affected first

kafka It's down. , First of all, we should consider whether the services provided are affected by the down machine , If the service is available , If the disaster recovery mechanism of cluster is well implemented , Then there's no need to worry about this one .

  1. Node debugging and recovery

Nodes that want to restore the cluster , The main step is to find out the cause of node downtime through log analysis , So as to solve , Restore the node .

8. Why? Kafka Read write separation is not supported ?

stay Kafka in , Producer writes message 、 The consumer reads the message with leader Copy to interact with , from And it's a kind of Write and read Production and consumption model of .

Kafka Does not support Write and read , Because there is 2 One obvious drawback :

  1. Data consistency issues : There must be a delay time window when data is transferred from the master node to the slave node , At this time Windows can cause data inconsistency between master and slave nodes . At some point , In the master node and the slave node A The values of the data are X, After that, it will be in the master node A Is changed to Y, So before the change is notified to the slave node , The application reads from the node A The value of the data is not up to date Y, This leads to the problem of inconsistent data .

  2. Time delay problem : similar Redis This kind of component , The process from writing data to master node to synchronizing data to slave node needs to go through The Internet → Main node memory → The Internet → From node memory These stages , The whole process will take some time . And in the Kafka in , Master slave synchronization will be better than Redis More time consuming , It needs to go through The Internet → Main node memory → Primary node disk → The Internet → From the festival Point memory → From the node disk These stages . For delay sensitive applications , The function of master-slave reading is not very suitable .

and kafka Of Write and read There are a lot of advantages of :

  1. It can simplify the implementation logic of the code , Reduce the possibility of errors ;
  2. The load granularity is refined and evenly distributed , Compared with master writer and slave reader , Not only is load efficiency better , And it's controllable to users ;
  3. There is no delay effect ;
  4. With a stable copy , There will be no data inconsistency .

Search official account “ Five minutes to learn big data ”, Deeply study big data technology

Interviewers in big factories are so fond of asking Kafka, Eight in a row Kafka The question blinds me ? More articles about

  1. The interviewer of Daichang asked you META-INF/spring.factories How to realize automatic scanning 、 Automatic assembly ?

    The interviewer of Daichang asked you META-INF/spring.factories How to realize automatic scanning . Automatic assembly ?   A lot of programmers want to interview for a big internet factory , But there are also many people who don't know what conditions are needed to enter a large factory , And what questions the interviewer will ask , ...

  2. Most often asked by interviewers of big factories @Configuration+@Bean(JDKConfig programmatically )

    Most often asked by interviewers of big factories @Configuration+@Bean(JDKConfig programmatically )   Now most of Spring The project uses annotation based configuration , Adopted @Configuration How to replace the label . One ...

  3. 【Nginx】 The interviewer asked me Nginx How to generate thumbnails , Fortunately, I read this article !!

    Write it at the front Today I want to write an article about how to use it Nginx How to generate thumbnails of articles , I haven't thought about the topic for a long time , A reader helped me with this topic . The reason is that the reader recently went out for an interview , The interviewer just asked Nginx How to generate a thumbnail . Also don't say , Just ...

  4. What interviewers like to ask most is 15 Avenue Java Multithreaded interview questions

    Preface In any Java Multithreading and concurrency are essential parts of an interview . If you want more jobs , So you should prepare a lot of questions about multithreading . They ask the interviewer a lot of confusing questions Java Thread problem . The interviewer just wants to make sure that the interviewer ...

  5. iOS Development , Only in this way can you be valued by the interviewers of big factories !

    Preface : For the workplace , A resume is like a facade . If you don't think well , Something went wrong , It doesn't matter if we delay some time , It's the resume that can't get into HR Eyes , It's hard to find a good place to hurt God , It's a great time to study and practice hard in the past few years , How can we live up to ? resume , Simple and powerful . It's for an artificial ...

  6. Interviewers from big factories :Java Engineer's “ Decathlon ”

    Want to be qualified Java What professional skills do programmers or engineers need , What do you need to prepare before the interview ? What professional skills do the interviewer want to know during the interview , The following are all qualified Java What a software engineer needs to have . One . Expertise skilled ...

  7. Apache Kafka( 8、 ... and )- Kafka Delivery Semantics for Consumers

    Kafka Delivery Semantics stay Kafka Consumer in , Yes 3 Kind of delivery semantics, Respectively : One more time (at most once). At least once (at least ...

  8. I said I know about set classes , The interviewer asked me why HashMap The load factor of is not set to 1!?

    stay Java In the foundation , Set class is a key piece of knowledge , It is often used in daily development . such as List.Map These are also very common in code . Personally think that , About HashMap The implementation of the ,JDK In fact, our engineers do a lot of optimization , want ...

  9. I said, "master strings." , The interviewer asked me Java Medium String Is there a length limit ?

    String yes Java One of the most important data types in , In addition to the basic data types ,String Is the most widely used , however , About String, In fact, there are many things that are easy to be ignored . As we are going to discuss in this article :Ja ...

  10. I said I'm good at strings , The interviewer asked me Java Medium String Is there a length limit !?| With video explanation

    About String Is there a length limit , I've written a separate article and analyzed it before , Recently, I took the time to review this issue , I found that I had some new knowledge . So I'm going to rearrange the content . This time, based on the previous article, in addition to adding some verification ...

Random recommendation

  1. Attribute class :Properties

    Attributes are a common form in programs . Provides a specialized Properties class . public class Propertiesextends Hashtable<Object,Object> ...

  2. [Android Pro] Network traffic security testing tool Nogotofail

    reference to : http://www.freebuf.com/tools/50324.html From the serious HeartBleed The hole goes to apple gotofail Loophole , To the nearest SSL v3 P ...

  3. SQL Server In the memory OLTP An overview of the internal mechanism ( Two )

    ---------------------------- I'm the divider ------------------------------- This article is translated from Microsoft white paper <SQL Server In-Memory ...

  4. git push A failed solution (2)

    A wrong :Cannot rebase: You have unstaged changes terms of settlement : Cannot rebase: You have unstaged changes. That means it has been modified ...

  5. hdu 5258 Count rectangles discretization

    Count rectangles Time Limit: 20 Sec Memory Limit: 256 MB Topic linking http://acm.hdu.edu.cn/showproblem.php?pid=5258 Des ...

  6. IE The filter should be implemented under the browser (transparent), Must add filter

    The problem is : ie9 Below a The label style is background-color:transparent; Causes the link to fail , Don't move . This question follows IE9 And the following a Tag links plus  background-color:tran ...

  7. Just from pc The feeling of being mobile

    Recently transferred to a new project , From the original pc End , Tune to the mobile terminal , Everything is so different , When we know that we are about to transfer to the mobile terminal , I heard that our technology is going to use vue,Aladdin, also es6, mentally ... I did the following preparatory work : 1, Bought a book ...

  8. ELK Installation

    First of all, it has to be installed Elasticsearch.Kibana and Logstash( All of them use rpm Installation is 6.4.2 edition , And it's all stand-alone , Distributed installation is not considered for the moment .) The server memory requirement is at least 4G, Here's how it works ...

  9. [ Reprint ] About laravel One to one relationship between Chinese and English 、 One to many 、 For one more 、 Many to many practice

    This is a reprinted article Source :https://blog.csdn.net/weixin_38112233/article/details/79220535 author : Meet again One . Creating tables and inserting test data 1. User table creation ...

  10. SpringBoot How to use interceptors

    1. Configure interceptors @Configuration public class WebMvcConfigurer extends WebMvcConfigurerAdapter { @Override pub ...