The interviewers of big factories love to ask Kafka so much. I'm blinded by eight Kafka questions in a row?

Learn big data in five minutes 2021-01-14 15:46:47
interviewers big factories love ask

This article was first published on the official account : Five minutes to learn big data

During the interview , I found that many interviewers like to ask Kafka Related issues , That's not hard to understand , Who let Kafka It's message queuing in the field of big data The only king , One hundred thousand throughput per machine , Millisecond delay , This inborn distributed message queue , Who can not love ?

In a recent interview , One of the interviewers saw that the items in his resume said Kafka 了 , Just ask Kafka, I didn't ask any other questions . Let's take a look at the examiners Kafka Ba Lian asked :

( The following answers are based on online materials , Only about a third of the actual interviews were answered )

1. Why use kafka?

  1. Buffering and peak clipping : There is sudden flow in the upstream data , Downstream may not be able to carry , Or there are not enough machines downstream to ensure redundancy ,kafka In the middle can play a buffer role , Save the message for the time being kafka in , Downstream services can be handled slowly at their own pace .

  2. Decoupling and scalability : At the beginning of the project , There's no way to identify specific needs . Message queuing can be used as an interface layer , Decouple important business processes . Just follow the agreement , For data programming, you can get the expansion capability .

  3. redundancy : It can be one to many , A producer releases a message , Can be subscribed to multiple topic The service consumption of , For multiple unrelated businesses .

  4. Robustness, : Message queues can stack requests , So even if the consumer business dies in a short time , It will not affect the normal operation of the main business .

  5. asynchronous communication : A lot of times , The user does not want or need to process the message immediately . Message queues provide an asynchronous processing mechanism , Allows the user to queue a message , But not immediately . Put as many messages as you want into the queue , Then deal with them as needed .

2. Kafka How to consume the consumed news again ?

kafka Consumer news offset Is defined in zookeeper Medium , If you want to repeat consumption kafka The news of , Can be in redis Record yourself in the book offset Of checkpoint spot (n individual ), When you want to consume messages repeatedly , By reading the redis Medium checkpoint Point proceed zookeeper Of offset reset , In this way, we can achieve the goal of consuming messages repeatedly

3. kafka Is your data on disk or in memory , Why is it so fast ?

kafka Using disk storage .

It's fast because :

  1. Write in sequence : Because the hard disk is a mechanical structure , Every read and write will address -> write in , Where addressing is a “ Mechanical action ”, It's time-consuming . So hard drives “ hate ” Random I/O, I like order I/O. In order to improve the speed of reading and writing hard disk ,Kafka It's the order of use I/O.
  2. Memory Mapped Files( Memory mapped files ):64 Bit operating system can generally represent 20G Data files for , It works by using the operating system directly Page To map files directly to physical memory . After the mapping, your operations on physical memory will be synchronized to the hard disk .
  3. Kafka Efficient file storage design : Kafka hold topic In a parition Large files are divided into smaller file segments , Through multiple small file segments , It's easy to periodically clean up or delete consumed files , Reduce disk usage . The index information can be quickly located
    message And make sure response Of Big Small . adopt index The metadata is all mapped to memory( Memory mapped files ),
    You can avoid segment file Of IO Disk operating . Sparse storage through index files , Can be significantly reduced index The size of the file metadata footprint .

notes :

  1. Kafka One of the ways to solve the query efficiency is to segment the data file , Such as the 100 strip Message, Their offset It's from 0 To 99. Suppose you divide the data file into 5 paragraph , The first paragraph is 0-19, The second paragraph is 20-39, And so on , Each segment is placed in a separate data file , The data file is in this paragraph Small offset name . So you're looking for the specified offset Of
    Message When , You can locate the Message In which paragraph .
  2. Create... For data files Index data file segmentation Make it possible to find the corresponding in a smaller data file offset Of Message 了 , But it still takes sequential scanning to find the corresponding offset Of Message.
    In order to further improve the efficiency of search ,Kafka An index file is created for each segmented data file , The file name is the same as the name of the data file , It's just that the file extension is .index.

4. Kafka How to protect data from loss ?

There are three points , One is the producer side , A consumer side , One broker End .

  1. No loss of producer data

kafka Of ack Mechanism : stay kafka When sending data , Every time a message is sent, there is a confirmation feedback mechanism , Make sure the message is received properly , The states are 0,1,-1.

If it's synchronous mode :
ack Set to 0, High risk , Generally, it is not recommended to set it to 0. Even if it is set to 1, Will follow leader Data loss due to downtime . So if we want to strictly ensure that the production side data is not lost , Can be set to -1.

If it's asynchronous :
I will also consider ack The state of , besides , Asynchronous mode has a buffer, adopt buffer To send control data , There are two values to control , The time threshold and the number threshold of messages , If buffer It's full and the data hasn't been sent out yet , One option is to configure whether to clear immediately buffer. It can be set to -1, Permanent blocking , Data is no longer produced . In asynchronous mode , Even if it is set to -1. It may also be because of the unscientific operation of the programmer , Loss of operational data , such as kill -9, But this is a special exception .

notes :
ack=0:producer Don't wait for broker Confirmation of synchronous completion , Continue to send the next ( batch ) Information .
ack=1( Default ):producer Waiting leader Successfully received data and confirmed , Send the next one message.
ack=-1:producer obtain follwer confirm , To send the next data .

  1. No loss of consumer data

adopt offset commit To ensure that the data is not lost ,kafka I recorded every consumption offset The number , The next time you keep spending , It's going to be like last time offset Consumption .

and offset Information on kafka0.8 Version previously saved in zookeeper in , stay 0.8 Save the version to topic in , Even if consumers hang up in the process , You'll find it when you start it again offset Value , Find the location of the previous consumption message , And then consumption , because offset Not every message is written after consumption , So this kind of situation may cause repeated consumption , But you don't lose messages .

The only exception is , We give two different functions in the program consumer Group settings
KafkaSpoutConfig.bulider.setGroupid It's set to the same groupid, This will cause the two groups to share the same piece of data , And you get groups A consumption partition1,partition2 The messages in the , Group B consumption partition3 The news of , In this way, messages consumed by each group will be lost , It's all incomplete . To ensure that each group has its own message data ,groupid Don't repeat it .

  1. kafka In the cluster broker Data of is not lost

Every broker Medium partition We usually have replication( copy ) The number of , The producer writes first according to the distribution policy ( Yes partition Press partition, Yes key Press key, No polling ) Write to leader in ,follower( copy ) Follow again leader Synchronous data , So there's a backup , It can also ensure that the message data is not lost .

5. Why collect data kafka?

Acquisition layer It can mainly be used Flume, Kafka Technology .

Flume:Flume It's pipe flow , There are a lot of default implementations , Let users deploy through parameters , And expansion API.

Kafka:Kafka It's a persistent distributed message queue . Kafka It's a very versatile system . You can have many producers and many consumers sharing multiple themes Topics.

by comparison ,Flume It's a special tool designed to HDFS,HBase send data . It's right HDFS There are special optimizations , And integrated Hadoop Safety features of .

therefore ,Cloudera It is suggested that if data is consumed by multiple systems , Use kafka; If the data is designed for Hadoop Use , Use Flume.

6. kafka Whether restart will cause data loss ?

  1. kafka Write data to disk , General data is not lost .
  2. But it's rebooting kafka In the process , If there's consumer news , that kafka If it's too late to submit offset, May cause inaccurate data ( Lost or re consumed ).

7. kafka How to solve the problem of downtime ?

  1. Consider whether the business is affected first

kafka It's down. , First of all, we should consider whether the services provided are affected by the down machine , If the service is available , If the disaster recovery mechanism of cluster is well implemented , Then there's no need to worry about this one .

  1. Node debugging and recovery

Nodes that want to restore the cluster , The main step is to find out the cause of node downtime through log analysis , So as to solve , Restore the node .

8. Why? Kafka Read write separation is not supported ?

stay Kafka in , Producer writes message 、 The consumer reads the message with leader Copy to interact with , from And it's a kind of Write and read Production and consumption model of .
Kafka Does not support Write and read , Because there is 2 One obvious drawback :

  1. Data consistency issues : There must be a delay time window when data is transferred from the master node to the slave node , At this time Windows can cause data inconsistency between master and slave nodes . At some point , In the master node and the slave node A The values of the data are X, After that, it will be in the master node A Is changed to Y, So before the change is notified to the slave node , The application reads from the node A The value of the data is not up to date Y, This leads to the problem of inconsistent data .

  2. Time delay problem : similar Redis This kind of component , The process from writing data to master node to synchronizing data to slave node needs to go through The Internet → Main node memory → The Internet → From node memory These stages , The whole process will take some time . And in the Kafka in , Master slave synchronization will be better than Redis More time consuming , It needs to go through The Internet → Main node memory → Primary node disk → The Internet → From the festival Point memory → From the node disk These stages . For delay sensitive applications , The function of master-slave reading is not very suitable .

and kafka Of Write and read There are a lot of advantages of :

  1. It can simplify the implementation logic of the code , Reduce the possibility of errors ;
  2. The load granularity is refined and evenly distributed , Compared with master writer and slave reader , Not only is load efficiency better , And it's controllable to users ;
  3. There is no delay effect ;
  4. With a stable copy , There will be no data inconsistency .
本文为[Learn big data in five minutes]所创,转载请带上原文链接,感谢

  1. Centos7 one click installation of JDK1.8 shell script
  2. Mounting of file system in Linux (centos7)
  3. How does serverless deal with the resource supply demand of k8s in the offline scenario
  4. Detailed explanation of HBase basic principle
  5. Spring security oauth2.0 authentication and authorization 4: distributed system authentication and authorization
  6. Redis performance Part 5 redis buffer
  7. JavaScript this keyword
  8. Summary of Java multithreading (3)
  9. Sentry(v20.12.1) K8S 云原生架构探索, SENTRY FOR JAVASCRIPT 手动捕获事件基本用法
  10. Sentry(v20.12.1) K8S 云原生架构探索, SENTRY FOR JAVASCRIPT 手动捕获事件基本用法
  11. (10) Spring from the beginning to the end
  12. Summary of Java multithreading (2)
  13. Spring source notes! From the introduction to the source code, let you really understand the source code
  14. A stormy sunny day
  15. Zookeeper (curator), the implementation of distributed lock
  16. Show the sky! Tencent T4's core Java Dictionary (framework + principle + Notes + map)
  17. Spring boot project, how to gracefully replace the blank value in the interface parameter with null value?
  18. Spring boot project, how to gracefully replace the blank value in the interface parameter with null value?
  19. docker+mysql集群+读写分离+mycat管理+垂直分库+负载均衡
  20. docker+mysql集群+读写分离+mycat管理+垂直分库+负载均衡
  21. To what extent can I go out to work?
  22. Java 使用拦截器无限转发/重定向无限循环/重定向次数过多报错(StackOverflowError) 解决方案
  23. Implementation of rocketmq message sending based on JMeter
  24. How to choose the ticket grabbing app in the Spring Festival? We have measured
  25. Implementation of rocketmq message sending based on JMeter
  26. My programmer's Road: self study java
  27. My programmer's Road: self study java
  28. All in one, one article talks about the use of virtual machine VirtualBox and Linux
  29. All in one, one article talks about the use of virtual machine VirtualBox and Linux
  30. Java 使用拦截器无限转发/重定向无限循环/重定向次数过多报错(StackOverflowError) 解决方案
  31. [Java training project] Java ID number recognition system
  32. How does serverless deal with the resource supply demand of k8s in the offline scenario
  33. Detailed explanation of HBase basic principle
  34. Explain the function of thread pool and how to use it in Java
  35. Kubernetes official java client 8: fluent style
  36. 010_MySQL
  37. Vibrant special purchases for the Spring Festival tiktok section, hundreds of good things to make the year more rich flavor.
  38. 010_MySQL
  39. Of the 4 million docker images, 51% have high-risk vulnerabilities
  40. Rocketmq CPP client visual studio 2019 compilation
  41. Rocketmq CPP client visual studio 2019 compilation
  42. Usage of data custom attribute in jquery
  43. Common decompression in Linux
  44. Upload large files in Java
  45. Sentry (v20.12.1) k8s cloud native architecture exploration, sentry for JavaScript manual capture event basic usage
  46. Sentry (v20.12.1) k8s cloud native architecture exploration, sentry for JavaScript manual capture event basic usage
  47. Docker + MySQL Cluster + read / write separation + MYCAT Management + vertical sub database + load balancing
  48. Docker + MySQL Cluster + read / write separation + MYCAT Management + vertical sub database + load balancing
  49. Java use interceptor infinite forwarding / redirection infinite loop / redirection times too many error (stack overflow error) solution
  50. Java use interceptor infinite forwarding / redirection infinite loop / redirection times too many error (stack overflow error) solution
  51. 010_ MySQL
  52. 010_ MySQL
  53. Fast integration of imsdk and Huawei offline push
  54. 消息队列之RabbitMQ
  55. Rabbitmq of message queue
  56. 初学java进制转换方面补充学习
  57. Learn java base conversion supplementary learning
  58. 了解一下RPC,为何诞生RPC,和HTTP有什么不同?
  59. 了解一下RPC,为何诞生RPC,和HTTP有什么不同?
  60. 初学java进制转换方面补充学习