[big data beeps 20210122] the interviewer asked me if HDFS lost data? I slapped the article in his face

Wang Zhiwu 2021-01-23 18:41:20
big data beeps interviewer hdfs


Data consistency

HDFS As a distributed file system, how to ensure data consistency in a distributed environment .HDFS in , The stored files will be divided into several uniform size block Distributed storage on different machines , need NameNode Node to manage the data , Store these block The node of is called DataNode,NameNode It's used to manage this metadata .

NameNode Ensure metadata consistency

When the client uploads the file ,NameNode First of all edits log The operation log of recording metadata in the file . meanwhile ,NameNode A persistence will be done on the disk (fsimage file ): It corresponds to the data in memory , How to ensure the consistency with the data in memory ? stay edits logs Before full on memory and fsimage We need to synchronize our data , Merge edits logs and fsimage The data on the , then edits logs The data on can be cleared . And when edits logs When it's full , File upload cannot be interrupted , So it's going to be a new file edits.new Write the data on , And the old one edits logs The merge operation of will be performed by secondNameNode To complete , That is to say checkpoint operation .

checkpoint There are generally two kinds of restrictions on the triggering of , One is edits logs Size limit for , namely fs.checkpoint.size To configure ; One is the designated time , namely fs.checkpoint.period To configure . According to the regulation , The size limit is the priority , Regulations edits Once the file exceeds the threshold , Whether the maximum time interval is reached or not , Will force checkpoint.

SecondaytNameNode yes HA(High Available High availability ) A solution for , But hot standby is not supported , The configuration can be .SecondaryNameNode Execution process : from NameNode Download metadata information (fsimage、edits), And then combine the two , Generate a new fsimage, Keep it locally , And push it to NameNode, Replace old fsimage.( notes :SecondaryNameNode Is only found in Hadoop1.0 in ,Hadoop2.0 Not in the above version , But in the pseudo distribution pattern, there are SecondaryNameNode Of , In cluster mode, there is no SecondaryNameNode Of )

SecondaryNameNode Workflow steps :

  • SecondaryNameNode notice NameNode Switch edits file
  • SecondaryNameNode from NameNode get fsimage and edits( adopt http)
  • SecondaryNameNode take fsimage Load memory , And then start merging edits. Also merge edits The operation needs to meet certain conditions , There are two conditions :
    1)fs.checkpoint.period Specify twice checkpoint The maximum time interval , Default 3600 second
    2)fs.checkpoint.size Regulations edits Maximum value of file , Once the value is exceeded ( The default size is 64M) Then force checkpoint, Whether the maximum time interval is reached or not .)
  • SecondaryNameNode New fsimage Send back to NameNode
  • NameNode With the new fsimage Replace old fsimage

The checksum

HDFS The checksums are calculated for all data written (checksum), And verify the checksums when reading the data . Computes the check sum for the specified number of bytes . The default number of bytes is 512 byte , Can pass io.bytes.per.checksum Property settings . adopt CRC-32 After coding is 4 byte .

DataNode Responsible for validation before saving data checksum.client The data and the checksums are sent together to a group of DataNode In a queue of , the last one DataNode Responsible for verifying checksum. If validation fails , Will throw out a ChecksumException. The client needs to handle this exception . The client from DataNode When reading data , It will also verify checksum. Every DataNode All saved a validation checksum Log . Every time the client successfully verifies a data block , Will tell DataNode ,DataNode The log will be updated .

Every DataNode It will also run one in a background thread DataBlockScanner, Verify this regularly DataNode All data blocks on . In use hadoop fs get Command to read a file , It can be used -ignoreCrc Ignore validation . If it's through FileSystem API When reading , Can pass setVerifyChecksum(false), Ignore validation .Hadoop Medium LocalFileSystem Client side verification and , When writing files , Will create a directory named .filename.crc Hidden files , If you want to disable the checksum function , It can be used RawLocalFileSystem Instead of LocalFileSystem .

Configuration conf = ...
FileSystem fs = new RawLocalFileSystem();
fs.initialize(null, conf);

Or set it directly fs.file.impl The attribute is org.apache.hadoop.fs.RawLocalFileSystem, This will globally disable checksum.LocalFileSystem Internal use ChecksumFileSystem complete checksum Work . adopt ChecksumFileSystem You can add checksums .

HA High availability

Redundant copies

HDFS One way to deal with node failure is data redundancy , That is to do multiple backup of data , stay HDFS You can set the number of backups through the configuration file , If you don't set it , The default number is 3. Pay attention to the parameters dfs.replication.min and dfs.replication The difference between : While a block is being written , As long as at least dfs.replication.min Number of copies ( The default is 1), The write operation will succeed , Until its target number of copies is reached dfs.replication( The default is 3)
HDFS How copies are distributed on the rack ?

HDFS A strategy called frame awareness is used to improve the reliability of data 、 Availability and utilization of network bandwidth . in the majority of cases ,HDFS The copy factor is set to... By default 3(dfs.replication),HDFS The storage strategy is to store a copy on the local rack node , A copy is stored on another node in the same rack , The last copy is on a node in a different rack . This strategy reduces data transfer between racks , Improve the efficiency of write operation . Rack errors are far less than node errors , So this strategy will not affect the reliability and availability of data . meanwhile , Because data blocks are only stored in two different racks , Therefore, this strategy reduces the total network transmission bandwidth when reading data .

Rack perception

Usually , large Hadoop Clusters are organized in the form of racks , The network condition between different nodes in the same rack is better than that between different racks .Namenode Try to keep block copies in different racks to improve fault tolerance
HDFS How to know which Datanode On which rack ?

HDFS A method called “ Rack perception ” The strategy of .HDFS It can't automatically determine each Datanode Network topology of , It has to be configured dfs.network.script Parameter to determine the rack where the node is located . The document provides IP->rackid Translation ,NameNode Through this, we can get each Datanode Node rackid.


Detect node failure using “ heartbeat ”. Every DataNode The nodes periodically move toward NameNode Send a heartbeat . Network partitioning may lead to some DataNode Follow NameNode Lost contact .NameNode This is detected by the absence of a heartbeat signal , And will no longer send these heartbeat signals in the near future DataNode Marked as down , No more new IO Ask to send them .

Any storage down DataNode The data on will no longer be valid .DataNode The outage of may cause the copy coefficient of some data blocks to be lower than the specified value ,NameNode Constantly detect these data blocks that need to be copied , Once found, start the replication operation .

In the following cases , It may need to be reproduced :
a) Some DataNode Node failure
b) A copy is damaged
c) DataNode Hard disk error on
d) The redundancy factor of files increases

DataNode And NameNode Heartbeat report content

A data block in DataNode Stored on disk as a file , There are two files , One is the data itself , One is that the metadata includes the length of the data block 、 Checksums and timestamps of block data . When DataNode Read block When , It will also have a checksum, If you calculate the checksum, And block The values are different when created , Explain the block Has been damaged . If the block is damaged ,Client Will read other DataNode Upper block.NameNode Mark that the block is damaged , And then copy it block Achieve the expected number of file backups .

DataNode Starts to NameNode register , The heartbeat is every 3 Seconds at a time , The purpose is to tell namenode Your own survival and available space . Heartbeat returns results with NameNode To the DataNode Commands such as copying block data to another machine , Or delete a block of data .
confirm datanode Downtime
(1) Stop sending heartbeat reports
Default continuous 10 A heartbeat is continuous if it cannot be received 10*3=30s Continuous . this 10 In the middle of the second time, as long as there is 1 I received a new heartbeat recording .

(2)namenode Send check

In succession 10 Time NameNode Did not receive DataNode After the heart rate report ,NameNode conclude DataNode Maybe it's down ,NameNode Active direction DataNode Send check NameNode Will turn on the backstage guard ( Blocking ) process Waiting for the results of the inspection .NameNode Check DataNode Time for : Default 5min.

Default check 2 Every time 5min continuity 2 Secondary inspection (10min) There's no response. Confirm DataNode It's down. ( Send a wait 5 minute ).NameNode Confirm one DataNode The total time required for downtime : 103s+300s2=630s.

When DataNode When it starts , It will traverse the local file system , Produce a copy HDFS List of correspondence between data block and local file , This is the report block (BlockReport), The report block contains DataNode A list of all the blocks on the . Every hour DataNode Default direction NameNode send out block report. report DataNode The data nodes of .

safe mode

  • NameNode After startup, it will enter a special state called safe mode . In safe mode NameNode It's read-only to the client .NameNode From all of DataNode Receive heartbeat and block status reports (blockreport).

  • Each data block has a specified minimum number of copies (dfs.replication.min), When NameNode The detection confirms that the number of copies of a data block reaches this minimum value , Then the data block will be considered as replica security (safely replicated) Of .

  • In a certain percentage ( This parameter is configured in dfs.safemode.threshold.pct, The default value is 99.9%) The data block of is NameNode After checking and confirming that it is safe , In a few minutes ( This parameter is configured in dfs.safemode.extension, The default value is 30 second ),NameNode Will exit safe mode state . Next NameNode It will determine which data blocks have not reached the specified number of copies , And copy these data blocks to other DataNode On .

The checksum

  • HDFS The checksums are calculated for all data written (checksum), And verify when reading data .Datanode After receiving data from the client or copying other Datanode When the data is , After verifying the data, the checksums are stored . The client that is writing data sends the data and its checksums to a series of Datanode A pipeline made up of , The last one in the pipeline Datanode Responsible for verifying the checksums . If Datanode Error detected , The client will receive a ChecksumException
  • The client from Datanode When reading data , The checksums are also verified , Connect them with Datanode Compare the checksums stored in . Every Datanode A check sum log for validation is persisted , So it knows the last validation time of each data block . After the client successfully verifies a data block , I'll inform this Datanode Update the secondary log
  • Besides , Every Datanode It will also run in a background called DataBlockScanner The process periodically validates stored in this Datanode All data blocks on . When an error is detected ,Namenode Mark the damaged block as damaged , And then from the other Datanode Copy a copy of this data , Finally, make the number of copies of data reach the specified number

The recycle bin

When a user or application deletes a file , This document did not immediately from HDFS Delete in . actually ,HDFS This file rename will be transferred to /trash Catalog . As long as the documents are still there /trash Directory , The file can be recovered quickly .

The file in /trash The time saved in is configurable ( Configuration parameters fs.trash.interval), When the time is over ,Namenode The file will be removed from the namespace . Deleting the file will release the data block related to the file . Be careful , Delete files from users to HDFS There will be some time delay between the increase of free space .

Metadata protection

FsImage and Editlog yes HDFS The core data of . If these files are damaged , Whole HDFS All of them will be invalid . thus ,Namenode It can be configured to support maintenance of multiple FsImage and Editlog Copy of . Any right FsImage perhaps Editlog Modification of , Will be synced to their copies . This multi copy synchronization operation may be reduced Namenode Number of namespace transactions processed per second . But the price is acceptable , Because even if HDFS Application is data intensive , They're not metadata intensive either . When Namenode When restarting , It will pick the latest complete FsImage and Editlog To use

Snapshot mechanism

Snapshot supports the replication and backup of data at a specific time . Using snapshots , It can make HDFS Recover to a known point in time when data is corrupted .HDFS Snapshot is not supported yet , But the plan is to support it in a future release .

Fault tolerance mechanism

There are three main types of faults , For these three fault types ,HDFS Different fault detection mechanisms are provided :

in the light of DataNode Failure problem ,HDFS Using the heartbeat mechanism ,DataNode Regularly send to NameNode Send heartbeat message ,NameNode According to the heartbeat information DataNode Survival
In view of the problem that the network fails to send and receive data ,HDFS Provides ACK The mechanism of , After the sender sends the data , If not received ACK And it's still true after many retries , It's a network failure
For data corruption , all DataNode Regularly to NameNode Send a list of blocks stored in itself , At the same time, the data is transmitted and the total check code is sent ,NameNode Judge whether the data is lost or damaged in turn

Read fault tolerance

When reading fails :

  • DFSInputStream Will try to connect to the next one in the list DataNode , Record the exception node at the same time
  • DFSInputStream It will also check the data obtained checknums . If it's damaged block Found out ,DFSInputStream Just try to get a backup from another DataNode Read the data in the backup block
  • Finally, the synchronization of abnormal data is also done by NameNode To arrange the completion of

Write fault tolerance

When a computer in the writing process DataNode It's down. :

First pipeline Shut down , What's left in the confirmation queue package It will be added to the starting position of the data queue and will not be sent any more , To prevent the node downstream of the failed node from losing data again
then , Stored in normal DataNode Upper block Will be assigned a new identity , And pass the identity to NameNode, So that the fault DataNode After recovery, you can delete the part of the invalid data block that you have stored
The failed node will start from pipeline Remove , And then there are two good ones left DataNode Will form a new pipeline , The rest of this block My bag will continue to be written into pipeline It's normal DataNode in
Last ,NameNode It will be found that the node outage causes some block The number of block backups of is less than the specified number of backups , here NameNode The number of backup nodes will be arranged to meet dfs.replication Configuration requirements

DataNode invalid

stay NameNode Block tables and DataNode Two tables . The block table stores a block of data ( Including copies ) Where DataNode,DataNode The table stores each DataNode List of data blocks saved in . because DataNode Will periodically give NameNode Send your own data block information , therefore NameNode The data block table and DataNode surface . If you find someone DataNode Data block error on ,NameNode The block is removed from the block table ; If you find someone DataNode invalid ,NameNode Two tables will be updated .NameNode And periodically scan the block table , If it is found that the backup number of a database in the data block table is lower than the set backup number , Will coordinate from other DataNode Copy data to another DataNode Complete the backup on .

HDFS Copy placement strategy

If the write request appears in a DataNode On , The first copy will be stored in the current DataNode On the same machine , If the machine is overloaded , You can put the backup on a random machine in the same rack . The second copy is then stored in a different rack than the first one , The third copy will be randomly stored in any rack where the second copy is located DataNode On .

In the case of three copies , The first copy is on the same machine as the original data , The other two copies are placed on random machines in other racks . Such a setting can make both performance and disaster tolerance , Priority to get backup data from the same machine , Reduce data transmission overhead ; In the event that the machine goes down , Get backup data from another rack , Avoid collective downtime of machines in the same rack .

HDFS Of HA framework

All of the above fault tolerance is based on DataNode The problem of failure is considered , however NameNode There is a single point of failure in itself , If NameNode Something goes wrong , Then the whole cluster will go down directly . therefore HDFS Provides HA The architecture of , For a typical HA In terms of clusters ,NameNode Will be configured on two separate machines , At any time , One NameNode be in Active state , And another one. NameNode be in Standby state ,Active State of NameNode Will respond to requests from all clients in the cluster ,Standby State of NameNode Just as a copy , Make sure to provide a quick transfer when necessary , Make the upper layer to NameNode There's no sense in switching ,Standby NameNode And Active NameNode It should be synchronized at all times , stay Active NameNode and Standby NameNode There should be a shared log storage place between them ,Active NameNode hold EditLog Write to the shared storage log ,Standby NameNode Read the log and execute , bring Active NameNode and Standby NameNode In memory HDFS Metadata keeps in sync .


Welcome to your attention ,《 The way of big data becoming God 》 Series articles

Welcome to your attention ,《 The way of big data becoming God 》 Series articles

Welcome to your attention ,《 The way of big data becoming God 》 Series articles

本文为[Wang Zhiwu]所创,转载请带上原文链接,感谢

  1. 【计算机网络 12(1),尚学堂马士兵Java视频教程
  2. 【程序猿历程,史上最全的Java面试题集锦在这里
  3. 【程序猿历程(1),Javaweb视频教程百度云
  4. Notes on MySQL 45 lectures (1-7)
  5. [computer network 12 (1), Shang Xuetang Ma soldier java video tutorial
  6. The most complete collection of Java interview questions in history is here
  7. [process of program ape (1), JavaWeb video tutorial, baidu cloud
  8. Notes on MySQL 45 lectures (1-7)
  9. 精进 Spring Boot 03:Spring Boot 的配置文件和配置管理,以及用三种方式读取配置文件
  10. Refined spring boot 03: spring boot configuration files and configuration management, and reading configuration files in three ways
  11. 精进 Spring Boot 03:Spring Boot 的配置文件和配置管理,以及用三种方式读取配置文件
  12. Refined spring boot 03: spring boot configuration files and configuration management, and reading configuration files in three ways
  13. 【递归,Java传智播客笔记
  14. [recursion, Java intelligence podcast notes
  15. [adhere to painting for 386 days] the beginning of spring of 24 solar terms
  16. K8S系列第八篇(Service、EndPoints以及高可用kubeadm部署)
  17. K8s Series Part 8 (service, endpoints and high availability kubeadm deployment)
  18. 【重识 HTML (3),350道Java面试真题分享
  19. 【重识 HTML (2),Java并发编程必会的多线程你竟然还不会
  20. 【重识 HTML (1),二本Java小菜鸟4面字节跳动被秒成渣渣
  21. [re recognize HTML (3) and share 350 real Java interview questions
  22. [re recognize HTML (2). Multithreading is a must for Java Concurrent Programming. How dare you not
  23. [re recognize HTML (1), two Java rookies' 4-sided bytes beat and become slag in seconds
  24. 造轮子系列之RPC 1:如何从零开始开发RPC框架
  25. RPC 1: how to develop RPC framework from scratch
  26. 造轮子系列之RPC 1:如何从零开始开发RPC框架
  27. RPC 1: how to develop RPC framework from scratch
  28. 一次性捋清楚吧,对乱糟糟的,Spring事务扩展机制
  29. 一文彻底弄懂如何选择抽象类还是接口,连续四年百度Java岗必问面试题
  30. Redis常用命令
  31. 一双拖鞋引发的血案,狂神说Java系列笔记
  32. 一、mysql基础安装
  33. 一位程序员的独白:尽管我一生坎坷,Java框架面试基础
  34. Clear it all at once. For the messy, spring transaction extension mechanism
  35. A thorough understanding of how to choose abstract classes or interfaces, baidu Java post must ask interview questions for four consecutive years
  36. Redis common commands
  37. A pair of slippers triggered the murder, crazy God said java series notes
  38. 1、 MySQL basic installation
  39. Monologue of a programmer: despite my ups and downs in my life, Java framework is the foundation of interview
  40. 【大厂面试】三面三问Spring循环依赖,请一定要把这篇看完(建议收藏)
  41. 一线互联网企业中,springboot入门项目
  42. 一篇文带你入门SSM框架Spring开发,帮你快速拿Offer
  43. 【面试资料】Java全集、微服务、大数据、数据结构与算法、机器学习知识最全总结,283页pdf
  44. 【leetcode刷题】24.数组中重复的数字——Java版
  45. 【leetcode刷题】23.对称二叉树——Java版
  46. 【leetcode刷题】22.二叉树的中序遍历——Java版
  47. 【leetcode刷题】21.三数之和——Java版
  48. 【leetcode刷题】20.最长回文子串——Java版
  49. 【leetcode刷题】19.回文链表——Java版
  50. 【leetcode刷题】18.反转链表——Java版
  51. 【leetcode刷题】17.相交链表——Java&python版
  52. 【leetcode刷题】16.环形链表——Java版
  53. 【leetcode刷题】15.汉明距离——Java版
  54. 【leetcode刷题】14.找到所有数组中消失的数字——Java版
  55. 【leetcode刷题】13.比特位计数——Java版
  56. oracle控制用户权限命令
  57. 三年Java开发,继阿里,鲁班二期Java架构师
  58. Oracle必须要启动的服务
  59. 万字长文!深入剖析HashMap,Java基础笔试题大全带答案
  60. 一问Kafka就心慌?我却凭着这份,图灵学院vip课程百度云