HDFS As a distributed file system, how to ensure data consistency in a distributed environment .HDFS in , The stored files will be divided into several uniform size block Distributed storage on different machines , need NameNode Node to manage the data , Store these block The node of is called DataNode,NameNode It's used to manage this metadata .
NameNode Ensure metadata consistency
When the client uploads the file ,NameNode First of all edits log The operation log of recording metadata in the file . meanwhile ,NameNode A persistence will be done on the disk （fsimage file ）： It corresponds to the data in memory , How to ensure the consistency with the data in memory ？ stay edits logs Before full on memory and fsimage We need to synchronize our data , Merge edits logs and fsimage The data on the , then edits logs The data on can be cleared . And when edits logs When it's full , File upload cannot be interrupted , So it's going to be a new file edits.new Write the data on , And the old one edits logs The merge operation of will be performed by secondNameNode To complete , That is to say checkpoint operation .
checkpoint There are generally two kinds of restrictions on the triggering of , One is edits logs Size limit for , namely fs.checkpoint.size To configure ; One is the designated time , namely fs.checkpoint.period To configure . According to the regulation , The size limit is the priority , Regulations edits Once the file exceeds the threshold , Whether the maximum time interval is reached or not , Will force checkpoint.
SecondaytNameNode yes HA（High Available High availability ） A solution for , But hot standby is not supported , The configuration can be .SecondaryNameNode Execution process ： from NameNode Download metadata information （fsimage、edits）, And then combine the two , Generate a new fsimage, Keep it locally , And push it to NameNode, Replace old fsimage.( notes ：SecondaryNameNode Is only found in Hadoop1.0 in ,Hadoop2.0 Not in the above version , But in the pseudo distribution pattern, there are SecondaryNameNode Of , In cluster mode, there is no SecondaryNameNode Of )
SecondaryNameNode Workflow steps ：
- SecondaryNameNode notice NameNode Switch edits file
- SecondaryNameNode from NameNode get fsimage and edits( adopt http)
- SecondaryNameNode take fsimage Load memory , And then start merging edits. Also merge edits The operation needs to meet certain conditions , There are two conditions ：
1)fs.checkpoint.period Specify twice checkpoint The maximum time interval , Default 3600 second
2)fs.checkpoint.size Regulations edits Maximum value of file , Once the value is exceeded （ The default size is 64M） Then force checkpoint, Whether the maximum time interval is reached or not .）
- SecondaryNameNode New fsimage Send back to NameNode
- NameNode With the new fsimage Replace old fsimage
HDFS The checksums are calculated for all data written (checksum), And verify the checksums when reading the data . Computes the check sum for the specified number of bytes . The default number of bytes is 512 byte , Can pass io.bytes.per.checksum Property settings . adopt CRC-32 After coding is 4 byte .
DataNode Responsible for validation before saving data checksum.client The data and the checksums are sent together to a group of DataNode In a queue of , the last one DataNode Responsible for verifying checksum. If validation fails , Will throw out a ChecksumException. The client needs to handle this exception . The client from DataNode When reading data , It will also verify checksum. Every DataNode All saved a validation checksum Log . Every time the client successfully verifies a data block , Will tell DataNode ,DataNode The log will be updated .
Every DataNode It will also run one in a background thread DataBlockScanner, Verify this regularly DataNode All data blocks on . In use hadoop fs get Command to read a file , It can be used -ignoreCrc Ignore validation . If it's through FileSystem API When reading , Can pass setVerifyChecksum(false), Ignore validation .Hadoop Medium LocalFileSystem Client side verification and , When writing files , Will create a directory named .filename.crc Hidden files , If you want to disable the checksum function , It can be used RawLocalFileSystem Instead of LocalFileSystem .
Configuration conf = ... FileSystem fs = new RawLocalFileSystem(); fs.initialize(null, conf);
Or set it directly fs.file.impl The attribute is org.apache.hadoop.fs.RawLocalFileSystem, This will globally disable checksum.LocalFileSystem Internal use ChecksumFileSystem complete checksum Work . adopt ChecksumFileSystem You can add checksums .
HA High availability
HDFS One way to deal with node failure is data redundancy , That is to do multiple backup of data , stay HDFS You can set the number of backups through the configuration file , If you don't set it , The default number is 3. Pay attention to the parameters dfs.replication.min and dfs.replication The difference between ： While a block is being written , As long as at least dfs.replication.min Number of copies （ The default is 1）, The write operation will succeed , Until its target number of copies is reached dfs.replication（ The default is 3）
HDFS How copies are distributed on the rack ？
HDFS A strategy called frame awareness is used to improve the reliability of data 、 Availability and utilization of network bandwidth . in the majority of cases ,HDFS The copy factor is set to... By default 3(dfs.replication),HDFS The storage strategy is to store a copy on the local rack node , A copy is stored on another node in the same rack , The last copy is on a node in a different rack . This strategy reduces data transfer between racks , Improve the efficiency of write operation . Rack errors are far less than node errors , So this strategy will not affect the reliability and availability of data . meanwhile , Because data blocks are only stored in two different racks , Therefore, this strategy reduces the total network transmission bandwidth when reading data .
Usually , large Hadoop Clusters are organized in the form of racks , The network condition between different nodes in the same rack is better than that between different racks .Namenode Try to keep block copies in different racks to improve fault tolerance
HDFS How to know which Datanode On which rack ？
HDFS A method called “ Rack perception ” The strategy of .HDFS It can't automatically determine each Datanode Network topology of , It has to be configured dfs.network.script Parameter to determine the rack where the node is located . The document provides IP->rackid Translation ,NameNode Through this, we can get each Datanode Node rackid.
Detect node failure using “ heartbeat ”. Every DataNode The nodes periodically move toward NameNode Send a heartbeat . Network partitioning may lead to some DataNode Follow NameNode Lost contact .NameNode This is detected by the absence of a heartbeat signal , And will no longer send these heartbeat signals in the near future DataNode Marked as down , No more new IO Ask to send them .
Any storage down DataNode The data on will no longer be valid .DataNode The outage of may cause the copy coefficient of some data blocks to be lower than the specified value ,NameNode Constantly detect these data blocks that need to be copied , Once found, start the replication operation .
In the following cases , It may need to be reproduced ：
a) Some DataNode Node failure
b) A copy is damaged
c) DataNode Hard disk error on
d) The redundancy factor of files increases
DataNode And NameNode Heartbeat report content
A data block in DataNode Stored on disk as a file , There are two files , One is the data itself , One is that the metadata includes the length of the data block 、 Checksums and timestamps of block data . When DataNode Read block When , It will also have a checksum, If you calculate the checksum, And block The values are different when created , Explain the block Has been damaged . If the block is damaged ,Client Will read other DataNode Upper block.NameNode Mark that the block is damaged , And then copy it block Achieve the expected number of file backups .
DataNode Starts to NameNode register , The heartbeat is every 3 Seconds at a time , The purpose is to tell namenode Your own survival and available space . Heartbeat returns results with NameNode To the DataNode Commands such as copying block data to another machine , Or delete a block of data .
confirm datanode Downtime
（1） Stop sending heartbeat reports
Default continuous 10 A heartbeat is continuous if it cannot be received 10*3=30s Continuous . this 10 In the middle of the second time, as long as there is 1 I received a new heartbeat recording .
（2）namenode Send check
In succession 10 Time NameNode Did not receive DataNode After the heart rate report ,NameNode conclude DataNode Maybe it's down ,NameNode Active direction DataNode Send check NameNode Will turn on the backstage guard （ Blocking ） process Waiting for the results of the inspection .NameNode Check DataNode Time for ： Default 5min.
Default check 2 Every time 5min continuity 2 Secondary inspection （10min） There's no response. Confirm DataNode It's down. ( Send a wait 5 minute ).NameNode Confirm one DataNode The total time required for downtime ： 103s+300s2=630s.
When DataNode When it starts , It will traverse the local file system , Produce a copy HDFS List of correspondence between data block and local file , This is the report block （BlockReport）, The report block contains DataNode A list of all the blocks on the . Every hour DataNode Default direction NameNode send out block report. report DataNode The data nodes of .
NameNode After startup, it will enter a special state called safe mode . In safe mode NameNode It's read-only to the client .NameNode From all of DataNode Receive heartbeat and block status reports （blockreport）.
Each data block has a specified minimum number of copies （dfs.replication.min）, When NameNode The detection confirms that the number of copies of a data block reaches this minimum value , Then the data block will be considered as replica security (safely replicated) Of .
In a certain percentage （ This parameter is configured in dfs.safemode.threshold.pct, The default value is 99.9%） The data block of is NameNode After checking and confirming that it is safe , In a few minutes （ This parameter is configured in dfs.safemode.extension, The default value is 30 second ）,NameNode Will exit safe mode state . Next NameNode It will determine which data blocks have not reached the specified number of copies , And copy these data blocks to other DataNode On .
- HDFS The checksums are calculated for all data written （checksum）, And verify when reading data .Datanode After receiving data from the client or copying other Datanode When the data is , After verifying the data, the checksums are stored . The client that is writing data sends the data and its checksums to a series of Datanode A pipeline made up of , The last one in the pipeline Datanode Responsible for verifying the checksums . If Datanode Error detected , The client will receive a ChecksumException
- The client from Datanode When reading data , The checksums are also verified , Connect them with Datanode Compare the checksums stored in . Every Datanode A check sum log for validation is persisted , So it knows the last validation time of each data block . After the client successfully verifies a data block , I'll inform this Datanode Update the secondary log
- Besides , Every Datanode It will also run in a background called DataBlockScanner The process periodically validates stored in this Datanode All data blocks on . When an error is detected ,Namenode Mark the damaged block as damaged , And then from the other Datanode Copy a copy of this data , Finally, make the number of copies of data reach the specified number
The recycle bin
When a user or application deletes a file , This document did not immediately from HDFS Delete in . actually ,HDFS This file rename will be transferred to /trash Catalog . As long as the documents are still there /trash Directory , The file can be recovered quickly .
The file in /trash The time saved in is configurable （ Configuration parameters fs.trash.interval）, When the time is over ,Namenode The file will be removed from the namespace . Deleting the file will release the data block related to the file . Be careful , Delete files from users to HDFS There will be some time delay between the increase of free space .
FsImage and Editlog yes HDFS The core data of . If these files are damaged , Whole HDFS All of them will be invalid . thus ,Namenode It can be configured to support maintenance of multiple FsImage and Editlog Copy of . Any right FsImage perhaps Editlog Modification of , Will be synced to their copies . This multi copy synchronization operation may be reduced Namenode Number of namespace transactions processed per second . But the price is acceptable , Because even if HDFS Application is data intensive , They're not metadata intensive either . When Namenode When restarting , It will pick the latest complete FsImage and Editlog To use
Snapshot supports the replication and backup of data at a specific time . Using snapshots , It can make HDFS Recover to a known point in time when data is corrupted .HDFS Snapshot is not supported yet , But the plan is to support it in a future release .
Fault tolerance mechanism
There are three main types of faults , For these three fault types ,HDFS Different fault detection mechanisms are provided ：
in the light of DataNode Failure problem ,HDFS Using the heartbeat mechanism ,DataNode Regularly send to NameNode Send heartbeat message ,NameNode According to the heartbeat information DataNode Survival
In view of the problem that the network fails to send and receive data ,HDFS Provides ACK The mechanism of , After the sender sends the data , If not received ACK And it's still true after many retries , It's a network failure
For data corruption , all DataNode Regularly to NameNode Send a list of blocks stored in itself , At the same time, the data is transmitted and the total check code is sent ,NameNode Judge whether the data is lost or damaged in turn
Read fault tolerance
When reading fails ：
- DFSInputStream Will try to connect to the next one in the list DataNode , Record the exception node at the same time
- DFSInputStream It will also check the data obtained checknums . If it's damaged block Found out ,DFSInputStream Just try to get a backup from another DataNode Read the data in the backup block
- Finally, the synchronization of abnormal data is also done by NameNode To arrange the completion of
Write fault tolerance
When a computer in the writing process DataNode It's down. ：
First pipeline Shut down , What's left in the confirmation queue package It will be added to the starting position of the data queue and will not be sent any more , To prevent the node downstream of the failed node from losing data again
then , Stored in normal DataNode Upper block Will be assigned a new identity , And pass the identity to NameNode, So that the fault DataNode After recovery, you can delete the part of the invalid data block that you have stored
The failed node will start from pipeline Remove , And then there are two good ones left DataNode Will form a new pipeline , The rest of this block My bag will continue to be written into pipeline It's normal DataNode in
Last ,NameNode It will be found that the node outage causes some block The number of block backups of is less than the specified number of backups , here NameNode The number of backup nodes will be arranged to meet dfs.replication Configuration requirements
stay NameNode Block tables and DataNode Two tables . The block table stores a block of data （ Including copies ） Where DataNode,DataNode The table stores each DataNode List of data blocks saved in . because DataNode Will periodically give NameNode Send your own data block information , therefore NameNode The data block table and DataNode surface . If you find someone DataNode Data block error on ,NameNode The block is removed from the block table ; If you find someone DataNode invalid ,NameNode Two tables will be updated .NameNode And periodically scan the block table , If it is found that the backup number of a database in the data block table is lower than the set backup number , Will coordinate from other DataNode Copy data to another DataNode Complete the backup on .
HDFS Copy placement strategy
If the write request appears in a DataNode On , The first copy will be stored in the current DataNode On the same machine , If the machine is overloaded , You can put the backup on a random machine in the same rack . The second copy is then stored in a different rack than the first one , The third copy will be randomly stored in any rack where the second copy is located DataNode On .
In the case of three copies , The first copy is on the same machine as the original data , The other two copies are placed on random machines in other racks . Such a setting can make both performance and disaster tolerance , Priority to get backup data from the same machine , Reduce data transmission overhead ; In the event that the machine goes down , Get backup data from another rack , Avoid collective downtime of machines in the same rack .
HDFS Of HA framework
All of the above fault tolerance is based on DataNode The problem of failure is considered , however NameNode There is a single point of failure in itself , If NameNode Something goes wrong , Then the whole cluster will go down directly . therefore HDFS Provides HA The architecture of , For a typical HA In terms of clusters ,NameNode Will be configured on two separate machines , At any time , One NameNode be in Active state , And another one. NameNode be in Standby state ,Active State of NameNode Will respond to requests from all clients in the cluster ,Standby State of NameNode Just as a copy , Make sure to provide a quick transfer when necessary , Make the upper layer to NameNode There's no sense in switching ,Standby NameNode And Active NameNode It should be synchronized at all times , stay Active NameNode and Standby NameNode There should be a shared log storage place between them ,Active NameNode hold EditLog Write to the shared storage log ,Standby NameNode Read the log and execute , bring Active NameNode and Standby NameNode In memory HDFS Metadata keeps in sync .
Welcome to your attention ,《 The way of big data becoming God 》 Series articles
Welcome to your attention ,《 The way of big data becoming God 》 Series articles
Welcome to your attention ,《 The way of big data becoming God 》 Series articles