HDFS brief introduction
HDFS yes Hadoop The core of , It's a distributed storage service
HDFS It's one of the distributed file systems
HDFS An important concept of
HDFS Locate files through a unified namespace directory tree
in addition , It's distributed , A lot of servers are united to implement their functions . The servers in the cluster have their own roles ( The essence of distribution is to split , Attend to each one's own duties )
- Typical Master/Slave framework
HDFS The architecture of is typical Master/Slave structure
HDFS Clusters tend to be One NameNode+ Multiple DataNode formNameNode
Is the master node of the cluster ,DataNode
Is the slave node of the cluster - Block storage (block Mechanism )
HDFS The file in physical is Block storage (block) Of , The size of the block can be specified by configuration parameters
Hadoop2.x Default... In version block Size is 128M - Namespace (NameSpace)
HDFS Support traditional hierarchical file organization structure .
Users or applications can create directories , Then save the files in these directories .
The hierarchy of the file system namespace is similar to most existing file systems : Users can create , Delete , Move or rename filesNameNode
Responsible for maintaining the file system's namespace , Any changes to the file system namespace or properties will be NameNode recorded - NameNode Metadata management
We put Directory structure And File block location It's called metadataNameNode
The metadata of each file records the corresponding block Information (block Of id, And where they are DataNode node ) - DataNode data storage
Each of the documents block The specific storage management of DataNode Node to undertake
One block There will be multiple DataNode To store , DataNode Will be timed to NameNode To report what you have block Information - Replica mechanism
For fault tolerance , All of the documents block There will be copies
For each file block Size and replica factor are configurable
The application can specify the number of copies of a file
The copy factor can be specified when the file is created , It can also be changed later
The default number of copies is 3 individual - Write once , Multiple readout
HDFS Designed to accommodate one write , Multiple read scenes , Random modification of files is not supported . ( Support append write , Random updates are not supported )
Because of that , HDFS Low level storage suitable for big data analysis , Not suitable for network disk and other applications ( It's inconvenient to modify , Big delay , Network overhead is high , The cost is too high )
HDFS framework
NameNode
: HDFS The manager of the cluster , Master- Maintenance Management HDFS The namespace of (NameSpace)
- Maintain replica policy
- Record file block (Block) Mapping information for
- Responsible for processing client read and write requests
DataNode
: NameNode give a command , DataNode Perform the actual operation , Slave- Save the actual block of data
- Responsible for reading and writing data blocks
Client
: client- Upload files to HDFS When , Client Be responsible for cutting the document into Block, Then upload
- request NameNode Interaction , Get file location information
- Read or write files , And DataNode Interaction
- Client You can use some commands to manage HDFS Or visit HDFS
HDFS Reading and writing analysis
HDFS Read data flow
- Client pass Distributed FileSystem towards NameNode Request file download , NameNode Find the location of the file block by querying the metadata DataNode Address
- Choose a DataNode( Nearby principle , Then a random ) The server , Request read data
- DataNode Start transferring data to client ( Reads the data input stream from the disk , With Packet I'm going to do the check for units )
- The client to Packet Is unit reception , Cache locally first , Then write to the target file
HDFS Write data flow
- Client pass Distributed FileSystem Module to NameNode Request file upload , NameNode Check if the destination file already exists , Does the parent directory exist
- NameNode Returns whether you can upload
- The client requests the first one Block Which ones to upload DataNode Server
- NameNode return 3 individual DataNode node , Namely dn1, dn2, dn3
- Client pass FSDataOutputStream Module request dn1 Upload data , dn1 Upon receipt of the request, the call continues dn2, then dn2 call dn3, Set up the communication channel
- dn1, dn2, dn3 Step by step reply client