Detailed explanation of HBase basic principle

itread01 2021-01-14 14:11:40
detailed explanation hbase basic principle

## HBase Introduction HBase It's a decentralized 、 Open source database for Columns . Based on the HDFS above .Hbase The source of my name is Hadoop database, namely Hadoop Database .HBase The calculation and storage capacity of depends on Hadoop Cluster . It's about NoSql and RDBMS Between , Only through primary key (row key) And primary key range To retrieve information , Only single transaction is supported ( It can be done by Hive Support to implement multiple tables join And so on ).HBase The characteristics of Chinese table :1. Big : A table can have billions of rows , Millions of columns 2. For the column : For the column ( family ) Storage and licensing control of , Column ( family ) Independent search .3. sparse :** For empty (null) The column of , It doesn't take up storage space , therefore , Tables can be designed to be very sparse **.## HBase Underlying principle ### System architecture ![HBase System architecture ]( According to this picture , Explain HBase The components in the #### Client1. Include access to hbase The interface of ,**Client Maintaining some cache To speed up the process of hbase Access to **, such as regione Location information .#### ZookeeperHBase You can use built-in Zookeeper, You can also use external , In the actual production environment , To maintain unity , Generally use external connection Zookeeper.Zookeeper stay HBase The role of :1. Make sure that at any time , There is only one cluster master2. Store everything Region Address entry for 3. Real time monitoring Region Server The state of , Will Region server Real time notification of online and offline information to Master#### HMaster1. For Region server Distribute region2. ** Responsible for region server Load balancing **3. Found to be invalid region server And redistribute the region4. HDFS Garbage file recycling on 5. Deal with schema Update request #### HRegion Server1. HRegion server** Maintenance HMaster Assigned to it region**, Deal with these region Of IO Ask for 2. HRegion server Responsible for segmentation becomes too large in the process of execution region You can see from the picture that ,**Client Visit HBase There is no need to HMaster Participate in **( Addressing access Zookeeper and HRegion server, Read and write the data and visit HRegione server)**HMaster Just defenders table and HRegion The metadata information of , The load is very low .**### HBase Table data model of ![HBase Table structure of ]( Line key Row Key And nosql The database is the same ,row key Is the primary key used to retrieve records . Visit hbase table The lines in the , There are only three ways :1. By single row key Visit 2. Through row key Of range3. Full scan Row Key The line key can be any string (** The maximum length is 64KB**, In practical application, the length is usually 10-100bytes), stay hbase Inside ,row key Stored as an array of bytes .**Hbase The information in the table will be processed according to rowkey Sort ( Dictionary order )** In storage , Information according to Row key Dictionary sequence (byte order) Sort storage . Design key When , To fully sort and store this feature , Save the rows that are often read together .( Location dependence ). Be careful : The dictionary order is right int The result of the sorting is 1,10,100,11,12,13,14,15,16,17,18,19,2,20,21 ... .** To maintain the natural order of plastic surgery , The line key must be 0 Fill left .**** One read and write of a line is an atomic operation ( No matter how many columns are read or written at a time )**. This design decision can make it easy for users to understand the behavior of the program when concurrent updates are performed on the same line .#### Column family Column Family**HBase Every column in the table , All belong to a certain family **. A column family is a table schema Part of ( The column is not ),** Must be defined before using tables **. All the names begin with the family . for example courses:history , courses:math All belong to courses This clan .** Access control 、 Disk and memory usage statistics are done at the column family level . The more families , To be involved in a row of data IO、 The more files we search for , therefore , If it's not necessary , Don't set too many column families .**#### Column Column Specific columns below column families , Belong to a certain ColumnFamily, Similar to in mysql The specific columns that are created .#### Time stamp TimestampHBase Medium pass row and columns What is determined is a storage unit called cell. Every cell They all store multiple versions of the same data . Versions are indexed by timestamps . The type of timestamp is 64 An integer .** Timestamps can be created by hbase( Automatically when data is written ) Assignment **, At this point, the timestamp is the current system time accurate to milliseconds . Timestamps can also be assigned explicitly by the client . If the application wants to avoid data version conflicts , You have to generate your own unique timestamp .** Every cell in , The data are sorted in reverse chronological order **, That is, the latest information is at the top of the list . In order to avoid the management caused by too many versions of data ( Including storage and index ) Burden ,hbase There are two ways to recycle data version :1. At the end of the storage n Version 2. Save the latest version ( Set the life cycle of the data TTL). Users can set it for each column family .#### Unit Cell from {row key, co

