Hive Is based on Hadoop A data warehouse tool , A structured data file can be mapped to a database table , And provide classes SQL Query function （HQL）.
Reduce the development cost and learning cost of developers .
Metadata ： Data that describes data Internal execution process ： Parser （ analysis SQL sentence ）、 compiler （ hold SQL The statement is compiled into MapReduce Program ）、 Optimizer （ Optimize MapRedue Program ）、 actuator （ take MapReduce The results of the program run are submitted to HDFS）
The first kind of interaction ：Hive Interaction shell（ Directly through bin/hive The way ） The second way of interaction ：Hive JDBC service 1. start-up hiveserver2 service The front desk ：bin/hive --service hiveserver2 2.beeline Connect hiveserver2 beeline beeline> !connect jdbc:hive2://node01:10000
like：like It's a fuzzy matching query rlike：rlike regular expression .
Internal table deletion removes both metadata and data of the table . Metadata of external table is deleted , The data itself is not deleted .
advantage : Specify partition query , Improve query , The efficiency of analysis requirement : Partition fields must not appear in fields in the data table .
advantage ： 1、 about join The needs of , Can play the role of optimization acceleration .（ Premise is ,join Field is set to bucket field ） 2、 For data sampling （ obtain / Extract data samples ） requirement ： The bucket field must be a field in the table
1. Insert data directly into the table 2. Insert data through a query 3. Multiple insertion mode 4. Create table and load data in query statement 5. Passed while creating the table location Specify the load data path
1、 Export the results of the query to local 2、 Format and export the query results to local 3、 Export the results of the query to HDFS On ( No, local) 4、Hadoop Command export to local 5 、hive shell Command export 6、export Export to HDFS On （ Full table export ） 7. sqoop export
order by： Global ordering , One MapReduce sort by： Sort within each partition , Not sort for global result sets .
“Where” It's a constraint statement , Constrain the query conditions in the database before the results of the query database are returned , That is, it works before the result returns , And where Can't be used later “ Aggregate functions ”; “Having” It's a filter statement , The so-called filtering is to filter after the result of querying the database is returned , It works after the result is returned , also having You can use it later “ Aggregate functions ”.
When you need to select a field Conduct In zoning Use Usually with sort by Use a combination of ( Partition before sort ) Hive requirement DISTRIBUTE BY The statement is written in SORT BY The statement before .
When you need to partition according to a certain field and sort in ascending order according to this field cluster by
distribute by+sort by Method can specify positive and negative order Cluster It can only be positive order , Cannot specify sort by
-e Executes the specified HQL -f perform HQL Script -hiveconf Set up hive Parameter configuration at run time
The configuration file < Command line arguments < Parameter declarations
The storage format is ORC,ParquetFile Format , The data compression format is snappy
Custom functions fall into three categories ： UDF(User Defined Function)： One in, one out UDAF(User Defined Aggregation Function)： Aggregation function , More in one out ( for example count/max/min) UDTF(User Defined Table Generating Function)： One in, many out , Such as lateral view explode()
Set to more, Simple query statements don't translate into MR Program Set to none, All query statements should be transformed into MR Program
On the premise of small amount of data Improved query efficiency
Turn on Map After end aggregation and function opening local aggregation hive Will create two MR Program The first is local aggregation of data The second is the final summary of the data
SELECT count(DISTINCT id) FROM bigtable; Replace statement SELECT count(id) FROM (SELECT id FROM bigtable GROUP BY id) a; First filter in Management
Column cut : Just take the columns you need Partition clipping : Just take the partitions you need What would you like? What to take
With the partition rule of the first table , To correspond to the partition rule of the second table , All partitions of the first table , Copy it all to the second table , When the second table loads data , There's no need to specify a partition , Just use the partition of the first table
( Split a large task into several small tasks , Re execution ) Set up reduce Number (10) 1:distribute by ( Field ) 2 distribute by rand()
When the files were very small ： influence map The number factor is the number of documents When the files are big ： influence map The number factor is the number of blocks
The formula ： N=min( Parameters 2, Total amount of input data / Parameters 1) Parameters 1： Every Reduce The maximum amount of data processed Parameters 2： Every mission is the biggest Reduce Number
Parallel execution enables multiple tasks without dependencies to be executed at the same time , Played a role in improving the efficiency of the query
1、 Scanning all partitions is not allowed 2、 Used order by Statement query , Be required to use limit sentence 3、 Queries that restrict cartesian products
Allow multiple task Use one jvm The cost of task startup is reduced , Improve the efficiency of the task ( however , Before the end of the mission ,jvm Don't release , To occupy for a long time . When resources are insufficient , Waste of resources )
The task is being submitted SQL Statement " Local execution ", Tasks are not assigned to the cluster
Data stored in HDFS after , Write analysis code to realize calculation program , When the program is distributed , Priority distribution to the node where the data used by this program is located .
1. Write the filter conditions in join…on Of on in SELECT a.id FROM ori a LEFT JOIN bigtable b ON (b.id <= 10 AND a.id = b.id); 2. Write the filter conditions in join…on Of join, Subquery filtering SELECT a.id FROM bigtable a RIGHT JOIN (SELECT id FROM ori WHERE id <= 10 ) b ON a.id = b.id;
Participation of this paper Tencent cloud media sharing plan , You are welcome to join us , share .