A summary of some of hive's common operations.

Keven He 2022-05-14 14:33:17 阅读数:614

summaryhivecommonoperations.operations

DDL Data definition

Create database

create database db_hive;
//if not exists Judge 
create database if not exists db_hive;
// Appoint hdfs Storage location 
create database db_hive2 location '/db_hive2.db';

Query the database

show databases;
// Filter the database that displays the query 
show databases like 'db_hive*';
// Display database information 
desc database db_hive;
// Show database details 
desc database extended db_hive;
// Switch the current database 
use db_hive;

modify the database

alter database hive set dbproperties('createtime'='20170830');

have access to ALTER DATABASE Command for a database DBPROPERTIES Set the key - Value to property value , To describe the property information of this database . Other metadata information in the database is immutable , Include the database name and the directory location where the database is located .

Delete database

// Delete empty database 
drop database db_hive2;
//if not exists Judge 
drop database if exists db_hive2;
// The database is not empty , Mandatory deletion 
drop database db_hive cascade;

Create table

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]

(1)CREATE TABLE Create a table with the specified name . If a table with the same name already exists , Throw an exception ; Users can use IF NOT EXISTS Option to ignore this exception .
(2)EXTERNAL Keywords allow users to create an external table , Specify a path to the actual data while building the table (LOCATION),Hive When creating an internal table , Will move the data to the path the data warehouse points to ; To create an external table , Only the path where the data is recorded , Don't make any changes to the location of the data . When deleting a table , The metadata and data of the internal table will be deleted together , External tables only delete metadata , Do not delete data .
(3)COMMENT: Add comments to tables and columns .
(4)PARTITIONED BY Create a partition table
(5)CLUSTERED BY Create a bucket table
(6)SORTED BY Not commonly used
(7)ROW FORMAT
DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char]
[[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value,
property_name=property_value, …)]
Users can customize when creating tables SerDe Or use your own SerDe. If not specified ROW FORMAT perhaps ROW FORMAT DELIMITED, Will use the SerDe. While building the watch , You also need to specify columns for the table , The user will also specify a custom... While specifying the columns of the table SerDe,Hive adopt SerDe Data that determines the specific columns of a table .
SerDe yes Serialize/Deserilize For short , The purpose is to serialize and deserialize .
(8)STORED AS Specify the storage file type
Common storage file types :SEQUENCEFILE( Binary sequence file )、TEXTFILE( Text )、RCFILE( Column storage format file )
If the file data is plain text , have access to STORED AS TEXTFILE. If data needs to be compressed , Use STORED AS SEQUENCEFILE.
(9)LOCATION : Specify the table in HDFS Storage location on .
(10)LIKE Allow users to copy existing table structures , But do not copy data .

Management table

// Normal create table 
create table if not exists student2(
id int, name string
)
row format delimited fields terminated by '\t'
stored as textfile
location '/user/hive/warehouse/student2';
// Create a table based on the query results 
create table if not exists student3 as select id, name from student;
// Create a table based on an existing table structure 
create table if not exists student4 like student;
// The type of query table 
desc formatted student2;

External table

Internal and external table scenarios :

The collected website logs will flow into HDFS text file . In the external table ( Original log table ) On the basis of doing a lot of statistical analysis , The middle table used 、 The result table is stored in an internal table , Data is passed through SELECT+INSERT Enter the internal table .

operation : Create department and employee external tables respectively , And import data into the table .

dept surface

create external table if not exists default.dept(
deptno int,
dname string,
loc int
)
row format delimited fields terminated by '\t';

emp surface

create external table if not exists default.emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int)
row format delimited fields terminated by '\t';

Import data :

load data local inpath '/opt/module/data/dept.txt' into table default.dept;
load data local inpath '/opt/module/data/emp.txt' into table default.emp;

Query results

select * from emp;
select * from dept;

Manage the conversion between tables and external tables

Modify internal tables student2 For external tables

alter table student2 set tblproperties('EXTERNAL'='TRUE');

Modify the external table student2 For internal tables

alter table student2 set tblproperties('EXTERNAL'='FALSE');

Be careful :(‘EXTERNAL’=‘TRUE’) and (‘EXTERNAL’=‘FALSE’) For fixed writing , Case sensitive !

Partition table

The partition table actually corresponds to a HDFS A separate folder on the file system , Under this folder are all the data files of the partition .Hive A partition in is a subdirectory , Divide a large data set into small data sets according to business needs . Pass... When querying WHERE The expression in Clause selects the specified partition required by the query , This kind of query efficiency will improve a lot .

Partition table basic operation

Introduce partition table ( The log needs to be managed according to the date ):

/user/hive/warehouse/log_partition/20170702/20170702.log
/user/hive/warehouse/log_partition/20170703/20170703.log
/user/hive/warehouse/log_partition/20170704/20170704.log

Create partition table Syntax :

create table dept_partition(
deptno int, dname string, loc string
)
partitioned by (month string)
row format delimited fields terminated by '\t';

Load the data into the partitioned table :

load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='201709');
load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='201708');
load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='201707);

Query partition table data :

// Single partition query
select * from dept_partition where month='201709';
// Multi partition union query
select * from dept_partition where month='201709'
union
select * from dept_partition where month='201708'
union
select * from dept_partition where month='201707';

Add partitions :

// Add a single partition 
alter table dept_partition add partition(month='201706') ;
// Create multiple partitions at the same time 
alter table dept_partition add partition(month='201705') partition(month='201704');

Delete partition :

// Delete a single partition 
alter table dept_partition drop partition (month='201704');
// Delete multiple partitions at the same time 
alter table dept_partition drop partition (month='201705'), partition (month='201706');

See how many partitions the partition table has :

show partitions dept_partition;

Create a secondary partition table

create table dept_partition2(
deptno int, dname string, loc string
)
partitioned by (month string, day string)
row format delimited fields terminated by '\t';

Load data into the secondary partition table :

load data local inpath '/opt/module/datas/dept.txt' into table
default.dept_partition2 partition(month='201709', day='13');

Query partition data :

select * from dept_partition2 where month='201709' and day='13';

Modify table

rename table :

alter table dept_partition2 rename to dept_partition3;

Update column :

ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]
// chestnuts :
alter table dept_partition change column deptdesc desc int;

Add and replace columns :

ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)
// Add columns 
alter table dept_partition add columns(deptdesc string);
// Replace Columns 
alter table dept_partition replace columns(deptno string, dname string, loc string);

Delete table

drop table dept_partition;

DML Data manipulation

Data import

Loading data into a table (Load)

load data [local] inpath '/opt/module/datas/student.txt' [overwrite] | into table student [partition (partcol1=val1,)];

(1)load data: Represents loading data
(2)local: Load data from local to hive surface ; Otherwise, from HDFS Load data to hive surface
(3)inpath: Indicates the path to load data
(4)overwrite: Represents the data already in the overlay table , Otherwise, it means additional
(5)into table: Indicates which table to load
(6)student: A specific table
(7)partition: It means to upload to the specified partition

chestnuts :

  1. Create table :
create table student(id string, name string) row format delimited fields terminated by '\t';
  1. Load the local file into hive
load data local inpath '/opt/module/datas/student.txt' into table default.student;
  1. load HDFS File to hive Upload files to HDFS
dfs -put /opt/module/datas/student.txt /user/atguigu/hive;
// load HDFS Data on 
load data inpath '/user/atguigu/hive/student.txt' into table default.student;
  1. Load the existing data in the data overwrite table and upload the file to HDFS
dfs -put /opt/module/datas/student.txt /user/atguigu/hive;
// Load data to cover existing data in the table 
load data inpath '/user/atguigu/hive/student.txt' overwrite into table default.student;

Insert data into a table through a query statement (Insert)

  1. Create a partition table
create table student(id int, name string) partitioned by (month string) row format delimited fields terminated by '\t';
  1. Basic insert data
insert into table student partition(month='201709') values(1,'wangwu');
  1. Basic mode insertion ( According to the single table query results )
insert overwrite table student partition(month='201708')
select id, name from student where month='201709';
  1. Multiple insertion mode ( Query results based on multiple tables )
from student
insert overwrite table student partition(month='201707')
select id, name where month='201709'
insert overwrite table student partition(month='201706')
select id, name where month='201709';

Create table and load data in query statement (As Select)

create table if not exists student3
as select id, name from student;

Passed while creating the table Location Specify the load data path

  1. Create table , And designated at hdfs Position on
create table if not exists student5(
id int, name string
)
row format delimited fields terminated by '\t'
location '/user/hive/warehouse/student5';
  1. Upload data to hdfs On
dfs -put /opt/module/datas/student.txt /user/hive/warehouse/student5;

Import Data to designated Hive In the table

import table student2 partition(month='201709') from
'/user/hive/warehouse/export/student';

Export data

Insert export

  1. Export the results of the query to local
insert overwrite local directory '/opt/module/datas/export/student'
select * from student;
  1. Format and export the query results to local
insert overwrite local directory '/opt/module/datas/export/student1'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
select * from student;
  1. Export the results of the query to HDFS On ( No, local)
insert overwrite directory '/user/atguigu/student2'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
select * from student;

Hadoop Command export to local

dfs -get /user/hive/warehouse/student/month=201709/000000_0
/opt/module/datas/export/student3.txt;

Hive Shell Command export

bin/hive -e 'select * from default.student;' >
/opt/module/datas/export/student4.txt;

Export Export to HDFS On

export table default.student to '/user/hive/warehouse/export/student';

Clear the data in the table (Truncate)

truncate table student;
// Be careful :Truncate Only internal tables can be deleted , Cannot delete data from external tables 

Inquire about

The basic query (Select…From)

// A full table query 
select * from emp;
// Specific column query 
select empno, ename from emp;

Column alias

select ename AS name, deptno dn from emp;
1. Rename a column
2. Easy to calculate
3. Keep up with the column name , You can also add keywords between column names and aliases ‘AS

Arithmetic operator

Operator describe
A+BA and B Add up
A-BA subtract B
A*BA and B Multiply
A/BA Divide B
A%BA Yes B Remainder
A&BA and B According to take with
A|BA and B Take or by location
A^BA and B Or by bit
~AA According to the not

Find out all the employees' salaries and then add them 1 Show .

select sal +1 from emp;

Common functions

//1. Strives for the total number of rows (count)
select count(*) cnt from emp;
//2. Find the maximum wage (max)
select max(sal) max_sal from emp;
//3. Find the minimum wage (min)
select min(sal) min_sal from emp;
//4. Find the sum of wages (sum)
select sum(sal) sum_sal from emp;
//5. Find the average salary (avg)
select avg(sal) avg_sal from emp;

Limit sentence

A typical query returns multiple rows of data .LIMIT The clause is used to limit the number of rows returned

select * from emp limit 5;

Where sentence

1. Use WHERE Clause , Filter out rows that do not meet the criteria
2WHERE Clause in the wake of FROM Clause
select * from emp where sal >1000;
The operator Supported data types describe
A=B Basic data type If A be equal to B Then return to TRUE, Instead, return to FALSE
A<=>B Basic data type If A and B All for NULL, Then return to TRUE, Everything else is equal (=) The results of the operators are consistent , If either NULL The result is NULL
A<>B, A!=B Basic data type A perhaps B by NULL Then return to NULL; If A It's not equal to B, Then return to TRUE, Instead, return to FALSE
A<B Basic data type A perhaps B by NULL, Then return to NULL; If A Less than B, Then return to TRUE, Instead, return to FALSE
A<=B Basic data type A perhaps B by NULL, Then return to NULL; If A Less than or equal to B, Then return to TRUE, Instead, return to FALSE
A>B Basic data type A perhaps B by NULL, Then return to NULL; If A Greater than B, Then return to TRUE, Instead, return to FALSE
A>=B Basic data type A perhaps B by NULL, Then return to NULL; If A Greater than or equal to B, Then return to TRUE, Instead, return to FALSE
A [NOT] BETWEEN B AND C Basic data type If A,B perhaps C Any for NULL, The result is NULL. If A Is greater than or equal to B And less than or equal to C, The result is TRUE, Instead of FALSE. If you use NOT Keywords have the opposite effect .
A IS NULL All data types If A be equal to NULL, Then return to TRUE, Instead, return to FALSE
A IS NOT NULL All data types If A It's not equal to NULL, Then return to TRUE, Instead, return to FALSE
IN( The number 1, The number 2) All data types Use IN The operation displays the values in the list
A [NOT] LIKE BSTRING type B It's a SQL The following simple regular expression , If A If it matches , Then return to TRUE; Instead, return to FALSE.B Is expressed as follows :‘x%’ Express A Must be in letters ‘x’ start ,‘%x’ Express A Must be in letters ’x’ ending , and ‘%x%’ Express A Contain letters ’x’, It can be at the beginning , The end or the middle of the string . If you use NOT Keywords have the opposite effect .
A RLIKE B, A REGEXP BSTRING type B Is a regular expression , If A matching , Then return to TRUE; Instead, return to FALSE. The match USES the JDK The regular expression interface in the , Because regularization also follows the rules . for example , The regular expression must be the entire string A Match , Instead of just matching the string .

Operators can also be used for JOIN…ON and HAVING In the sentence .

//(1) Find out the salary is equal to 5000 All employees 
select * from emp where sal =5000;
//(2) Check salary at 500 To 1000 Employee information 
select * from emp where sal between 500 and 1000;
//(3) Inquire about comm Empty for all employee information 
select * from emp where comm is null;
//(4) Check salary is 1500 or 5000 Employee information 
select * from emp where sal IN (1500, 5000);

Like and RLike

1) Use LIKE The operation selects similar values
2) Selection criteria can contain characters or Numbers : % Represents zero or more characters ( Arbitrary characters )._ Represents a character .
3)RLIKE Clause is Hive An extension of this functionality , It can be passed Java Regular expressions, a more powerful language for specifying matching conditions .

//1) Looking to 2 Starting salary information for the employee 
select * from emp where sal LIKE '2%';
//(2) Find the second value of 2 Salary information of the staff 
select * from emp where sal LIKE '_2%';
//(3) Look for the salary 2 Employee information 
select * from emp where sal RLIKE '[2]';

Logical operators (And/Or/Not)

The operator meaning
AND Logical Union
OR Logic or
NOT Logical not
//(1) Query salary greater than 1000, Department is 30
select * from emp where sal>1000 and deptno=30;
//(2) Query salary greater than 1000, Or the department is 30
select * from emp where sal>1000 or deptno=30;
//(3) The query in addition to 20 Departments and 30 Employee information outside the department 
select * from emp where deptno not IN(30, 20);

Group By sentence

GROUP BY Statements are often used with aggregate functions , Group by one or more queue results , The aggregation operation is then performed on each group .

//(1) Calculation emp Table the average salary of each department 
select t.deptno, avg(t.sal) avg_sal from emp t group by t.deptno;
//(2) Calculation emp The highest salary per position in each department 
select t.deptno, t.job, max(t.sal) max_sal from emp t group by t.deptno, t.job;

Having sentence

having And where Difference

(1)where Work on columns in a table , Query data ;having Works on columns in query results , Filter data .
(2)where Aggregate functions cannot be written later , and having You can use aggregate functions later .
(3)having Used only for group by Grouping statistics statement .

//(1) Find the average salary of each department is greater than 2000 The department in charge of the 
// Find the average salary for each department 
select deptno, avg(sal) from emp group by deptno;
// Find the average salary of each department is greater than 2000 The department in charge of the 
select deptno, avg(sal) avg_sal from emp group by deptno having avg_sal > 2000;

Join sentence

equivalence Join

Hive Support usual SQL JOIN sentence , But only equivalent connections are supported , Non-equivalent connections are not supported .

select e.empno, e.ename, d.deptno, d.dname from emp e join dept d
on e.deptno = d.deptno;

The table alias

(1) Using aliases can simplify queries .

(2) Using a table name prefix can improve execution efficiency .

select e.empno, e.ename, d.deptno from emp e join dept d on e.deptno
= d.deptno;

Internal connection

Internal connection : Only the data matching the join condition exists in both tables that are joined .

select e.empno, e.ename, d.deptno from emp e join dept d on e.deptno
= d.deptno;

The left outer join

JOIN Operator in the left table WHERE All records of the clause will be returned .

 select e.empno, e.ename, d.deptno from emp e left join dept d on e.deptno = d.deptno;

Right connection

JOIN Matches in the table to the right of the operator WHERE All records of the clause will be returned .

select e.empno, e.ename, d.deptno from emp e right join dept d on e.deptno = d.deptno;

Full outer join

Will return all the table matches WHERE All records of statement conditions . If the specified field in any table does not have a qualified value , So use NULL Value of alternative .

select e.empno, e.ename, d.deptno from emp e full join dept d on e.deptno = d.deptno;

The cartesian product

produce :

(1) Omit connection condition
(2) Connection condition invalid
(3) All rows in all tables are interconnected

select empno, dname from emp, dept;

Join predicates are not supported or

select
e.empno,
e.ename,
d.deptno
from
emp e
join
dept d
on
e.deptno=d.deptno or e.ename=d.dname;

FAILED: SemanticException [Error 10019]: Line 10:3 OR not supported in JOIN currently ‘dname’

Sort

Global ordering (Order By)

Order By: Global ordering , One Reducer

1. Use ORDER BY Clause ordering

ASC(ascend): Ascending ( Default )
DESC(descend): Descending

2.ORDER BY Clause in SELECT End of statement

1) Query employee information in ascending order of salary

select * from emp order by sal;

2) Query employee information in descending order of salary

select * from emp order by sal desc;

Partition sorting (Distribute By)

Distribute By: similar MR in partition, partition , combination sort by Use .
Be careful ,Hive requirement DISTRIBUTE BY The statement is written in SORT BY The statement before .

set mapreduce.job.reduces=3;
// Partition by department number first , Then sort by employee number in descending order .
insert overwrite local directory '/opt/module/datas/distribute-result' select * from emp distribute by deptno sort by empno desc;

Cluster By

When distribute by and sorts by Same field , have access to cluster by The way .

cluster by In addition to having distribute by In addition to the function of both sort by The function of . But sort can only be sort in ascending order , The collation cannot be specified as ASC perhaps DESC.

select * from emp cluster by deptno;

Empty field assignment

NVL: The given value is NULL Data assignment of , Its format is NVL( string1, replace_with). Its function is if string1 by NULL, be NVL The function returns replace_with Value , Otherwise return to string1 Value , If both parameters are NULL , Then return to NULL.

1) If the employee's comm by NULL, Then use -1 Instead of

select nvl(comm,-1) from emp;

If the employee's comm by NULL, Then use leadership id Instead of

select nvl(comm,mgr) from emp;

Time class

1)date_format: Format time

select date_format('2019-06-29','yyyy-MM-dd');

2)date_add: Add time to days

select date_add('2019-06-29',5);
select date_add('2019-06-29',-5);

3)date_sub: Time minus days

select date_sub('2019-06-29',5);
select date_sub('2019-06-29 12:12:12',5);
select date_sub('2019-06-29',-5);

4)datediff: Subtract two times

select datediff('2019-06-29','2019-06-24');
select datediff('2019-06-24','2019-06-29');
select datediff('2019-06-24 12:12:12','2019-06-29');
select datediff('2019-06-24 12:12:12','2019-06-29 13:13:13');

CASE WHEN

1) Data preparation

namedept_idsex
The wu is empty A male
The sea A male
Song song B male
feng A Woman
Sister Ting B Woman
tingting B Woman

2) demand : Find out the number of men and women in different departments . give the result as follows :

A 2 1

B 1 2

3) To create local emp_sex.txt, Add data

4) establish hive Table and import the data

create table emp_sex(
name string,
dept_id string,
sex string)
row format delimited fields terminated by "\t";
load data local inpath '/opt/module/data/emp_sex.txt' into table emp_sex;

5) Query data as required

select
dept_id,
sum(case sex when ' male ' then 1 else 0 end) male_count,
sum(case sex when ' Woman ' then 1 else 0 end) female_count
from
emp_sex
group by
dept_id;

Transfer line column

CONCAT(string A/col, string B/col…): Returns the result of an input string concatenation , Supports arbitrary input strings ;

CONCAT_WS(separator, str1, str2,…): It's a special form CONCAT(). The separator between the remaining parameters of the first parameter . The separator can be the same string as the remaining arguments . If the separator is NULL, The return value will also be NULL. This function skips any after the delimiter argument NULL And empty string . The delimiter will be added between the concatenated strings ;

COLLECT_SET(col): The function accepts only basic data types , Its main function is to de summarize the value of a field , produce array The type field .

1) Data preparation :

nameconstellationblood_type
The Monkey King Aries A
The sea Sagittarius A
Song song Aries B
Pig eight quit Aries A
feng Sagittarius A

2) demand : Group zodiac signs and people with the same blood type . give the result as follows

Sagittarius ,A The sea | feng
Aries ,A The Monkey King | Pig eight quit
Aries ,B Song song

3) To create local constellation.txt, Import data

vim xxxx.txt
The Monkey King Aries A
The sea Sagittarius A
Song song Aries B
Pig eight quit Aries A
feng Sagittarius A

4) establish hive Table and import the data

create table person_info(
name string,
constellation string,
blood_type string)
row format delimited fields terminated by "\t";
load data local inpath "/opt/module/data/person_info.txt" into table person_info;

Query data as required

select
t1.base,
concat_ws('|', collect_set(t1.name)) name
from
(select
name,
concat(constellation, ",", blood_type) base
from
person_info) t1
group by
t1.base;

Column turned

EXPLODE(col): take hive Complex in a column array perhaps map The structure is split into multiple rows .

LATERAL VIEW

usage :LATERAL VIEW udtf(expression) tableAlias AS columnAlias

explain : Used for and split, explode etc. UDTF Use it together , It can split a column of data into multiple rows , On this basis, the split data can be aggregated .

1) Data preparation

moviecategory
《 The suspect tracking 》 The suspense , action , Science fiction , The plot
《Lie to me》 The suspense , Police and , action , psychology , The plot
《 Warwolf 2》 The war , action , disaster

2) demand : Expands the array data in the movie category . give the result as follows :

《 The suspect tracking 》 The suspense
《 The suspect tracking 》 action
《 The suspect tracking 》 Science fiction
《 The suspect tracking 》 The plot
《Lie to me》 The suspense
《Lie to me》 Police and
《Lie to me》 action
《Lie to me》 psychology
《Lie to me》 The plot
《 Warwolf 2》 The war
《 Warwolf 2》 action
《 Warwolf 2》 disaster

3) To create local movie.txt, Import data

《 The suspect tracking 》 The suspense , action , Science fiction , The plot
《Lie to me》 The suspense , Police and , action , psychology , The plot
《 Warwolf 2》 The war , action , disaster

4) establish hive Table and import the data

create table movie_info(
movie string,
category array<string>)
row format delimited fields terminated by "\t"
collection items terminated by ",";
load data local inpath "/opt/module/datas/movie.txt" into table movie_info;

5) Query data as required

select
movie,
category_name
from
movie_info lateral view explode(category) table_tmp as category_name;
版权声明:本文为[Keven He]所创,转载请带上原文链接,感谢。 https://javamana.com/2022/134/202205141423370678.html