Summarize hive's compression and storage

Keven He 2022-05-14 14:33:28 阅读数:235



MR Supported compression codes

Compressed format Tools Algorithm File extension Is it divisible
DEFAULT nothing DEFAULT.deflate no
GzipgzipDEFAULT.gz no
bzip2bzip2bzip2.bz2 yes
LZOlzopLZO.lzo yes
Snappy nothing Snappy.snappy no

To support a variety of compression / Decompression algorithm ,Hadoop Introduced coding / decoder

Compressed format Corresponding code / decoder

Comparison of compression performance :

Compression algorithm Original file size Compressed file size Compression speed Decompression speed

Compression parameter configuration :

stay Hadoop Medium enabled compression , The following parameters can be configured (mapred-site.xml In file )

Parameters The default value is Stage Suggest
io.compression.codecs ( stay core-site.xml Middle configuration ),,, Input compression Hadoop Use the file extension to determine if a codec is supported Output This parameter is set to true Enable compression Output Use LZO、LZ4 or snappy Codec compresses data at this stage
mapreduce.output.fileoutputformat.compressfalsereducer Output This parameter is set to true Enable compression DefaultCodecreducer Output Use standard tools or codecs , Such as gzip and bzip2
mapreduce.output.fileoutputformat.compress.typeRECORDreducer Output SequenceFile The type of compression used by the output :NONE and BLOCK

Turn on Map Output stage compression

Turn on map Output phase compression can be reduced job in map and Reduce task Inter-data transfer volume .

example :

1. Turn on hive Intermediate transmission data compression function
hive (default)>set hive.exec.compress.intermediate=true;
2. Turn on mapreduce in map Output compression function
hive (default)>set;
3. Set up mapreduce in map Compression of output data
hive (default)>set;
4. Execute query statement
hive (default)> select count(ename) name from emp;

Turn on Reduce Output stage compression

When Hive When the output is written to a table , The output can also be compressed . attribute hive.exec.compress.output Controls this function .

example :

1. Turn on hive The final output data compression function
hive (default)>set hive.exec.compress.output=true;
2. Turn on mapreduce Final output data compression
hive (default)>set mapreduce.output.fileoutputformat.compress=true;
3. Set up mapreduce Final data output compression mode
hive (default)> set mapreduce.output.fileoutputformat.compress.codec =;
4. Set up mapreduce The final data output is compressed into block compression
hive (default)> set mapreduce.output.fileoutputformat.compress.type=BLOCK;
5. Test whether the output is a compressed file
hive (default)> insert overwrite local directory
'/opt/module/datas/distribute-result' select * from emp distribute by deptno sort by empno desc;


File storage format
Hive The format of the number of supported stores is mainly :TEXTFILE 、SEQUENCEFILE、ORC、PARQUET.

Column and row storage

 Insert picture description here

1. Characteristics of row storage

When querying for an entire row of data that satisfies the criteria , The column store needs to go to each clustered field to find the value of each column , The row store only needs to find one of the values , The rest of the values are in adjacent places , So row storage queries are faster at this point .

2. Characteristics of column storage

Because the data for each field is aggregated and stored , When the query requires only a few fields , Can greatly reduce the amount of data read ; The data type for each field must be the same , Column storage can be targeted to design better compression algorithm .

Be careful :

TEXTFILE and SEQUENCEFILE Is based on row storage ;
ORC and PARQUET Is based on column storage .

  • TextFile Format :
    It's the default format , Data is not compressed , High disk overhead , Data parsing is expensive , Can be combined with Gzip、Bzip2 Use , But use Gzip This way, ,hive No segmentation of data , So we can't do parallel operations on the data .

  • Orc Format :
    Every Orc File by 1 One or more stripe form , Every stripe250MB size , This Stripe Actual equivalent RowGroup Concept , But the size is determined by 4MB->250MB, This should improve the throughput of sequential reads . Every Stripe It's made up of three parts , Namely Index Data,Row Data,Stripe Footer:
     Insert picture description here

1)Index Data: A lightweight index, The default is every time 1W Rows are indexed . The index made here should only record the fields in a certain row Row Data Medium offset.
2)Row Data: It stores specific data , So let's take some rows , These rows are then stored as columns . Each column is coded , Divided into multiple Stream To store .
3)Stripe Footer: It's individual Stream The type of , Length information .

Each file has one File Footer, That's each of them Stripe The number of rows , Every Column Data type information, etc ; The end of each file is one PostScript, This records the entire file's compression type as well FileFooter Length information, etc . When the file is read , Meeting seek Go to the end of the file and read PostScript, From the inside File Footer length , read FileFooter, I'm going to parse from the inside to each one Stripe Information , Read each Stripe, Read backwards .

  • Parquet Format

Parquet Is a column storage format for analytic business ,Parquet Files are stored in binary form , It can't be read directly , The file contains the data and metadata for that file , therefore Parquet Format files are self-parsed .

In the storage Parquet The data will be based on Block Size sets the size of the row group , Because in general each of these Mapper The smallest unit of data that a task processes is one Block, This allows you to group each row by one Mapper Task processing , Increases the parallelism of task execution .

Parquet File format : Insert picture description here

Multiple row groups can be stored in a file , The first part of the file is that of the file Magic Code, To verify whether it is a Parquet file ,Footer length The size of the file metadata is recorded , The offset of the metadata can be calculated from this value and the file length , The file's metadata contains the metadata information for each row group and the data that the file stores Schema Information . Except for the metadata for each row group in the file , The beginning of each page stores the metadata for that page , stay Parquet in , There are three types of pages : Data pages 、 Dictionary page and index page . The data page is used to store the value of the column in the current row group , The dictionary page stores the coded dictionary for the column value , Each column block contains at most one dictionary page , The index page is used to store the index of the column under the current row group , at present Parquet Index pages are not yet supported in .

Storage file compression ratio summary :ORC > Parquet > textFile
A summary of the query speed of stored files : Similar query speed .

Storage compression combined with

Storage and compression summary

Dictionary pages , The index page is used to store the index of the column under the current row group , at present Parquet Index pages are not yet supported in .

Storage file compression ratio summary :ORC > Parquet > textFile
A summary of the query speed of stored files : Similar query speed .

Storage compression combined with

Storage and compression summary

In actual project development ,hive The data storage format of the table is generally selected :orc or parquet. Compression mode is generally selected snappy,lzo.

版权声明:本文为[Keven He]所创,转载请带上原文链接,感谢。