Summarize hive's compression and storage

Keven He 2022-05-14 14:33:28 阅读数:235

summarizehivecompressionstorage

Compress

MR Supported compression codes

Compressed format Tools Algorithm File extension Is it divisible
DEFAULT nothing DEFAULT.deflate no
GzipgzipDEFAULT.gz no
bzip2bzip2bzip2.bz2 yes
LZOlzopLZO.lzo yes
Snappy nothing Snappy.snappy no

To support a variety of compression / Decompression algorithm ,Hadoop Introduced coding / decoder

Compressed format Corresponding code / decoder
DEFLATEorg.apache.hadoop.io.compress.DefaultCodec
gziporg.apache.hadoop.io.compress.GzipCodec
bzip2org.apache.hadoop.io.compress.BZip2Codec
LZOcom.hadoop.compression.lzo.LzopCodec
Snappyorg.apache.hadoop.io.compress.SnappyCodec

Comparison of compression performance :

Compression algorithm Original file size Compressed file size Compression speed Decompression speed
gzip8.3GB1.8GB17.5MB/s58MB/s
bzip28.3GB1.1GB2.4MB/s9.5MB/s
LZO8.3GB2.9GB49.3MB/s74.6MB/s

Compression parameter configuration :

stay Hadoop Medium enabled compression , The following parameters can be configured (mapred-site.xml In file )

Parameters The default value is Stage Suggest
io.compression.codecs ( stay core-site.xml Middle configuration )org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec, org.apache.hadoop.io.compress.Lz4Codec Input compression Hadoop Use the file extension to determine if a codec is supported
mapreduce.map.output.compressfalsemapper Output This parameter is set to true Enable compression
mapreduce.map.output.compress.codecorg.apache.hadoop.io.compress.DefaultCodecmapper Output Use LZO、LZ4 or snappy Codec compresses data at this stage
mapreduce.output.fileoutputformat.compressfalsereducer Output This parameter is set to true Enable compression
mapreduce.output.fileoutputformat.compress.codecorg.apache.hadoop.io.compress. DefaultCodecreducer Output Use standard tools or codecs , Such as gzip and bzip2
mapreduce.output.fileoutputformat.compress.typeRECORDreducer Output SequenceFile The type of compression used by the output :NONE and BLOCK

Turn on Map Output stage compression

Turn on map Output phase compression can be reduced job in map and Reduce task Inter-data transfer volume .

example :

1. Turn on hive Intermediate transmission data compression function
hive (default)>set hive.exec.compress.intermediate=true;
2. Turn on mapreduce in map Output compression function
hive (default)>set mapreduce.map.output.compress=true;
3. Set up mapreduce in map Compression of output data
hive (default)>set mapreduce.map.output.compress.codec=
org.apache.hadoop.io.compress.SnappyCodec;
4. Execute query statement
hive (default)> select count(ename) name from emp;

Turn on Reduce Output stage compression

When Hive When the output is written to a table , The output can also be compressed . attribute hive.exec.compress.output Controls this function .

example :

1. Turn on hive The final output data compression function
hive (default)>set hive.exec.compress.output=true;
2. Turn on mapreduce Final output data compression
hive (default)>set mapreduce.output.fileoutputformat.compress=true;
3. Set up mapreduce Final data output compression mode
hive (default)> set mapreduce.output.fileoutputformat.compress.codec =
org.apache.hadoop.io.compress.SnappyCodec;
4. Set up mapreduce The final data output is compressed into block compression
hive (default)> set mapreduce.output.fileoutputformat.compress.type=BLOCK;
5. Test whether the output is a compressed file
hive (default)> insert overwrite local directory
'/opt/module/datas/distribute-result' select * from emp distribute by deptno sort by empno desc;

Storage

File storage format
Hive The format of the number of supported stores is mainly :TEXTFILE 、SEQUENCEFILE、ORC、PARQUET.

Column and row storage

 Insert picture description here

1. Characteristics of row storage

When querying for an entire row of data that satisfies the criteria , The column store needs to go to each clustered field to find the value of each column , The row store only needs to find one of the values , The rest of the values are in adjacent places , So row storage queries are faster at this point .

2. Characteristics of column storage

Because the data for each field is aggregated and stored , When the query requires only a few fields , Can greatly reduce the amount of data read ; The data type for each field must be the same , Column storage can be targeted to design better compression algorithm .

Be careful :

TEXTFILE and SEQUENCEFILE Is based on row storage ;
ORC and PARQUET Is based on column storage .

  • TextFile Format :
    It's the default format , Data is not compressed , High disk overhead , Data parsing is expensive , Can be combined with Gzip、Bzip2 Use , But use Gzip This way, ,hive No segmentation of data , So we can't do parallel operations on the data .

  • Orc Format :
    Every Orc File by 1 One or more stripe form , Every stripe250MB size , This Stripe Actual equivalent RowGroup Concept , But the size is determined by 4MB->250MB, This should improve the throughput of sequential reads . Every Stripe It's made up of three parts , Namely Index Data,Row Data,Stripe Footer:
     Insert picture description here

1)Index Data: A lightweight index, The default is every time 1W Rows are indexed . The index made here should only record the fields in a certain row Row Data Medium offset.
2)Row Data: It stores specific data , So let's take some rows , These rows are then stored as columns . Each column is coded , Divided into multiple Stream To store .
3)Stripe Footer: It's individual Stream The type of , Length information .

Each file has one File Footer, That's each of them Stripe The number of rows , Every Column Data type information, etc ; The end of each file is one PostScript, This records the entire file's compression type as well FileFooter Length information, etc . When the file is read , Meeting seek Go to the end of the file and read PostScript, From the inside File Footer length , read FileFooter, I'm going to parse from the inside to each one Stripe Information , Read each Stripe, Read backwards .

  • Parquet Format

Parquet Is a column storage format for analytic business ,Parquet Files are stored in binary form , It can't be read directly , The file contains the data and metadata for that file , therefore Parquet Format files are self-parsed .

In the storage Parquet The data will be based on Block Size sets the size of the row group , Because in general each of these Mapper The smallest unit of data that a task processes is one Block, This allows you to group each row by one Mapper Task processing , Increases the parallelism of task execution .

Parquet File format : Insert picture description here

Multiple row groups can be stored in a file , The first part of the file is that of the file Magic Code, To verify whether it is a Parquet file ,Footer length The size of the file metadata is recorded , The offset of the metadata can be calculated from this value and the file length , The file's metadata contains the metadata information for each row group and the data that the file stores Schema Information . Except for the metadata for each row group in the file , The beginning of each page stores the metadata for that page , stay Parquet in , There are three types of pages : Data pages 、 Dictionary page and index page . The data page is used to store the value of the column in the current row group , The dictionary page stores the coded dictionary for the column value , Each column block contains at most one dictionary page , The index page is used to store the index of the column under the current row group , at present Parquet Index pages are not yet supported in .

Storage file compression ratio summary :ORC > Parquet > textFile
A summary of the query speed of stored files : Similar query speed .

Storage compression combined with

Storage and compression summary

Dictionary pages , The index page is used to store the index of the column under the current row group , at present Parquet Index pages are not yet supported in .

Storage file compression ratio summary :ORC > Parquet > textFile
A summary of the query speed of stored files : Similar query speed .

Storage compression combined with

Storage and compression summary

In actual project development ,hive The data storage format of the table is generally selected :orc or parquet. Compression mode is generally selected snappy,lzo.

版权声明:本文为[Keven He]所创,转载请带上原文链接,感谢。 https://javamana.com/2022/134/202205141423370607.html