Keven He 2022-05-14 14:33:28 阅读数:235
MR Supported compression codes
Compressed format | Tools | Algorithm | File extension | Is it divisible |
---|---|---|---|---|
DEFAULT | nothing | DEFAULT | .deflate | no |
Gzip | gzip | DEFAULT | .gz | no |
bzip2 | bzip2 | bzip2 | .bz2 | yes |
LZO | lzop | LZO | .lzo | yes |
Snappy | nothing | Snappy | .snappy | no |
To support a variety of compression / Decompression algorithm ,Hadoop Introduced coding / decoder
Compressed format | Corresponding code / decoder |
---|---|
DEFLATE | org.apache.hadoop.io.compress.DefaultCodec |
gzip | org.apache.hadoop.io.compress.GzipCodec |
bzip2 | org.apache.hadoop.io.compress.BZip2Codec |
LZO | com.hadoop.compression.lzo.LzopCodec |
Snappy | org.apache.hadoop.io.compress.SnappyCodec |
Comparison of compression performance :
Compression algorithm | Original file size | Compressed file size | Compression speed | Decompression speed |
---|---|---|---|---|
gzip | 8.3GB | 1.8GB | 17.5MB/s | 58MB/s |
bzip2 | 8.3GB | 1.1GB | 2.4MB/s | 9.5MB/s |
LZO | 8.3GB | 2.9GB | 49.3MB/s | 74.6MB/s |
Compression parameter configuration :
stay Hadoop Medium enabled compression , The following parameters can be configured (mapred-site.xml In file )
Parameters | The default value is | Stage | Suggest |
---|---|---|---|
io.compression.codecs ( stay core-site.xml Middle configuration ) | org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec, org.apache.hadoop.io.compress.Lz4Codec | Input compression | Hadoop Use the file extension to determine if a codec is supported |
mapreduce.map.output.compress | false | mapper Output | This parameter is set to true Enable compression |
mapreduce.map.output.compress.codec | org.apache.hadoop.io.compress.DefaultCodec | mapper Output | Use LZO、LZ4 or snappy Codec compresses data at this stage |
mapreduce.output.fileoutputformat.compress | false | reducer Output | This parameter is set to true Enable compression |
mapreduce.output.fileoutputformat.compress.codec | org.apache.hadoop.io.compress. DefaultCodec | reducer Output | Use standard tools or codecs , Such as gzip and bzip2 |
mapreduce.output.fileoutputformat.compress.type | RECORD | reducer Output | SequenceFile The type of compression used by the output :NONE and BLOCK |
Turn on Map Output stage compression
Turn on map Output phase compression can be reduced job in map and Reduce task Inter-data transfer volume .
example :
1. Turn on hive Intermediate transmission data compression function
hive (default)>set hive.exec.compress.intermediate=true;
2. Turn on mapreduce in map Output compression function
hive (default)>set mapreduce.map.output.compress=true;
3. Set up mapreduce in map Compression of output data
hive (default)>set mapreduce.map.output.compress.codec=
org.apache.hadoop.io.compress.SnappyCodec;
4. Execute query statement
hive (default)> select count(ename) name from emp;
Turn on Reduce Output stage compression
When Hive When the output is written to a table , The output can also be compressed . attribute hive.exec.compress.output Controls this function .
example :
1. Turn on hive The final output data compression function
hive (default)>set hive.exec.compress.output=true;
2. Turn on mapreduce Final output data compression
hive (default)>set mapreduce.output.fileoutputformat.compress=true;
3. Set up mapreduce Final data output compression mode
hive (default)> set mapreduce.output.fileoutputformat.compress.codec =
org.apache.hadoop.io.compress.SnappyCodec;
4. Set up mapreduce The final data output is compressed into block compression
hive (default)> set mapreduce.output.fileoutputformat.compress.type=BLOCK;
5. Test whether the output is a compressed file
hive (default)> insert overwrite local directory
'/opt/module/datas/distribute-result' select * from emp distribute by deptno sort by empno desc;
File storage format
Hive The format of the number of supported stores is mainly :TEXTFILE 、SEQUENCEFILE、ORC、PARQUET.
Column and row storage
1. Characteristics of row storage
When querying for an entire row of data that satisfies the criteria , The column store needs to go to each clustered field to find the value of each column , The row store only needs to find one of the values , The rest of the values are in adjacent places , So row storage queries are faster at this point .
2. Characteristics of column storage
Because the data for each field is aggregated and stored , When the query requires only a few fields , Can greatly reduce the amount of data read ; The data type for each field must be the same , Column storage can be targeted to design better compression algorithm .
Be careful :
TEXTFILE and SEQUENCEFILE Is based on row storage ;
ORC and PARQUET Is based on column storage .
TextFile Format :
It's the default format , Data is not compressed , High disk overhead , Data parsing is expensive , Can be combined with Gzip、Bzip2 Use , But use Gzip This way, ,hive No segmentation of data , So we can't do parallel operations on the data .
Orc Format :
Every Orc File by 1 One or more stripe form , Every stripe250MB size , This Stripe Actual equivalent RowGroup Concept , But the size is determined by 4MB->250MB, This should improve the throughput of sequential reads . Every Stripe It's made up of three parts , Namely Index Data,Row Data,Stripe Footer:
1)Index Data: A lightweight index, The default is every time 1W Rows are indexed . The index made here should only record the fields in a certain row Row Data Medium offset.
2)Row Data: It stores specific data , So let's take some rows , These rows are then stored as columns . Each column is coded , Divided into multiple Stream To store .
3)Stripe Footer: It's individual Stream The type of , Length information .
Each file has one File Footer, That's each of them Stripe The number of rows , Every Column Data type information, etc ; The end of each file is one PostScript, This records the entire file's compression type as well FileFooter Length information, etc . When the file is read , Meeting seek Go to the end of the file and read PostScript, From the inside File Footer length , read FileFooter, I'm going to parse from the inside to each one Stripe Information , Read each Stripe, Read backwards .
Parquet Is a column storage format for analytic business ,Parquet Files are stored in binary form , It can't be read directly , The file contains the data and metadata for that file , therefore Parquet Format files are self-parsed .
In the storage Parquet The data will be based on Block Size sets the size of the row group , Because in general each of these Mapper The smallest unit of data that a task processes is one Block, This allows you to group each row by one Mapper Task processing , Increases the parallelism of task execution .
Parquet File format :
Multiple row groups can be stored in a file , The first part of the file is that of the file Magic Code, To verify whether it is a Parquet file ,Footer length The size of the file metadata is recorded , The offset of the metadata can be calculated from this value and the file length , The file's metadata contains the metadata information for each row group and the data that the file stores Schema Information . Except for the metadata for each row group in the file , The beginning of each page stores the metadata for that page , stay Parquet in , There are three types of pages : Data pages 、 Dictionary page and index page . The data page is used to store the value of the column in the current row group , The dictionary page stores the coded dictionary for the column value , Each column block contains at most one dictionary page , The index page is used to store the index of the column under the current row group , at present Parquet Index pages are not yet supported in .
Storage file compression ratio summary :ORC > Parquet > textFile
A summary of the query speed of stored files : Similar query speed .
Storage and compression summary
Dictionary pages , The index page is used to store the index of the column under the current row group , at present Parquet Index pages are not yet supported in .
Storage file compression ratio summary :ORC > Parquet > textFile
A summary of the query speed of stored files : Similar query speed .
Storage and compression summary
In actual project development ,hive The data storage format of the table is generally selected :orc or parquet. Compression mode is generally selected snappy,lzo.
版权声明:本文为[Keven He]所创,转载请带上原文链接,感谢。 https://javamana.com/2022/134/202205141423370607.html