Cloudera The basic principle of data compression has been put forward ：
Besides , Which compression formats are used , Why use these compression formats instead of other compression formats ？
Mainly considering ：
Let's introduce .
sequenceFile File is Hadoop Used to store binary forms [Key,Value] It's a kind of flat file designed for you (Flat File). You can put SequenceFile Think of it as a container , Pack all the files into SequenceFile Class can efficiently store and process small files .SequenceFile Files are not stored according to their Key Sort storage ,SequenceFile The inner class of Writer Provides append function .SequenceFile Medium Key and Value It can be any type Writable Or custom Writable.
On the storage structure ,SequenceFile Mainly by a Header More than one heel Record form ,Header Mainly consists of Key classname,value classname, Storage compression algorithm , User defined metadata and other information , Besides , There are also some synchronization IDS , Used to quickly locate to the boundary of the record . Every one of them Record Store in the form of key value pairs , The array of characters used to represent it can be parsed into ： The length of the record 、Key The length of 、Key Values and value value , also Value The structure of the value depends on whether the record is compressed .
SequenceFile It supports three record storage methods ：
however SequenceFile Only support Java, SequenceFile It is generally used as a container for small files , Prevent small files from taking up too much NameNode Memory space to store it in DataNode Metadata of location .
Serialization refers to the process of converting data format into byte stream , Mainly used for remote transmission or storage . hadoop The main serialization format used is Writables. But it can only support Java Language , So then it came out Thrift, Avro Equiform .
Thrift yes Facebook Development framework , It is used to provide services and interfaces across languages , Meet cross platform communication . however Thrift Fragmentation is not supported , And missing MapReduce Native support for . So we can ignore this compression algorithm .
Avro yes Hadoop A sub project in , It's also Apache A separate project in ,Avro It is a high performance middleware based on binary data transmission . stay Hadoop In other projects of , for example HBase and Hive Of Client This tool is also used for data transmission between client and server .
Avro Is a language independent data serialization system , It appears mainly to solve Writables Lack of cross language porting defects .Avro Store the schema in the file header , So every file is self describing , and Avro It also supports pattern evolution (schema evolution), in other words , The pattern of reading the file does not need to match the pattern of writing the file , When there is a new need , You can add new fields to the schema .
ORC The full name is (Optimized Row Columnar),ORC File format is a kind of Hadoop Column storage formats in the ecosystem , It came into being as early as 2013 Beginning of the year , It originated from Apache Hive, For lowering Hadoop Data storage space and acceleration Hive Query speed . and Parquet similar , It's not a pure columnar storage format , It's still the first step to split the entire table according to row groups , Store by column in each row group .ORC The document is self describing , Its metadata uses Protocol Buffers serialize , And the data in the file should be compressed as much as possible to reduce the consumption of storage space , At present, it is also Spark SQL、Presto Query engine support .2015 year ORC The project was Apache The project foundation was upgraded to Apache Top projects .ORC There are some advantages ：
advantage ： High compression rate , And compression / Decompression speed is also relatively fast ;hadoop It supports , Deal with in the application gzip The format of a file is just like processing text directly ; Yes hadoop native library ; Most of the linux The system comes with it gzip command , Easy to use . shortcoming ： I won't support it split. Application scenarios ： When each file is compressed, in 130M Inside （1 Within a block size ）, You can consider using gzip Compressed format . For example, a day or an hour's log is compressed into one gzip file , function mapreduce When programming, through multiple gzip File concurrency .hive Program ,streaming Program , and java Written mapreduce The program is exactly like text processing , After compression, the original program does not need to make any changes .
advantage ： Compress / Decompression speed is also relatively fast , Reasonable compression ratio ; Support split, yes hadoop The most popular compression format in ; Support hadoop native library ; Can be in linux Installation under system lzop command , Easy to use . shortcoming ： Compression ratio than gzip Be a little lower. ;hadoop Itself does not support , Need to install ; In the application of lzo The format of the file needs to do some special processing （ For support split Need to index , You also need to specify inputformat by lzo Format ）. Application scenarios ： A large text file , After compression, it is larger than 200M The above can be considered , And the bigger a single file ,lzo The more obvious the advantages .
snappy Compress advantage ： High compression speed and reasonable compression rate ; Support hadoop native library . shortcoming ： I won't support it split; Compression ratio than gzip Be low ;hadoop Itself does not support , Need to install ;linux There is no corresponding command in the system . Application scenarios ： When mapreduce Operational map When the output data is large , As map To reduce The compressed format of the intermediate data of ; Or as a mapreduce The output of the job and another mapreduce Job input .
advantage ： Support split; It has a high compression rate , Than gzip Compression ratio is high ;hadoop It supports , But does not support native; stay linux The system comes with bzip2 command , Easy to use . shortcoming ： Compress / Decompression is slow ; I won't support it native. Application scenarios ： Suitable for speed requirements are not high , But when a higher compression ratio is needed , It can be used as mapreduce The output format of the job ; Or the output data is relatively large , The processed data needs to be compressed and archived to reduce the disk space, and the data will be used less in the future ; Or for a single large text file to compress to reduce storage space , At the same time, we need support split, And compatible with previous applications （ That is, the application does not need to be modified ） The situation of .
Last , Let's take a picture of this 4 Compare two compression formats ：
This article is from WeChat official account. - Big data is fun （havefun_bigdata） , author ： Big data is fun
The source and reprint of the original text are detailed in the text , If there is any infringement , Please contact the firstname.lastname@example.org Delete .
Original publication time ： 2021-01-11
Participation of this paper Tencent cloud media sharing plan , You are welcome to join us , share .