Introduction and use of hive's data compression

Homo sapiens 2021-01-22 18:07:04
introduction use hive data compression


In practical work ,hive Data to be processed , It usually needs to be compressed , In the early stage, we were learning hadoop When , It has been configured hadoop Compression of , We have hive It's the same , We can use compression to save our MR Processing network bandwidth .

MR Supported compression codes

Compressed format

Tools

Algorithm

File extension

Is it divisible

DEFAULT

nothing

DEFAULT

.deflate

no

Gzip

gzip

DEFAULT

.gz

no

bzip2

bzip2

bzip2

.bz2

yes

LZO

lzop

LZO

.lzo

no

LZ4

nothing

nothing

.lz4

no

Snappy

nothing

Snappy

.snappy

no

To support a variety of compression / Decompression algorithm ,Hadoop Introduced coding / decoder , As shown in the following table

Compressed format

Corresponding code / decoder

DEFLATE

org.apache.hadoop.io.compress.DefaultCodec

gzip

org.apache.hadoop.io.compress.GzipCodec

bzip2

org.apache.hadoop.io.compress.BZip2Codec

LZO

com.hadoop.compression.lzo.LzopCodec

LZ4

org.apache.hadoop.io.compress.Lz4Codec

Snappy

org.apache.hadoop.io.compress.SnappyCodec

Comparison of compression performance

Compression algorithm

Original file size

Compressed file size

Compression speed

Decompression speed

gzip

8.3GB

1.8GB

17.5MB/s

58MB/s

bzip2

8.3GB

1.1GB

2.4MB/s

9.5MB/s

LZO

8.3GB

2.9GB

49.3MB/s

74.6MB/s

Do you really want to know why I put snappy It's marked in ? Let's go to snappy On the open source website of http://google.github.io/snappy/

On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.

We can see snappy Compression has reached 250MB/s, Decompression reached 500MB/s, This property can directly crush the ones listed above ! therefore snappy Also often used as enterprise data compression format !

Next, let's see how to configure compression parameters ?

Compression parameter configuration

To be in Hadoop Medium enabled compression , The following parameters can be configured (mapred-site.xml In file ):

Parameters

The default value is

Stage

Suggest

io.compression.codecs ( stay core-site.xml Middle configuration )

org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec, org.apache.hadoop.io.compress.Lz4Codec

Input compression

Hadoop Use the file extension to determine if a codec is supported

mapreduce.map.output.compress

false

mapper Output

This parameter is set to true Enable compression

mapreduce.map.output.compress.codec

org.apache.hadoop.io.compress.DefaultCodec

mapper Output

Use LZO、LZ4 or snappy Codec compresses data at this stage

mapreduce.output.fileoutputformat.compress

false

reducer Output

This parameter is set to true Enable compression

mapreduce.output.fileoutputformat.compress.codec

org.apache.hadoop.io.compress. DefaultCodec

reducer Output

Use standard tools or codecs , Such as gzip and bzip2

mapreduce.output.fileoutputformat.compress.type

RECORD

reducer Output

SequenceFile The type of compression used by the output :NONE and BLOCK

Turn on Map Output stage compression

Turn on map Output phase compression can be reduced job in map and Reduce task Inter-data transfer volume . The specific configuration is as follows :

Case practice <1> Turn on hive Intermediate transmission data compression function hive (default)>set hive.exec.compress.intermediate=true;

<2> Turn on mapreduce in map Output compression function hive (default)>set mapreduce.map.output.compress=true;

<3> Set up mapreduce in map Compression of output data hive (default)>set mapreduce.map.output.compress.codec= org.apache.hadoop.io.compress.SnappyCodec;

<4> Execute query statement select count(1) from score;

Turn on Reduce Output stage compression

When Hive When the output is written to a table , The output can also be compressed . attribute hive.exec.compress.output Controls this function . Users may need to keep the default values in the default Settings file false, The default output is then an uncompressed plain text file . The user can set this value to true, To enable output compression .

Case practice

<1> Turn on hive The final output data compression function hive (default)>set hive.exec.compress.output=true;

<2> Turn on mapreduce Final output data compression hive (default)>set mapreduce.output.fileoutputformat.compress=true;

<3> Set up mapreduce Final data output compression mode hive (default)> set mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;

<4> Set up mapreduce The final data output is compressed into block compression hive(default)>set mapreduce.output.fileoutputformat.compress.type=BLOCK;

<5> Test whether the output is a compressed file insert overwrite local directory '/export/servers/snappy' select * from score distribute by s_id sort by s_id desc;

That's all for this sharing , If you like, please pay attention to it (^U^)ノ~YO, More will follow hive Introduction to , Coming soon !

Participation of this paper Tencent cloud media sharing plan , You are welcome to join us , share .

版权声明
本文为[Homo sapiens]所创,转载请带上原文链接,感谢
https://javamana.com/2021/01/20210122180032284j.html

  1. 【计算机网络 12(1),尚学堂马士兵Java视频教程
  2. 【程序猿历程,史上最全的Java面试题集锦在这里
  3. 【程序猿历程(1),Javaweb视频教程百度云
  4. Notes on MySQL 45 lectures (1-7)
  5. [computer network 12 (1), Shang Xuetang Ma soldier java video tutorial
  6. The most complete collection of Java interview questions in history is here
  7. [process of program ape (1), JavaWeb video tutorial, baidu cloud
  8. Notes on MySQL 45 lectures (1-7)
  9. 精进 Spring Boot 03:Spring Boot 的配置文件和配置管理,以及用三种方式读取配置文件
  10. Refined spring boot 03: spring boot configuration files and configuration management, and reading configuration files in three ways
  11. 精进 Spring Boot 03:Spring Boot 的配置文件和配置管理,以及用三种方式读取配置文件
  12. Refined spring boot 03: spring boot configuration files and configuration management, and reading configuration files in three ways
  13. 【递归,Java传智播客笔记
  14. [recursion, Java intelligence podcast notes
  15. [adhere to painting for 386 days] the beginning of spring of 24 solar terms
  16. K8S系列第八篇(Service、EndPoints以及高可用kubeadm部署)
  17. K8s Series Part 8 (service, endpoints and high availability kubeadm deployment)
  18. 【重识 HTML (3),350道Java面试真题分享
  19. 【重识 HTML (2),Java并发编程必会的多线程你竟然还不会
  20. 【重识 HTML (1),二本Java小菜鸟4面字节跳动被秒成渣渣
  21. [re recognize HTML (3) and share 350 real Java interview questions
  22. [re recognize HTML (2). Multithreading is a must for Java Concurrent Programming. How dare you not
  23. [re recognize HTML (1), two Java rookies' 4-sided bytes beat and become slag in seconds
  24. 造轮子系列之RPC 1:如何从零开始开发RPC框架
  25. RPC 1: how to develop RPC framework from scratch
  26. 造轮子系列之RPC 1:如何从零开始开发RPC框架
  27. RPC 1: how to develop RPC framework from scratch
  28. 一次性捋清楚吧,对乱糟糟的,Spring事务扩展机制
  29. 一文彻底弄懂如何选择抽象类还是接口,连续四年百度Java岗必问面试题
  30. Redis常用命令
  31. 一双拖鞋引发的血案,狂神说Java系列笔记
  32. 一、mysql基础安装
  33. 一位程序员的独白:尽管我一生坎坷,Java框架面试基础
  34. Clear it all at once. For the messy, spring transaction extension mechanism
  35. A thorough understanding of how to choose abstract classes or interfaces, baidu Java post must ask interview questions for four consecutive years
  36. Redis common commands
  37. A pair of slippers triggered the murder, crazy God said java series notes
  38. 1、 MySQL basic installation
  39. Monologue of a programmer: despite my ups and downs in my life, Java framework is the foundation of interview
  40. 【大厂面试】三面三问Spring循环依赖,请一定要把这篇看完(建议收藏)
  41. 一线互联网企业中,springboot入门项目
  42. 一篇文带你入门SSM框架Spring开发,帮你快速拿Offer
  43. 【面试资料】Java全集、微服务、大数据、数据结构与算法、机器学习知识最全总结,283页pdf
  44. 【leetcode刷题】24.数组中重复的数字——Java版
  45. 【leetcode刷题】23.对称二叉树——Java版
  46. 【leetcode刷题】22.二叉树的中序遍历——Java版
  47. 【leetcode刷题】21.三数之和——Java版
  48. 【leetcode刷题】20.最长回文子串——Java版
  49. 【leetcode刷题】19.回文链表——Java版
  50. 【leetcode刷题】18.反转链表——Java版
  51. 【leetcode刷题】17.相交链表——Java&python版
  52. 【leetcode刷题】16.环形链表——Java版
  53. 【leetcode刷题】15.汉明距离——Java版
  54. 【leetcode刷题】14.找到所有数组中消失的数字——Java版
  55. 【leetcode刷题】13.比特位计数——Java版
  56. oracle控制用户权限命令
  57. 三年Java开发,继阿里,鲁班二期Java架构师
  58. Oracle必须要启动的服务
  59. 万字长文!深入剖析HashMap,Java基础笔试题大全带答案
  60. 一问Kafka就心慌?我却凭着这份,图灵学院vip课程百度云