Data storage format of hive

Homo sapiens 2021-01-22 18:06:57
data storage format hive


Hive The main formats supported for storing data are :TEXTFILE( Line storage ) 、SEQUENCEFILE( Line storage )、ORC( The column type storage )、PARQUET( The column type storage ).

Column and row storage

On the left is the logic table , The first one on the right is row storage , The second is column storage .

The characteristics of row storage : When querying for an entire row of data that satisfies the criteria , The row store only needs to find one of the values , The rest of the values are in adjacent places . The column store needs to go to each clustered field to find the value of each column , So row storage queries are faster at this point .

The characteristics of column storage : Because the data for each field is aggregated and stored , When the query requires only a few fields , Can greatly reduce the amount of data read ; The data type for each field must be the same , Column storage can be targeted to design better compression algorithm .

TEXTFILE and SEQUENCEFILE Is based on row storage ;

ORC and PARQUET Is based on column storage .

TEXTFILE Format

The default format , Data is not compressed , High disk overhead , Data parsing is expensive . Can be combined with Gzip、Bzip2 Use ( The system checks automatically , Automatically decompress when executing a query ), But in this way ,hive No segmentation of data , So we can't do parallel operations on the data .

ORC Format

Orc (Optimized Row Columnar) yes hive 0.11 A new storage format introduced in Windows xp .

You can see each Orc File by 1 One or more stripe form , Every stripe250MB size , This Stripe Actual equivalent RowGroup Concept , But the size is determined by 4MB->250MB, This can improve the throughput of sequential reading . Every Stripe It's made up of three parts , Namely Index Data,Row Data,Stripe Footer:

One orc Files can be divided into several Stripe One stripe It can be divided into three parts

indexData: Index data for some columns rowData : Real data storage StripFooter:stripe Metadata information

<1>Index Data: A lightweight index, The default is every time 1W Rows are indexed . The index here only records the fields of a line in Row Data Medium offset. <2>Row Data: It stores specific data , So let's take some rows , These rows are then stored as columns . Each column is coded , Divided into multiple Stream To store . <3>Stripe Footer: It's individual stripe Metadata information

Each file has one File Footer, That's each of them Stripe The number of rows , Every Column Data type information, etc ; The end of each file is one PostScript, This records the entire file's compression type as well FileFooter Length information, etc . When the file is read , Meeting seek Go to the end of the file and read PostScript, From the inside File Footer length , read FileFooter, I'm going to parse from the inside to each one Stripe Information , Read each Stripe, Read backwards .

PARQUET Format

Parquet Is a column storage format for analytic business , from Twitter and Cloudera Cooperative development ,2015 year 5 Month from Apache Graduated from the incubator Apache Top projects . Parquet Files are stored in binary form , So it can't be read directly , The file contains the data and metadata for that file , therefore Parquet Format files are self-parsed . Usually , In the storage Parquet The data will be based on Block Size sets the size of the row group , Because in general each of these Mapper The smallest unit of data that a task processes is one Block, This allows you to group each row by one Mapper Task processing , Increases the parallelism of task execution .Parquet The format of the file is shown in the figure below .

The figure above shows one Parquet The content of the document , Multiple row groups can be stored in a file , The first part of the file is that of the file Magic Code, To verify whether it is a Parquet file ,Footer length The size of the file metadata is recorded , The offset of the metadata can be calculated from this value and the file length , The file's metadata contains the metadata information for each row group and the data that the file stores Schema Information . Except for the metadata for each row group in the file , The beginning of each page stores the metadata for that page , stay Parquet in , There are three types of pages : Data pages 、 Dictionary page and index page . The data page is used to store the value of the column in the current row group , The dictionary page stores the coded dictionary for the column value , Each column block contains at most one dictionary page , The index page is used to store the index of the column under the current row group , at present Parquet Index pages are not yet supported in .

After introducing the mainstream file storage formats , Next, we'll do a comparative experiment on them !

Compression ratio test for stored files :

Take a raw data as 19M As an example ,

TextFile <1> Create table , The stored data format is TEXTFILE

create table log_text2 (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE ;

<2> Load the data into the table

load data local inpath '/export/servers/hivedatas/log.data' into table log_text1 ;

<3> Look at the size of the table data

dfs -du -h /user/hive/warehouse/myhive.db/log_text;

The size of the data after compression is 18.1M

ORC

<1> Create table , The storage format is ORC

create table log_orc(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc ;

<2> Load the data into the table

insert into table log_orc select * from log_text1 ;

<3> View the data size in the table

dfs -du -h /user/hive/warehouse/myhive.db/log_orc;

The size of the data after compression is 2.8 M

TextFile <1> Create table , The stored data format is parquet

create table log_parquet(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS PARQUET ;

<2> Load the data into the table

insert into table log_parquet select * from log_text ;

<3> View the data size in the table

dfs -du -h /user/hive/warehouse/myhive.db/log_parquet;

The size of the data after compression is 13.1 M

Storage file compression ratio summary : ORCR > arque t > textFile

Query speed test for storing files :

TextFile

hive (default)> select count(*) from log_text;
result :
_c0
100000
1 row selected (5.97 seconds)
1 row selected (5.754 seconds)

ORC

hive (default)> select count(*) from log_orc;
result :
_c0
100000
1 row selected (5.967 seconds)
1 row selected (6.761 seconds)

Parquet

hive (default)> select count(*) from log_parquet;
result :
_c0
100000
1 row selected (6.7 seconds)
1 row selected (6.812 seconds)

A summary of the query speed of stored files :

TextFile >ORC> Parquet

Participation of this paper Tencent cloud media sharing plan , You are welcome to join us , share .

版权声明
本文为[Homo sapiens]所创,转载请带上原文链接,感谢
https://javamana.com/2021/01/20210122180032279w.html

  1. 【计算机网络 12(1),尚学堂马士兵Java视频教程
  2. 【程序猿历程,史上最全的Java面试题集锦在这里
  3. 【程序猿历程(1),Javaweb视频教程百度云
  4. Notes on MySQL 45 lectures (1-7)
  5. [computer network 12 (1), Shang Xuetang Ma soldier java video tutorial
  6. The most complete collection of Java interview questions in history is here
  7. [process of program ape (1), JavaWeb video tutorial, baidu cloud
  8. Notes on MySQL 45 lectures (1-7)
  9. 精进 Spring Boot 03:Spring Boot 的配置文件和配置管理,以及用三种方式读取配置文件
  10. Refined spring boot 03: spring boot configuration files and configuration management, and reading configuration files in three ways
  11. 精进 Spring Boot 03:Spring Boot 的配置文件和配置管理,以及用三种方式读取配置文件
  12. Refined spring boot 03: spring boot configuration files and configuration management, and reading configuration files in three ways
  13. 【递归,Java传智播客笔记
  14. [recursion, Java intelligence podcast notes
  15. [adhere to painting for 386 days] the beginning of spring of 24 solar terms
  16. K8S系列第八篇(Service、EndPoints以及高可用kubeadm部署)
  17. K8s Series Part 8 (service, endpoints and high availability kubeadm deployment)
  18. 【重识 HTML (3),350道Java面试真题分享
  19. 【重识 HTML (2),Java并发编程必会的多线程你竟然还不会
  20. 【重识 HTML (1),二本Java小菜鸟4面字节跳动被秒成渣渣
  21. [re recognize HTML (3) and share 350 real Java interview questions
  22. [re recognize HTML (2). Multithreading is a must for Java Concurrent Programming. How dare you not
  23. [re recognize HTML (1), two Java rookies' 4-sided bytes beat and become slag in seconds
  24. 造轮子系列之RPC 1:如何从零开始开发RPC框架
  25. RPC 1: how to develop RPC framework from scratch
  26. 造轮子系列之RPC 1:如何从零开始开发RPC框架
  27. RPC 1: how to develop RPC framework from scratch
  28. 一次性捋清楚吧,对乱糟糟的,Spring事务扩展机制
  29. 一文彻底弄懂如何选择抽象类还是接口,连续四年百度Java岗必问面试题
  30. Redis常用命令
  31. 一双拖鞋引发的血案,狂神说Java系列笔记
  32. 一、mysql基础安装
  33. 一位程序员的独白:尽管我一生坎坷,Java框架面试基础
  34. Clear it all at once. For the messy, spring transaction extension mechanism
  35. A thorough understanding of how to choose abstract classes or interfaces, baidu Java post must ask interview questions for four consecutive years
  36. Redis common commands
  37. A pair of slippers triggered the murder, crazy God said java series notes
  38. 1、 MySQL basic installation
  39. Monologue of a programmer: despite my ups and downs in my life, Java framework is the foundation of interview
  40. 【大厂面试】三面三问Spring循环依赖,请一定要把这篇看完(建议收藏)
  41. 一线互联网企业中,springboot入门项目
  42. 一篇文带你入门SSM框架Spring开发,帮你快速拿Offer
  43. 【面试资料】Java全集、微服务、大数据、数据结构与算法、机器学习知识最全总结,283页pdf
  44. 【leetcode刷题】24.数组中重复的数字——Java版
  45. 【leetcode刷题】23.对称二叉树——Java版
  46. 【leetcode刷题】22.二叉树的中序遍历——Java版
  47. 【leetcode刷题】21.三数之和——Java版
  48. 【leetcode刷题】20.最长回文子串——Java版
  49. 【leetcode刷题】19.回文链表——Java版
  50. 【leetcode刷题】18.反转链表——Java版
  51. 【leetcode刷题】17.相交链表——Java&python版
  52. 【leetcode刷题】16.环形链表——Java版
  53. 【leetcode刷题】15.汉明距离——Java版
  54. 【leetcode刷题】14.找到所有数组中消失的数字——Java版
  55. 【leetcode刷题】13.比特位计数——Java版
  56. oracle控制用户权限命令
  57. 三年Java开发,继阿里,鲁班二期Java架构师
  58. Oracle必须要启动的服务
  59. 万字长文!深入剖析HashMap,Java基础笔试题大全带答案
  60. 一问Kafka就心慌?我却凭着这份,图灵学院vip课程百度云