HDFS classic short answer questions (required for interns!)

Homo sapiens 2021-01-22 16:50:02
hdfs classic short answer questions

Some time ago , Little bacteria have shared HDFS series 1-12 The blog of , It's finally over . So Xiaojun plans to issue another issue about HDFS The classic interview questions , Most of the content comes from the blogs I shared earlier , Interested partners can browse by themselves , Link to Xiaojun and put it at the end of the article ~

1. How to understand distributed ?

Distributed needs to start from Calculation and Storage Two aspects to discuss : Distributed computing : It's a calculation method , It's breaking down the application into many small parts , Assign to multiple computers for processing . This can save the overall calculation time , Greatly improve the calculation efficiency . Distributed storage , It's a data storage technology , Use the disk space on each machine in the enterprise through the network , These scattered storage resources constitute a virtual storage device , Data is stored in every corner of the enterprise , Multiple servers .

2.hadoop Component part

a) HDFS i. managers :NameNode ii. Worker :DataNode iii. Assistant Manager :SecondaryNameNode b) MapReduce c) Yarn i. managers :ResourceManager ii. Worker :NodeManager

3.HDFS Replica storage mechanism

i. The first copy comes from the client ii. The second copy is stored in different nodes on the same rack as the first copy according to certain rules iii. According to certain rules, the third copy is stored on the random nodes on different racks which are logically closest to the first and second copies

4.Namenode effect

a) Maintain the namespace of the managed file system ( Metadata ) b) Make sure that the user's file block points to the specific DataNode The mapping of nodes c) Maintenance Management DataNode Periodically reported heartbeat information

5.DataNode effect

a) Read and write data b) Periodically to NameNode Report heartbeat information ( Information that includes data 、 The checksum ) if DataNode exceed 10 Minutes didn't go to NameNode Upload heartbeat information , It is deemed that DataNode The node is down c) Pipelined replication of data

6. What is frame perception ?

Generally speaking, it means nameNode Configure the rack information of each node by reading our configuration

7. When will frame sensing be used ?

NameNode When assigning nodes ( Pipelining of data and HDFS When making a replica )

8.HDFS Data writing process ?

1、 client Initiate file upload request , adopt RPC And NameNode Establish communication ,NameNode Check if the destination file already exists , Does the parent directory There is , Returns whether you can upload ; 2、 client Request first block To which DataNode Server ; 3、 NameNode Allocate files according to the number of backups specified in the configuration file and the rack aware principle , Return to available DataNode Your address is like : A,B,C; 4、 client request 3 platform DataNode One of them A Upload data ( It's essentially a RPC call , establish pipeline),A Upon receipt of the request, the call continues B, then B call C, Will the whole pipeline Establishment and completion , And then go back step by step client; 5、 client Start to go A Upload the first one block( The data is first read from disk and put into a local memory cache ), With packet In units of ( Default 64K),A Receive a packet Will be passed on to B,B Pass to C;A Each one packet A reply queue is placed waiting for the reply . 6、 The data is divided into pieces packet The packet is in pipeline On the Internet , stay pipeline In the opposite direction , Send... One by one ack( The order is correct answer ), In the end by the pipeline First of all DataNode node A take pipelineack Send to client;. 7、 Turn off the write stream . 8、 When one block Once the transmission is complete ,client Ask again NameNode Upload the second block To the server .

9.HDFS Data reading process ?

1、 The client calls FileSystem Object's open() To read the file you want to open . 2、 Client towards NameNode launch RPC request , To determine the request file block Where it is ; 3、 NameNode Some or all of the files will be returned as appropriate block list , For each block,NameNode Will be returned containing the block Replica DataNode Address ; These returned DN Address , According to the cluster topology DataNode Distance from client , And then sort it , There are two rules for sorting : Distance in network topology Client The nearest row is in the front ; Timeout reporting in heartbeat mechanism DN Status as STALE, This kind of platoon be in the rear . 4、Client Select the one at the top of the order DataNode To read block, If the client itself is DataNode, Then the data will be obtained directly from the local ( short Path read feature ). 5、 The bottom line is essentially to build Socket Stream(FSDataInputStream), Repeated calls to the parent class DataInputStream Of read Method , Until the data on this block is read . 6、 Parallel reading , If it fails, reread . 7、 After reading the list block after , If the file reading is not finished , The client will continue to NameNode Get the next batch of block list . And return to the following block list . 8、 Finally close the read stream , And will read all block Merge into a complete final document .

10.HDFS How to guarantee data integrity ?

After the data is written, the check sum is calculated ,DataNode Check and calculate periodically , Compare the calculation results with the results of the first time . If the same, it means no data loss , If not, it means that the data is lost , Data recovery for lost data . Check the data before reading , Compared with the first results . If it is the same, it means that the data is not lost , Can read . If not, it means data Something is missing . Read to other copies .

11.HDFS characteristic ?

1、 Mass data storage : HDFS Scalable , Its stored files can support PB Level data . 2、 High fault tolerance : Node lost , The system is still available , The data is kept in multiple copies , Automatic recovery after copy loss . Can be built on cheap ( Compared with small and large computers ) On the machine , Realize linear expansion ( As the number of nodes increases , The storage capacity of the cluster , Computing power increases ). 3、 Large file storage :DFS Data is stored in blocks , Split a large file into several small files , Distributed storage .

12.HDFS shortcoming ?

1. Can't do low latency data access : HDFS Continue to optimize for reading large amount of data at one time , At the expense of delay . 2. Not suitable for a large number of small file storage : A: because NameNode Store the metadata of the file system in memory , Therefore, the total number of files that the file system can store is limited by NameNode The memory capacity of . B: Each file 、 The storage information of directory and data block accounts for about 150 byte . For the above two reasons , So lead to HDFS Not suitable for a large number of small file storage . 3. File modification ; Not suitable for multiple writes , Read once ( A small amount of reading ). 4. Parallel writing of the same text by multiple users is not supported .

13. When to enter safe mode ?

When the cluster starts , First, go into safe mode Or use hdfs dfsadmin -safemode enter Command to manually enter safe mode

14. What are the characteristics of security mode ?

a) The client is not allowed to modify any files , Including uploading files 、 Delete file 、 rename 、 Create a folder and so on . b) Only the client is allowed to read data

15. What is the cluster doing in secure mode ?

a) Check the integrity of the data block b) Merge fsimge,editslog Restore the state before the last shutdown

16. How to enter / Exit safe mode ?

a) hdfs dfsadmin -safemode enter Command to enter manually b) hdfs dfsadmin -safemode leave Command to exit manually

17.Fsimage and Edits What is the role of ?

a) fsimage The file is actually Hadoop A permanent checkpoint for file system metadata . b) edits The file stores the client's operation on the cluster These two files can be used to restore the state of the cluster before shutdown !

18. When will I use Fsimage Edits?

a) stay NameNode When it starts , It will be fsimage The contents of the file are loaded into memory , We'll do it later edits The operations in the file , Make the metadata in memory and the actual synchronization . b).SecondayNameNode Periodic pull fsimage and edits Merge to create a new fsimage

19.SecondaryNamenode What is the working mechanism of ?

a) NameNode Create a Edits.new b)SNN from NameNode Node copy Fsimage and Edits File to SNN,SNN Import two files into memory and merge them to create a new Fsimage.ckpt file . c)SNN New Fsimage.ckpt Send to NameNode node , Rename it to Fsimage Replace the original Fsimage, The original Edits Generate Edits.new file , take Edits Replace with new Edits.new

20.SecondaryNamenode What is the meaning of being ?

a): Conduct Fsimage and Edits Merge operation of , Reduce edits Log size , Speed up cluster startup b): take Fsimage And Edits Make a backup , To prevent loss

21.SecondaryNamenode What are the triggers for work ?

1. Time dimension , The default is once an hour dfs.namenode.checkpoint.period :3600 2. The number dimension , Default HDFS The operation reaches 100 Once a million times dfs.namenode.checkpoint.txns : 1000000 3、 60 seconds to determine whether to reach 100W

22. Use SNN Of FSimage and Edits Restore Namenode What is the process ?

Enter into SNN Data storage folder -----> Will be the latest version of Fsimage as well as Edits Copy to nameNode node , Put it in NN In the corresponding configuration directory of the node -----> Restart the cluster

23. The cluster expansion 1 What preparations need to be made for the new node ?

a) Close the protective wall 、 close SELinux、 To configure ssh Password free login 、 To configure IP Corresponds to the host name 、 Change host name 、 install jdk、

24. The cluster expansion 2 The process of adding a node to the cluster ?

b) Create a white list dfs.hosts, Add all nodes to the file , edit hdfs-site.xml File configuration dfs.hosts The mapping information c) Use hdfs dfsadmin -refreshNodes Refresh NameNode d) Use yarn dfsadmin -refreshNodes to update resourceManager e) modify slaves Add the host name of the new service node to the file f) Start the new node g) Browser view new node information h) perform start-balancer.sh Load balancing

25. How to merge small files ?

a) Use HDFS Provided -getmerge command 【HDFS–> Local 】 b) Traverse each small file, append to a file, and then upload 【 Local –>HDFS】

26. Set up Turn on permission control key What is it? ?

a) dfs.permissions

27. Use java API stay hdfs The process of creating a new directory is ?

a) Instantiation configuration b) Instantiate file system objects hdfs c) call hdfs Of mkdirs() The method can

28.HDFS web Interface

The Chinese meaning of the English corresponding to each catalog

Participation of this paper Tencent cloud media sharing plan , You are welcome to join us , share .

本文为[Homo sapiens]所创,转载请带上原文链接,感谢

  1. 【计算机网络 12(1),尚学堂马士兵Java视频教程
  2. 【程序猿历程,史上最全的Java面试题集锦在这里
  3. 【程序猿历程(1),Javaweb视频教程百度云
  4. Notes on MySQL 45 lectures (1-7)
  5. [computer network 12 (1), Shang Xuetang Ma soldier java video tutorial
  6. The most complete collection of Java interview questions in history is here
  7. [process of program ape (1), JavaWeb video tutorial, baidu cloud
  8. Notes on MySQL 45 lectures (1-7)
  9. 精进 Spring Boot 03:Spring Boot 的配置文件和配置管理,以及用三种方式读取配置文件
  10. Refined spring boot 03: spring boot configuration files and configuration management, and reading configuration files in three ways
  11. 精进 Spring Boot 03:Spring Boot 的配置文件和配置管理,以及用三种方式读取配置文件
  12. Refined spring boot 03: spring boot configuration files and configuration management, and reading configuration files in three ways
  13. 【递归,Java传智播客笔记
  14. [recursion, Java intelligence podcast notes
  15. [adhere to painting for 386 days] the beginning of spring of 24 solar terms
  16. K8S系列第八篇(Service、EndPoints以及高可用kubeadm部署)
  17. K8s Series Part 8 (service, endpoints and high availability kubeadm deployment)
  18. 【重识 HTML (3),350道Java面试真题分享
  19. 【重识 HTML (2),Java并发编程必会的多线程你竟然还不会
  20. 【重识 HTML (1),二本Java小菜鸟4面字节跳动被秒成渣渣
  21. [re recognize HTML (3) and share 350 real Java interview questions
  22. [re recognize HTML (2). Multithreading is a must for Java Concurrent Programming. How dare you not
  23. [re recognize HTML (1), two Java rookies' 4-sided bytes beat and become slag in seconds
  24. 造轮子系列之RPC 1:如何从零开始开发RPC框架
  25. RPC 1: how to develop RPC framework from scratch
  26. 造轮子系列之RPC 1:如何从零开始开发RPC框架
  27. RPC 1: how to develop RPC framework from scratch
  28. 一次性捋清楚吧,对乱糟糟的,Spring事务扩展机制
  29. 一文彻底弄懂如何选择抽象类还是接口,连续四年百度Java岗必问面试题
  30. Redis常用命令
  31. 一双拖鞋引发的血案,狂神说Java系列笔记
  32. 一、mysql基础安装
  33. 一位程序员的独白:尽管我一生坎坷,Java框架面试基础
  34. Clear it all at once. For the messy, spring transaction extension mechanism
  35. A thorough understanding of how to choose abstract classes or interfaces, baidu Java post must ask interview questions for four consecutive years
  36. Redis common commands
  37. A pair of slippers triggered the murder, crazy God said java series notes
  38. 1、 MySQL basic installation
  39. Monologue of a programmer: despite my ups and downs in my life, Java framework is the foundation of interview
  40. 【大厂面试】三面三问Spring循环依赖,请一定要把这篇看完(建议收藏)
  41. 一线互联网企业中,springboot入门项目
  42. 一篇文带你入门SSM框架Spring开发,帮你快速拿Offer
  43. 【面试资料】Java全集、微服务、大数据、数据结构与算法、机器学习知识最全总结,283页pdf
  44. 【leetcode刷题】24.数组中重复的数字——Java版
  45. 【leetcode刷题】23.对称二叉树——Java版
  46. 【leetcode刷题】22.二叉树的中序遍历——Java版
  47. 【leetcode刷题】21.三数之和——Java版
  48. 【leetcode刷题】20.最长回文子串——Java版
  49. 【leetcode刷题】19.回文链表——Java版
  50. 【leetcode刷题】18.反转链表——Java版
  51. 【leetcode刷题】17.相交链表——Java&python版
  52. 【leetcode刷题】16.环形链表——Java版
  53. 【leetcode刷题】15.汉明距离——Java版
  54. 【leetcode刷题】14.找到所有数组中消失的数字——Java版
  55. 【leetcode刷题】13.比特位计数——Java版
  56. oracle控制用户权限命令
  57. 三年Java开发,继阿里,鲁班二期Java架构师
  58. Oracle必须要启动的服务
  59. 万字长文!深入剖析HashMap,Java基础笔试题大全带答案
  60. 一问Kafka就心慌?我却凭着这份,图灵学院vip课程百度云