Linux file system and file cache knowledge

osc_mpdsal 2020-11-11 15:09:00
linux file file cache knowledge

Python The actual combat community Java In the actual combat community, long press to identify the QR code below , Add scanning code according to demand, pay attention to add customer service into Python community ▲ Scan code, pay attention to add customer service into Java community ▲
from :luozhiyun
link :

Linux File system

File system features

  1. The file system should have a strict organization form , Enables files to be stored in blocks .

  2. There should also be an index area in the file system , It is used to find the location where the multiple blocks of a file are stored .

  3. If some files in the file system are hot files , Recently, it is often read and written , The file system should have a cache layer .

  4. Files should be organized in the form of folders , Easy to manage and query .

  5. Linux The kernel maintains a set of data structures in its own memory , To save which files are opened and used by which processes .

    On the whole , The main functions of file system are as follows :

ext Series of file system formats

inode With block storage

The hard disk is divided into units of the same size , We call it blocks (Block). The size of a block is an integral multiple of the sector size , The default is 4K. When formatting , This value can be set .

A big hard disk is divided into small blocks , The part of data used to hold a file . thus , If we're like storing a file , You don't have to assign him a continuous space . We can store them in small pieces . It's much more flexible , It's also easier to add 、 Delete and insert data .

inode File index means file index , We have a file for each of us inode; A folder is a file , It also corresponds to a inode.

inode The data structure is as follows :

struct ext4_inode {
__le16 i_mode; /* File mode */
__le16 i_uid; /* Low 16 bits of Owner Uid */
__le32 i_size_lo; /* Size in bytes */
__le32 i_atime; /* Access time */
__le32 i_ctime; /* Inode Change time */
__le32 i_mtime; /* Modification time */
__le32 i_dtime; /* Deletion Time */
__le16 i_gid; /* Low 16 bits of Group Id */
__le16 i_links_count; /* Links count */
__le32 i_blocks_lo; /* Blocks count */
__le32 i_flags; /* File flags */
__le32 i_block[EXT4_N_BLOCKS];/* Pointers to blocks */
__le32 i_generation; /* File version (for NFS) */
__le32 i_file_acl_lo; /* File ACL */
__le32 i_size_high;

inode There are read and write permissions for files i_mode, To which user i_uid, Which group i_gid, What's the size i_size_io, How many blocks does it take i_blocks_io,i_atime yes access time, It's the last time the file was accessed ;i_ctime yes change time, It's the last change inode Time for ;i_mtime yes modify time, It's the last time the file was changed .

All the files are saved in i_block Inside . The specific preservation rules are as follows EXT4_N_BLOCKS decision ,EXT4_N_BLOCKS It has the following definition :

#define EXT4_NDIR_BLOCKS 12

stay ext2 and ext3 in , The top 12 The item directly stores the location of the block , in other words , We can go through i_block[0-11], Directly get the block to save the contents of the file .

however , If a file is large ,12 The block won't fit . When we use i_block[12] When , You can't put the data block directly , otherwise i_block It's going to run out soon .

Then you can make i_block[12] Point to a block , There is no data block in this block , It's the location of the data block , This block is called the indirect block . If the file is bigger ,i_block[13] It points to a block , We can use quadratic indirect blocks . The location of the indirect block is stored in the secondary indirect block , The indirect block contains the location of the data block , Data blocks hold real data . If the file is bigger , that i_block[14] Empathy .

There is a very significant problem , For big files , We have to read the hard disk many times to find the corresponding block , In this way, the access speed will be slower .

To solve this problem ,ext4 Made some changes . It introduces a new concept , called Extents. For example , A file size is 128M, If you use 4k The size of the block to store , need 32k Block . If according to ext2 perhaps ext3 It's spread out like that , It's too much . however Extents Can be used to store contiguous blocks , in other words , We can 128M In a Extents Inside . In this case , Read and write performance for large files has been improved , File fragmentation has also been reduced .

Exents It's a tree structure :

Each node has a head ,ext4_extent_header It can be used to describe a node .

struct ext4_extent_header {
__le16 eh_magic; /* probably will support different formats */
__le16 eh_entries; /* number of valid entries */
__le16 eh_max; /* capacity of store in entries */
__le16 eh_depth; /* has tree real underlying blocks? */
__le32 eh_generation; /* generation of the tree */

eh_entries Indicates how many items are in this node . There are two kinds of items here , If it's a leaf node , This entry points directly to the address of the contiguous blocks on the hard disk , We call it data nodes ext4_extent; If it's a branch node , This item points to the branch node or leaf node at the next level , We call it the inode ext4_extent_idx. The size of both types of items is 12 individual byte.

* This is the extent on-disk structure.
* It's used at the bottom of the tree.
struct ext4_extent {
__le32 ee_block; /* first logical block extent covers */
__le16 ee_len; /* number of blocks covered by extent */
__le16 ee_start_hi; /* high 16 bits of physical block */
__le32 ee_start_lo; /* low 32 bits of physical block */
* This is index on-disk structure.
* It's used at all the levels except the bottom.
struct ext4_extent_idx {
__le32 ei_block; /* index covers logical blocks from 'block' */
__le32 ei_leaf_lo; /* pointer to the physical block of the next *
* level. leaf or next index could be there */
__le16 ei_leaf_hi; /* high 16 bits of physical block */
__u16 ei_unused;
}; If the file is not big ,inode Inside i_block in , It can hold the next one ext4_extent_header and 4 term ext4_extent. So at this point ,eh_depth by 0, That is to say inode Inside is the leaf node , The height of the tree is 0.

If the file is large ,4 individual extent can't let go , It's about to split into a tree ,eh_depth>0 The node of is the index node , The root node has the largest depth , stay inode in . At the bottom eh_depth=0 It's the leaf node .

Except for the root node , The other nodes are stored in a block 4k Inside ,4k deduction ext4_extent_header Of 12 individual byte, The rest can put 340 term , Every extent Maximum energy means 128MB The data of ,340 individual extent Will make your presentation file reach 42.5GB.

inode Bitmaps and block bitmaps

inode The bitmap size of is 4k, Each bit corresponds to one inode. If it is 1, Express this inode It's been used ; If it is 0, It means that it is not used .block It's the same thing with bitmap .

stay Linux In the operating system , Want to create a new file , Would call open function , And the parameters will have O_CREAT. This means that when the file can't be found , We need to create a . that open The call procedure of the function is roughly : To open a file , First find the folder according to the path . If you find that there is no file under the folder , At the same time, it sets O_CREAT, That means we need to create a file under this folder .

Create a file , Then you need to create a inode, Then it will read from the file system inode Bitmap , And find the next one for 0 Of inode, Just idle inode. about block Bitmap , When writing to a file , There will also be this process .

File system format

The bitmap of a data block is placed in a block , common 4k. Each represents a block of data , A total can mean $4 * 1024 * 8 = 2{15}$ Data blocks . If each block is also by default 4K, The maximum space can be expressed as $2{15} * 4 * 1024 = 2^{27}$ individual byte, That is to say 128M, So obviously it's not enough .

In this case, we need to use the block group , The data structure is ext4_group_desc, This is for a block group of inode Bitmap bg_inode_bitmap_lo、 Block bitmap bg_block_bitmap_lo、inode list bg_inode_table_lo, There are corresponding member variables .

Such a block by block , It basically constitutes the structure of our entire file system . Because there are multiple blocks , Block group descriptors also form a list , We call these block group descriptor tables .

We also need to have a data structure , Describe the whole file system , This is the super block ext4_super_block. There's a whole file system in it, how many inode,s_inodes_count; How many pieces are there ,s_blocks_count_lo, How many... Per block group inode,s_inodes_per_group, How many pieces per block group ,s_blocks_per_group etc. . These are global information of this kind .

Final , The entire file system format looks like this .

By default , A copy of the superblock and block group descriptor tables is stored in each block group . To prevent this data from being lost , The entire file system can't be opened .

Because if there is a complete block group descriptor table in each block group , On the one hand, it's a waste of space ; Another aspect , Because a block group is the largest 128M, How many items are in the block group descriptor table , This limits how many blocks there are ,128M * The total number of block groups is the size of the entire file system , It's limited .

So introduce Meta Block Groups characteristic .

First , The block group descriptor table does not hold all block group descriptors , Instead, we divide the block groups into groups , We call them chunks (Meta Block Group). The list of block group descriptors in each meta block group only includes its own , A chunk group contains 64 Block groups , Such a metablock group has the largest number of block group descriptors 64 term .

Let's assume that there are 256 Block groups , It turns out to be a whole list of block group descriptors , There are 256 term , If you want to back up, back up all of them , Now it's divided into 4 Block groups , The list of block group descriptors in each meta block group is only 64 The item , It's much smaller , And the four chunk groups back up their own .

According to the picture , Each chunk group contains 64 Block groups , The block group descriptor table is also 64 term , Make three copies of it , At the first of the chunk groups , The beginning of the second and last block group .

If it's on sparse_super characteristic , Copies of the superblock and block group descriptor tables are only saved in the block group index of 0、3、5、7 In the integer power of . So the superblock in the figure above is only indexed to 0、3、5、7 In the integer power of .

The storage format of the directory

In fact, the directory itself is a file , Also have inode.inode It also points to some blocks . Different from ordinary documents , The file data is stored in the block of ordinary file , The block of the directory file stores the file information of each item in the directory . This information we call ext4_dir_entry.

In the block of the catalog file , The simplest format to save is a list , Each item will save the file name of the next level file in this directory and the corresponding file name inode, Through this inode, You can find the real file . The first is “.”, Represents the current directory , The second is “…”, Represents the directory above , Next are the file names and inode.

If in inode Set in EXT4_INDEX_FL sign , Then it means to find the file according to the index . The index entry maintains a mapping relationship between the hash value of a file name and the data block .

If we want to find the file name under a directory , You can hash by name . If the hash matches , It means that the information of this file is in the corresponding block . And then open this block , If it's no longer an index , It's the leaf node of the index tree , There is still ext4_dir_entry A list of , We just need to find the file names one by one . Through the index tree , We can put a directory under N Many files are scattered into many blocks , It's quick to find .

Linux File cache in

ext4 File system layer

about ext4 File system , The kernel defines a ext4_file_operations.

const struct file_operations ext4_file_operations = {
.read_iter = ext4_file_read_iter,
.write_iter = ext4_file_write_iter,

ext4_file_read_iter Would call generic_file_read_iter,ext4_file_write_iter Would call __generic_file_write_iter.

generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
if (iocb->ki_flags & IOCB_DIRECT) {
struct address_space *mapping = file->f_mapping;
retval = mapping->a_ops->direct_IO(iocb, iter);
retval = generic_file_buffered_read(iocb, iter, retval);
ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (iocb->ki_flags & IOCB_DIRECT) {
written = generic_file_direct_write(iocb, from);
} else {
written = generic_perform_write(file, from, iocb->ki_pos);
}generic_file_read_iter and __generic_file_write_iter There's a similar logic , Is to distinguish whether to use cache or not . therefore , Depending on whether memory is used for caching , We can put the file of I/O There are two types of operations .

The first type is caching I/O. Default for most file systems I/O Operations are all caching I/O. For reading operations , The operating system will check , Does the kernel buffer have the required data . If it's already cached , It's going to come back directly from the cache ; Otherwise read from disk , It is then cached in the operating system's cache . For write operations , The operating system will first copy the data from user space to kernel space cache . Now for the user program , The write operation is complete . It's up to the operating system to decide when to write to disk again , Unless you explicitly call sync Synchronization command .

The second type is direct IO, That is, applications access disk data directly , Without passing through the kernel buffer , This reduces data replication between the kernel cache and the user program .

If you're writing logic __generic_file_write_iter Inside , Found that the settings IOCB_DIRECT, Call generic_file_direct_write, It also calls address_space Of direct_IO Function of , Write data directly to the hard disk .

Write operations with cache

Let's first look at functions with cache writes generic_perform_write.

ssize_t generic_perform_write(struct file *file,
struct iov_iter *i, loff_t pos)
struct address_space *mapping = file->f_mapping;
const struct address_space_operations *a_ops = mapping->a_ops;
do {
struct page *page;
unsigned long offset; /* Offset into pagecache page */
unsigned long bytes; /* Bytes to write to page */
status = a_ops->write_begin(file, mapping, pos, bytes, flags,
&page, &fsdata);
copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
status = a_ops->write_end(file, mapping, pos, bytes, copied,
page, fsdata);
pos += copied;
written += copied;
} while (iov_iter_count(i));

These are the main things that are done in the loop :

  • For every page , First call address_space Of write_begin Do some preparation ;

  • call iov_iter_copy_from_user_atomic, Copy the contents written from user mode to kernel state page ;

  • call address_space Of write_end Complete the write operation ;

  • call balance_dirty_pages_ratelimited, See if there are too many dirty pages , Need to write back to the hard disk . Dirty pages , It's writing to the cache , But it hasn't been written to the hard disk yet .

For the first step , It's called ext4_write_begin Come on , There are two main things to do :

First, do log related work .

ext4 It's a log file system , It is to prevent data loss in case of sudden power failure , The introduction of the Journal (Journal) Pattern . There is one more log file system than a non log file system Journal Area . The file in ext4 There are two parts to store , Part of it is the metadata of the file , The other part is data . Metadata and data operation log Journal It's also managed separately . You can mount ext4 When , choice Journal Pattern . This mode before writing data to the file system , You have to wait for the metadata and the log of the data to go to disk before it can work . This performance is poor , But the safest .

Another model is order Pattern . This mode does not log data , Log metadata only , But before writing the metadata log , You have to make sure that the data is on the disk . This compromise , It's the default mode .

There's another pattern writeback, Logs that don't record data , Log metadata only , And there is no guarantee that the data will be put on the disk before the metadata . This is the best performance , But the least safe .

The second call grab_cache_page_write_begin Come on , Get the cache page that should be written .

struct page *grab_cache_page_write_begin(struct address_space *mapping,
pgoff_t index, unsigned flags)
struct page *page;
page = pagecache_get_page(mapping, index, fgp_flags,
if (page)
return page;

In kernel , The cache is placed in memory on a page by page basis , Every open file has a struct file structure , Every struct file Every structure has a struct address_space Used to associate files and memory , It's in this structure , There is a tree. , Used to save all cache pages associated with this file .

For the second step , call iov_iter_copy_from_user_atomic. Call the assigned page first kmap_atomic Mapping to a virtual address in the kernel , Then copy the user mode data to the virtual address of the kernel state page , call kunmap_atomic Remove the mapping in the kernel .

size_t iov_iter_copy_from_user_atomic(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes)
char *kaddr = kmap_atomic(page), *p = kaddr + offset;
iterate_all_kinds(i, bytes, v,
copyin((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len),
memcpy_from_page((p += v.bv_len) - v.bv_len, v.bv_page,
v.bv_offset, v.bv_len),
memcpy((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
return bytes;

In the third step , call ext4_write_end Write complete . It will call ext4_journal_stop Finish writing log , Would call block_write_end->__block_commit_write->mark_buffer_dirty, Mark the modified cache as dirty . It can be seen that , In fact, the so-called completion of writing , It's not really written to the hard disk , Just after writing to the cache , Marked as Dirty page .

Step four , call balance_dirty_pages_ratelimited, It's a dirty page .

* balance_dirty_pages_ratelimited - balance dirty memory state
* @mapping: address_space which was dirtied
* Processes which are dirtying memory should call in here once for each page
* which was newly dirtied. The function will periodically check the system's
* dirty state and will initiate writeback if needed.
void balance_dirty_pages_ratelimited(struct address_space *mapping)
struct inode *inode = mapping->host;
struct backing_dev_info *bdi = inode_to_bdi(inode);
struct bdi_writeback *wb = NULL;
int ratelimit;
if (unlikely(current->nr_dirtied >= ratelimit))
balance_dirty_pages(mapping, wb, current->nr_dirtied);

stay balance_dirty_pages_ratelimited Inside , Found that the number of dirty pages exceeds the specified number , Just call balance_dirty_pages->wb_start_background_writeback, Start a back thread and start writing back .

There are also several scenarios that trigger copybacks :

  • The user calls sync, Brush the cache onto the hard disk , Will eventually call wakeup_flusher_threads, Synchronize dirty pages ;

  • When memory is very tight , When you can't allocate pages , Would call free_more_memory, Will eventually call wakeup_flusher_threads, Release dirty pages ;

  • Dirty pages have been updated for a long time , The time exceeds the set time , Need to write back in time , Keep data consistency between memory and disk .

Read operations with cache

Read with cache , It's a function generic_file_buffered_read.

static ssize_t generic_file_buffered_read(struct kiocb *iocb,
struct iov_iter *iter, ssize_t written)
struct file *filp = iocb->ki_filp;
struct address_space *mapping = filp->f_mapping;
struct inode *inode = mapping->host;
for (;;) {
struct page *page;
pgoff_t end_index;
loff_t isize;
page = find_get_page(mapping, index);
if (!page) {
if (iocb->ki_flags & IOCB_NOWAIT)
goto would_block;
ra, filp,
index, last_index - index);
page = find_get_page(mapping, index);
if (unlikely(page == NULL))
goto no_cached_page;
if (PageReadahead(page)) {
ra, filp, page,
index, last_index - index);
* Ok, we have the page, and it's up-to-date, so
* now we can copy it to user space...
ret = copy_page_to_iter(page, offset, nr, iter);

stay generic_file_buffered_read Function , We need to find page cache Is there a cache page in it . If not found , Not only read this page , And a preview , This needs to be in page_cache_sync_readahead Function . After the preview , Try again to find the cache page .

If you look for the cache page for the first time, you will find , We still have to judge , Should we continue to preview ; if necessary , Just call page_cache_async_readahead Initiate an asynchronous read ahead .

Last ,copy_page_to_iter The contents will be copied from the kernel cache page to the user memory space .

 Programmer column   Scan code and pay attention to customer service   Press and hold to recognize the QR code below to enter the group

Recent highlights are recommended :  

  My girlfriend thinks the annual salary is 50 Ten thousand is the average level , What do I do ?

  The sexy goddess of the biggest straight men forum in China overturned

 IntelliJ IDEA Fully optimized settings , Efficiency bars !

  Very useful Python skill

Here's a look Good articles to share with more people ↓↓


  1. 【计算机网络 12(1),尚学堂马士兵Java视频教程
  2. 【程序猿历程,史上最全的Java面试题集锦在这里
  3. 【程序猿历程(1),Javaweb视频教程百度云
  4. Notes on MySQL 45 lectures (1-7)
  5. [computer network 12 (1), Shang Xuetang Ma soldier java video tutorial
  6. The most complete collection of Java interview questions in history is here
  7. [process of program ape (1), JavaWeb video tutorial, baidu cloud
  8. Notes on MySQL 45 lectures (1-7)
  9. 精进 Spring Boot 03:Spring Boot 的配置文件和配置管理,以及用三种方式读取配置文件
  10. Refined spring boot 03: spring boot configuration files and configuration management, and reading configuration files in three ways
  11. 精进 Spring Boot 03:Spring Boot 的配置文件和配置管理,以及用三种方式读取配置文件
  12. Refined spring boot 03: spring boot configuration files and configuration management, and reading configuration files in three ways
  13. 【递归,Java传智播客笔记
  14. [recursion, Java intelligence podcast notes
  15. [adhere to painting for 386 days] the beginning of spring of 24 solar terms
  16. K8S系列第八篇(Service、EndPoints以及高可用kubeadm部署)
  17. K8s Series Part 8 (service, endpoints and high availability kubeadm deployment)
  18. 【重识 HTML (3),350道Java面试真题分享
  19. 【重识 HTML (2),Java并发编程必会的多线程你竟然还不会
  20. 【重识 HTML (1),二本Java小菜鸟4面字节跳动被秒成渣渣
  21. [re recognize HTML (3) and share 350 real Java interview questions
  22. [re recognize HTML (2). Multithreading is a must for Java Concurrent Programming. How dare you not
  23. [re recognize HTML (1), two Java rookies' 4-sided bytes beat and become slag in seconds
  24. 造轮子系列之RPC 1:如何从零开始开发RPC框架
  25. RPC 1: how to develop RPC framework from scratch
  26. 造轮子系列之RPC 1:如何从零开始开发RPC框架
  27. RPC 1: how to develop RPC framework from scratch
  28. 一次性捋清楚吧,对乱糟糟的,Spring事务扩展机制
  29. 一文彻底弄懂如何选择抽象类还是接口,连续四年百度Java岗必问面试题
  30. Redis常用命令
  31. 一双拖鞋引发的血案,狂神说Java系列笔记
  32. 一、mysql基础安装
  33. 一位程序员的独白:尽管我一生坎坷,Java框架面试基础
  34. Clear it all at once. For the messy, spring transaction extension mechanism
  35. A thorough understanding of how to choose abstract classes or interfaces, baidu Java post must ask interview questions for four consecutive years
  36. Redis common commands
  37. A pair of slippers triggered the murder, crazy God said java series notes
  38. 1、 MySQL basic installation
  39. Monologue of a programmer: despite my ups and downs in my life, Java framework is the foundation of interview
  40. 【大厂面试】三面三问Spring循环依赖,请一定要把这篇看完(建议收藏)
  41. 一线互联网企业中,springboot入门项目
  42. 一篇文带你入门SSM框架Spring开发,帮你快速拿Offer
  43. 【面试资料】Java全集、微服务、大数据、数据结构与算法、机器学习知识最全总结,283页pdf
  44. 【leetcode刷题】24.数组中重复的数字——Java版
  45. 【leetcode刷题】23.对称二叉树——Java版
  46. 【leetcode刷题】22.二叉树的中序遍历——Java版
  47. 【leetcode刷题】21.三数之和——Java版
  48. 【leetcode刷题】20.最长回文子串——Java版
  49. 【leetcode刷题】19.回文链表——Java版
  50. 【leetcode刷题】18.反转链表——Java版
  51. 【leetcode刷题】17.相交链表——Java&python版
  52. 【leetcode刷题】16.环形链表——Java版
  53. 【leetcode刷题】15.汉明距离——Java版
  54. 【leetcode刷题】14.找到所有数组中消失的数字——Java版
  55. 【leetcode刷题】13.比特位计数——Java版
  56. oracle控制用户权限命令
  57. 三年Java开发,继阿里,鲁班二期Java架构师
  58. Oracle必须要启动的服务
  59. 万字长文!深入剖析HashMap,Java基础笔试题大全带答案
  60. 一问Kafka就心慌?我却凭着这份,图灵学院vip课程百度云