VFS
A VFS specifies an interface (or a "contract") between the kernel and a concrete file system。更具体地来说,位于 系统调用 和 特定 file_operations
之间。Therefore, it is easy to add support for new file system types to the kernel simply by fulfilling the contract.
VFS 通用文件模型中包含以下四种元数据结构:
- 超级块对象 (
super_block
^ object),用于存放已经注册的文件系统的信息。比如 ext2,ext3 等这些基础的磁盘文件系统,还有用于读写 socket 的 socket 文件系统,以及当前的用于读写 cgroups 配置信息的 cgroups 文件系统等。 - 索引节点对象 (
inode
object),用于存放具体文件的信息。对于一般的磁盘文件系统而言,inode 节点中一般会存放文件在硬盘中的存储块等信息;对于 socket 文件系统,inode 会存放 socket 的相关属性,而对于 cgroups 这样的特殊文件系统,inode 会存放与 cgroup 节点相关的属性信息。这里面比较重要的一个部分是一个叫做 inode_operations 的结构体,这个结构体定义了在具体文件系统中创建文件,删除文件等的具体实现。 - 文件对象 (
file
object),表示进程内打开的一个文件,文件对象是存放在进程的 fd 表里面的。同样这个文件中比较重要的部分是一个叫file_operations
的结构体,这个结构体描述了具体的文件系统的读写实现。当进程在某一个文件描述符上调用读写操作时,实际调用的是 file_operations 中定义的方法。 对于普通的磁盘文件系统,file_operations 中定义的就是普通的块设备读写操作;对于 socket 文件系统,file_operations 中定义的就是 socket 对应的 send/recv 等操作;而对于 cgroups 这样的特殊文件系统,file_operations 中定义的就是操作 cgroup 结构体等具体的实现。 - 目录项对象 (
dentry
object),在每个文件系统中,内核在查找某一个路径中的文件时,会为内核路径上的每一个分量都生成一个目录项对象,通过目录项对象能够找到对应的 inode 对象,目录项对象一般会被缓存,从而提高内核查找速度。
struct super_block
Kernel / VFS
超级块代表了整个文件系统,超级块是文件系统的控制块,有整个文件系统信息,一个文件系统所有的 inode 都要连接到超级块上,可以说,一个超级块就代表了一个文件系统。
struct super_block {
//...
// All inodes belong to this super_block
struct list_head s_inodes; /* all inodes */
} __randomize_layout;
Pseudo filesystems
A filesystem that doesn't have actual files.
For example, /proc
on many OSes is a procfs which dynamically generates directories for every process. Similarly, /sys
on Linux generates files and directories to represent hardware layouts. There are FUSE-based pseudo-filesystems for a lot of things.
filesystems - What is a pseudo file system in Linux? - Super User
Magic number
A magic number is used to specify a type of file system. 难道不是可以得知一个文件格式?(这个是另一个 magic number,和我们这个不是一回事)。
BPF_FS_MAGIC
is used to identify BPS filesystem, BTRFS_TEST_MAGIC
is used to identify some testing btrfs filesystem, while BTRFS_MAGIC_NUMBER
is used to identify (production-ready) btrfs filesystem and so on. The naming of these macros may be not consistent, but they all share the same purpose,gmem 的名字也叫做 KVM_GUEST_MEMORY_MAGIC
,和哪一种 pattern 都不像。
一句话总结,名字只是参考的,其实作用都是一样的。
filesystems - Linux file system SUPER_MAGIC, FS_MAGIC and TEST_MAGIC difference - Stack Overflow
Dentry / struct dcache
中文名称:目录项。
不要把 dentry 和 pathname (filename) 搞混,后者需要通过 lookup 来找到前者。dcache 就是为了加速这一过程而设计的。
dentry and pathname are not 1 to 1, for example: for a pathname /bin/vi
: /
, bin
, and vi
are all dentry objects.
dentry
is cached in dcache
. filesystems - How long do dentries stay in the dcache? - Unix & Linux Stack Exchange
dcache 相对于 page cache,就相当于 TLB 之于 cache。
从内存的角度来说,cache 缓存了内存页,内存页是通过页表来索引的,页表的访问本身也在内存中,所以也需要缓存,这是 TLB 的功能。
从 IO 的角度来说,page cache 缓存了文件内容,文件内容是通过目录索引的,目录的访问本身也在磁盘文件系统元文件中,所以也需要缓存,这是 dcache 的功能。
#define d_lock d_lockref.lock
struct dentry {
/* RCU lookup touched fields */
unsigned int d_flags; /* protected by d_lock */
seqcount_spinlock_t d_seq; /* per dentry seqlock */
struct hlist_bl_node d_hash; /* lookup hash list */
struct dentry *d_parent; /* parent directory */
struct qstr d_name;
struct inode *d_inode; /* Where the name belongs to - NULL is negative */
unsigned char d_iname[DNAME_INLINE_LEN]; /* small names */
/* Ref lookup also touches following */
// 再使用的时候是 dentry->d_lock,因为上面的宏已经替换了 d_lock 为 d_lockref.lock
// 这个锁保护的是这个 dentry 的一些关键数据。
struct lockref d_lockref; /* per-dentry lock and refcount */
const struct dentry_operations *d_op;
struct super_block *d_sb; /* The root of the dentry tree */
unsigned long d_time; /* used by d_revalidate */
void *d_fsdata; /* fs-specific data */
union {
struct list_head d_lru; /* LRU list */
wait_queue_head_t *d_wait; /* in-lookup ones only */
};
struct list_head d_child; /* child of parent list */
struct list_head d_subdirs; /* our children */
/*
* d_alias and d_rcu can share memory
*/
union {
struct hlist_node d_alias; /* inode alias list */
struct hlist_bl_node d_in_lookup_hash; /* only for in-lookup ones */
struct rcu_head d_rcu;
} d_u;
CK_KABI_RESERVE(1)
CK_KABI_RESERVE(2)
} __randomize_layout;
Struct file (include/linux/fs.h)
Representing an opened file.
struct file
and file descriptors and roughly 1-to-1. linux - What's the relationship between struct file
and file descriptor? - Stack Overflow
Different open()
calls create different struct file objects.,之所以这样是因为不同的 open()
返回不同的 fd,每一个 fd 都有对应的 file 访问到哪里的 offset 等等私有的信息。
fd 主要是 userspace 用来操作的数据结构,而 file 看起来 kernel 用的比较多?
Reading, writing and closing files (and other assorted VFS operations) is done by using the userspace file descriptor to grab the appropriate file structure, and then calling the required file structure method to do whatever is required.
In the kernel sources, a pointer to struct file is usually called either file or filp (“file pointer”).
Do not mix-up it with FILE, A FILE is defined in the C library and never appears in kernel code. A struct file, on the other hand, is a kernel structure that never appears in user programs.
The file Structure - Linux Device Drivers, Second Edition [Book]
file_operations
/ anon_inode_getfile()
/ anon_inode_getfd()
把系统调用和驱动程序(文件或者 fd)关联起来,file_operations
里的每一个成员都服务于一个系统调用。
anon_inode_getfd()
函数可以创建一个 fd
,和我们传进去的私有数据关联起来(比如 struct vcpu*
),传入的 file_operations
用来支持在这个 fd 上进行系统调用,最后将 fd 返回。
anno_inode_getfd()
函数可以看作是对 anno_inode_getfile()
的封装,前者先调用了后者,然后使用 fd_install
将 fd 与后者返回的 struct file
关联起来。
These function pointers are filled in by the device driver or file system that implements the operations.
struct file_operations {
struct module *owner;
loff_t (*llseek) (struct file *, loff_t, int);
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
// close(), release will be called only
// when the last fd only on the last close call for the
// last reference to the file invokes release
// https://stackoverflow.com/questions/11393674/why-is-the-close-function-is-called-release-in-struct-file-operations-in-the-l
int (*release) (struct inode *, struct file *);
//
long (*fallocate)(struct file *file, int mode, loff_t offset, loff_t len);
//...
} __randomize_layout;
file_operations
For a socket fd
The file_operations
structure is a common interface used by different subsystems (not just for filesystem, KVM also uses it, such as kvm_vm_fops
and kvm_vcpu_fops
).
For a socket fd, the file_operations
for this fd is implemented in networking subsystem. Although both file systems and the networking subsystem in Linux use the file_operations
, it is important to note that they use it for different purposes.
- In a file system, the
file_operations
structure is used to define the operations that can be performed on a file or device, such as reading and writing data. - In contrast, in the networking subsystem, the
file_operations
structure is used to define the operations that can be performed on a network resource, such as sending and receiving data over a network socket.
// The file_operations implemented by networking subsystem, in net/socket.c
static const struct file_operations socket_file_ops = {
.owner = THIS_MODULE,
//...
.show_fdinfo = sock_show_fdinfo,
};
// bind the socket to a file, because a file struct is needed for file_operations functions
// sock_alloc_file - Bind a &socket to a &file
struct file *sock_alloc_file(struct socket *sock, int flags, const char *dname)
{
struct file *file;
//...
file = alloc_file_pseudo(SOCK_INODE(sock), sock_mnt, dname,
O_RDWR | (flags & O_NONBLOCK),
&socket_file_ops);
//...
return file;
}
poll()
对于 eventpoll 来说,它的 fops->poll
函数对应的是 ep_eventpoll_poll
。(select 和 poll 没有对应的 poll,因为他们本来就不是 fs,而 eventpoll 是 eventpollfs)。
poll 函数的具体实现必须完成两件事(这两点算是规范了):
- 在 poll 函数感兴趣的等待队列上调用
poll_wait
函数(ep_eventpoll_poll
就调用了poll_wait
),以接收到唤醒;具体的实现必须把 poll_table 类型的参数作为透明对象来使用,不需要知道它的具体结构。 - 返回比特掩码,表示当前可立即执行而不会阻塞的操作。(
__poll_t
)
It is supposed to do two things:
- Expose the queue(s) related to “readiness” inside
file
: call the callback_qproc
several times with each queue as the argument. - Return a bitmask indicating current “readiness” (
__poll_t
).
typedef unsigned __bitwise __poll_t;
__poll_t (*poll) (struct file *, struct poll_table_struct *);
typedef struct poll_table_struct {
poll_queue_proc _qproc;
__poll_t _key;
} poll_table;
typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct poll_table_struct *);
Why design like this?
From the official definition of this function:
This function is called by the VFS when a process wants to check if there is activity on this file and (optionally) go to sleep until there is activity. Called by the select(2) and poll(2) system calls.
poll_wait()
Motivation: Add a process to the wait queue for a particular fd.
// @wait_address: the wait queue of processes for the descriptor
static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
{
if (p && p->_qproc && wait_address)
p->_qproc(filp, wait_address, p);
}
Why use poll_table
to call another function _qproc
, why not add to the queue directly?
poll_table
file_operations
里有一个 poll
函数,它的签名是这样的:
typedef struct poll_table_struct {
poll_queue_proc _qproc;
__poll_t _key; // 事件 mask
} poll_table;
typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct poll_table_struct *);
__poll_t (*poll) (struct file *, struct poll_table_struct *);
其中 poll_table_struct
是 poll_table
的别名。
poll_table(第二个参数)是和调用者(select, poll)相关的,和要 poll 的 fd (第一个参数) 是无关。比如:
- 对于 select 和 poll,poll_table 里的 function 是
__pollwait()
。 - 对于 epoll,poll_table 里的 function 是
ep_ptable_queue_proc()
。
Meaning of name _qproc
Queued process.
Note that newer versions of the Linux kernel may use different mechanisms for managing queues of waiting processes, and the _qproc field may no longer be used or may have a different meaning altogether.
poll_queue_proc
function means
Motivation
In the Linux kernel, the poll_table
structure is used to efficiently manage the set of processes that are waiting for events on file descriptors. The primary motivation for using a poll_table
structure is to improve the performance of the kernel's polling and notification mechanisms, while minimizing the overhead of managing the set of waiting processes.
Traditionally, Each driver had to provide its own implementation of functions such as poll, select, and epoll, which were used by the kernel to manage the set of waiting processes.
This approach had several disadvantages. First, it required each driver to implement its own version of the same basic functionality, which led to code duplication and increased maintenance overhead. Second, it made it difficult to optimize the performance of the polling and notification mechanisms, because each driver had its own implementation with its own set of limitations and trade-offs.
The poll_table structure was introduced as a way to address these issues. By providing a unified mechanism for managing the set of waiting processes, the poll_table structure allows the kernel to optimize the performance of the polling and notification mechanisms in a more uniform and efficient way. It also reduces the amount of duplicated code in the kernel, which makes it easier to maintain and improve.
Overall, the poll_table structure is an important part of the Linux kernel's design, because it allows the kernel to efficiently manage the set of waiting processes for file descriptors, while providing a uniform and extensible mechanism for device drivers to implement their own polling and notification functionality.
Who call .poll
and the calling process?
A process wants to check if there is activity on this file: invoke the given callback on each readiness-related queue, so the process can know the activity, the "readiness".
// ---------- For select ----------
SYSCALL_DEFINE5(select, int, n, fd_set __user *, inp, fd_set __user *, outp,
fd_set __user *, exp, struct __kernel_old_timeval __user *, tvp)
{
return kern_select(n, inp, outp, exp, tvp);
}
kern_select()
core_sys_select()
do_select()
@vfs_poll() // need to be called on each fd
f_op->poll(file, pt);
// ---------- For poll ----------
SYSCALL_DEFINE3(poll, struct pollfd __user *, ufds, unsigned int, nfds,
int, timeout_msecs)
{
//...
ret = do_sys_poll(ufds, nfds, to);
//...
}
do_sys_poll()
do_poll()
@do_pollfd() // need to be called on each fd
vfs_poll()
f_op->poll(file, pt);
// ---------- For epoll ----------
// epoll may has epoll_fd as the epitem, the epitem also need to be poll, so we should use a recursive polling approach
do_epoll_wait()
ep_poll()
ep_send_events()
@ep_item_poll(epi, &pt, 1); // need to be called on each epitem, the depth is for mutex locking, the poll_table pt is same for all
__ep_eventpoll_poll() // the fd in epitem is **also** a eventpoll fd. I.e., we get a nested epollfd.
poll_wait() // call the function in pt, current is null so do nothing
@ep_item_poll(epi, &pt, 1 + 1);
__ep_eventpoll_poll() // for nested eventpoll fd...
poll_wait()
ep_item_poll(epi, &pt, 1 + 2);
//......
vfs_poll() // for non-eventpoll fd
Do not mix-up with the poll()
system call.
https://www.kernel.org/doc/Documentation/filesystems/vfs.txt
The implementation of poll() for a network socket
static __poll_t sock_poll(struct file *file, struct poll_table_struct *wait);
which will call another poll
in
struct proto_ops {
//...
__poll_t(*poll)(struct file *file, struct socket *sock, struct poll_table_struct *wait);
//...
};
which will eventually call the poll_wait
.
file_system_type
表示一个文件系统。
struct file_system_type {
// 可以是 sysfs, nfs, ext4 等等等等。
const char *name;
int (*init_fs_context)(struct fs_context *);
//...
}
.init_fs_context()
Kernel
不同的文件系统会注册不同的此函数。
SYSCALL_DEFINE5(fsconfig...
vfs_fsconfig_locked
finish_clean_context
init_fs_context
如上所示,fsconfg()
本身就是一个 syscall,userspace 调用了之后就一路调用到了 init_fs_context
这里,对于伪文件系统来说,这个函数有一个作用就是把 magic number 置上。一般来说,一个文件系统会定义一个 magic number,比如 gmem 是 0x474d454d
。
.kill_sb()
Kernel
相当于把这个文件系统都给 deactivate 了。
deactivate_locked_super
fs->kill_sb(s);
Inode
They live either on the disk (for block device filesystems) or in the memory (for pseudo filesystems). Inodes that live on the disc are copied into the memory when required and changes to the inode are written back to disc. A single inode can be pointed to by multiple dentries (hard links, for example, do this).
inodes aren't loaded from disk until they are actually needed.
https://www.kernel.org/doc/Documentation/filesystems/vfs.txt
alloc_anon_inode()
/ new_inode_pseudo()
/ 匿名 inode / Anonymous inode
每一个 super block 都有一个匿名 inode,所以多次调用 alloc_anon_inode()
返回的其实同一个 inode,也就是多个 fd/file 共享的是同一个 inode,因为没有 backend,所以真的没有必要给每一个 fd 都创建一个 inode 浪费空间。
struct inode *alloc_anon_inode(struct super_block *s)
{
static const struct address_space_operations anon_aops = {
.dirty_folio = noop_dirty_folio,
};
struct inode *inode = new_inode_pseudo(s);
if (!inode)
return ERR_PTR(-ENOMEM);
inode->i_ino = get_next_ino();
inode->i_mapping->a_ops = &anon_aops;
/*
* Mark the inode dirty from the very beginning,
* that way it will never be moved to the dirty
* list because mark_inode_dirty() will think
* that it already _is_ on the dirty list.
*/
inode->i_state = I_DIRTY;
inode->i_mode = S_IRUSR | S_IWUSR;
inode->i_uid = current_fsuid();
inode->i_gid = current_fsgid();
inode->i_flags |= S_PRIVATE;
inode->i_atime = inode->i_mtime = inode_set_ctime_current(inode);
return inode;
}
struct inode *new_inode_pseudo(struct super_block *sb)
{
struct inode *inode = alloc_inode(sb);
if (inode) {
spin_lock(&inode->i_lock);
inode->i_state = 0;
spin_unlock(&inode->i_lock);
}
return inode;
}