用户态、内核态、系统调用、syscall、system call

内核态和用户态是怎样实现的？这个问题看似简单，其实不然。

首先从一个简单的回答说起，一般来说，为了应付面试，我们可以说内核态和用户态是基于 CPU 实现的，x86 架构下定义了四个特权级，分别是 Ring 0 到 Ring 3，其中 Ring 0 具有最高的优先级，一般用作内核态；而 Ring 3 的优先级最低，一般用作用户态。对于特权级，我们可以展开说说其实现机制。

CPU 对特权级的定义及处理

特权级有四级，分别是 0 到 3，越小的特权级越高。特权级还可以细分为

CPL，是当前进程的权限级别（Current Privilege Level），是当前正在执行的代码所在的段（也就是 cs 段）的特权级，代表着这个进程的特权级，存在于 cs 寄存器的低两位。CPL 其实指的就是我们说的 ring，也就是说，ring 在硬件上的反映其实就是 cs 寄存器的低两位，参考 operating system - Is an x86 CPU in kernel mode when the CPL value of the CS register is equal to 0? - Stack Overflow。
RPL，说明的是进程对段访问的请求权限（Request Privilege Level），是对于段选择子（保护模式）而言的，每个段选择子有自己的 RPL，它说明的是进程对段访问的请求权限。而且 RPL 对每个段来说不是固定的，两次访问同一段时的 RPL 可以不同。
DPL，说明的是进程对段访问的权限，规定访问该段的权限级别（Descriptor Privilege Level），每个段的 DPL 固定。

对于 RPL 与 DPL 之间的区别可以访问 memory segmentation - Difference between DPL and RPL in x86 - Stack Overflow。

防止用户态代码更改访问级别是如何实现的？

如果一个用户态代码想要进行一些必要的文件读取操作，那么它必须先切换到内核态才有足够的权限。一般来说，如我我们对于内核态/用户态以及它们之间的切换不够了解，那么我们会自然地认为用户态代码切换到内核态需要以下步骤：

内核存在能够从用户态切换到内核态的例程；
用户态代码执行系统调用；
在调用系统调用的过程中，调用权限切换例程，完成权限的切换。

正常来说，这个步骤是没有问题的。但是，有没有一种可能，用户态进程可以自己实现这个权限切换例程，从而偷取权限，执行一些能够毁坏系统的代码？

答案当然是不可能的，权限等级的设计就是要保证权限切换的方向是单向的，也就是只能够从高优先级切换到低优先级，而不能从低优先级切换到高优先级。

那么如果这是不可能的，我们的系统调用又是如何通过执行例程，在当前特权等级较低的情况下，完成向高特权等级的切换的呢？

所以真实的系统调用流程是这样的：

用户态代码以中断的形式，进行系统调用；
CPU 在响应中断的同时，自动将特权级别设置为 0，这是一种硬件机制，参考 operating system - how does an interrupt put CPU into the required privilege level? - Stack Overflow；
调用操作系统预先设置的中断响应函数；
返回，将特权级别恢复。

可以看到，借助了硬件的实现，用户态进程就不可能自己实现这个权限切换例程，而是只能乖乖地通过中断机制进行系统调用，将代码执行权乖乖地交给内核。

这也是 系统调用以中断进行调用的原因。

ISA 里的 `SYSCALL` 指令也是通过中断实现的吗？

From the previous part we know that system call concept is very similar to an interrupt. Furthermore, system calls are implemented as software interrupts. So, when the processor handles a syscall instruction from a user application, this instruction causes an exception which transfers control to an exception handler.

How the Linux kernel handles a system call · Linux Inside

When the kernel is invoked with a system call, it is not necessarily using an interrupt. On x86-64, it is invoked directly using a specific processor instruction (syscall). This instruction makes the processor jump to the address stored in a special register.

process - Does a system call involve a context switch or not? - Stack Overflow

The syscall instruction causes an exception, which transfers control to an exception handler.

syscall

Syscall instruction

SYSCALL invokes an OS system-call handler at privilege level 0. It does so by loading RIP from the IA32_LSTAR MSR (after saving the address of the instruction following SYSCALL into RCX). (The WRMSR instruction ensures that the IA32_LSTAR MSR always contain a canonical address.)

From SDM.

So if you want to call a procedure, you should first write the RIP of the procedure to the MSR, then syscall.

Syscall 在 Linux kernel 中的实现

__NR 开头的宏是一个系统调用号的宏定义。

#define __NR_memfd_restricted 451
__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted)

asmlinkage long sys_memfd_restricted(unsigned int flags);

SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
{
    //...
}

All system calls are marked with the asmlinkage tag, so they all look to the stack for arguments.

c - What is the 'asmlinkage' modifier meant for? - Stack Overflow

SYSCALL_DEFINE() is a macro that is used to define the implementation of a system call function.

__SYSCALL() is a low-level macro that is used to declare the interface to a system call and is used by the kernel to generate the system call table.

The main entry point for your new xyzzy(2) system call will be called sys_xyzzy(), but you add this entry point with the appropriate SYSCALL_DEFINEn() macro rather than explicitly.

SYSCALL_DEFINEn(). The ‘n’ indicates the number of arguments to the system call.

Adding a New System Call — The Linux Kernel documentation

Take this as an example: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory - Chao Peng

A call from userspace to syscall() function actually won't get the return value you defined. Because you are calling the wrapper function which is:

int syscall(num, ...)
{
    // ...
    int rc = /* arch-specific code to get the result of the system call */ ;
    if (rc < 0) { errno = -rc; return -1; }
    return 0;
}

So the actual return value is in errno, if error, syscall() will always return -1 no matter what the real return value is.

Syscall vs. int80

A system call number is a unique integer (i.e., whole number), from one to around 256, that is assigned to each system call in a Unix-like operating system.

syscall is the default way of entering kernel mode on x86-64. This instruction is not available in 32 bit modes of operation on Intel processors.
sysenter is an instruction most frequently used to invoke system calls in 32 bit modes of operation. It is similar to syscall, a bit more difficult to use though, but that is the kernel's concern.
int 0x80 is a legacy way to invoke a system call and should be avoided.

int 0x80 does work in some cases in 64-bit code, but is never recommended.

Chromium OS Docs - Linux System Call Table

assembly - What is better "int 0x80" or "syscall" in 32-bit code on Linux? - Stack Overflow

Can a syscall number be different between x86 and x64?

Yes, they may different.

fork in x86 is 2, in x64 is 57.

系统调用的 context switch 会改 CR3 吗？/ 内核页表 / 进程页表 / Meltdown / Kernel page table / PTI / KPTI

答案是：

老 CPU 老 kernel，不会改（因为漏洞还没发现）；
老 CPU 新 kernel，会改（漏洞被发现并在 Kernel 层做了修复），
新 CPU 新 kernel，不会改（在处理器硬件上已经得到了修复）。

至于原因请看下面：

进程从用户态进入内核态不会引起 CR3 的改变但会引起堆栈的改变。之所以不会引起 CR3 的改变，是因为内核页表和进程页表并不是在独立的两片区域，而是被 merge 在了同一片区域，也就是共享同一个 CR3。比如说一个页表所映射的 0-3G 是用户空间页表，但是 3-4G 是内核空间页表，PSE/PTE 有对应的 U/S bit 来表示这个页表项是 supervisor 还是 user 的，如果 User 试图访问内核页表所映射的内存，那么就会 crash。注意，整个页表都是在内核内存空间放着的。

每个进程的页面目录就分成了两部分，第一部分为“用户空间”，用来映射其整个进程空间（0x0000 0000－0xBFFF FFFF）即 3G 字节的虚拟地址；第二部分为“系统空间”，用来映射（0xC000 0000－0xFFFF FFFF）1G 字节的虚拟地址。可以看出 Linux 系统中每个进程的页面目录的第二部分是相同的，所以从进程的角度来看，每个进程有 4G 字节的虚拟空间，较低的 3G 字节是自己的用户空间，最高的 1G 字节则为与所有进程以及内核共享的系统空间。

进入内核态CR3会修改吗？ - 知乎

但是在 Meltdown 这个漏洞出现之后，就变了：内核页表隔离 - 维基百科，自由的百科全书，基于此我们有了内核页表隔离技术（KPTI）：

如果没有 KPTI，每当执行用户空间代码（应用程序）时，Linux 会在其分页表中保留整个内核内存的映射，并保护其访问。这样做的优点是当应用程序向内核发送系统调用或收到中断时，内核页表始终存在，可以避免绝大多数上下文切换相关的开销（TLB 刷新、页表交换也就是 CR3 交换等）。

KPTI 通过完全分离用户空间与内核空间页表来解决页表泄露。 KPTI fixes these leaks by separating user-space and kernel-space page tables entirely.

One set of page tables includes both kernel-space and user-space addresses same as before, but it is only used when the system is running in kernel mode.
The second set of page tables for use in user mode contains a copy of user-space and a minimal set of kernel-space mappings that provides the information needed to enter or exit system calls, interrupts and exceptions.

支持进程上下文标识符（PCID）特性的 x86 处理器可以用它来避免 TLB 刷新，但即便如此，它依然有很高的性能成本：

据 KAISER 原作者称，其开销为 0.28%；
一名 Linux 开发者称大多数工作负载下测得约为 5%。
但即便有 PCID 优化，在某些情况下开销高达 30%。

KPTI 在 2018 年早期被合并到 Linux 内核 4.15 版，使用内核启动选项 pti=off 可以部分禁用内核页表隔离。依规定也可对已修复漏洞的新款处理器禁用 KPTI。

有用的参考：context switch^。

内核页表和用户页表不同，它一般不做 swap，也就没有 page fault，而且它一般不会把连续的虚拟地址空间映射成不连续的物理空间，一般只是做一个 offset。这样有一个好处，内核代码比如 driver 如果要拿到一个 VA 所对应的 PA，不需要去遍历页表，只需要通过 offset 来加就可以了，性能上会更好。

所有进程内核态空间到物理地址空间的映射是相同的。

即使内核代码执行的指令操作数也是虚拟地址而不是物理地址。

21. Page Table Isolation (PTI) — The Linux Kernel documentation

所有进程的内核页表用的是同一片内存区域吗？

答案是是的，更确切地说，大家用的是同一片物理内存区域。

虽然页表的基址 CR3 肯定都是不一样的，但是因为每次创建新进程时都要 merge kernel 的这部分到整个进程的页表中，这部分页表其实是放在一个地方的？

linux - Why does kernel add kernel master page table to process's page table? - Stack Overflow

There is no need to duplicate the kernel page tables for each process, ie. all process-specific page directories can reference the same set of kernel page tables. 尽管 virtual memory 空间 duplicate 的，但是其实用的物理内存并没有。页表也是要存在内存里的，所以指向内核页表的物理地址的页表项大家都保持一致就可以啦，毕竟页表的内存也是要由页表来 map 的。

The top level page table is per-process, but it can contain a single entry for the kernel address space. This entry would then point to a common "sub" page table that is shared between every process. By having each process share a common "sub" page table for the kernel addresses, the kernel does not have to duplicate the entries in memory and waste space. And the any updates to the kernel's memory usage only requires modifying one page table and not every process.

linux - Does page table per process contains entries mapping to kernel address space? - Stack Overflow

memory - How are the kernel page tables shared among all processes? - Unix & Linux Stack Exchange

Thus, a change to the kernel page tables does not need to be duplicated to each process, because it already is (due to that referencing via the page directory) and thus only the processors' memory address lookup table must be changed so that the new mapping is loaded from memory.

进程页表除了包含内核页表和用户页表，还有其他部分吗？

我觉得应该是有的，每一个进程都有自己的内核栈。

内核栈不能被用户态访问所以明显映射其 PTE 的 U/S bit 应该是 S，
同时每一个进程的内核栈内容又不尽相同，所以不能放在内核页表那部分里面，因为内核页表那部分是所有 process 需要共享的。

需要说明的是，在内核执行期间分配的内存（除了栈）是可以包括在内核页表那部分里面的，因为是共享的，原因请参考 linux kernel architecture^，映射其 PTE 的 U/S bit 也应该是 S。这也就是为什么在 process 关闭后其之前在内核态分配的内存比如 kmalloc() 的内存没有被回收，可能因为是放在内核页表里是共享的。

Kernel memory is never freed automatically. This includes kmalloc().

linux - is memory allocated by kmalloc() ever automatically freed? - Stack Overflow

综上所述，除了内核页表和用户页表，还有一些其他的部分是 process-specific 同时又是只能是 kernel 来访问的，这应该就是其对应内核线程所分配的那一部分内存。

内核线程 / Kernel thread

内核线程（是不是叫内核进程比较好？）是直接由内核本身启动的进程。内核线程实际上是将内核函数委托给独立的进程，它与内核中的其他进程”并行”执行。

内核线程也是有自己的 pid 的。Linux 中，用户态进程的“祖先”，都是 PID 号为 1 的 init 进程 (或者 systemd)。2 号进程为 kthreadd 进程，在内核态运行，用来管理内核线程。

和其他用户空间的线程的区别应该在于 code 不一样吧：用户空间线程的代码肯定有很多 SYSENTER 和 SYSRET 指令，也就导致一会跑在用户态一会跑在内核态；但是内核线程没有用这些指令，一直跑在内核态。

一些常用的内核线程：

ksoftirqd
oom-reaper: it is the implementation of the OMM (Out–of-Memory) killer function of the Linux kernel, try and reap the memory used by the OOM victim.
kswapd0: 用于内存回收
kworker: 用于执行内核工作队列，分为
- 绑定 CPU（名称格式为 kworker/CPU86330）和
- 未绑定 CPU（名称格式为 kworker/uPOOL86330）两类
migration: 在负载均衡过程中，把进程迁移到 CPU 上，每个 CPU 一个
jbd2/sda1-8:
- jbd 是 Journaling Block Device 的缩写
- 用来为文件系统提供日志功能，以保证数据的完整性；
- 名称中的 sda1-8，表示磁盘分区名称和设备
- 每个使用了 ext4 文件系统的磁盘分区，都会有一个对应的 jbd2 内核线程

4.17 总结-内核线程 - LoveIt

内核栈 / 用户栈 / 中断栈

linux - kernel stack and user space stack - Stack Overflow 这个里面的回答可以看一看 kernel stack 和 user stack 的区别已经为什么要区分。

先说内核栈和用户栈。

每个进程会有两个栈，一个用户栈，存在于用户空间，一个内核栈，存在于内核空间。

当进程在用户空间运行时，CPU 堆栈指针寄存器里面的内容是用户堆栈地址，使用用户栈；
当进程在内核空间时，CPU 堆栈指针寄存器里面的内容是内核栈空间地址，使用内核栈。

所以 CPU syscall context switch 之后堆栈寄存器也需要切换。

The size of the kernel stack is configured during compilation and remains fixed. This is usually two pages (8KB) for each thread.

[Kernel Stack and User Space Stack

Baeldung on Linux](https://www.baeldung.com/linux/kernel-stack-and-user-space-stack)

为什么要设置两套栈呢，直接在用户栈上面加不好吗？

The kernel therefore cannot trust the user space stack pointer to be valid nor usable, and therefore will require one set under its own control. 栈是由两个寄存器 RBP（基址）和 RSP（当前指针）组成的：

如果 Userspace 有权限改 RSP，那么 Userspace 可以故意给 RSP 一个恶意的值，比如 -1，kernel 应该怎么应对？
如果 Userspace 没有权限改 RSP，那么 Userspace 里面的函数调用应该怎么进行呢？

这就是需要一个内核栈的原因。每个进程在内核态都有一个专用的内核栈。当一个进程从用户态转入内核态（例如，通过系统调用或硬件中断）时，该进程会使用它的内核栈。这是因为在内核态需要一个安全的栈区域来执行内核代码，而不能共享用户态的栈，因为用户态可能是不可信或者不安全的。

内核栈并不是硬件架构设计所要求必须有的，而是 OS 这一层出于自己的需求而设计的机制。

linux - kernel stack and user space stack - Stack Overflow

内核是所有进程共享的，那么内核栈也是共享的吗？同一个 CPU 上的不同进程在 syscall 时置上的内核堆栈寄存器的值都是一样的吗？

同一个 CPU 上不同进程在 syscall 时置上的内核栈寄存器的值应该是不相等的，也就是说，每一个进程都有自己的 kernel stack 区域。这样也合理，如果一个进程运行在 kernel space 被调度了出去，栈的内容应该不变。

栈指针所指向的区域也是 map 在页表里的，所以页表所占用的内存空间其实也是 process memory layout 的一部分。页表应该放在 kernel space，这应该是为了防止用户态直接操作页表。

那么问题来了，不是说内核空间大家映射都是相等的吗，内核空间大家都是 shared 的，那怎么保证大家有独自的内核栈呢？

进程在用户态运行时，内核栈是空的吗？

在进程从用户态转到内核态的时候，进程的内核栈总是空的。一旦进程返回到用户态后，内核栈中保存的信息无效，会全部恢复，因此每次进程从用户态陷入内核的时候得到的内核栈都是空的。所以在进程陷入内核的时候，内核直接把内核栈的栈顶地址给堆栈指针寄存器就可以了。

当从内核态返回用户态时，如何恢复用户态栈寄存器？

在 SYSENTER 之后，内核代码会先把用户态堆栈的地址保存在内核栈之中，然后设置堆栈指针寄存器的内容为内核栈的地址，这样就完成了用户栈向内核栈的转换；在 SYSRET 之前，在内核态执行的最后将保存在内核栈里面的用户栈的地址恢复到堆栈指针寄存器即可。

当从用户态进入到内核态时，如何确定内核态栈的位置？

进程从用户态转到内核态的时候，进程的内核栈总是空的。这是因为，当进程在用户态运行时，使用的是用户栈，当进程陷入到内核态时，内核栈保存进程在内核态运行的相关信息，但是一旦进程返回到用户态后，内核栈中保存的信息无效，会全部恢复，因此每次进程从用户态陷入内核的时候得到的内核栈都是空的。所以在进程陷入内核的时候，直接把内核栈（此时是全局的）的栈顶地址给堆栈指针寄存器就可以了。这个由谁来做呢？

这个应该是由内核来做的。

study-note/Linux内核栈与用户栈.md at master · Sawyer-zh/study-note

为什么有内核栈但是没有内核数据段/堆段？

因为如上所述，内核中申请的内存都是全局可以访问的，并不限于一个线程，同时对应的 userspace 线程即使退出了，在调用的过程中跑在 kernel 时申请的内存也是不会被释放的，所以不需要一个内核数据段的概念。

Anyway，反正 64bit 已经没有段的概念了，所以也没关系了。

Context switch

如果我们把上下文切换理解成需要对一些寄存器进行 save/restore 那么，上下文切换发生的时机（越往下 overhead 越大）：

函数调用：也需要 save/restore 寄存器的。但是其轻量的原因是不会发生大量的 cache miss。
系统调用（syscall）：当用户程序调用系统调用时，系统需要从用户程序的上下文切换到内核态，执行系统调用的操作（只涉及部分的 context switch，比如寄存器的 save/restore，不需要更重量级的 context switch 比如切换 CR3，所以你可以说 syscall 会导致 context switch，也可以说不会导致，取决于你怎么去定义什么是一个 context switch）。
中断：当硬件设备需要与操作系统交互时，系统会触发一个中断，从而进行上下文切换。
进程调度：当操作系统需要调度多个进程执行时，它会进行上下文切换。

Why system call needs context switch?

A system call does not necessarily require a context switch in general, but rather a privilege switch. This is because the kernel memory is mapped in each process memory.

Do system calls always means a context switch? - Computer Science Stack Exchange 这里的 comment 很有意思，可以看看。

System calls are very light compared to process context switching, about 100ns or less. Saving the context isn't that expensive. The more expensive part is all the cache misses that will happen as the instruction flow goes to cold memory.

The cost of context switching - SoByte

Tsuna's blog: How long does it take to make a context switch?

我们的确不能把 save and restore registers 这个动作妖魔化：因为 it even happens when calling a function entirely in userspace. Most platforms (i.e. the combination of CPU and OS) specify that some registers are "caller-save" (that is, the value in that register is not guaranteed to be preserved if you call a function, so should be saved by the caller) and some are "callee-save". A system call is callee-save.

System call vs. context switch

process - System call and context switch - Stack Overflow

In a nutshell: context contains many parts, some system calls just need to switch few parts and other system calls need to switch more parts, which makes it heavier.

Voluntary context switch

是指进程无法获取所需资源，导致的上下文切换。比如说， I/O、内存等系统资源不足时，就会发生自愿上下文切换。

Non-voluntary context switch

是指进程由于时间片已到等原因，被系统强制调度，进而发生的上下文切换。比如说，大量进程都在争抢 CPU 时，就容易发生非自愿上下文切换。如果进程比较多，那么更容易触发被打断的调度。