2024-09 Monthly Archive

统计某些 CPU 的 CPU 使用量

import psutil
import time


def parse_cpu_ranges(cpu_ranges):
    """Parse the CPU ranges and return a list of CPU indices."""
    cpus = []
    for part in cpu_ranges.split(','):
        if '-' in part:
            start, end = map(int, part.split('-'))
            cpus.extend(range(start, end + 1))
        else:
            cpus.append(int(part))
    return cpus


def get_cpu_usage_percent(interval, cpus):
    """Get CPU usage percentages for specific CPUs over the interval."""
    usage = psutil.cpu_times_percent(interval=interval, percpu=True)
    filtered_usages = [usage[i] for i in cpus]

    # Initialize a dictionary to sum the CPU usage stats
    sum_usage = {
        'user': 0.0, 'nice': 0.0, 'system': 0.0, 'idle': 0.0,
        'iowait': 0.0, 'irq': 0.0, 'softirq': 0.0,
        'steal': 0.0, 'guest': 0.0, 'guest_nice': 0.0
    }

    for u in filtered_usages:
        sum_usage['user'] += u.user
        sum_usage['nice'] += u.nice
        sum_usage['system'] += u.system
        sum_usage['idle'] += u.idle
        sum_usage['iowait'] += getattr(u, 'iowait', 0.0)
        sum_usage['irq'] += getattr(u, 'irq', 0.0)
        sum_usage['softirq'] += getattr(u, 'softirq', 0.0)
        sum_usage['steal'] += getattr(u, 'steal', 0.0)
        sum_usage['guest'] += getattr(u, 'guest', 0.0)
        sum_usage['guest_nice'] += getattr(u, 'guest_nice', 0.0)

    # Calculate the average usage
    num_cpus = len(filtered_usages)
    average_usage = {key: value / num_cpus for key, value in sum_usage.items()}

    non_idle = sum(average_usage[key] for key in average_usage if key != 'idle')
    total_usage = non_idle / (non_idle + average_usage['idle'])
    average_usage['total'] = total_usage * 100

    return average_usage


def main():
    cpu_ranges = input("Enter CPU ranges (e.g., 0-5,11-20,30-32): ")
    interval_duration = 1  # interval to measure CPU usage in seconds
    monitoring_time = 5  # total monitoring time in seconds

    cpus = parse_cpu_ranges(cpu_ranges)
    start_time = time.time()
    end_time = start_time + monitoring_time

    total_usage = {key: 0.0 for key in ['user', 'nice', 'system', 'idle',
                                        'iowait', 'irq', 'softirq',
                                        'steal', 'guest', 'guest_nice']}

    intervals = 0
    while time.time() < end_time:
        interval_usage = get_cpu_usage_percent(interval_duration, cpus)
        for key in total_usage:
            total_usage[key] += interval_usage[key]
        intervals += 1
        time.sleep(interval_duration)

    average_usage = {key: value / intervals for key, value in total_usage.items()}

    # Calculate total usage
    non_idle = sum(average_usage[key] for key in average_usage if key != 'idle')
    total_usage = non_idle / (non_idle + average_usage['idle']) * 100

    average_usage['total'] = total_usage

    print(f"\nAverage CPU usage over {monitoring_time} seconds:")
    for key, value in average_usage.items():
        print(f"{key.capitalize()}: {value:.2f}%")

    print(f"\nSum CPU usage over {monitoring_time} seconds and {len(cpus)} cores:")
    for key, value in average_usage.items():
        print(f"{key.capitalize()}: {value * len(cpus):.2f}%")


if __name__ == "__main__":
    main()

查看网卡型号

sudo lspci -s `ethtool -i eth0 | grep bus | awk '{print $2}'` -vvv | grep Product

一个文件系统可以挂载在另一个文件系统下面吗？

当然是可以的，比如宿主机上的根文件系统，创建一个 folder 可以用来挂载另一张盘上的文件系统。

文件系统在卸载时，也会先卸载其所有的子文件系统。

线程组 / thread group / tgid / `clone()` / `fork()`

In the kernel, each thread has its own ID, called a PID, although it would possibly make more sense to call this a TID, or thread ID, and they also have a TGID (thread group ID) which is the PID of the first thread that was created when the process was created.

clone() is the syscall used by fork(). with some parameters, it creates a new process, with others, it creates a thread. the difference between them is just which data structures (memory space, processor state, stack, PID, open files, etc) are shared or not. 所以 tgid 会不会变，是起了一个新进程还是在进程里起了一个新线程，取决于调用 clone() 这个 syscall 的时候传入进去的参数。

一个进程被 fork 出来的时候会被马上 wakeup：

SYSCALL_DEFINE0(fork)
    kernel_clone
        wake_up_new_task
            p->state = TASK_RUNNING;
            trace_sched_wakeup_new()

sudo bpftrace -l 'tracepoint:sched:sched_wakeup_new' 能看出来是有输出的，对应的就是这个场景。

See linux system last crash log

# Only see critical logs
# 倒过来看，看最新的
# -b -1 表示看上一次 boot 的 log
sudo journalctl -p 3 -r -b -1

# 取决于不同发行版，执行下面命令
# 注意这是这次 boot 的 log，不是上次的
less /var/log/syslog
less /var/log/messages

Iocost / `blk-iocost`

论文：IOCost: Block IO Control for Containers in Datacenters

Linux 信号机制 / signal in Linux

信号本质上是在软件层次上对中断机制的一种模拟。信号是异步的，我们只需要在进程中设置信号相应的处理函数，当有信号到达的时候，由系统异步触发相应的处理函数即可。

Linux 信号和硬件没有什么关系，不是通过中断实现的。

为了尽快让信号得到处理，发送信号时，如果信号的接收方处于睡眠状态，那么会调用 signal_wake_up() 来唤醒这个进程，进程执行自己的内核态代码（并不是原来的代码逻辑，而是类似 revert 一些应该 revert 的然后 ret = -EINTR 返回的逻辑，这样是为了更快地响应 signal，所以把手下的工作放下了）直到返回用户态空间，Linux 把信号处理过程放置在进程从内核态返回到用户态前，也就是在 ret_from_sys_call 处，这么设计的原因是 We cannot kill processes in kernel mode because this might corrupt data.（请看下面链接里的代码解析）。返回之前会把内核栈^里的 rip 改成信号处理程序的用户态地址，所以返回的是信号处理程序，处理完后又通过一种机制执行 sigreturn 这个 syscall 重新进入内核态（为什么还要进入内核态呢？我们就是在 sys ret to user 的时候执行的信号处理函数，说明这个 syscall 已经执行完了，直接跳到用户态的返回地址不好吗？下面的文章没有说这个原因，这是因为之前是通过改内核栈的方式来注入中断处理程序的地址的，现在我们需要重新进入内核态把内核栈改回来，比如我们之前用户栈的地址是保存在原来内核栈的，跑用户态 handler 的时候寄存器肯定不能和跑之前的 userspace 程序一样，需要有自己的上下文，所以我们不能直接跳过去，需要借助内核态来恢复）。

在最后返回用户态的时候，我们会返回 EINTR：sigreturn() may "return" something different from EINTR? Yes. -EINTR return means that the interrupted system call won't be restarted.

https://lists.strace.io/pipermail/strace-devel/2012-January/002043.html

这篇文章对于 Linux 信号机制的解析非常到位，值得一读：一文看懂 Linux 信号处理原理与实现-linux 信号处理流程

Swapper process in Linux

idle 进程或因为历史的原因叫做 swapper 进程。当 cpu 上没有要执行的任务时，往往会运行 swapper 进程。

Create new tmux connection

tmux -L trist new -A -s trist

`/proc/iomem`

Maps the status of the physical memory in system.

Jbd2

jbd2 is a kernel thread that updates the filesystem journal.

They're kernel threads. [jbd2/%s] are used by JBD2 (the journal manager for ext4) to periodically flush journal commits and other changes to disk.

protect the filesystem against metadata inconsistencies in the case of a system crash.

Kprobe

是 kernel 自己提供的 tracing 机制，和 BPF subsystem 是并列的。

Can monitoring events inside a production system.

KProbes heavily depends on processor architecture specific features and uses slightly different mechanisms depending on the architecture on which it's being executed.

A kernel probe is a set of handlers placed on a certain instruction address. A KProbe is defined by a pre-handler and a post-handler. When a KProbe is installed at a particular instruction and that instruction is executed, the pre-handler is executed just before the execution of the probed instruction. Similarly, the post-handler is executed just after the execution of the probed instruction. JProbes are used to get access to a kernel function's arguments at runtime. A JProbe is defined by a JProbe handler with the same prototype as that of the function whose arguments are to be accessed. When the probed function is executed the control is first transferred to the user-defined JProbe handler, followed by the transfer of execution to the original function. The KProbes package has been designed in such a way that tools for debugging, tracing and logging could be built by extending it.

文件句柄泄漏 / 磁盘用满排查

进程在打开文件或其他资源后，由于缺乏适当的关闭操作，导致文件句柄没有被及时释放。长期运行的程序如果存在文件句柄泄漏问题，会逐渐消耗掉可用文件描述符，最终导致无法打开新的文件或建立新的网络连接，因为会达到文件描述符的限制。

系统为每个进程分配的文件描述符数量有限，当这些文件描述符用完时，进程将不能再打开新的文件或套接字，可能导致程序崩溃或拒绝服务。
过多未释放的文件描述符可能导致系统性能下降，因为操作系统需要在更大的文件描述符表中进行操作，增加了开销。
如果活动文件没有被正确关闭，数据可能没有被完整地写入磁盘，导致数据丢失或损坏。

如果磁盘用满时，可能需要排查一下和这个有关系：

在 Linux 操作系统中，当一个文件被删除后，在文件系统目录中已经不可见了，所以 du 就不会再统计它了。然而如果此时还有运行的进程持有这个已经被删除了的文件的 fd，那么这个文件就不会真正在磁盘中被删除，分区超级块中的信息也就不会更改，所以 df 命令查看的磁盘占用没有减少。我们知道在 Linux 中磁盘分区的超级块是非常重要的，在 superblock 中记录该分区上文件系统的整体信息，包括 inode 和 block 的总量，剩余量，使用量，以及文件系统的格式等信息。因此，superblock 不更新，那么磁盘的使用量必然不会变化，操作系统对于文件的存盘都是需要事先读取 superblock 的信息，然后分配可用的 inode 和 block。

解决进程文件句柄泄露导致磁盘空间无法释放问题_java 文件句柄不释放导致磁盘空间不足-CSDN博客

SNC (Sub-NUMA Clustering) / NPS (NUMA Per Socket)

Sub-NUMA clustering is the functionality of partitioning Intel CPU packages. AMD provides similar functionality called NUMA per Socket (NPS).

SLUB Debug, KASAN, Kfence 内存被改、内存泄漏

内核对于在线性映射区分配的内存具体的使用行为没有监控和约束。这意味着如果内核程序的行为不规范，将可能污染到其他区域的内存。这会引起许多问题，严重的情况下直接会导致宕机。

因为没有地址越界检查，而且这段线性映射区肯定已经全部写入页表了（不然后面 offset 计算是没有意义的）所以并不会报错，一般来说可能的错误有：

越界访问 (out-of-bound)
释放后使用 (use-after-free)
无效释放 (invalid-free)

debug 难的原因是，宕机最后将会由用户 B 引发，从而产生的各种日志记录和 vmcore 都会把矛头指向 B。也就是说，宕机时已经是问题的第二现场了，距离内存被改的第一现场存在时间差，此时 A 可能早已销声匿迹。

不同解决方案的局限性：

SLUB DEBUG 需要传入 boot cmdline 后重启，也影响不小的 slab 性能，并且只能针对 slab 场景；
KASAN 功能强大，同时也引入了较大的性能开销，因此不适用于线上环境；后续推出的 tag-based 方案能缓解开销，但依赖于 Arm64 的硬件特性，因此不具备通用性；
KFENCE 相对来讲进步不少（可以看到才被合入没有多久），可在生产环境常态化开启，但它是以采样的方式极小概率地发现问题，需要大规模集群开启来提升概率。而且只能探测 slab 相关的内存被改问题。

为什么内存泄漏调测难：内核对于线性映射区的分配是不做记录的，也无从得知每块内存的主人是谁。

Kfence 使用方法：内核内存错误检测工具KFENCE-腾讯云开发者社区-腾讯云 Kfence 的原理非常简单：内核内存错误检测工具KFENCE-腾讯云开发者社区-腾讯云

如何解决Linux内核调测两大难题：内存被改与内存泄露_开源_Kernel SIG成员_InfoQ精选文章

How does kernel know a segment fault should occur?

The process is:

The CPU's MMU sends a signal" and
The kernel directs it to the offending program, terminating it.

首先是 CPU 为什么会 send a signal，如果越界的地址正好在之前申请的 4k 页内，那么页表其实是有映射的，所以 hardware 在翻译的时候不会出现 page fault，因此更无法发信号介入。答案是不一定会出问题，The language simply says what should happen if you access the elements within the bounds of an array. It is left undefined what happens if you go out of bounds. c++ - Accessing an array out of bounds gives no error, why? - Stack Overflow

Array bound check 是编译器可能做的事情 It seems that the widely-used languages which are compiled (eg C, C++, Fortran) all decided to not generate array-bounds-checking by default. But their compilers provide the option to generate code with array-bounds-checking. 编译器可能在编译的时候插入一些代码来进行地址长度的检查。

线性映射区、Linear Mapping Area、内核内存管理

在内核虚拟区域存在一块特殊的区域，Linux 将一整块虚拟内存映射到一整块物理内存上，形成了虚拟地址连续和物理地址也连续的区域，因此在这块区域只要获得通过一个线性公式就可以知道虚拟地址对应的物理地址，通过物理地址也可以知道对应的虚拟内存，这个区域称为线性映射区 (Linear Mapping Area)。

线性区的虚拟地址只要通过一个线性关系就可以知道映射物理内存的信息，而不需要费时去软件查询页表。同理物理内存也可以通过这个线性关系知道内核虚拟内存映射了它，而不需要通过逆向映射才能知道。

好处没查过，暂时还不清楚。但是我们知道线性映射区是很大的，内核中绝大多数的内存分配行为都是直接在线性映射区划出一块内存归自己使用，

内核线性映射区(Linear Mapping Area) on Paging(试读版)

KVM `mmu_shrinker`

已经 remove 了：[PATCH v3 0/1] Remove KVM MMU shrinker - Vipin Sharma

回收的是 EPT 的页表所占的内存，不是 Guest VM 内存页的内容。

进程在等锁时是处于 S 状态还是 D 状态？

都有可能。取决于等锁时是否处于 TASK_UNINTRRUPTIBLE 状态。

Linux 进程状态

R (TASK_RUNNING)：可执行（位于执行队列 rq 中），很多操作系统教科书将正在 CPU 上执行的进程定义为 RUNNING 状态、而将可执行但是尚未被调度执行的进程定义为 READY 状态，这两种状态在 Linux 下统一为 TASK_RUNNING 状态。
S (TASK_INTERRUPTIBLE Sleep)：处于这个状态的进程因为等待某某事件的发生（比如等待 socket 连接、等待信号量），而被挂起。当这些事件发生时（由外部中断触发、或由其他进程触发），对应的等待队列中的一个或多个进程将被唤醒。通过 ps 命令我们会看到，一般情况下，进程列表中的绝大多数进程都处于 TASK_INTERRUPTIBLE 状态。
D (TASK_UNINTERRUPTIBLE Sleep)：不可中断睡眠状态，指的并不是 CPU 不响应外部硬件的中断，而是指进程不响应异步信号。绝大多数情况下，进程处在睡眠状态时，总是应该能够响应异步信号的。否则你将惊奇的发现，kill -9 竟然杀不死一个正在睡眠的进程了！存在的意义就在于，内核的某些处理流程是不能被打断的。如果响应异步信号，程序的执行流程中就会被插入一段用于处理异步信号的流程（这个插入的流程可能只存在于内核态，也可能延伸到用户态），于是原有的流程就被中断了。在进程对某些硬件进行操作时（比如进程调用 read 系统调用对某个设备文件进行读操作，而 read 系统调用最终执行到对应设备驱动的代码，并与对应的物理设备进行交互），可能需要使用 TASK_UNINTERRUPTIBLE 状态对进程进行保护，以避免进程与设备交互的过程被打断，造成设备陷入不可控的状态。这种情况下的 TASK_UNINTERRUPTIBLE 状态总是非常短暂的，通过 ps 命令基本上不可能捕捉到。 不管 kill 还是 kill -9，这个 TASK_UNINTERRUPTIBLE 状态的父进程依然屹立不倒。注意进程是可以被调度出去的，唯一不能做的就是让这个进程响应异步信号去做其他事情（kill -9 会直接杀死进程，让设备处于不可用的状态）。
T (TASK_STOPPED or TASK_TRACED)：
Z (TASK_DEAD - EXIT_ZOMBIE)
X (TASK_DEAD - EXIT_DEAD)

所以 S 和 D 状态的区别就在于 IO 的类型？有一些外设（比如磁盘）在交互过程中不可以中断，所以我们引入了 D 状态。 D 状态也可能是在等锁，长时间拿不到锁，也可能是在 D 状态。for more see loadd^。

IOWait

按照经验，iowait 很高的系统，很可能存在 IO 瓶颈。
cpu 处于 iowait 状态时，仍然跳出 iowait 状态去处理其他计算密集型任务。所以如果有 IO 密集型和计算密集型任务同时存在在系统中，iowait 可能很低。
iowait 高，反映有大量 cpu 空闲时间在等待 IO。 CPU 空，说明硬盘负载大。
若队列为空则 CPU 执行内核空闲线程的代码。
当线程执行 io 操作时，需要将当前任务切换出去，会先将当前线程 task_struct.in_iowait 设置为 1，线程状态设置为 TASK_UNINTERRUPTIBLE(D 状态)，并将运行队列上 rq.nr_iowait 加 1，而 io 完成后 task_struct.in_iowait 还原为 0，rq.nr_iowait 减 1，具体见内核 io_schedule_timeout 函数，可见 rq.nr_iowait 代表当前 CPU 上等待 io 操作的线程数量。
当 CPU 执行内核空闲代码时，会判断 rq.nr_iowait，若大于 0 则将空闲时间计算在 iowait 上，否则计算在 idle 上，具体见内核 account_idle_time 函数。

总结一下 idle 与 iowait 区别，如下：

iowait 时间实际上就是 CPU 空闲时间，Linux 上空闲时间有两类，一类是普通的 idle，另一类是 iowait。
普通 idle 与 iowait 区别是，iowait 是 CPU 空闲时，有任务正在做磁盘 io 操作，而 idle 则没有。

iowait 指标是从 CPU 角度看 io，但毕竟不是从 io 层面看的，所以 iowait 高也不一定代表有问题，如下：

如果程序迁移到性能更好的 CPU 上，由于 CPU 运行代码变快，会导致空闲时间变多，而 iowait 时间实际上就是空闲时间，所以有时会发现，性能更好的机器上 iowait 反而更高了。
比如有这样两个程序，程序 A 在 10s 内每 1s 都做 1 次 io 操作，假设 io 操作需要 1s，那么 10s 内的 iowait 是 100%，而程序的 IOPS 是 1。另一程序 B 在 10s 的前 1s 内并发执行了 10 次 io 操作，那么 iowait 是 10%，而程序的峰值 IOPS 是 10。虽然例子比较极端，但这里很明显程序 B 的 IOPS 峰值更高，但它的 iowait 却更低。

Linux命令拾遗-%iowait指标代表了什么？你可能见过top、vmstat、mpstat中有一个叫wa(%iowa - 掘金

iowait 到底是什么？ - Steins;Lab

LOC, CAL

CAL (Function Call Interupts)，我的理解是这个是 IPI。

That is interrupt signals sent by one processor to any other processor in the system and delivered not through an IRQ line, but directly as a message on the bus that connects the local APIC of all CPUs.

CPU 利用率

user+nice+system+idle+iowait+irq+softirq+steal

user 和 system 不用说了

nice：低优先级用户态 CPU 时间，也就是进程的 nice 值被调整为 1-19 之间时的 CPU 时间。nice 可取值范围是 -20 到 19，数值越大，优先级反而越低。
iowait：代表等待 I/O 的 CPU 时间。
irq：代表处理硬中断的 CPU 时间。
softirq：代表处理软中断的 CPU 时间。

CPU 使用率过高怎么办 - 观海云不远 - 博客园

`sar` / `ssar` / `tsar2` (SRE SAR)

和 sar 比起来，其他 Linux 命令都是渣。sar 是一个 Linux 下的监控工具，一直站在鄙视链的顶端。

ssar 是 sar 工具家族中崭新的一个。在几乎涵盖了传统 sar 工具的大部分主要功能之外，它还扩展了更多的整机指标；新增了进程级指标和特色的 load 指标。

Tsar 是业界一款非常经典的 sar 类型工具，很多同学在日常调查问题中都会经常用到。Tsar2 在选项参数和输出格式方面和 tsar 基本保持一致和兼容。

ssar: ssar(SRE SAR)是sar工具家族中崭新的一个。在几乎涵盖了传统sar工具的大部分主要功能之外，它还扩展了更多的整机指标；新增了进程级指标和特色的load指标。

Memory bandwidth and latency 内存带宽和延迟

内存带宽和延迟这两者是有关系的吗还是互相独立的？

高内存带宽一定会带来高内存延迟吗？是会影响到的。

外设发来的中断是发在哪一个 CPU 上的？

Intel Memory Latency Checker(Intel MLC)

Intel Memory Latency Checker(Intel MLC) 是一个测试内存延迟和带宽的工具，并且可以测试延迟和带宽随着系统负载增加的变化。

硬件 CPU 如何给 non-root 模式下的 vCPU 注入中断到调用 interrupt handler？

就像 host 上有 IDT，VMCS 里也有 IDT 区域。当硬件电路想要注入一个中断时，会找到 non-root 模式下 vCPU 对应的那一个 VMCS 里的 IDT 来调用对应的 IDT 里对应中断向量号的 handler。

Hrtimer / HPET

APIC Timer 可以被 hrtimer 所使用，特别是在以 x86 架构为基础的系统中，hrtimer 可以利用 APIC Timer 来实现高精度定时功能。特别是在没有 HPET 的情况下，APIC Timer 是一个良好的选择。

hrtimer 是内核中的一个抽象层（和 APIC timer 不是同一个层次的概念），其上可以通过各种底层硬件计时器（如 APIC Timer、HPET、TSC）来实现高精度计时。

APIC timer 和 HPET 是如何选择的呢，这个问题讲的很详细：APIC Calibration Using HPET : r/osdev

HPET is higher overhead, partially from being external to the CPU core, and partially because there's only one HPET and there could be hundreds of CPU cores.

看起来 APIC timer 要更广泛地被用来做 timer interrupt。In practice, the Linux kernel uses a mix of available timers to balance precision and performance. 可能并不是就固定用来哪一个 timer 来生成 interrupt。

Linux capabilities

每个进程都有一个自己的 capability list，表示自己对系统能做什么不能做什么。

traditional UNIX implementations distinguish two categories of processes:

privileged processes (whose effective user ID is 0, referred to as superuser or root),
and unprivileged processes (whose effective UID is nonzero).

Privileged processes bypass all kernel permission checks, while unprivileged processes are subject to full permission checking based on the process's credentials (usually: effective UID, effective GID, and supplementary group list).

Capabilities are a per-thread attribute.

capabilities(7) - Linux manual page

Baremetal 上的应用程序是如何获取能够使用的核数的？

Python 应用程序肯定是调用了 os.cpu_count()
Java 应用程序调用了 Runtime.getRuntime().availableProcessors()

那么 JVM 和 CPython 是如何获取 CPU 核数的？通过 sysconf(_SC_NPROCESSORS_ONLN)，这是 glibc 里的一个函数。

软中断 / `ksoftirqd` / `INT` 指令

软中断应该是有两种含义的：

INT 指令，表示软件触发的中断，所以简称成了软中断。一些调试的功能需要用到 INT 指令。以前的 syscall 也是用 INT 指令来实现的。
softirq，表示中断处理的下半部的处理和软件相关而非设置硬件寄存器等等，所以简称成为了软中断。

不要弄混，要根据上下文来判断软中断这个词的语义。

INT 指令为软件中断指令，是 CALL 指令的一种特殊形式，

CALL 指令调用的子程序是用户程序的一部分，
而 INT 指令调用的则是操作系统或者 BIOS 提供的特殊子程序。

软中断其实就是中断处理的下半部，软中断通常是硬中断服务程序对内核的中断。

如何通知 ksoftirqd 来处理下半部呢？下面可以看到，其实就是通过 wakeup 这个进程实现的。

raise_softirq(nr)
    raise_softirq_irqoff
        wakeup_softirqd
            wake_up_process
                try_to_wake_up

// 有下面这么多的 irq 号
enum
{
	HI_SOFTIRQ=0,
	TIMER_SOFTIRQ,
    // network interrupt
	NET_TX_SOFTIRQ,
	NET_RX_SOFTIRQ,
    // block storage interrupt
	BLOCK_SOFTIRQ,
	IRQ_POLL_SOFTIRQ,
	TASKLET_SOFTIRQ,
    // cfs load balancing
	SCHED_SOFTIRQ,
	HRTIMER_SOFTIRQ,
	RCU_SOFTIRQ,    /* Preferable RCU should always be the last softirq */
	NR_SOFTIRQS
};

// 每一个 irq 号对应的 ksoftirqd 中中断处理逻辑，以调度 SCHED_SOFTIRQ 为例
// kernel/sched/fair.c 注册处理程序：
open_softirq(SCHED_SOFTIRQ, sched_balance_softirq);
// 这个程序主要是为了 load balancing
sched_balance_trigger
    raise_softirq(SCHED_SOFTIRQ)

在 CPU 的硬件中断发生之后，CPU 需要将硬件中断请求通过向量表映射成具体的服务程序，这个过程是硬件自动完成的；
但是软中断不是，其需要守护线程去实现这一过程，这也就是软件模拟的中断，故称之为软中断。名字叫做软中断的另一个原因是其和硬中断很像，都要使用中断向量号来区分，ksoftirqd 也是需要通过软件保存的 irq nr 来区分应该怎么来服务的。请看上面的 raise_softirq(nr) 函数。
并不一定发生硬件中断之后才能让 ksoftirqd 来处理，比如调度，就其实就是没有硬件中断的？
每一个硬件中断的处理程序也是由使用者来指定的，比如对于 CFS 来说，其通过 open_softirq() 来注册应该使用哪一个函数来处理。当然大多数情况下内核调度不需要 softirq，当出发 load balancing 的时候才需要。

每一个 CPU 都有一个软中断处理内核线程，名字叫做 ksoftirq/<CPU index>。

它是做什么的：

这个内核线程负责处理所有中断的下半部。

一般来说监控是如何统计 softirq 的情况的？

什么情况下会导致这个线程飙高？

外设中断比较多，下部都让 ksoftirqd 来处理了；
内核模块或者驱动程序问题；
调度比较频繁？ 一般来说只有在负载均衡的时候才会触发？所以应该问题不大？

PELT (Per Entity Load Tracking) / kernel

kernel/sched/pelt.c. 目前代码不多，只有 400 行。
Per-entity load tracking [LWN.net]

要解决的问题/达到的效果：the scheduler now has a much clearer idea of how much each process and scheduler control group is contributing to the load on the system. The most obvious target is likely to be load balancing: distributing the processes on the system so that each CPU is carrying roughly the same load.

Time is viewed as a sequence of 1ms (actually, 1024µs) periods. An entity's contribution to the system load in a period $p_i$ is just the portion of that period that the entity was runnable — either actually running, or waiting for an available CPU. 这里衍生出来一个问题，为什么一个实体在等待 available CPU 的时间也看作是它对 load 的 contribution 呢？可能这个 load 表示的是这个进程实际所需要的计算量是多少。这样调度器才能根据每一个不同的调度实体的需求来进行 CPU 资源的分配。毕竟就像上面的 LWN 文章中所写到的：

Perfect scheduling requires a crystal ball; when the kernel knows exactly what demands every process will make on the system and when, it can schedule those processes optimally. Unfortunately, hardware manufacturers continue to push affordable prediction-offload devices back in their roadmaps, so the scheduler has to be able to muddle through in their absence.

If we let $L_i$ designate the entity's load contribution in period $pi$, then an entity's total contribution can be expressed as:

\[L=L_0+L_1\times y+L_2\times y^2 +\dots\]

Where $y$ is the decay factor chosen. In the current code, $y$ has been chosen so that $y^{32}$ is equal to 0.5. Thus, an entity's load contribution $32ms$ in the past is weighted half as strongly as its current contribution.

这种方式的另一个好处是计算起来很方便：The nice thing about this series is that it is not actually necessary to keep an array of past load contributions; simply multiplying the previous period's total load contribution by $y$ and adding the new $L0$ is sufficient.

使用指数加权移动平均（EWMA），The most recent 32ms contribute half, while the rest of history contribute the other half.

\[ewma(u)=ewma\_sum(u)/ewma\_sum(1)\]

Schedutil — The Linux Kernel documentation

`update_rq_clock()` Kernel

rq->clock 表示发生调度的时刻。

void update_rq_clock(struct rq *rq)
{
	s64 delta;

	lockdep_assert_rq_held(rq);

	if (rq->clock_update_flags & RQCF_ACT_SKIP)
		return;

    //...
	delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
	if (delta < 0)
		return;
	rq->clock += delta;
	update_rq_clock_task(rq, delta);
}

`struct rq` Kernel

运行队列，表示现在可以运行的线程，线程调度就是在这些可运行的线程之间进行调度。

/*
 * This is the main, per-CPU runqueue data structure.
 *
 * Locking rule: those places that want to lock multiple runqueues
 * (such as the load balancing or the thread migration code), lock
 * acquire operations must be ordered by ascending &runqueue.
 */
struct rq {
	/* runqueue lock: */
	raw_spinlock_t		__lock;

    // 有多少个就绪的进程
	unsigned int		nr_running;
#ifdef CONFIG_NUMA_BALANCING
	unsigned int		nr_numa_running;

    // 一个 rq 绑定了一个 CPU，这个 CPU 是和 NUMA node 绑定的（它肯定是在一个 NUMA node 里面的）
    // 这个 rq 中的，希望跑在这个 rq 所在 numa node 的任务数量
    // 这个表示此 NUMA 节点中运行的任务数量
	unsigned int		nr_preferred_running;
	unsigned int		numa_migrate_on;
#endif
#ifdef CONFIG_NO_HZ_COMMON
#ifdef CONFIG_SMP
	unsigned long		last_blocked_load_update_tick;
	unsigned int		has_blocked_load;
	call_single_data_t	nohz_csd;
#endif /* CONFIG_SMP */
	unsigned int		nohz_tick_stopped;
	atomic_t		nohz_flags;
#endif /* CONFIG_NO_HZ_COMMON */

#ifdef CONFIG_SMP
	unsigned int		ttwu_pending;
#endif
	u64			nr_switches;

#ifdef CONFIG_UCLAMP_TASK
	/* Utilization clamp values based on CPU's RUNNABLE tasks */
	struct uclamp_rq	uclamp[UCLAMP_CNT] ____cacheline_aligned;
	unsigned int		uclamp_flags;
#define UCLAMP_FLAG_IDLE 0x01
#endif

	struct cfs_rq		cfs;
	struct rt_rq		rt;
	struct dl_rq		dl;

#ifdef CONFIG_FAIR_GROUP_SCHED
	/* list of leaf cfs_rq on this CPU: */
	struct list_head	leaf_cfs_rq_list;
	struct list_head	*tmp_alone_branch;
#endif /* CONFIG_FAIR_GROUP_SCHED */

	/*
	 * This is part of a global counter where only the total sum
	 * over all CPUs matters. A task can increase this counter on
	 * one CPU and if it got migrated afterwards it may decrease
	 * it on another CPU. Always updated under the runqueue lock:
	 */
	unsigned int		nr_uninterruptible;

    // 表示这个就绪队列上正在运行的任务，可以看一下和 current 的区别：
    // [current与rq->curr浅析 - 温暖的电波 - 博客园](https://www.cnblogs.com/liuhailong0112/p/14921228.html)
	struct task_struct __rcu	*curr;
	struct task_struct	*idle;
	struct task_struct	*stop;
	unsigned long		next_balance;
	struct mm_struct	*prev_mm;

	unsigned int		clock_update_flags;
    // 上次更新时的 CPU clock 时间，更新可能在很多个时间点发生
	u64			clock;
    // 进程真正占用的时间，rq->clock_task = rq->clock - time for interrupt and stolen time
    // 在两次更新 clock 的时刻中间会有一个更新的时间 delta，我们会先把 delta -= irq_delta
    // 然后加到 clock_task 上。
	u64			clock_task ____cacheline_aligned;
	u64			clock_pelt;
	unsigned long		lost_idle_time;
	u64			clock_pelt_idle;
	u64			clock_idle;
#ifndef CONFIG_64BIT
	u64			clock_pelt_idle_copy;
	u64			clock_idle_copy;
#endif

	atomic_t		nr_iowait;

#ifdef CONFIG_SCHED_DEBUG
	u64 last_seen_need_resched_ns;
	int ticks_without_resched;
#endif

#ifdef CONFIG_MEMBARRIER
	int membarrier_state;
#endif

#ifdef CONFIG_SMP
	struct root_domain		*rd;
    // 这个 rq 或者说这个 CPU 对应的调度域，
    // 理论上来说这应该是最后一个层级也就是最底层的调度域 Base domain
	struct sched_domain __rcu	*sd;

	unsigned long		cpu_capacity;

    // 每一个 CPU runqueue 有一个 balance callback
	struct balance_callback *balance_callback;

	unsigned char		nohz_idle_balance;
	unsigned char		idle_balance;

	unsigned long		misfit_task_load;

	/* For active balancing */
	int			active_balance;
	int			push_cpu;
	struct cpu_stop_work	active_balance_work;

	/* CPU of this runqueue: */
	int			cpu;
	int			online;

	struct list_head cfs_tasks;

	struct sched_avg	avg_rt;
	struct sched_avg	avg_dl;
#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
	struct sched_avg	avg_irq;
#endif
#ifdef CONFIG_SCHED_HW_PRESSURE
	struct sched_avg	avg_hw;
#endif
	u64			idle_stamp;

    // 记录这个 rq 所在 CPU 的平均 idle 时间
    // 不一定就代表 CPU 利用率，因为如果 idle 1ms busy 1ms 和
    // idle 10ms busy 10ms 的 CPU 利用率虽然是一样的，但是前者
    // avg_idle 明显更低。
	u64			avg_idle;

	// This is used to determine avg_idle's max value
    // 该 CPU 在各个层级上执行 new idle 均衡的最大时间开销之和
    // 主要用于限制 avg_idle 最大值，计算出来的 avg_idle 不能大于 2 倍
    // 的 max_idle_balance_cost
	u64			max_idle_balance_cost;

#ifdef CONFIG_HOTPLUG_CPU
	struct rcuwait		hotplug_wait;
#endif
#endif /* CONFIG_SMP */

#ifdef CONFIG_IRQ_TIME_ACCOUNTING
	u64			prev_irq_time;
	u64			psi_irq_time;
#endif
#ifdef CONFIG_PARAVIRT
	u64			prev_steal_time;
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
	u64			prev_steal_time_rq;
#endif

	/* calc_load related fields */
	unsigned long		calc_load_update;
	long			calc_load_active;

#ifdef CONFIG_SCHED_HRTICK
#ifdef CONFIG_SMP
	call_single_data_t	hrtick_csd;
#endif
	struct hrtimer		hrtick_timer;
	ktime_t			hrtick_time;
#endif

#ifdef CONFIG_SCHEDSTATS
	/* latency stats */
	struct sched_info	rq_sched_info;
	unsigned long long	rq_cpu_time;
	/* could above be rq->cfs_rq.exec_clock + rq->rt_rq.rt_runtime ? */

	/* sys_sched_yield() stats */
	unsigned int		yld_count;

	/* schedule() stats */
	unsigned int		sched_count;
	unsigned int		sched_goidle;

	/* try_to_wake_up() stats */
	unsigned int		ttwu_count;
	unsigned int		ttwu_local;
#endif

#ifdef CONFIG_CPU_IDLE
	/* Must be inspected within a RCU lock section */
	struct cpuidle_state	*idle_state;
#endif

#ifdef CONFIG_SMP
	unsigned int		nr_pinned;
#endif
	unsigned int		push_busy;
	struct cpu_stop_work	push_work;

#ifdef CONFIG_SCHED_CORE
	/* per rq */
	struct rq		*core;
	struct task_struct	*core_pick;
	unsigned int		core_enabled;
	unsigned int		core_sched_seq;
	struct rb_root		core_tree;

	/* shared state -- careful with sched_core_cpu_deactivate() */
	unsigned int		core_task_seq;
	unsigned int		core_pick_seq;
	unsigned long		core_cookie;
	unsigned int		core_forceidle_count;
	unsigned int		core_forceidle_seq;
	unsigned int		core_forceidle_occupation;
	u64			core_forceidle_start;
#endif

	/* Scratch cpumask to be temporarily used under rq_lock */
	cpumask_var_t		scratch_mask;

#if defined(CONFIG_CFS_BANDWIDTH) && defined(CONFIG_SMP)
	call_single_data_t	cfsb_csd;
	struct list_head	cfsb_csd_list;
#endif
};

Docker 国内镜像汇总

DockerPull 镜像加速

五级流水线取指（IF）、译指（ID）、执行（EX）、访存（MEM）、写回（WB）

所以将所有指令都扩充到需要 5 个时钟周期（一个时钟周期对应一级流水线）来完成，每个时钟周期的长度由执行时间最长的那一个部件来决定（不难理解）。

取指：没什么好说的，就是把指令从内存地址当中读出来。
译码：经过译码之后得到指令需要的操作数寄存器索引，可以使用此索引从通用寄存器组中将操作数读出。
执行：指令执行是指对指令进行真正运算的过程。譬如，如果指令是一条加法运算指令，则对操作数进行加法操作；如果是减法运算指令，则进行减法操作。
访存：存储器访问指令往往是指令集中最重要的指令类型之一，访存（Memory Access）是指存储器访问指令将数据从存储器中读出，或者写入存储器的过程。
写回：将指令执行的结果写回通用寄存器组的过程。如果是普通运算指令，该结果值来自于“执行”阶段计算的结果；如果是存储器读指令，该结果来自于“访存”阶段从存储器中读取出来的数据。

基本上每一级流水线（除了 write back）都有自己的 pipe line register 用来暂存数据。一个很好的图例：https://users.cs.utah.edu/~bojnordi/classes/6810/f19/slides/05-pipelining.pdf 的第 29 页。

为什么不是先访存再运算？应该先把我们要的内存内容取出来，然后再基于取出的值进行计算才对？

意味着可以在执行阶段计算访存地址，然后在下一阶段访存，比如我们用 mov 在进行相对寻址的时候。执行单元可以先计算出我们要访问内存的地址，然后在访存的 stage 进行内存地址的访问。

add 指令可以直接对内存进行操作，那是不是就是需要先访存了？

假设：如果 R-type 指令也可访存，则变为取指、译码、地址计算、访存、执行、写回等几个阶段。

第一章

从上面不难得出，并不是每一条指令五级流水线都是需要的。

为什么是五级流水线，而不是四级流水线或者是六级流水线呢？

A 5-stage pipeline allows 5 instructions to be executing at once, as long as they are in different stages. 我们如果拆的再细一点，是不是就能让更多的指令在同一时刻执行了？

五级流水线是最经典的。The most popular RISC architecture ARM processor follows 3-stage and 5-stage pipelining. In 3-stage pipelining the stages are: Fetch, Decode, and Execute. Processors have reasonable implements with 3 or 5 stages of the pipeline because as the depth of pipeline increases the hazards related to it increases.

主要是 pipeline 多了指令执行的 latency 就高了，分支预测失败的惩罚也高了，同时 pipeline 在指令特征相近的时候优势明显，在指令特征差异比较大的时候优势没有那么明显。

What is Pipelining : Architecture, Hazards, Advantages Disadvantages

块设备文件和普通文件 Block Device File

普通文件，顾名思义，没有什么好特殊说明的；
块设备文件：比如说 /dev/ 下面的文件。

块设备文件是块设备的物理寻址空间；普通文件是块设备的虚拟寻址空间。普通文件比块设备文件多一层文件系统的地址转换机构。

我们可使用对于普通文件能够使用的系统调用来访问块设备文件的内容，比如我们可以 open(), read(), mmap() 等等。

Block device files provide access to devices at the block level, meaning that data is read from and written to the device in fixed-size blocks. These blocks are usually 512 bytes or a multiple of 512 bytes in size.

const struct file_operations def_blk_fops = {
	.open		= blkdev_open,
	.release	= blkdev_release,
	.llseek		= blkdev_llseek,
	.read_iter	= blkdev_read_iter,
	.write_iter	= blkdev_write_iter,
	.iopoll		= iocb_bio_iopoll,
	.mmap		= blkdev_mmap,
	.fsync		= blkdev_fsync,
	.unlocked_ioctl	= blkdev_ioctl,
#ifdef CONFIG_COMPAT
	.compat_ioctl	= compat_blkdev_ioctl,
#endif
	.splice_read	= filemap_splice_read,
	.splice_write	= iter_file_splice_write,
	.fallocate	= blkdev_fallocate,
	.fop_flags	= FOP_BUFFER_RASYNC,
};

`copy_to_user()`, `copy_from_user()` Kernel

这两个函数都是 include/linux/uaccess.h 暴露出来的 API。

如果我们要和 userspace 进行数据传输通信，有三种方式：

Userspace 传一个指针进来，我们直接访问这个指针就行了。
我们通过 memcpy 的方式来 copy userspace 的数据到 kernel 里面。
老老实实使用 copy_to_user()/copy_from_user() 来安全地拷贝数据，效率肯定是最低的。

为什么第一种方式不行呢，明明内核和用户空间使用的同一个页表（CR3 是一样的），所以 MMU 应该是能够进行页的映射的。原因如下：

如果一个地址用户空间没有权限访问，恶意的用户空间把这个地址提供给了 kernel 让 kernel 来访问。所以检查是必要的。
如果指针指向的内存是 swap out 的内存，那么 access it 会发生 page fault。但是 kernel code 不允许在 kernel mode 发生 page fault 的^，the result would be an "oops"。
有时候内核页表和用户页表是分开的，比如 KPTI，这样的话内核拿到用户传来的虚拟地址没有什么意义，因为内核页表并不知道怎么映射这个虚拟地址或者映射到了一片其他的物理地址。

不能用 memcpy() 的原因是漏掉了检查，也就是上面的第一二点。You only use memcpy() with pointers internal to the kernel that are never supplied to userspace.

不能直接用 pointer 的原因是上面的一二三点，最重要的是第三点，因为页表是分开的所以需要 copy 之后使用内核这边的虚拟地址搭配内核的页表来使用，而不是直接访问用户空间的虚拟地址。

Why page fault is not allowed in kernel mode?

架构上是允许的，也就是说从 CPU 的硬件设计角度来说，如果工作在 Ring 0 时发生了缺页，一个缺页异常 PF 是可以报出来的。

Linux kernel 里 code 没有允许，当发生这种事的时候会认为这是不合理的并且报错：Normally, page faults incurred when running in kernel mode will cause a kernel oops. There are exceptions, though; the functions which copy data between user and kernel space are one example.

Kernel development [LWN.net]

In user space, you can simply suspend the user process and move on without causing any problems. But in kernel space, your thread may have taken many locks, or disabled interrupts. If you have to stop to handle a page fault, then you have a choice:

Let the entire system grind to a halt for millions of instructions while that page is loaded from disk. This would lead to terrible performance.
Add complexity so that at any point, the locks/interrupts can be "un-wound", allowing other kernel threads to proceed.

Can the Linux kernel use pageable (swappable) memory for its own buffers? - Stack Overflow

Spinlocks in Linux

发展顺序应该是：

Ticket spinlock
MCS lock
qspinlock

Linux中的spinlock机制[一] - CAS和ticket spinlock - 知乎

Queued spinlock

Queued spinlocks is a locking mechanism in the Linux kernel which is replacement for the standard spinlocks. At least this is true for the x86_64 architecture. 也就是说现在 Linux 里 spinlock 的实现都是 qspinlock 了。老 spinlock 实现的问题：

May be unfair since other threads which arrived later at the lock may acquire it first. 比如说两个 thread 都在 spin，一个先开始 spin，一个后开始 spin，但是可能后 spin 的拿到锁，因为竞争是公平的，并不会考虑到你 spin 了多长时间；
The second problem is that all threads which want to acquire a lock must execute many atomic operations like test_and_set on a variable which is in shared memory. This leads to the cache invalidation as the cache of the processor will store lock=1, but the value of the lock in memory may not be 1 after a thread will release this lock. 也就是说不停 invalidate 自己 CPU 上的 cache 来保证内存状态的 up-to-date，性能不好。

Queued spinlock 会在自己的 memory location 而不是 shared memory spin。具体思路就是所有在等锁的线程放在一个 queue 里，the first thread will contain a next field which will point to the second thread. From this moment, the second thread will wait until the first thread release its lock and notify next thread about this event. 因为所有的等锁 thread 用 next 链起来了，那么拿锁的那个线程就可以直接把锁递给下一个而不是释放了锁然后让所有等锁的线程争抢。等锁的线程 spin 在自己 local variable 里面，持锁的线程可以直接改 local variable 的内容然后通知等锁的，等锁的 invalid 一次然后读最新的就好了。

The Queued spinlock uses a queuing mechanism (hence its name) to keep track of processors waiting on the spinlock. This is different from the ordinary spinlock, where processors simply spinning on same piece of memory. The Queued Spinlock is more efficient by avoiding this memory contention.

这里介绍很详细： Queued spinlocks · Linux Inside

Ticket lock

Ticketlocks have a lot of advantages for large systems, including:

reduced cacheline bouncing and,
more predictable wait time. (See this LWN article for a complete description.)

The key attribute for this discussion is that ticketlocks essentially make a first-come first-served queue: if A has the lock, then B tries to grab it, and then C, B is guaranteed to get the lock before C. So now, if C is spinning waiting for the lock, it must wait for both A and B to finish before it can get the lock.

The result is that on a moderately loaded system, the vast majority of the cpu cycles are actually spent waiting for ticketlocks rather than doing any useful work. This is called a “ticketlock storm”.

Ticket Locks: This is a type of spinlock that provides fairness by serving requests in the order they were made. Each thread that wants to acquire the lock takes a "ticket" (increments a counter) and waits until its ticket number matches a "serving" counter.

Ticket locks ensure that every thread gets access in the order they requested the lock, preventing starvation that can occur with traditional spinlock implementations. 感觉是 Ticket lock 解决了 spinlock 公平性的问题，但是应该没有解决共享的内存资源的问题。