segfault Issue debug using addr2line

发生了 pagefault 用 addr2line 来定位位置不一定行,因为 segfault 可能是访问的一个数组的内存的地址,而那个地址发生了 page fault,并不是 text segment 里的 segfault。

According to this c++ - Determine the line of code that causes a segmentation fault? - Stack Overflow

我们可以使用 AddressSanitizer

AddressSanitizer — Clang 19.0.0git documentation

AddressSanitizer · google/sanitizers Wiki

AddressSanitizer is a part of LLVM starting with version 3.1 and a part of GCC starting with version 4.8.

比如如果我们要 debug QEMU,我们可以加上这个 flag:--enable-sanitizers

# install dependency
yum install libasan
./configure --target-list=x86_64-softmmu --enable-kvm --enable-debug --disable-strip --enable-sanitizers --extra-cflags="-lunwind"

Build QEMU using Clang

./configure --target-list=x86_64-softmmu --enable-kvm  --disable-werror --enable-debug --disable-strip --cc=clang --cxx=clang++

[QUESTION] Clang as QEMU compiler - two questions

Shadow RAM

在 32 根地址线的 PC 中(即 32 位系统),BIOS 固件是映射到内存的 0x00000000fffc0000 - 0x00000000ffffffff (prio 0, rom): pc.bios 地址中的,即 4G 内存的最后 256KB。但为了保持向前兼容性(16 位系统),也为了提高执行速度,机器启动的时候会自动将 ROM 的 BIOS 复制到 RAM 的 BIOS ROM 区域当中。所以,当开机后的第一条跳转指令 ljmpw 执行之后,因为系统已经将 BIOS 复制到 RAM 中了,在 RAM 当中也有 BIOS 固件代码,这样 16 位系统也能在 32 bit 的机器上执行。

这个复制高地址处的 ROM 到低地址处的过程被称为 Shadow RAM 技术。然而,在这个过程后,这段内存会被保护起来,无法进行写入(也就是说呈现给 16 位系统的是一个 ROM 空间,不过这个 ROM 空间是在 RAM 中的,所以我们要将其定义为只读)。但是 Seabios 在初始化的过程中需要让这段内存可写,从而便于更改一些静态分配的全局变量值,Seabios 中通过 make_bios_writable 来完成这一工作。

PAM in QEMU | What is the Utopian World!

PAM in QEMU

也就是说,PAM 寄存器能够将对 BIOS ROM 的读写操作重定向到内存地址中。 这样可以提高 BIOS 在裸机上的执行速度,因为直接在 ROM 上执行速度非常慢。

PAM in QEMU | What is the Utopian World!

i440FX / PIIX

Intel 440FX PMC (PCI and Memory Controller): 是一种北桥芯片,PIIX (PCI ISA IDE Xcelerator): 是一种南桥芯片。

在 QEMU 里,i440FX 也是一个 PCI device:

static const TypeInfo i440fx_info = {
    .name          = TYPE_I440FX_PCI_DEVICE,
    // 可以看到这是一个 PCI Device
    .parent        = TYPE_PCI_DEVICE,
    .instance_size = sizeof(PCII440FXState),
    .class_init    = i440fx_class_init,
    .interfaces = (InterfaceInfo[]) {
        { INTERFACE_CONVENTIONAL_PCI_DEVICE },
        { },
    },
};

struct PCII440FXState {
    /*< private >*/
    PCIDevice parent_obj;
    /*< public >*/

    // #define PAM_REGIONS_COUNT       13
    /*
     * SMRAM memory area and PAM memory area in Legacy address range for PC.
     * PAM: Programmable Attribute Map registers
     *
     * 0xa0000 - 0xbffff compatible SMRAM
     *
     * 0xc0000 - 0xc3fff Expansion area memory segments
     * 0xc4000 - 0xc7fff
     * 0xc8000 - 0xcbfff
     * 0xcc000 - 0xcffff
     * 0xd0000 - 0xd3fff
     * 0xd4000 - 0xd7fff
     * 0xd8000 - 0xdbfff
     * 0xdc000 - 0xdffff
     * 0xe0000 - 0xe3fff Extended System BIOS Area Memory Segments
     * 0xe4000 - 0xe7fff
     * 0xe8000 - 0xebfff
     * 0xec000 - 0xeffff
     *
     * 0xf0000 - 0xfffff System BIOS Area Memory Segments
     */
    PAMMemoryRegion pam_regions[PAM_REGIONS_COUNT];
    MemoryRegion smram_region;
    MemoryRegion smram, low_smram;
};

Hello, when I list the PCI devices I get one, for which I can't find any driver. It is Vendor ID 0x8086 and Device ID 0x1237. pcidatabase.com says, it is "PCI & Memory", and I found it is "Intel 82440FX". After reading the Intel manual, I still don't know what I can do with it. Can this device be configured, or can I get information about the system from it? I have a PCI driver that uses it, but I don't know how to communicate with it…..

Can I ask it for example, how much memory the system has? I heard some OSes don't use the BIOS to detect the memory, and since manual probing is not the bes methode. Do they use the 82440FX to get the memory map?

这个网页解释了什么是这个 device。

OSDev.org • View topic - Intel 82440FX

i440FX 也可以是一个 PCI Host Bridge。

static const TypeInfo i440fx_pcihost_info = {
    .name          = TYPE_I440FX_PCI_HOST_BRIDGE,
    .parent        = TYPE_PCI_HOST_BRIDGE,
    .instance_size = sizeof(I440FXState),
    .instance_init = i440fx_pcihost_initfn,
    .class_init    = i440fx_pcihost_class_init,
};

static void i440fx_register_types(void)
{
    type_register_static(&i440fx_info);
    type_register_static(&i440fx_pcihost_info);
}
type_init(i440fx_register_types)

【虚拟化qemu】Q35 与 I440FX - 知乎

Run a single QEMU test

pip3 install meson
cd build
meson test qtest-x86_64/qos-test

Grep: /etc/sysconfig/network-scripts/ifcfg-*: No such file or directory

直接 create 吧:

ifconfig - CentOS ifcfg-eth0 config file deleted. Utility to recreate it? - Server Fault

DEVICE=eth0
BOOTPROTO=dhcp
ONBOOT=yes

then:

nmcli connection reload

Enlightened VMCS

The enlightenment is nested specific, it targets Hyper-V on KVM guests.

Hyper-V Enlightenments — QEMU documentation

"Curl Error (60): SSL peer certificate or SSH remote key was not OK" when yum install

当用 yum 安装包的时候如果遇到如上报错,那么:

vim /etc/yum.conf
# Then add sslverify=False

然后再 yum install。如果遇到了报错 "Waiting for process with pid to finish.",那么就 kill -9 掉这个进程。

VMsucceed, VMfail, VMfailValid, VMfailValid

这些都是 SDM 里面的 conventions,是一种简化。

The operation sections also use the pseudo-functions VMsucceed, VMfail, VMfailInvalid, and VMfailValid. These pseudo-functions signal instruction success or failure by setting or clearing bits in RFLAGS and, in some cases, by writing the VM-instruction error field.

VMsucced: CF := 0; PF := 0; AF := 0; ZF := 0; SF := 0; OF := 0;

VMfail(ErrorNumber):

IF VMCS pointer is valid
    THEN VMfailValid(ErrorNumber);
    ELSE VMfailInvalid;
FI;

VMfailInvalid: CF := 1; PF := 0; AF := 0; ZF := 0; SF := 0; OF := 0;

VMfailValid(ErrorNumber): // executed only if there is a current VMCS

CF := 0; PF := 0; AF := 0; ZF := 1; SF := 0; OF := 0;
Set the VM-instruction error field to ErrorNumber;

X86_EFLAGS_ZF

当运算结果为零时, ZF = 1 ,否则 ZF = 0 。

IA32_VMX_VMCS_ENUM

前置:VMCS Encoding^。

This is an MSR related to VMX.

The IA32_VMX_VMCS_ENUM MSR (index 48AH) provides information to assist software in enumerating fields in the VMCS.

Each field in the VMCS is associated with a 32-bit encoding which is structured as follows:

  • Bits 31:15 are reserved (must be 0).
  • Bits 14:13 indicate the field’s width.
    • 0: 16-bit
    • 1: 64-bit
    • 2: 32-bit
    • 3: natural-width
  • Bit 12 is reserved (must be 0).
  • Bits 11:10 indicate the field’s type.
    • 0: control
    • 1: VM-exit information
    • 2: guest state
    • 3: host state
  • Bits 9:1 is an index field that distinguishes different fields with the same width and type.
  • Bit 0 indicates access type.

这个 MSR 就是上面加粗的那部分。表示所有 widthtype 的 VMCS field 中的 highest index,可以通过下面 KVM 代码来印证这一点:

static u64 nested_vmx_calc_vmcs_enum_msr(void)
{
	unsigned int max_idx, idx;
	int i;

	max_idx = 0;
	for (i = 0; i < nr_vmcs12_fields; i++) {
		/* The vmcs12 table is very, very sparsely populated. */
		if (!vmcs12_field_offsets[i])
			continue;

		idx = vmcs_field_index(VMCS12_IDX_TO_ENC(i));
        // 所有的 index 中最大的。
		if (idx > max_idx)
			max_idx = idx;
	}

	return (u64)max_idx << VMCS_FIELD_INDEX_SHIFT;
}

split_huge_page_to_list() Kernel

假设这个大页是 512 个 4K page 的 2M 大页。那么只有在 refcount 是 513 的时候这个页才会被 split,也就是只有我们要访问的页其他的页都没有被 ref 的时候才会被 split。因为我们没有办法对大页内的每一个小页都记录一个 refcount,所以一旦这个大页里面有不止一个小页被 get 了,那么我们是没有办法分清楚到底是哪一个小页被 get 的。See can_split_folio()^。

/*
 * This function splits huge page into normal pages. @page can point to any
 * subpage of huge page to split. Split doesn't change the position of @page.
 *
 * Only caller must hold pin on the @page, otherwise split fails with -EBUSY.
 * The huge page must be locked.
 *
 * If @list is null, tail pages will be added to LRU list, otherwise, to @list.
 *
 * Both head page and tail pages will inherit mapping, flags, and so on from
 * the hugepage.
 *
 * GUP pin and PG_locked transferred to @page. Rest subpages can be freed if
 * they are not mapped.
 *
 * Returns 0 if the hugepage is split successfully.
 * Returns -EBUSY if the page is pinned or if anon_vma disappeared from under
 * us.
 */
int split_huge_page_to_list(struct page *page, struct list_head *list)
{
	struct folio *folio = page_folio(page);
	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
	struct anon_vma *anon_vma = NULL;
	struct address_space *mapping = NULL;
	int extra_pins, ret;
	pgoff_t end;
	bool is_hzp;

	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);

	is_hzp = is_huge_zero_page(&folio->page);
	if (is_hzp) {
		pr_warn_ratelimited("Called split_huge_page for huge zero page\n");
		return -EBUSY;
	}

	if (folio_test_writeback(folio))
		return -EBUSY;

	if (folio_test_anon(folio)) {
		/*
		 * The caller does not necessarily hold an mmap_lock that would
		 * prevent the anon_vma disappearing so we first we take a
		 * reference to it and then lock the anon_vma for write. This
		 * is similar to folio_lock_anon_vma_read except the write lock
		 * is taken to serialise against parallel split or collapse
		 * operations.
		 */
		anon_vma = folio_get_anon_vma(folio);
		if (!anon_vma) {
			ret = -EBUSY;
			goto out;
		}
		end = -1;
		mapping = NULL;
		anon_vma_lock_write(anon_vma);
	} else {
		gfp_t gfp;

		mapping = folio->mapping;

		/* Truncated ? */
		if (!mapping) {
			ret = -EBUSY;
			goto out;
		}

		gfp = current_gfp_context(mapping_gfp_mask(mapping) &
							GFP_RECLAIM_MASK);

		if (!filemap_release_folio(folio, gfp)) {
			ret = -EBUSY;
			goto out;
		}

		xas_split_alloc(&xas, folio, folio_order(folio), gfp);
		if (xas_error(&xas)) {
			ret = xas_error(&xas);
			goto out;
		}

		anon_vma = NULL;
		i_mmap_lock_read(mapping);

		/*
		 *__split_huge_page() may need to trim off pages beyond EOF:
		 * but on 32-bit, i_size_read() takes an irq-unsafe seqlock,
		 * which cannot be nested inside the page tree lock. So note
		 * end now: i_size itself may be changed at any moment, but
		 * folio lock is good enough to serialize the trimming.
		 */
		end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
		if (shmem_mapping(mapping))
			end = shmem_fallocend(mapping->host, end);
	}

	/*
	 * Racy check if we can split the page, before unmap_folio() will
	 * split PMDs
	 */
	if (!can_split_folio(folio, &extra_pins)) {
		ret = -EAGAIN;
		goto out_unlock;
	}

	unmap_folio(folio);

	/* block interrupt reentry in xa_lock and spinlock */
	local_irq_disable();
	if (mapping) {
		/*
		 * Check if the folio is present in page cache.
		 * We assume all tail are present too, if folio is there.
		 */
		xas_lock(&xas);
		xas_reset(&xas);
		if (xas_load(&xas) != folio)
			goto fail;
	}

	/* Prevent deferred_split_scan() touching ->_refcount */
	spin_lock(&ds_queue->split_queue_lock);
	if (folio_ref_freeze(folio, 1 + extra_pins)) {
		if (!list_empty(&folio->_deferred_list)) {
			ds_queue->split_queue_len--;
			list_del(&folio->_deferred_list);
		}
		spin_unlock(&ds_queue->split_queue_lock);
		if (mapping) {
			int nr = folio_nr_pages(folio);

			xas_split(&xas, folio, folio_order(folio));
			if (folio_test_pmd_mappable(folio)) {
				if (folio_test_swapbacked(folio)) {
					__lruvec_stat_mod_folio(folio,
							NR_SHMEM_THPS, -nr);
				} else {
					__lruvec_stat_mod_folio(folio,
							NR_FILE_THPS, -nr);
					filemap_nr_thps_dec(mapping);
				}
			}
		}

		__split_huge_page(page, list, end);
		ret = 0;
	} else {
		spin_unlock(&ds_queue->split_queue_lock);
fail:
		if (mapping)
			xas_unlock(&xas);
		local_irq_enable();
		remap_page(folio, folio_nr_pages(folio));
		ret = -EAGAIN;
	}

out_unlock:
	if (anon_vma) {
		anon_vma_unlock_write(anon_vma);
		put_anon_vma(anon_vma);
	}
	if (mapping)
		i_mmap_unlock_read(mapping);
out:
	xas_destroy(&xas);
	count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED);
	return ret;
}

filemap_add_folio() / __filemap_add_folio() Kernel

int filemap_add_folio(struct address_space *mapping, struct folio *folio,
				pgoff_t index, gfp_t gfp)
{
    //...
	ret = __filemap_add_folio(mapping, folio, index, gfp, &shadow);
    //...
    /*
     * The folio might have been evicted from cache only
     * recently, in which case it should be activated like
     * any other repeatedly accessed folio.
     * The exception is folios getting rewritten; evicting other
     * data from the working set, only to cache data that will
     * get overwritten with something else, is a waste of memory.
     */
    WARN_ON_ONCE(folio_test_active(folio));
    if (!(gfp & __GFP_WRITE) && shadow)
        workingset_refault(folio, shadow);
    folio_add_lru(folio);
    //...
}

DMAR

DMA remapping.

Userspace simply read memory range without knowing its content


qemu_sem_timedwait() QEMU


truncate_inode_pages_range() Kernel

/**
 * truncate_inode_pages_range - truncate range of pages specified by start & end byte offsets
 * @mapping: mapping to truncate
 * @lstart: offset from which to truncate
 * @lend: offset to which to truncate (inclusive)
 *
 * Truncate the page cache, removing the pages that are between
 * specified offsets (and zeroing out partial pages
 * if lstart or lend + 1 is not page aligned).
 *
 * Truncate takes two passes - the first pass is nonblocking.  It will not
 * block on page locks and it will not block on writeback.  The second pass
 * will wait.  This is to prevent as much IO as possible in the affected region.
 * The first pass will remove most pages, so the search cost of the second pass
 * is low.
 *
 * We pass down the cache-hot hint to the page freeing code.  Even if the
 * mapping is large, it is probably the case that the final pages are the most
 * recently touched, and freeing happens in ascending file offset order.
 *
 * Note that since ->invalidate_folio() accepts range to invalidate
 * truncate_inode_pages_range is able to handle cases where lend + 1 is not
 * page aligned properly.
 */
// 这个函数在调用的时候 lstart 和 lend 都包含在内,比如下面的这个使用方式:
// truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1)
void truncate_inode_pages_range(struct address_space *mapping, loff_t lstart, loff_t lend)
{
	pgoff_t		start;		/* inclusive */
	pgoff_t		end;		/* exclusive */
	struct folio_batch fbatch;
	pgoff_t		indices[PAGEVEC_SIZE];
	pgoff_t		index;
	int		i;
	struct folio	*folio;
	bool		same_folio;

    //...
	/*
	 * 'start' and 'end' always covers the range of pages to be fully
	 * truncated. Partial pages are covered with 'partial_start' at the
	 * start of the range and 'partial_end' at the end of the range.
	 * Note that 'end' is exclusive while 'lend' is inclusive.
	 */
    // 如果 lstart 没有 align 到一个 page,那就让它 align 到下一个 page 的 start
	start = (lstart + PAGE_SIZE - 1) >> PAGE_SHIFT;
	if (lend == -1)
		/*
		 * lend == -1 indicates end-of-file so we have to set 'end'
		 * to the highest possible pgoff_t and since the type is
		 * unsigned we're using -1.
		 */
		end = -1;
	else
        // 如果 lend + 1 没有 align 到一个 page,那么就 align 到上一个 page 的 start。
		end = (lend + 1) >> PAGE_SHIFT;

    // 上面对于 start 和 end 的 align 方式保证了我们只 truncate 包含在 lstart, lend 里的完整页面
    // 也就是 (start, end) 是包含于 [lstart, lend] 的。
	folio_batch_init(&fbatch);
	index = start;
    // 因为对于函数 find_lock_entries(), Folios which are partially outside the range are not returned.
    // 所以如果 folio 是大页比如 2M 512 个 4k page,那么就会出现这种问题。
	while (index < end && find_lock_entries(mapping, &index, end - 1, &fbatch, indices)) {
		truncate_folio_batch_exceptionals(mapping, &fbatch, indices);
		for (i = 0; i < folio_batch_count(&fbatch); i++)
			truncate_cleanup_folio(fbatch.folios[i]);
		delete_from_page_cache_batch(mapping, &fbatch);
		for (i = 0; i < folio_batch_count(&fbatch); i++)
			folio_unlock(fbatch.folios[i]);
		folio_batch_release(&fbatch);
		cond_resched();
	}

	same_folio = (lstart >> PAGE_SHIFT) == (lend >> PAGE_SHIFT);
	folio = __filemap_get_folio(mapping, lstart >> PAGE_SHIFT, FGP_LOCK, 0);
	if (!IS_ERR(folio)) {
		same_folio = lend < folio_pos(folio) + folio_size(folio);
        // May split the 2M folio into 4K pages.
		if (!truncate_inode_partial_folio(folio, lstart, lend)) {
			start = folio_next_index(folio);
			if (same_folio)
				end = folio->index;
		}
		folio_unlock(folio);
		folio_put(folio);
		folio = NULL;
	}

	if (!same_folio) {
		folio = __filemap_get_folio(mapping, lend >> PAGE_SHIFT,
						FGP_LOCK, 0);
		if (!IS_ERR(folio)) {
			if (!truncate_inode_partial_folio(folio, lstart, lend))
				end = folio->index;
			folio_unlock(folio);
			folio_put(folio);
		}
	}

	index = start;
	while (index < end) {
		cond_resched();
		if (!find_get_entries(mapping, &index, end - 1, &fbatch,
				indices)) {
			/* If all gone from start onwards, we're done */
			if (index == start)
				break;
			/* Otherwise restart to make sure all gone */
			index = start;
			continue;
		}

		for (i = 0; i < folio_batch_count(&fbatch); i++) {
			struct folio *folio = fbatch.folios[i];

			/* We rely upon deletion not changing page->index */

			if (xa_is_value(folio))
				continue;

			folio_lock(folio);
			VM_BUG_ON_FOLIO(!folio_contains(folio, indices[i]), folio);
			folio_wait_writeback(folio);
			truncate_inode_folio(mapping, folio);
			folio_unlock(folio);
		}
		truncate_folio_batch_exceptionals(mapping, &fbatch, indices);
		folio_batch_release(&fbatch);
	}
}

CR0 / CR4

Volume 3 (3A, 3B, 3C, & 3D): System Programming Guide
CHAPTER 2 SYSTEM ARCHITECTURE OVERVIEW
2.5 Control Registers
  • CR0 guest/host mask
  • CR4 guest/host mask
  • CR0 read shadow
  • CR4 read shadow

Guest 访问 CR 有可能造成 VMExit 的,并且有自己独有的 exit 号:28。

See:

APPENDIX C VMX BASIC EXIT REASONS
Table C-1. Basic Exit Reasons

The MOV to CR4 instruction causes a VM exit unless the value of its source operand matches, for the position of each bit set in the CR4 guest/host mask, the corresponding bit in the CR4 read shadow.

KVM 里的 code:

[EXIT_REASON_CR_ACCESS]               = handle_cr,
    handle_set_cr4
        kvm_set_cr4

Yum self-signed certificate in certificate chain

vim /etc/yum.conf

添加 sslverify=false

Update ld

http://ftp.gnu.org/gnu/binutils

cd /root
set LD_VERSION 2.42
wget https://ftp.gnu.org/gnu/binutils/binutils-$LD_VERSION.tar.xz
de binutils-$LD_VERSION.tar.xz
cd binutils-$LD_VERSION
./configure --prefix=/root/binutils-$LD_VERSION/build
make -j64
make tooldir=/usr install
# one time use
# echo "export PATH=/root/binutils-$LD_VERSION/build/bin:$PATH" >> /etc/profile.d/localld.sh
# source /etc/profile.d/localld.sh
ld -v

Centos 下 ld 链接器版本更新_手动升级 gnu ld-CSDN博客

qemu_put_buffer() / qemu_put_buffer_async() QEMU

前者是先 copy 再发,后者是先放着 pointer 在那一会再发。

void qemu_put_buffer_async(QEMUFile *f, const uint8_t *buf, size_t size,
                           bool may_free)
{
    if (f->last_error) {
        return;
    }

    add_to_iovec(f, buf, size, may_free);
}

void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, size_t size)
{
    size_t l;

    if (f->last_error) {
        return;
    }

    while (size > 0) {
        l = IO_BUF_SIZE - f->buf_index;
        if (l > size) {
            l = size;
        }
        memcpy(f->buf + f->buf_index, buf, l);
        add_buf_to_iovec(f, l);
        if (qemu_file_get_error(f)) {
            break;
        }
        buf += l;
        size -= l;
    }
}

KVM x86 upstream workflow

KVM x86 — The Linux Kernel documentation

Why ram-below-4g and ram-above-4g is 2G?

Historically, 32-bit architectures were prevalent, and usable RAM below 4GB was generally limited due to addressing limitations (around 3.5GB). Setting ram-below-4g to 2GB in QEMU provided a reasonable default that worked well for emulating 32-bit guests and maintaining compatibility with older hardware configurations.

connect() And accept() syscall

The OS takes care of the TCP handshake, when the handshake is finished, connect() returns. (that is, connect() does not block until the other side calls accept())

A successful TCP handshake will be queued to the server application, and can be accept()'ed any time later.

Reason: Just want to add that connect() just wait for the handshake and not for the server to call accept().

connect() blocks until finishing TCP 3-way handshake. Handshake on listening side is handled by TCP/IP stack in kernel and finished without notifying user process. Only after handshake is completed (and initiator can return from connect() call already), accept() in user process can pick up new socket and return. No waiting accept() needed for completing handshake.

The reason is simple: if you have single thread listening for connections and require waiting accept() for establishing connections, you can't respond to TCP SYN's while processing another request. TCP stack on initating side will retransmit, but on moderately loaded server chances are high this retransmitted packet still will arrive while no accept() pending and will be dropped again, resulting in ugly delays and connection timeouts.

c - Does connect() block for TCP socket? - Stack Overflow

为什么还说 connect() 是一个 block 的 syscall 呢?

It could take many (hundreds) of miliseconds to complete the hand shake over a network. There's literally millions of things you could do in that time instead of waiting (blocking) for the handshake to complete.

QEMU main event loop

QEMU's main event loop is main_loop_wait().

How to interprets info mtree QEMU

大体上会输出两部分,先输出所有的 AddressSpace,再输出所有的 MemoryRegion

address-space: memory 是比较重要的,表示能够看到的内存区间。

// 看这两段空间只有 2G,为什么要叫 ram-below-4g 和 above-4g 呢?
0000000000000000-000000007fffffff (prio 0, ram): alias ram-below-4g @ram1 0000000000000000-000000007fffffff
0000000100000000-000000017fffffff (prio 0, ram): alias ram-above-4g @ram1 0000000080000000-00000000ffffffff
address-space: ICH9-SMB
  0000000000000000-ffffffffffffffff (prio 0, i/o): bus master container
    0000000000000000-ffffffffffffffff (prio 0, i/o): alias bus master @system 0000000000000000-ffffffffffffffff

address-space: ICH9-LPC
  0000000000000000-ffffffffffffffff (prio 0, i/o): bus master container
    0000000000000000-ffffffffffffffff (prio 0, i/o): alias bus master @system 0000000000000000-ffffffffffffffff

address-space: mch
  0000000000000000-ffffffffffffffff (prio 0, i/o): bus master container
    0000000000000000-ffffffffffffffff (prio 0, i/o): alias bus master @system 0000000000000000-ffffffffffffffff

address-space: I/O
  0000000000000000-000000000000ffff (prio 0, i/o): io
    0000000000000000-0000000000000007 (prio 0, i/o): dma-chan
    0000000000000008-000000000000000f (prio 0, i/o): dma-cont
    0000000000000040-0000000000000043 (prio 0, i/o): pit
    0000000000000060-0000000000000060 (prio 0, i/o): i8042-data
    0000000000000061-0000000000000061 (prio 0, i/o): pcspk
    0000000000000064-0000000000000064 (prio 0, i/o): i8042-cmd
    0000000000000070-0000000000000071 (prio 0, i/o): rtc
      0000000000000070-0000000000000070 (prio 0, i/o): rtc-index
    000000000000007e-000000000000007f (prio 0, i/o): kvmvapic
    0000000000000080-0000000000000080 (prio 0, i/o): ioport80
    0000000000000081-0000000000000083 (prio 0, i/o): dma-page
    0000000000000087-0000000000000087 (prio 0, i/o): dma-page
    0000000000000089-000000000000008b (prio 0, i/o): dma-page
    000000000000008f-000000000000008f (prio 0, i/o): dma-page
    0000000000000092-0000000000000092 (prio 0, i/o): port92
    00000000000000b2-00000000000000b3 (prio 0, i/o): apm-io
    00000000000000c0-00000000000000cf (prio 0, i/o): dma-chan
    00000000000000d0-00000000000000df (prio 0, i/o): dma-cont
    00000000000000f0-00000000000000f0 (prio 0, i/o): ioportF0
    00000000000003f8-00000000000003ff (prio 0, i/o): serial
    0000000000000510-0000000000000511 (prio 0, i/o): fwcfg
    0000000000000514-000000000000051b (prio 0, i/o): fwcfg.dma
    0000000000000600-000000000000067f (prio 0, i/o): ich9-pm
      0000000000000600-0000000000000603 (prio 0, i/o): acpi-evt
      0000000000000604-0000000000000605 (prio 0, i/o): acpi-cnt
      0000000000000608-000000000000060b (prio 0, i/o): acpi-tmr
      0000000000000620-000000000000062f (prio 0, i/o): acpi-gpe0
      0000000000000630-0000000000000637 (prio 0, i/o): acpi-smi
      0000000000000660-000000000000067f (prio 0, i/o): sm-tco
    0000000000000cc0-0000000000000cd7 (prio 0, i/o): acpi-pci-hotplug
    0000000000000cd8-0000000000000ce3 (prio 0, i/o): acpi-cpu-hotplug
    0000000000000cf8-0000000000000cfb (prio 0, i/o): pci-conf-idx
    0000000000000cf9-0000000000000cf9 (prio 1, i/o): lpc-reset-control
    0000000000000cfc-0000000000000cff (prio 0, i/o): pci-conf-data
    0000000000005658-0000000000005658 (prio 0, i/o): vmport
    0000000000006000-000000000000603f (prio 1, i/o): pm-smbus
    0000000000006040-000000000000605f (prio 1, i/o): ahci-idp

address-space: virtio-serial-pci
  0000000000000000-ffffffffffffffff (prio 0, i/o): bus master container
    0000000000000000-ffffffffffffffff (prio 0, i/o): alias bus master @system 0000000000000000-ffffffffffffffff

address-space: cpu-memory-0
address-space: memory
  0000000000000000-ffffffffffffffff (prio 0, i/o): system
    0000000000000000-000000007fffffff (prio 0, ram): alias ram-below-4g @ram1 0000000000000000-000000007fffffff
    0000000000000000-ffffffffffffffff (prio -1, i/o): pci
      00000000c0000000-00000000c0000fff (prio 1, i/o): ahci
      00000000c0001000-00000000c0001fff (prio 1, i/o): virtio-blk-pci-msix
        00000000c0001000-00000000c000101f (prio 0, i/o): msix-table
        00000000c0001800-00000000c0001807 (prio 0, i/o): msix-pba
      00000000c0002000-00000000c0002fff (prio 1, i/o): virtio-serial-pci-msix
        00000000c0002000-00000000c000201f (prio 0, i/o): msix-table
        00000000c0002800-00000000c0002807 (prio 0, i/o): msix-pba
      00000000c0003000-00000000c0003fff (prio 1, i/o): virtio-net-pci-msix
        00000000c0003000-00000000c000303f (prio 0, i/o): msix-table
        00000000c0003800-00000000c0003807 (prio 0, i/o): msix-pba
      00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios
      0000700000000000-0000700000003fff (prio 1, i/o): virtio-pci
        0000700000000000-0000700000000fff (prio 0, i/o): virtio-pci-common-virtio-net
        0000700000001000-0000700000001fff (prio 0, i/o): virtio-pci-isr-virtio-net
        0000700000002000-0000700000002fff (prio 0, i/o): virtio-pci-device-virtio-net
        0000700000003000-0000700000003fff (prio 0, i/o): virtio-pci-notify-virtio-net
      0000700000004000-0000700000007fff (prio 1, i/o): virtio-pci
        0000700000004000-0000700000004fff (prio 0, i/o): virtio-pci-common-virtio-serial
        0000700000005000-0000700000005fff (prio 0, i/o): virtio-pci-isr-virtio-serial
        0000700000006000-0000700000006fff (prio 0, i/o): virtio-pci-device-virtio-serial
        0000700000007000-0000700000007fff (prio 0, i/o): virtio-pci-notify-virtio-serial
      0000700000008000-000070000000bfff (prio 1, i/o): virtio-pci
        0000700000008000-0000700000008fff (prio 0, i/o): virtio-pci-common-virtio-blk
        0000700000009000-0000700000009fff (prio 0, i/o): virtio-pci-isr-virtio-blk
        000070000000a000-000070000000afff (prio 0, i/o): virtio-pci-device-virtio-blk
        000070000000b000-000070000000bfff (prio 0, i/o): virtio-pci-notify-virtio-blk
    00000000000c0000-00000000000c3fff (prio 1, i/o): alias pam-pci @pci 00000000000c0000-00000000000c3fff
    00000000000c4000-00000000000c7fff (prio 1, i/o): alias pam-pci @pci 00000000000c4000-00000000000c7fff
    00000000000c8000-00000000000cbfff (prio 1, i/o): alias pam-pci @pci 00000000000c8000-00000000000cbfff
    00000000000cc000-00000000000cffff (prio 1, i/o): alias pam-pci @pci 00000000000cc000-00000000000cffff
    00000000000d0000-00000000000d3fff (prio 1, i/o): alias pam-pci @pci 00000000000d0000-00000000000d3fff
    00000000000d4000-00000000000d7fff (prio 1, i/o): alias pam-pci @pci 00000000000d4000-00000000000d7fff
    00000000000d8000-00000000000dbfff (prio 1, i/o): alias pam-pci @pci 00000000000d8000-00000000000dbfff
    00000000000dc000-00000000000dffff (prio 1, i/o): alias pam-pci @pci 00000000000dc000-00000000000dffff
    00000000000e0000-00000000000e3fff (prio 1, i/o): alias pam-pci @pci 00000000000e0000-00000000000e3fff
    00000000000e4000-00000000000e7fff (prio 1, i/o): alias pam-pci @pci 00000000000e4000-00000000000e7fff
    00000000000e8000-00000000000ebfff (prio 1, i/o): alias pam-pci @pci 00000000000e8000-00000000000ebfff
    00000000000ec000-00000000000effff (prio 1, i/o): alias pam-pci @pci 00000000000ec000-00000000000effff
    00000000000f0000-00000000000fffff (prio 1, i/o): alias pam-pci @pci 00000000000f0000-00000000000fffff
    00000000b0000000-00000000bfffffff (prio 0, i/o): pcie-mmcfg-mmio
    00000000fec00000-00000000fec00fff (prio 0, i/o): ioapic
    00000000fed00000-00000000fed003ff (prio 0, i/o): hpet
    00000000fed1c000-00000000fed1ffff (prio 1, i/o): lpc-rcrb-mmio
    00000000fee00000-00000000feefffff (prio 4096, i/o): kvm-apic-msi
    0000000100000000-000000017fffffff (prio 0, ram): alias ram-above-4g @ram1 0000000080000000-00000000ffffffff

address-space: ich9-ahci
  0000000000000000-ffffffffffffffff (prio 0, i/o): bus master container
    0000000000000000-ffffffffffffffff (prio 0, i/o): alias bus master @system 0000000000000000-ffffffffffffffff

address-space: virtio-net-pci
  0000000000000000-ffffffffffffffff (prio 0, i/o): bus master container
    0000000000000000-ffffffffffffffff (prio 0, i/o): alias bus master @system 0000000000000000-ffffffffffffffff

address-space: virtio-blk-pci
  0000000000000000-ffffffffffffffff (prio 0, i/o): bus master container
    0000000000000000-ffffffffffffffff (prio 0, i/o): alias bus master @system 0000000000000000-ffffffffffffffff

memory-region: system
  0000000000000000-ffffffffffffffff (prio 0, i/o): system
    0000000000000000-000000007fffffff (prio 0, ram): alias ram-below-4g @ram1 0000000000000000-000000007fffffff
    0000000000000000-ffffffffffffffff (prio -1, i/o): pci
      00000000c0000000-00000000c0000fff (prio 1, i/o): ahci
      00000000c0001000-00000000c0001fff (prio 1, i/o): virtio-blk-pci-msix
        00000000c0001000-00000000c000101f (prio 0, i/o): msix-table
        00000000c0001800-00000000c0001807 (prio 0, i/o): msix-pba
      00000000c0002000-00000000c0002fff (prio 1, i/o): virtio-serial-pci-msix
        00000000c0002000-00000000c000201f (prio 0, i/o): msix-table
        00000000c0002800-00000000c0002807 (prio 0, i/o): msix-pba
      00000000c0003000-00000000c0003fff (prio 1, i/o): virtio-net-pci-msix
        00000000c0003000-00000000c000303f (prio 0, i/o): msix-table
        00000000c0003800-00000000c0003807 (prio 0, i/o): msix-pba
      00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios
      0000700000000000-0000700000003fff (prio 1, i/o): virtio-pci
        0000700000000000-0000700000000fff (prio 0, i/o): virtio-pci-common-virtio-net
        0000700000001000-0000700000001fff (prio 0, i/o): virtio-pci-isr-virtio-net
        0000700000002000-0000700000002fff (prio 0, i/o): virtio-pci-device-virtio-net
        0000700000003000-0000700000003fff (prio 0, i/o): virtio-pci-notify-virtio-net
      0000700000004000-0000700000007fff (prio 1, i/o): virtio-pci
        0000700000004000-0000700000004fff (prio 0, i/o): virtio-pci-common-virtio-serial
        0000700000005000-0000700000005fff (prio 0, i/o): virtio-pci-isr-virtio-serial
        0000700000006000-0000700000006fff (prio 0, i/o): virtio-pci-device-virtio-serial
        0000700000007000-0000700000007fff (prio 0, i/o): virtio-pci-notify-virtio-serial
      0000700000008000-000070000000bfff (prio 1, i/o): virtio-pci
        0000700000008000-0000700000008fff (prio 0, i/o): virtio-pci-common-virtio-blk
        0000700000009000-0000700000009fff (prio 0, i/o): virtio-pci-isr-virtio-blk
        000070000000a000-000070000000afff (prio 0, i/o): virtio-pci-device-virtio-blk
        000070000000b000-000070000000bfff (prio 0, i/o): virtio-pci-notify-virtio-blk
    00000000000c0000-00000000000c3fff (prio 1, i/o): alias pam-pci @pci 00000000000c0000-00000000000c3fff
    00000000000c4000-00000000000c7fff (prio 1, i/o): alias pam-pci @pci 00000000000c4000-00000000000c7fff
    00000000000c8000-00000000000cbfff (prio 1, i/o): alias pam-pci @pci 00000000000c8000-00000000000cbfff
    00000000000cc000-00000000000cffff (prio 1, i/o): alias pam-pci @pci 00000000000cc000-00000000000cffff
    00000000000d0000-00000000000d3fff (prio 1, i/o): alias pam-pci @pci 00000000000d0000-00000000000d3fff
    00000000000d4000-00000000000d7fff (prio 1, i/o): alias pam-pci @pci 00000000000d4000-00000000000d7fff
    00000000000d8000-00000000000dbfff (prio 1, i/o): alias pam-pci @pci 00000000000d8000-00000000000dbfff
    00000000000dc000-00000000000dffff (prio 1, i/o): alias pam-pci @pci 00000000000dc000-00000000000dffff
    00000000000e0000-00000000000e3fff (prio 1, i/o): alias pam-pci @pci 00000000000e0000-00000000000e3fff
    00000000000e4000-00000000000e7fff (prio 1, i/o): alias pam-pci @pci 00000000000e4000-00000000000e7fff
    00000000000e8000-00000000000ebfff (prio 1, i/o): alias pam-pci @pci 00000000000e8000-00000000000ebfff
    00000000000ec000-00000000000effff (prio 1, i/o): alias pam-pci @pci 00000000000ec000-00000000000effff
    00000000000f0000-00000000000fffff (prio 1, i/o): alias pam-pci @pci 00000000000f0000-00000000000fffff
    00000000b0000000-00000000bfffffff (prio 0, i/o): pcie-mmcfg-mmio
    00000000fec00000-00000000fec00fff (prio 0, i/o): ioapic
    00000000fed00000-00000000fed003ff (prio 0, i/o): hpet
    00000000fed1c000-00000000fed1ffff (prio 1, i/o): lpc-rcrb-mmio
    00000000fee00000-00000000feefffff (prio 4096, i/o): kvm-apic-msi
    0000000100000000-000000017fffffff (prio 0, ram): alias ram-above-4g @ram1 0000000080000000-00000000ffffffff

memory-region: ram1
  0000000000000000-00000000ffffffff (prio 0, ram): ram1

memory-region: pci
  0000000000000000-ffffffffffffffff (prio -1, i/o): pci
    00000000c0000000-00000000c0000fff (prio 1, i/o): ahci
    00000000c0001000-00000000c0001fff (prio 1, i/o): virtio-blk-pci-msix
      00000000c0001000-00000000c000101f (prio 0, i/o): msix-table
      00000000c0001800-00000000c0001807 (prio 0, i/o): msix-pba
    00000000c0002000-00000000c0002fff (prio 1, i/o): virtio-serial-pci-msix
      00000000c0002000-00000000c000201f (prio 0, i/o): msix-table
      00000000c0002800-00000000c0002807 (prio 0, i/o): msix-pba
    00000000c0003000-00000000c0003fff (prio 1, i/o): virtio-net-pci-msix
      00000000c0003000-00000000c000303f (prio 0, i/o): msix-table
      00000000c0003800-00000000c0003807 (prio 0, i/o): msix-pba
    00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios
    0000700000000000-0000700000003fff (prio 1, i/o): virtio-pci
      0000700000000000-0000700000000fff (prio 0, i/o): virtio-pci-common-virtio-net
      0000700000001000-0000700000001fff (prio 0, i/o): virtio-pci-isr-virtio-net
      0000700000002000-0000700000002fff (prio 0, i/o): virtio-pci-device-virtio-net
      0000700000003000-0000700000003fff (prio 0, i/o): virtio-pci-notify-virtio-net
    0000700000004000-0000700000007fff (prio 1, i/o): virtio-pci
      0000700000004000-0000700000004fff (prio 0, i/o): virtio-pci-common-virtio-serial
      0000700000005000-0000700000005fff (prio 0, i/o): virtio-pci-isr-virtio-serial
      0000700000006000-0000700000006fff (prio 0, i/o): virtio-pci-device-virtio-serial
      0000700000007000-0000700000007fff (prio 0, i/o): virtio-pci-notify-virtio-serial
    0000700000008000-000070000000bfff (prio 1, i/o): virtio-pci
      0000700000008000-0000700000008fff (prio 0, i/o): virtio-pci-common-virtio-blk
      0000700000009000-0000700000009fff (prio 0, i/o): virtio-pci-isr-virtio-blk
      000070000000a000-000070000000afff (prio 0, i/o): virtio-pci-device-virtio-blk
      000070000000b000-000070000000bfff (prio 0, i/o): virtio-pci-notify-virtio-blk

glue() In QEMU

glue 函数是:

#define xglue(x, y) x ## y
#define glue(x, y) xglue(x, y)

所以:

#define SUFFIX _cached_slow
static inline uint16_t glue(address_space_lduw_internal, SUFFIX)(ARG1_DECL, hwaddr addr, MemTxAttrs attrs, MemTxResult *result,
    enum device_endian endian)

可以展开为:

address_space_lduw_internal_cached_slow

.c.inc File in QEMU

这种类型的文件是 C 文件,但是是被其他 C 文件 include 用的,比如:

#include "memory_ldst.c.inc"

主要是为了避免代码重复,所以是一种模板代码。

lduw QEMU

Load from a given host pointer and return it.

  • ld: stands for "load" which is a common naming convention for instructions that retrieve data from memory.
  • u: indicates that the instruction operates on unsigned integer values. Unsigned integers represent non-negative values (0 and positive numbers).
  • w: signifies that the instruction deals with a word, which typically refers to 16 bits or 2 bytes on many architectures.

具体的讲解可以看这里:

Load and Store APIs — QEMU documentation

clflush_cache_range() / CLFLUSH / CLFLUSHOPT Kernel

Flush a cache range with CLFLUSH.

Invalidates from every level of the cache hierarchy in the cache coherence domain the cache line that contains the linear address specified with the memory operand.

If that cache line contains modified data at any level of the cache hierarchy, that data is written back to memory.

CLFLUSHOPT: Flush Cache Line Optimized.

/**
 * clflush_cache_range - flush a cache range with clflush
 * @vaddr:	virtual start address
 * @size:	number of bytes to flush
 *
 * CLFLUSHOPT is an unordered instruction which needs fencing with MFENCE or
 * SFENCE to avoid ordering issues.
 */
void clflush_cache_range(void *vaddr, unsigned int size)
{
	mb();
	clflush_cache_range_opt(vaddr, size);
	mb();
}

Git request-pull example

git request-pull HEAD~1 https://github.com/intel-sandbox/leinux pre-si/6.8/staging/cwf < ~/mail
git request-pull 8bc230b5ad0a938a1302b5ff1d25ce692721219 https://github.com/intel-sandbox/leinux pre-si/6.8/staging/cwf  > ~/mail

Git reset branch to a specific commit

git branch --force <branch-name> [<new-tip-commit>]

iowrite32() Guest kernel

iowrite32() in Linux is a function used to write a 32-bit value to a device register or MMIO address.

While memcpy or similar functions are meant for general memory operations, iowrite32 specifically targets device registers and I/O memory, which handle communication with hardware components.

iowrite32() may include memory barriers to ensure the write operation completes before other instructions.

In some cases, writel might be used interchangeably with iowrite32.

就是往特定的内存地址写上值。

For the lazy ones, here are my conclusions:

  • On x86 platforms, iowrite32() and writel() are translated to just a “mov” into memory.
  • On x86, the following functions translate into nothing: mmiowb(), smp_wmb() and smp_rmb(). wmb() and rmb() translate into “sfence” and “lfence” respectively.

iowrite32(), writel() and memory barriers taken apart

Qemu-storage-daemon

qemu-storage-daemon is a tool that provides disk image functionality for a VM without running the VM itself.

QEMU Storage Daemon - DEV Community

exit_fastpath / fastpath_t KVM

enum exit_fastpath_completion {
	EXIT_FASTPATH_NONE,
	EXIT_FASTPATH_REENTER_GUEST,
	EXIT_FASTPATH_EXIT_HANDLED,
};
typedef enum exit_fastpath_completion fastpath_t;

get_user_pages() Kernel

When kernel code needs to work directly with user-space pages, it often calls get_user_pages() (or one of several variants) to fault those pages into RAM and pin them there.

When it is called, it will translate user-space virtual addresses to physical addresses and ensure that the pages are in memory.

也就是说调用的过程中可能会发生 page fault。

KVM_REQ_EVENT KVM

vcpu_enter_guest
    if (kvm_check_request(KVM_REQ_EVENT, vcpu))
        //...
		r = kvm_apic_accept_events(vcpu);
        //...
		r = kvm_check_and_inject_events(vcpu, &req_immediate_exit);
        //...
		if (kvm_lapic_enabled(vcpu)) {
            // TPR related things
			update_cr8_intercept(vcpu);
            
			kvm_lapic_sync_to_vapic(vcpu);
		}

run_on_cpu() QEMU

此线程提交任务给对应的 vCPU 线程然后就可以离开了。需要等待该 vCPU 线程完成任务之后才可以继续。

thread A:

run_on_cpu
    do_run_on_cpu
        wi.func = func;
    // 请看 async_run_on_cpu^
    // 这个函数基本上就是把这个任务提交给 vCPU thread,然后
    // 让 vCPU thread 继续执行。
    queue_work_on_cpu(cpu, &wi);
    // 等待 vCPU 线程结束。
    qemu_cond_wait(&qemu_work_cond, mutex);

vCPU thread B:

kvm_vcpu_thread_fn
    do {
        if (cpu_can_run(cpu)) {
            r = kvm_cpu_exec(cpu);
            //...
        }
        qemu_wait_io_event(cpu);
            qemu_wait_io_event_common
                qatomic_set_mb(&cpu->thread_kicked, false);
                process_queued_cpu_work(cpu);
                    QSIMPLEQ_REMOVE_HEAD(&cpu->work_list, node);
                        wi->func(cpu, wi->data);
                        // 设置 done 为 true。
                        qatomic_store_release(&wi->done, true);
                    // 通知 thread 我们已经结束了。
                    qemu_cond_broadcast(&qemu_work_cond);
    } while (!cpu->unplug || cpu_can_run(cpu));

async_run_on_cpu() QEMU

此线程提交任务给对应的 vCPU 线程然后就可以离开了。可以继续执行,不必等待执行结束。

thread A:

async_run_on_cpu
    queue_work_on_cpu // 将任务挂到对应的 vCPU 线程的任务队列上。
        QSIMPLEQ_INSERT_TAIL(&cpu->work_list, wi, node);
        qemu_cpu_kick
            cpus_kick_thread
                cpu->thread_kicked = true;
                // 给对应的 cpu 的 sem 通知,告诉它有新的任务要执行
                qemu_sem_post(&cpu->sem);

vCPU thread B:

kvm_vcpu_thread_fn
    do {
        if (cpu_can_run(cpu)) {
            r = kvm_cpu_exec(cpu);
            if (r == EXCP_DEBUG) {
                cpu_handle_guest_debug(cpu);
            }
        }
        qemu_wait_io_event(cpu);
            qemu_wait_io_event_common
                qatomic_set_mb(&cpu->thread_kicked, false);
                process_queued_cpu_work(cpu);
                    QSIMPLEQ_REMOVE_HEAD(&cpu->work_list, node);
                        wi->func(cpu, wi->data);
    } while (!cpu->unplug || cpu_can_run(cpu));

QEMU iothread

Guest driver 的 virtio 交互都是和 IOThread 进行的。包括 ioeventfd 和 irqfd。

如下进行使用:

-object iothread,id=iothread1
-device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk1

Arch capabilities / Core capabilities

这两个不是同一个东西,一个是 CPUID_7_0_EDX bit 29,一个是 bit 30。分别 Enumerates support for the these 2 MSRs.

IA32_ARCH_CAPABILITIES in Intel processors that enumerates various architectural features, primarily related to security mitigations. It's a read-only register, meaning you can only query its value to determine the supported features, but you cannot directly modify it.

IA32_CORE_CAPABILITIES is an architectural MSR that enumerates model-specific features. A bit being set in this MSR indicates that a model specific feature is supported; software must still consult CPUID family/model/stepping to determine the behavior of the enumerated feature as features enumerated in IA32_CORE_CAPABILITIES may have different behavior on different processor models. Some of these features may have behavior that is consistent across processor models (and for which consultation of CPUID family/model/stepping is not necessary); such features are identified explicitly where they are documented in this manual.

Show all RAMBlocks information

info ramblock

bdrv_inactivate_recurse() QEMU

为什么 block driver 是这种树形的结构呢?

static int bdrv_inactivate_recurse(BlockDriverState *bs)
{
    BdrvChild *child, *parent;
    int ret;
    uint64_t cumulative_perms, cumulative_shared_perms;

    //...
    /* Inactivate this node */
    // only qcow2 has this function defined
    if (bs->drv->bdrv_inactivate)
        ret = bs->drv->bdrv_inactivate(bs);
        //...

    QLIST_FOREACH(parent, &bs->parents, next_parent) {
        if (parent->klass->inactivate)
            ret = parent->klass->inactivate(parent);
            //...
    }

    bdrv_get_cumulative_perm(bs, &cumulative_perms, &cumulative_shared_perms);
    if (cumulative_perms & (BLK_PERM_WRITE | BLK_PERM_WRITE_UNCHANGED)) {
        /* Our inactive parents still need write access. Inactivation failed. */
        return -EPERM;
    }

    bs->open_flags |= BDRV_O_INACTIVE;

    /*
     * Update permissions, they may differ for inactive nodes.
     * We only tried to loosen restrictions, so errors are not fatal, ignore
     * them.
     */
    bdrv_refresh_perms(bs, NULL, NULL);

    /* Recursively inactivate children */
    QLIST_FOREACH(child, &bs->children, next) {
        ret = bdrv_inactivate_recurse(child->bs);
        //...
    }
    //...
}

bdrv_inactivate_all() QEMU

Block driver inactivate all.

Inactivate before sending QEMU_VM_EOF so that the bdrv_activate_all() on the other end won't fail.

int bdrv_inactivate_all(void)
{
    BlockDriverState *bs = NULL;
    BdrvNextIterator it;
    int ret = 0;
    GSList *aio_ctxs = NULL, *ctx;

    //...
    for (bs = bdrv_first(&it); bs; bs = bdrv_next(&it)) {
        AioContext *aio_context = bdrv_get_aio_context(bs);

        if (!g_slist_find(aio_ctxs, aio_context)) {
            aio_ctxs = g_slist_prepend(aio_ctxs, aio_context);
            aio_context_acquire(aio_context);
        }
    }

    for (bs = bdrv_first(&it); bs; bs = bdrv_next(&it)) {
        /* Nodes with BDS parents are covered by recursion from the last
         * parent that gets inactivated. Don't inactivate them a second
         * time if that has already happened. */
        // 这个 bdrv 可能有 parent。这样这个 bdrv 会在 recurse 中
        // 被 invalidate,所以不需要在这里 invalidate 了
        if (bdrv_has_bds_parent(bs, false))
            continue;

        ret = bdrv_inactivate_recurse(bs);
        //...
    }

out:
    for (ctx = aio_ctxs; ctx != NULL; ctx = ctx->next) {
        AioContext *aio_context = ctx->data;
        aio_context_release(aio_context);
    }
    g_slist_free(aio_ctxs);

    return ret;
}

QEMU process name / What's the difference between thread <process> and qemu-system-x86?

$QEMU -accel kvm \
-name lm_dst,process=lm_dst \

nameprocess 有什么区别。

那么 name 参数的作用是什么?window title 的名字。

process 是 process name。可以通过 pgrep 出来。

debug-threads 表示我们要不要给每一个 thread 一个特别的名字,从而方便 debug。

-name string1[,process=string2][,debug-threads=on|off]
     set the name of the guest
     string1 sets the window title and string2 the process name
     When debug-threads is enabled, individual threads are given a separate name
     NOTE: The thread names are for debugging and not a stable API.
qemu_init
    qemu_process_early_options
        qemu_opts_foreach(qemu_find_opts("name"), parse_name, NULL, &error_fatal);
            proc_name = qemu_opt_get(opts, "process");
                prctl(PR_SET_NAME, name)

在我们加了上面说的参数,再去查看一个 qemu 进程的所有线程的时候,我们能发现至少有下面三个线程:

    PID    SPID TTY          TIME CMD
 629509  629509 pts/3    00:00:01 lm_src
 629509  629510 pts/3    00:00:00 qemu-system-x86
 629509  629511 pts/3    00:00:25 CPU 0/KVM

到底 lm_srcqemu-system-x86 哪一个才是我们的主线程?如果我们不加上面的参数,我们能够看到:

    PID    SPID TTY          TIME CMD
 632772  632772 pts/3    00:00:00 qemu-system-x86
 632772  632773 pts/3    00:00:00 qemu-system-x86
 632772  632775 pts/3    00:00:15 CPU 0/KVM

可以看到两个线程的名字都叫做 qemu-system-x86。我们可以确定第一个线程(也就是 SPID 和 PID 相等的)是我们的主线程,那么第二个线程是干嘛的?

通过 GDB 调试发现,这个 thread 是 call_rcu_thread(),用来实现 QEMU 的 RCU 的。

// 这个前面有 __attribute__((__constructor__)) 标记
// 表示这个函数会在 main 函数之前执行。
rcu_init
    rcu_init_complete
        qemu_thread_create(&thread, "call_rcu", call_rcu_thread, NULL, QEMU_THREAD_DETACHED);
            call_rcu_thread

iowrite* / ioread* Kernel

iowrite32(), iowrite16() 等等函数都是什么意思呢?

数字表示要写入的数据的 size,比如 32 就表示要写入 32bit 的数据。写到指定的地址。

这其实就是 MMIO 读写的函数。但是既然都是 MMIO 了,直接用内存操作的函数 writel() 不行吗,为什么还要叫 io*() 呢?其实这两个函数是一样的,On x86 platforms, iowrite32() and writel() are translated to just a “mov” into memory.

iowrite32(), writel() and memory barriers taken apart

hwaddr In QEMU

DMA in virtualization

memory_access_is_direct() QEMU

In QEMU, the memory_access_is_direct function is used to determine whether a given memory region supports direct access from a device using translated addresses.

看起来好像还是和设备 IO 有关的。

表示这次 memory access(无论是读还是写)。

static inline bool memory_access_is_direct(MemoryRegion *mr, bool is_write)
{
    if (is_write) {
        return memory_region_is_ram(mr) && !mr->readonly &&
               !mr->rom_device && !memory_region_is_ram_device(mr);
    } else {
        return (memory_region_is_ram(mr) && !memory_region_is_ram_device(mr)) ||
               memory_region_is_romd(mr);
    }
}

ref_count / Refcount / ref count

A page's refcount:

page_ref_count(pfn_to_page(pfn))

Why TSC frequency tsc-freq and -invtsc should be specified to live migrate between different platforms?

Why invtsc cannot be migrated?

x86 cpus can't migrate with the 'invtsc' feature flag enabled.

Features/Migration/Troubleshooting - QEMU

We can see from here that invtsc cannot be migrated.

    [FEAT_8000_0007_EDX] = {
        //...
        .unmigratable_flags = CPUID_APM_INVTSC,
    },

我觉得主要的原因是,invtsc 相当于告诉了 guest TSC 频率是一个恒定的值,我们可以用这个 TSC frequency 来做 wall clock,但是如果迁移到了一台新机器上,那么我们的 TSC 值会变化,从而 break 了 invtsc。因此除非我们手动指定了 TSC 频率,所以我们默认 invtsc 不可被迁移。

为什么不能 dst 在检测到 invtsc 时,直接用 src 端的 tsc 频率而不是用 dst 的呢?

如果置上 invtsc 的同时指定了 tsc-frequency,那么就不会报 invtsc 的错。

-cpu host 的情况下,因为默认是 migratable 的,所以不会把 invtsc CPUID 暴漏给 guest,除非显式指定 -cpu host,+invtsc.

1254124 – -cpu $cpu-model,+invtsc doesn't support migration

5-level Paging and 5-level EPT

5-Level Paging and 5-Level EPT White Paper

qemu_system_wakeup_request() QEMU

Wakeup guest from suspend.

Wake the guest, either with system_wakeup or moving the mouse or something.

C 语言中的逗号表达式

表达式的值为最后一个表达式的值。

a = (a=3*5, a*4) 的值是 60。

逗号表达式_百度百科

Memory address line / address bus width

用来表示物理地址宽度,而不是虚拟地址。

┌───┐ Virtual ┌───┐ Physical Address ┌──────┐
│CPU├────────►│MMU├─────────────────►│Memory│
└───┘ Address └───┘   Address Line   └──────┘

Usually "Address line" denotes the electrical connection between a single address bit of the CPU (after translation by a MMU from a virtual address to a physical address) and the memory.

ram - Address line, 16 bit memory, and addresses - Stack Overflow

mmu_lock KVM MMU

这是一个大锁。

要么就是读写锁,要么就是自旋锁。

struct kvm {
#ifdef KVM_HAVE_MMU_RWLOCK
	rwlock_t mmu_lock;
#else
	spinlock_t mmu_lock;
#endif
}

#ifdef KVM_HAVE_MMU_RWLOCK
#define KVM_MMU_LOCK_INIT(kvm)		rwlock_init(&(kvm)->mmu_lock)
#define KVM_MMU_LOCK(kvm)		write_lock(&(kvm)->mmu_lock)
#define KVM_MMU_UNLOCK(kvm)		write_unlock(&(kvm)->mmu_lock)
#else
#define KVM_MMU_LOCK_INIT(kvm)		spin_lock_init(&(kvm)->mmu_lock)
#define KVM_MMU_LOCK(kvm)		spin_lock(&(kvm)->mmu_lock)
#define KVM_MMU_UNLOCK(kvm)		spin_unlock(&(kvm)->mmu_lock)
#endif /* KVM_HAVE_MMU_RWLOCK */

通过 KVM_MMU_LOCK(), KVM_MMU_UNLOCK() 来上锁或者解锁。

MMU Lock 的作用。

  • 多个 vCPU 之间有 critical section,需要 take lock 才能同步。

Cache type

Strong Uncacheable (UC): 这种 cache 类型的 memory,任何读写操作都不经过 cache。一般是 memory-map 的 IO 地址可以使用这种类型。一般的 ram 强烈推荐不使用这种 cache,否则效率会非常低。

Uncacheable (UC-): 特性与 UC (Strong uncacheable) 相同,唯一不同的是,这种类型的 memory,可以通过修改 MTRR 来把它改变成 WC。

Write Combining (WC): 这种类型的 cache,特性与 UC 相似,不同的地方是它可以被 speculative read(什么叫 speculative read?)每次 write 都可能被 delay,write 的内容会 buffer 到一个叫“write combining buffer”的地方。可以通过 对 MTRR 编程来设置 WC,也可以通过设置 PAT 来设置 WC(pat 是什么?)

Write-through (WT): 这个很好理解,每次 write,都要 write 到 memory,同时 write 到对应的 cache(if write hits)。WT 方式保证了 cache 与 memory 是一致的。

Write-back (WB): 这种类型的 memory,read 和 write,都跟一般的 cache 一样。只是 write 的时候,当写到了 cache 中,不会立即 write 到 memory 里(这个就跟 WT 不一样了)。CPU 会等到适当的时候再 write 到 memory 里:比如当 cache 满了。这种类型是效率最高的类型。

Write-protected (WP): Read 跟 wb 一样,但每次 write,都会引起 cache invalidate。

Cache学习(UC, WC)_cpu地址wc属性-CSDN博客