L0: Host, L1: Guest L2: Nested Guest.

VMCS01 and VMCS02 is on L0.

VMCS12 is on L1.

The key is that (当没有 Shadow VMCS 这个 feature 的时候?):

  1. L1's VMRESUME will trigger VMExit to L0;
  2. L0 merges VMCS01 & VMCS12 to VMCS02;
  3. L0 runs L2 with VMCS02;
  4. L2 runs until VMExit triggered because of VMCS02 (L2 VMExit will directly to L0);
  5. L0 decides if to handle VMExit itself or reflect VMExit to L1;

为什么要 merge VMCS01 和 VMCS12 成为 VMCS02,直接 load VMCS12 难道不可以吗?

可能还是需要一些 Adjustment 吧,不能直接跑 VMCS12,需要根据 VMCS01 适当调整一下。

L0 cannot use VMCS1→2 to execute L2 directly, since VMCS1→2 is not valid in L0’s environment and L0 cannot use L1’s page tables to run L2. 毕竟比如 host state field 肯定是不一样的,需要用 L0 的 host data 而不是 L1 的。

是不是当有了 Shadow VMCS 这一个 VMX feature 之后,我们就不需要 merge VMCS01 和 VMCS12 了,因为这个 Shadow VMCS 本身就是 VMCS02?

VMRESUME 是固定会触发 VMExit 的,Shadow VMCS 只是让某些 field 在 VMREAD 和 VMWRITE 的时候不需要触发 VMExit。但是当真正 VMRESUME 的时候,还是要 VMExit 并且做一些 Adjustments。所以并不是有了 Shadow VMCS 之后就不用这种 merge 了。

How does L0 know the VMExit reason is L1 wants to run a nested VM?

static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
    //...
    // VMRESUME 是一个 VMX 命令。因为 guest 执行 VMX 命令肯定是和 nested 有关系
    // 其实也不是 exit 到这里,因为这个 handler 已经被 nested 换过了。
    // 所以实际的 handler 是 handle_vmresume()
	[EXIT_REASON_VMRESUME]		      = handle_vmx_instruction,
    //...
}

vmx_hardware_setup
    if (nested) {
        nested_vmx_hardware_setup
            exit_handlers[EXIT_REASON_VMCLEAR]	= handle_vmclear;
        	exit_handlers[EXIT_REASON_VMLAUNCH]	= handle_vmlaunch;
        	exit_handlers[EXIT_REASON_VMPTRLD]	= handle_vmptrld;
        	exit_handlers[EXIT_REASON_VMPTRST]	= handle_vmptrst;
        	exit_handlers[EXIT_REASON_VMREAD]	= handle_vmread;
        	exit_handlers[EXIT_REASON_VMRESUME]	= handle_vmresume;
        	exit_handlers[EXIT_REASON_VMWRITE]	= handle_vmwrite;
        	exit_handlers[EXIT_REASON_VMOFF]	= handle_vmxoff;
        	exit_handlers[EXIT_REASON_VMON]		= handle_vmxon;
        	exit_handlers[EXIT_REASON_INVEPT]	= handle_invept;
        	exit_handlers[EXIT_REASON_INVVPID]	= handle_invvpid;
        	exit_handlers[EXIT_REASON_VMFUNC]	= handle_vmfunc;
static u64 nested_vmx_calc_vmcs_enum_msr(void)
	/*
	 * For better or worse, KVM allows VMREAD/VMWRITE to all fields in
	 * vmcs12, regardless of whether or not the associated feature is
	 * exposed to L1.  Simply find the field with the highest index.
	 */

KVM 里的 nested virtualization 的原理是基于这篇论文的:

Ben-Yehuda.pdf

VMCS01, VMCS02, VMCS12 in KVM

vmcs01 表示的是这个 VCPU 上本身自己用的 VMCS,而由于 L1 上可能跑着很多个 L2(L1 的 VCPU 上可能跑着很多个 L2 的 VCPU),所以每一个 L2 都有一个自己的 VMCS。

struct vcpu_vmx {
    //...
	/*
	 * loaded_vmcs points to the VMCS currently used in this vcpu. For a
	 * non-nested (L1) guest, it always points to vmcs01. For a nested
	 * guest (L2), it points to a different VMCS.
	 */
	struct loaded_vmcs    vmcs01;
	struct loaded_vmcs   *loaded_vmcs;
    //...
}

What will happen when merging:

  • The host state defined in VMCS0→2 must contain the values required by the CPU to correctly switch back from L2 to L0.(也就是说我们需要把 VMCS0->1 中的一些 host state 写入到 VMCS0->2 中);
  • In addition, VMCS1→2 host state must be copied to VMCS0→1 guest state. Thus, when L0 emulates a switch between L2 to L1, the processor loads the correct L1 specifications.(为了让 L1 以为发生了一个 VMExit,L2 VMexit 到 L0 然后 L0 应该需要 VMENTRY 到 L1,所以 VMCS1->2 里的 host state 需要被记录到 VMCS0->1 的 guest 中这样下次 VMENTRY L1 的时候才能够有正确的状态)。
  • The guest state stored in VMCS1→2 does not require any special handling in general, and most fields can be copied directly to the guest state of VMCS0→2.

What will happen when L1 execute VMXON? / handle_vmon() / enter_vmx_operation() KVM

/* Emulate the VMXON instruction. */
static int handle_vmon(struct kvm_vcpu *vcpu)
{
	int ret;
	gpa_t vmptr;
	uint32_t revision;
	struct vcpu_vmx *vmx = to_vmx(vcpu);
	const u64 VMXON_NEEDED_FEATURES = FEAT_CTL_LOCKED
		| FEAT_CTL_VMX_ENABLED_OUTSIDE_SMX;

	/*
	 * The Intel VMX Instruction Reference lists a bunch of bits that are
	 * prerequisite to running VMXON, most notably cr4.VMXE must be set to
	 * 1 (see vmx_is_valid_cr4() for when we allow the guest to set this).
	 * Otherwise, we should fail with #UD.  But most faulting conditions
	 * have already been checked by hardware, prior to the VM-exit for
	 * VMXON.  We do test guest cr4.VMXE because processor CR4 always has
	 * that bit set to 1 in non-root mode.
	 */
    // 上面注释解释的很清楚了
	if (!kvm_read_cr4_bits(vcpu, X86_CR4_VMXE)) {
		kvm_queue_exception(vcpu, UD_VECTOR);
		return 1;
	}

	/* CPL=0 must be checked manually. */
    // VCPU 必须运行在 kernel mode
    // 防止 guest 的 userspace 程序执行 VMXON
	if (vmx_get_cpl(vcpu)) {
		kvm_inject_gp(vcpu, 0);
		return 1;
	}

    // 不应该 VMXON 之后再 VMXON
	if (vmx->nested.vmxon)
		return nested_vmx_fail(vcpu, VMXERR_VMXON_IN_VMX_ROOT_OPERATION);

	if ((vmx->msr_ia32_feature_control & VMXON_NEEDED_FEATURES)
			!= VMXON_NEEDED_FEATURES) {
		kvm_inject_gp(vcpu, 0);
		return 1;
	}

    // VMXON pointer, 注意不要和 VMCS 的 pointer 弄混,
    // 这个区域我们接触的并不多。
	if (nested_vmx_get_vmptr(vcpu, &vmptr, &ret))
		return ret;

	/*
	 * SDM 3: 24.11.5
	 * The first 4 bytes of VMXON region contain the supported
	 * VMCS revision identifier
	 *
	 * Note - IA32_VMX_BASIC[48] will never be 1 for the nested case;
	 * which replaces physical address width with 32
	 */
	if (!page_address_valid(vcpu, vmptr))
		return nested_vmx_failInvalid(vcpu);

	if (kvm_read_guest(vcpu->kvm, vmptr, &revision, sizeof(revision)) || revision != VMCS12_REVISION)
		return nested_vmx_failInvalid(vcpu);

    // 把这个 ptr 赋予 nested.vmxon_ptr
	vmx->nested.vmxon_ptr = vmptr;
    // 请看下面的讲解
	ret = enter_vmx_operation(vcpu);
    //...
	return nested_vmx_succeed(vcpu);
}
static int enter_vmx_operation(struct kvm_vcpu *vcpu)
{
	struct vcpu_vmx *vmx = to_vmx(vcpu);
	int r;

    // 分配并初始化 VMCS02
	r = alloc_loaded_vmcs(&vmx->nested.vmcs02);
    //...

    // 分配并初始化 VMCS12
    // 因为 vmcs12 永远不可能被 load(只有 vmsc01 和 vmcs02 可以被 load)
    // 所以我们并不调用 alloc_loaded_vmcs() 函数。
	vmx->nested.cached_vmcs12 = kzalloc(VMCS12_SIZE, GFP_KERNEL_ACCOUNT);
    //...

    // 
	vmx->nested.shadow_vmcs12_cache.gpa = INVALID_GPA;
    // shadow_vmcs12 是干什么用的?
	vmx->nested.cached_shadow_vmcs12 = kzalloc(VMCS12_SIZE, GFP_KERNEL_ACCOUNT);
    //...

    // Allocate a shadow VMCS and associate it with the **currently loaded VMCS**
    // 这个时候 loaded 的 VMCS 应该是 VMCS01 吧。
	if (enable_shadow_vmcs && !alloc_shadow_vmcs(vcpu))
		goto out_shadow_vmcs;

    // timer 相关,暂时不看。
	hrtimer_init(&vmx->nested.preemption_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
	vmx->nested.preemption_timer.function = vmx_preemption_timer_fn;

    // vpid 还有 02
	vmx->nested.vpid02 = allocate_vpid();

	vmx->nested.vmcs02_initialized = false;
    // 已经 vmxon 了
	vmx->nested.vmxon = true;

    // Intel-PT 相关的,先不管了。
	if (vmx_pt_mode_is_host_guest()) {
		vmx->pt_desc.guest.ctl = 0;
		pt_update_intercept_for_msr(vcpu);
	}
	return 0;
}

What will happen when L1 execute VMLAUNCH/VMRESUME? / handle_vmresume() / handle_vmlaunch() / nested_vmx_run() KVM

  • Write writable values in the shadow VMCS to VMCS12.
exit_handlers[EXIT_REASON_VMRESUME]	= handle_vmresume;
// 如果是 VMLAUNCH 那么 launch 就是 true
// 如果是 VMRESUME 那么 launch 就是 false
static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
{
    // vmcs12 有一个专门的结构体,不和其他的 VMCS 混用:
	struct vmcs12 *vmcs12;
	enum nvmx_vmentry_status status;
	struct vcpu_vmx *vmx = to_vmx(vcpu);
	u32 interrupt_shadow = vmx_get_interrupt_shadow(vcpu);
	enum nested_evmptrld_status evmptrld_status;

    // 如果 guest vcpu 还没有执行 VMXON,那是不行的
	if (!nested_vmx_check_permission(vcpu))
		return 1;

    // Hyper-V 相关的,先不关心了
	evmptrld_status = nested_vmx_handle_enlightened_vmptrld(vcpu, launch);
	if (evmptrld_status == EVMPTRLD_ERROR) {
		kvm_queue_exception(vcpu, UD_VECTOR);
		return 1;
	}

    // PMU 相关的,先不关心了
	kvm_pmu_trigger_event(vcpu, PERF_COUNT_HW_BRANCH_INSTRUCTIONS);

    // 还是 Hyper-V 相关的,先不关心了
	if (CC(evmptrld_status == EVMPTRLD_VMFAIL))
		return nested_vmx_failInvalid(vcpu);

	if (CC(!evmptr_is_valid(vmx->nested.hv_evmcs_vmptr) &&
	       vmx->nested.current_vmptr == INVALID_GPA))
		return nested_vmx_failInvalid(vcpu);


    // 拿到 cached_vmcs12
	vmcs12 = get_vmcs12(vcpu);

	/*
	 * Can't VMLAUNCH or VMRESUME a shadow VMCS. Despite the fact
	 * that there *is* a valid VMCS pointer, RFLAGS.CF is set
	 * rather than RFLAGS.ZF, and no error number is stored to the
	 * VM-instruction error field.
	 */
    // 我们是不能 LAUNCH/RESUME 一个 shadow VMCS 的
    // 这在 SDM 中也有规定。
	if (CC(vmcs12->hdr.shadow_vmcs))
		return nested_vmx_failInvalid(vcpu);

	if (evmptr_is_valid(vmx->nested.hv_evmcs_vmptr)) {
		copy_enlightened_to_vmcs12(vmx, vmx->nested.hv_evmcs->hv_clean_fields);
		/* Enlightened VMCS doesn't have launch state */
		vmcs12->launch_state = !launch;
	} else if (enable_shadow_vmcs) {
        // Copy the writable VMCS shadow fields back to the VMCS12
        // 因为 L1 在 VMLAUNCH/VMRESUME 之前有一段时间了,所以其在
        // 这段时间访问的 shadow VMCS (vmcs01.shadow_vmcs) 可能有一些值
        // 已经更新了。我们要把这些更新的值写回到 VMCS02 中去。
        // 注意我们只写 writable by guest 的值,也就是这个值对于 guest
        // 来说是可写的,这些值才有可能被更新,有写回的必要。
		copy_shadow_to_vmcs12(vmx);
	}


    // 如果是 VMLAUNCH,那么不能 VMLAUNCH 很多次
    // 如果是 VMRESUME,那么不能够没有 VMLAUNCH 过
	if (CC(vmcs12->launch_state == launch))
		return nested_vmx_fail(vcpu, launch ? VMXERR_VMLAUNCH_NONCLEAR_VMCS : VMXERR_VMRESUME_NONLAUNCHED_VMCS);

    // 检查 VMCS12 的 execution/exit/entry ctrls
    // 有没有设置不正确的地方,如果有的话,就 fail。
	if (nested_vmx_check_controls(vcpu, vmcs12))
		return nested_vmx_fail(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);

    // 检查 address space 的 size
	if (nested_vmx_check_address_space_size(vcpu, vmcs12))
		return nested_vmx_fail(vcpu, VMXERR_ENTRY_INVALID_HOST_STATE_FIELD);

    // ckeck 一下 host state 正不正常。
    // host state 有可能不正常,如果 L1 guest 没有好好处理虚拟化的话。
	if (nested_vmx_check_host_state(vcpu, vmcs12))
		return nested_vmx_fail(vcpu, VMXERR_ENTRY_INVALID_HOST_STATE_FIELD);

	// We're finally done with prerequisite checking, and can start with the nested launch/resume.
	vmx->nested.nested_run_pending = 1;
	vmx->nested.has_preemption_timer_deadline = false;
    // 如其所愿,让 guest 进入 non-root mode^。
	status = nested_vmx_enter_non_root_mode(vcpu, true);
	if (unlikely(status != NVMX_VMENTRY_SUCCESS))
		goto vmentry_failed;

    // interrupt 相关的。
	/* Emulate processing of posted interrupts on VM-Enter. */
	if (nested_cpu_has_posted_intr(vmcs12) &&
	    kvm_apic_has_interrupt(vcpu) == vmx->nested.posted_intr_nv) {
		vmx->nested.pi_pending = true;
		kvm_make_request(KVM_REQ_EVENT, vcpu);
		kvm_apic_clear_irr(vcpu, vmx->nested.posted_intr_nv);
	}

	/* Hide L1D cache contents from the nested guest.  */
	vmx->vcpu.arch.l1tf_flush_l1d = true;

	/*
	 * Must happen outside of nested_vmx_enter_non_root_mode() as it will
	 * also be used as part of restoring nVMX state for
	 * snapshot restore (migration).
	 *
	 * In this flow, it is assumed that vmcs12 cache was
	 * transferred as part of captured nVMX state and should
	 * therefore not be read from guest memory (which may not
	 * exist on destination host yet).
	 */
	nested_cache_shadow_vmcs12(vcpu, vmcs12);

	switch (vmcs12->guest_activity_state) {
	case GUEST_ACTIVITY_HLT:
		/*
		 * If we're entering a halted L2 vcpu and the L2 vcpu won't be
		 * awakened by event injection or by an NMI-window VM-exit or
		 * by an interrupt-window VM-exit, halt the vcpu.
		 */
		if (!(vmcs12->vm_entry_intr_info_field & INTR_INFO_VALID_MASK) &&
		    !nested_cpu_has(vmcs12, CPU_BASED_NMI_WINDOW_EXITING) &&
		    !(nested_cpu_has(vmcs12, CPU_BASED_INTR_WINDOW_EXITING) &&
		      (vmcs12->guest_rflags & X86_EFLAGS_IF))) {
			vmx->nested.nested_run_pending = 0;
			return kvm_emulate_halt_noskip(vcpu);
		}
		break;
	case GUEST_ACTIVITY_WAIT_SIPI:
		vmx->nested.nested_run_pending = 0;
		vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
		break;
	default:
		break;
	}

	return 1;
    //...
}

nested_vmx_enter_non_root_mode() KVM

enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu, bool from_vmentry)
{
	struct vcpu_vmx *vmx = to_vmx(vcpu);
	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
	enum vm_entry_failure_code entry_failure_code;
	bool evaluate_pending_interrupts;
	union vmx_exit_reason exit_reason = {
		.basic = EXIT_REASON_INVALID_STATE,
		.failed_vmentry = 1,
	};
	u32 failed_index;

	kvm_service_local_tlb_flush_requests(vcpu);

	evaluate_pending_interrupts = exec_controls_get(vmx) &
		(CPU_BASED_INTR_WINDOW_EXITING | CPU_BASED_NMI_WINDOW_EXITING);
	if (likely(!evaluate_pending_interrupts) && kvm_vcpu_apicv_active(vcpu))
		evaluate_pending_interrupts |= vmx_has_apicv_interrupt(vcpu);

	if (!(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS))
		vmx->nested.vmcs01_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
	if (kvm_mpx_supported() &&
		!(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
		vmx->nested.vmcs01_guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS);

	/*
	 * Overwrite vmcs01.GUEST_CR3 with L1's CR3 if EPT is disabled *and*
	 * nested early checks are disabled.  In the event of a "late" VM-Fail,
	 * i.e. a VM-Fail detected by hardware but not KVM, KVM must unwind its
	 * software model to the pre-VMEntry host state.  When EPT is disabled,
	 * GUEST_CR3 holds KVM's shadow CR3, not L1's "real" CR3, which causes
	 * nested_vmx_restore_host_state() to corrupt vcpu->arch.cr3.  Stuffing
	 * vmcs01.GUEST_CR3 results in the unwind naturally setting arch.cr3 to
	 * the correct value.  Smashing vmcs01.GUEST_CR3 is safe because nested
	 * VM-Exits, and the unwind, reset KVM's MMU, i.e. vmcs01.GUEST_CR3 is
	 * guaranteed to be overwritten with a shadow CR3 prior to re-entering
	 * L1.  Don't stuff vmcs01.GUEST_CR3 when using nested early checks as
	 * KVM modifies vcpu->arch.cr3 if and only if the early hardware checks
	 * pass, and early VM-Fails do not reset KVM's MMU, i.e. the VM-Fail
	 * path would need to manually save/restore vmcs01.GUEST_CR3.
	 */
	if (!enable_ept && !nested_early_check)
		vmcs_writel(GUEST_CR3, vcpu->arch.cr3);

	vmx_switch_vmcs(vcpu, &vmx->nested.vmcs02);

    // ^
	prepare_vmcs02_early(vmx, &vmx->vmcs01, vmcs12);

    // called from VMLAUNCH/VMRESUME
	if (from_vmentry) {
		if (unlikely(!nested_get_vmcs12_pages(vcpu))) {
			vmx_switch_vmcs(vcpu, &vmx->vmcs01);
			return NVMX_VMENTRY_KVM_INTERNAL_ERROR;
		}

		if (nested_vmx_check_vmentry_hw(vcpu)) {
			vmx_switch_vmcs(vcpu, &vmx->vmcs01);
			return NVMX_VMENTRY_VMFAIL;
		}

		if (nested_vmx_check_guest_state(vcpu, vmcs12,
						 &entry_failure_code)) {
			exit_reason.basic = EXIT_REASON_INVALID_STATE;
			vmcs12->exit_qualification = entry_failure_code;
			goto vmentry_fail_vmexit;
		}
	}

	enter_guest_mode(vcpu);

    // 前面 prepare VMCS02 
	if (prepare_vmcs02(vcpu, vmcs12, from_vmentry, &entry_failure_code)) {
		exit_reason.basic = EXIT_REASON_INVALID_STATE;
		vmcs12->exit_qualification = entry_failure_code;
		goto vmentry_fail_vmexit_guest_mode;
	}

	if (from_vmentry) {
		failed_index = nested_vmx_load_msr(vcpu,
						   vmcs12->vm_entry_msr_load_addr,
						   vmcs12->vm_entry_msr_load_count);
		if (failed_index) {
			exit_reason.basic = EXIT_REASON_MSR_LOAD_FAIL;
			vmcs12->exit_qualification = failed_index;
			goto vmentry_fail_vmexit_guest_mode;
		}
	} else {
		/*
		 * The MMU is not initialized to point at the right entities yet and
		 * "get pages" would need to read data from the guest (i.e. we will
		 * need to perform gpa to hpa translation). Request a call
		 * to nested_get_vmcs12_pages before the next VM-entry.  The MSRs
		 * have already been set at vmentry time and should not be reset.
		 */
		kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
	}

	/*
	 * If L1 had a pending IRQ/NMI until it executed
	 * VMLAUNCH/VMRESUME which wasn't delivered because it was
	 * disallowed (e.g. interrupts disabled), L0 needs to
	 * evaluate if this pending event should cause an exit from L2
	 * to L1 or delivered directly to L2 (e.g. In case L1 don't
	 * intercept EXTERNAL_INTERRUPT).
	 *
	 * Usually this would be handled by the processor noticing an
	 * IRQ/NMI window request, or checking RVI during evaluation of
	 * pending virtual interrupts.  However, this setting was done
	 * on VMCS01 and now VMCS02 is active instead. Thus, we force L0
	 * to perform pending event evaluation by requesting a KVM_REQ_EVENT.
	 */
	if (unlikely(evaluate_pending_interrupts))
		kvm_make_request(KVM_REQ_EVENT, vcpu);

	/*
	 * Do not start the preemption timer hrtimer until after we know
	 * we are successful, so that only nested_vmx_vmexit needs to cancel
	 * the timer.
	 */
	vmx->nested.preemption_timer_expired = false;
	if (nested_cpu_has_preemption_timer(vmcs12)) {
		u64 timer_value = vmx_calc_preemption_timer_value(vcpu);
		vmx_start_preemption_timer(vcpu, timer_value);
	}

	/*
	 * Note no nested_vmx_succeed or nested_vmx_fail here. At this point
	 * we are no longer running L1, and VMLAUNCH/VMRESUME has not yet
	 * returned as far as L1 is concerned. It will only return (and set
	 * the success flag) when L2 exits (see nested_vmx_vmexit()).
	 */
	return NVMX_VMENTRY_SUCCESS;

	/*
	 * A failed consistency check that leads to a VMExit during L1's
	 * VMEnter to L2 is a variation of a normal VMexit, as explained in
	 * 26.7 "VM-entry failures during or after loading guest state".
	 */
vmentry_fail_vmexit_guest_mode:
	if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING)
		vcpu->arch.tsc_offset -= vmcs12->tsc_offset;
	leave_guest_mode(vcpu);

vmentry_fail_vmexit:
	vmx_switch_vmcs(vcpu, &vmx->vmcs01);

	if (!from_vmentry)
		return NVMX_VMENTRY_VMEXIT;

	load_vmcs12_host_state(vcpu, vmcs12);
	vmcs12->vm_exit_reason = exit_reason.full;
	if (enable_shadow_vmcs || evmptr_is_valid(vmx->nested.hv_evmcs_vmptr))
		vmx->nested.need_vmcs12_to_shadow_sync = true;
	return NVMX_VMENTRY_VMEXIT;
}

prepare_vmcs02_early() KVM

static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs01, struct vmcs12 *vmcs12)
{
	u32 exec_control;
	u64 guest_efer = nested_vmx_calc_efer(vmx, vmcs12);

	if (vmx->nested.dirty_vmcs12 || evmptr_is_valid(vmx->nested.hv_evmcs_vmptr))
		prepare_vmcs02_early_rare(vmx, vmcs12);

	/*
	 * PIN CONTROLS
	 */
	exec_control = __pin_controls_get(vmcs01);
	exec_control |= (vmcs12->pin_based_vm_exec_control & ~PIN_BASED_VMX_PREEMPTION_TIMER);

	/* Posted interrupts setting is only taken from vmcs12. */
	vmx->nested.pi_pending = false;
	if (nested_cpu_has_posted_intr(vmcs12))
		vmx->nested.posted_intr_nv = vmcs12->posted_intr_nv;
	else
		exec_control &= ~PIN_BASED_POSTED_INTR;
	pin_controls_set(vmx, exec_control);

	/*
	 * EXEC CONTROLS
	 */
	exec_control = __exec_controls_get(vmcs01); /* L0's desires */
	exec_control &= ~CPU_BASED_INTR_WINDOW_EXITING;
	exec_control &= ~CPU_BASED_NMI_WINDOW_EXITING;
	exec_control &= ~CPU_BASED_TPR_SHADOW;
	exec_control |= vmcs12->cpu_based_vm_exec_control;

	vmx->nested.l1_tpr_threshold = -1;
	if (exec_control & CPU_BASED_TPR_SHADOW)
		vmcs_write32(TPR_THRESHOLD, vmcs12->tpr_threshold);
#ifdef CONFIG_X86_64
	else
		exec_control |= CPU_BASED_CR8_LOAD_EXITING |
				CPU_BASED_CR8_STORE_EXITING;
#endif

	/*
	 * A vmexit (to either L1 hypervisor or L0 userspace) is always needed
	 * for I/O port accesses.
	 */
	exec_control |= CPU_BASED_UNCOND_IO_EXITING;
	exec_control &= ~CPU_BASED_USE_IO_BITMAPS;

	/*
	 * This bit will be computed in nested_get_vmcs12_pages, because
	 * we do not have access to L1's MSR bitmap yet.  For now, keep
	 * the same bit as before, hoping to avoid multiple VMWRITEs that
	 * only set/clear this bit.
	 */
	exec_control &= ~CPU_BASED_USE_MSR_BITMAPS;
	exec_control |= exec_controls_get(vmx) & CPU_BASED_USE_MSR_BITMAPS;

	exec_controls_set(vmx, exec_control);

	/*
	 * SECONDARY EXEC CONTROLS
	 */
	if (cpu_has_secondary_exec_ctrls()) {
		exec_control = __secondary_exec_controls_get(vmcs01);

		/* Take the following fields only from vmcs12 */
		exec_control &= ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
				  SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
				  SECONDARY_EXEC_ENABLE_INVPCID |
				  SECONDARY_EXEC_ENABLE_RDTSCP |
				  SECONDARY_EXEC_XSAVES |
				  SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE |
				  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
				  SECONDARY_EXEC_APIC_REGISTER_VIRT |
				  SECONDARY_EXEC_ENABLE_VMFUNC |
				  SECONDARY_EXEC_TSC_SCALING |
				  SECONDARY_EXEC_DESC);

		if (nested_cpu_has(vmcs12,
				   CPU_BASED_ACTIVATE_SECONDARY_CONTROLS))
			exec_control |= vmcs12->secondary_vm_exec_control;

		/* PML is emulated and never enabled in hardware for L2. */
		exec_control &= ~SECONDARY_EXEC_ENABLE_PML;

		/* VMCS shadowing for L2 is emulated for now */
		exec_control &= ~SECONDARY_EXEC_SHADOW_VMCS;

		/*
		 * Preset *DT exiting when emulating UMIP, so that vmx_set_cr4()
		 * will not have to rewrite the controls just for this bit.
		 */
		if (!boot_cpu_has(X86_FEATURE_UMIP) && vmx_umip_emulated() &&
		    (vmcs12->guest_cr4 & X86_CR4_UMIP))
			exec_control |= SECONDARY_EXEC_DESC;

		if (exec_control & SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY)
			vmcs_write16(GUEST_INTR_STATUS,
				vmcs12->guest_intr_status);

		if (!nested_cpu_has2(vmcs12, SECONDARY_EXEC_UNRESTRICTED_GUEST))
		    exec_control &= ~SECONDARY_EXEC_UNRESTRICTED_GUEST;

		if (exec_control & SECONDARY_EXEC_ENCLS_EXITING)
			vmx_write_encls_bitmap(&vmx->vcpu, vmcs12);

		secondary_exec_controls_set(vmx, exec_control);
	}

	/*
	 * ENTRY CONTROLS
	 *
	 * vmcs12's VM_{ENTRY,EXIT}_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE
	 * are emulated by vmx_set_efer() in prepare_vmcs02(), but speculate
	 * on the related bits (if supported by the CPU) in the hope that
	 * we can avoid VMWrites during vmx_set_efer().
	 */
	exec_control = __vm_entry_controls_get(vmcs01);
	exec_control |= vmcs12->vm_entry_controls;
	exec_control &= ~(VM_ENTRY_IA32E_MODE | VM_ENTRY_LOAD_IA32_EFER);
	if (cpu_has_load_ia32_efer()) {
		if (guest_efer & EFER_LMA)
			exec_control |= VM_ENTRY_IA32E_MODE;
		if (guest_efer != host_efer)
			exec_control |= VM_ENTRY_LOAD_IA32_EFER;
	}
	vm_entry_controls_set(vmx, exec_control);

	/*
	 * EXIT CONTROLS
	 *
	 * L2->L1 exit controls are emulated - the hardware exit is to L0 so
	 * we should use its exit controls. Note that VM_EXIT_LOAD_IA32_EFER
	 * bits may be modified by vmx_set_efer() in prepare_vmcs02().
	 */
	exec_control = __vm_exit_controls_get(vmcs01);
	if (cpu_has_load_ia32_efer() && guest_efer != host_efer)
		exec_control |= VM_EXIT_LOAD_IA32_EFER;
	else
		exec_control &= ~VM_EXIT_LOAD_IA32_EFER;
	vm_exit_controls_set(vmx, exec_control);

	/*
	 * Interrupt/Exception Fields
	 */
	if (vmx->nested.nested_run_pending) {
		vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
			     vmcs12->vm_entry_intr_info_field);
		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
			     vmcs12->vm_entry_exception_error_code);
		vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
			     vmcs12->vm_entry_instruction_len);
		vmcs_write32(GUEST_INTERRUPTIBILITY_INFO,
			     vmcs12->guest_interruptibility_info);
		vmx->loaded_vmcs->nmi_known_unmasked =
			!(vmcs12->guest_interruptibility_info & GUEST_INTR_STATE_NMI);
	} else {
		vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
	}
}

handle_vmwrite() KVM

static int handle_vmwrite(struct kvm_vcpu *vcpu)
{
	struct vmcs12 *vmcs12 = is_guest_mode(vcpu) ? get_shadow_vmcs12(vcpu)
						    : get_vmcs12(vcpu);
	unsigned long exit_qualification = vmx_get_exit_qual(vcpu);
	u32 instr_info = vmcs_read32(VMX_INSTRUCTION_INFO);
	struct vcpu_vmx *vmx = to_vmx(vcpu);
	struct x86_exception e;
	unsigned long field;
	short offset;
	gva_t gva;
	int len, r;

	/*
	 * The value to write might be 32 or 64 bits, depending on L1's long
	 * mode, and eventually we need to write that into a field of several
	 * possible lengths. The code below first zero-extends the value to 64
	 * bit (value), and then copies only the appropriate number of
	 * bits into the vmcs12 field.
	 */
	u64 value = 0;

	if (!nested_vmx_check_permission(vcpu))
		return 1;

	/*
	 * In VMX non-root operation, when the VMCS-link pointer is INVALID_GPA,
	 * any VMWRITE sets the ALU flags for VMfailInvalid.
	 */
	if (vmx->nested.current_vmptr == INVALID_GPA ||
	    (is_guest_mode(vcpu) &&
	     get_vmcs12(vcpu)->vmcs_link_pointer == INVALID_GPA))
		return nested_vmx_failInvalid(vcpu);

	if (instr_info & BIT(10))
		value = kvm_register_read(vcpu, (((instr_info) >> 3) & 0xf));
	else {
		len = is_64_bit_mode(vcpu) ? 8 : 4;
		if (get_vmx_mem_address(vcpu, exit_qualification,
					instr_info, false, len, &gva))
			return 1;
		r = kvm_read_guest_virt(vcpu, gva, &value, len, &e);
		if (r != X86EMUL_CONTINUE)
			return kvm_handle_memory_failure(vcpu, r, &e);
	}

	field = kvm_register_read(vcpu, (((instr_info) >> 28) & 0xf));

	offset = get_vmcs12_field_offset(field);
	if (offset < 0)
		return nested_vmx_fail(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);

	/*
	 * If the vCPU supports "VMWRITE to any supported field in the
	 * VMCS," then the "read-only" fields are actually read/write.
	 */
	if (vmcs_field_readonly(field) &&
	    !nested_cpu_has_vmwrite_any_field(vcpu))
		return nested_vmx_fail(vcpu, VMXERR_VMWRITE_READ_ONLY_VMCS_COMPONENT);

	/*
	 * Ensure vmcs12 is up-to-date before any VMWRITE that dirties
	 * vmcs12, else we may crush a field or consume a stale value.
	 */
	if (!is_guest_mode(vcpu) && !is_shadow_field_rw(field))
		copy_vmcs02_to_vmcs12_rare(vcpu, vmcs12);

	/*
	 * Some Intel CPUs intentionally drop the reserved bits of the AR byte
	 * fields on VMWRITE.  Emulate this behavior to ensure consistent KVM
	 * behavior regardless of the underlying hardware, e.g. if an AR_BYTE
	 * field is intercepted for VMWRITE but not VMREAD (in L1), then VMREAD
	 * from L1 will return a different value than VMREAD from L2 (L1 sees
	 * the stripped down value, L2 sees the full value as stored by KVM).
	 */
	if (field >= GUEST_ES_AR_BYTES && field <= GUEST_TR_AR_BYTES)
		value &= 0x1f0ff;

	vmcs12_write_any(vmcs12, field, offset, value);

	/*
	 * Do not track vmcs12 dirty-state if in guest-mode as we actually
	 * dirty shadow vmcs12 instead of vmcs12.  Fields that can be updated
	 * by L1 without a vmexit are always updated in the vmcs02, i.e. don't
	 * "dirty" vmcs12, all others go down the prepare_vmcs02() slow path.
	 */
    // 需要是从 L1 exit 到 L0 的,同时这个 field 还不能是同时可读可写的
    // 因为:如果是从 L2 exit 到 L0,那么 dirty 的应该是 shadow vmcs12
    // 如果可读可写,那么会更新到 VMCS02 中。
    // 注意,shadow VMCS 这个 feature 是可以被 emulate 的。并不一定需要
    // 硬件必须支持才行。
	if (!is_guest_mode(vcpu) && !is_shadow_field_rw(field)) {
		// L1 can read these fields without exiting, ensure the shadow VMCS is up-to-date.
        // 如果这个 field L1 读的话不会 vmexit,那么我们应该事先把 up-to-date
        // 的值写到 Shadow VMCS 里面,防止人家读到一个 stale 的值。
		if (enable_shadow_vmcs && is_shadow_field_ro(field)) {
			preempt_disable();
			vmcs_load(vmx->vmcs01.shadow_vmcs);

			__vmcs_writel(field, value);

			vmcs_clear(vmx->vmcs01.shadow_vmcs);
			vmcs_load(vmx->loaded_vmcs->vmcs);
			preempt_enable();
		}
		vmx->nested.dirty_vmcs12 = true;
	}

	return nested_vmx_succeed(vcpu);
}

Maintain a linked list of all VMCSs loaded on this CPU, so we can clear them if the CPU goes down.

应该包含一个 L1 的 vCPU VMCS 以及若干个 L2 的 vCPU VMCS 吧。

struct loaded_vmcs {
    //...
	struct list_head loaded_vmcss_on_cpu_link;
    //...
};

可以看到添加 list entry 的地方,也就是在 load VMCS 到这个物理 CPU 上的时候:

kvm_arch_vcpu_load
    // cpu 一般就是当前跑在的这个 CPU 上。
	static_call(kvm_x86_vcpu_load)(vcpu, cpu);
        vt_vcpu_load
            vmx_vcpu_load
                vmx_vcpu_load_vmcs
vmx_vcpu_load_vmcs
    bool already_loaded = vmx->loaded_vmcs->cpu == cpu;
	if (!already_loaded) {
		loaded_vmcs_clear(vmx->loaded_vmcs);
        //...
        // 把当前 VMCS 加入到物理 CPU 
		list_add(&vmx->loaded_vmcs->loaded_vmcss_on_cpu_link, &per_cpu(loaded_vmcss_on_cpu, cpu));
	}

删除 list entry 的地方:

// 可以看到在 load 一个新的 VMCS 的时候,需要把当前已经 load 的 VMCS 从 list 中清除掉。
vmx_vcpu_load_vmcs
	if (!already_loaded) {
        loaded_vmcs_clear
            __loaded_vmcs_clear
            	list_del(&loaded_vmcs->loaded_vmcss_on_cpu_link);

如果要引用这个列表,需要 per_cpu(loaded_vmcss_on_cpu, cpu), 说明每一个物理 CPU 都有一个这个 list:

DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
//...
list_for_each_entry(v, &per_cpu(loaded_vmcss_on_cpu, cpu), loaded_vmcss_on_cpu_link)
	vmcs_clear(v->vmcs);

handle_vmx_instruction() KVM

当 KVM 没有开启 nested 的时候,也就是 nested=0 的时候。Guest 不应该执行任何一个 VMX Instruction,因为我们就没有暴漏 VMX 功能给 guest,所以对于任何一个 VMX 执行,handle 都是一样的,报一个 exception 出去:

[EXIT_REASON_VMCLEAR]		      = handle_vmx_instruction,
[EXIT_REASON_VMLAUNCH]		      = handle_vmx_instruction,
[EXIT_REASON_VMPTRLD]		      = handle_vmx_instruction,
[EXIT_REASON_VMPTRST]		      = handle_vmx_instruction,
[EXIT_REASON_VMREAD]		      = handle_vmx_instruction,
[EXIT_REASON_VMRESUME]		      = handle_vmx_instruction,
[EXIT_REASON_VMWRITE]		      = handle_vmx_instruction,
[EXIT_REASON_VMOFF]		      = handle_vmx_instruction,
[EXIT_REASON_VMON]		      = handle_vmx_instruction,

static int handle_vmx_instruction(struct kvm_vcpu *vcpu)
{
	kvm_queue_exception(vcpu, UD_VECTOR);
	return 1;
}

How does L0 know which VMCS are 01 VMCSs, which are 02's

What's the relationship between 01, 02 and 12?

Why the VMCS need to be merged? (from 01, 12 to 02)

The reason why CPU Model won't/will open VMX (nested) by default

Disable:

[PATCH v3 5/6] target-i386: Don't enable nested VMX by default - Eduardo Habkost

但是现在 nested 默认是打开的:The nested VMX feature is enabled by default since Linux kernel v4.20. For older Linux kernel, it can be enabled by giving the "nested=1" option to the kvm-intel module。所以感觉 CPU Model 把 vmx 这个 feature 打开也没什么问题了?

kvm_x86_nested_ops

vmx_secondary_exec_control(), vmx_tertiary_exec_control() KVM

返回一个 VCPU 应当被设置的 exec_control 值。

这个函数可以。

struct nested_vmx KVM

struct vcpu_vmx {
    //...
	/* Support for a guest hypervisor (nested VMX) */
	struct nested_vmx nested;
    //...
}

struct nested_vmx {
    //...
	// Cache of the guest's VMCS, existing **outside of guest memory**.(VMCS12 因为是 guest 分配的,所以是在 guest
	// 自己的 memory 里面的)Loaded from **guest memory** during VMPTRLD. Flushed to guest * memory during VMCLEAR and VMPTRLD.
    // 这个就是正经的 VMCS12
	struct vmcs12 *cached_vmcs12;
	// Cache of the guest's shadow VMCS, existing **outside of guest memory**.
	// Loaded from guest memory during VM entry. Flushed to guest memory during VM exit.
	struct vmcs12 *cached_shadow_vmcs12;
    // 当 L1 作为 hypervisor 时,能够让 L1 的 KVM 用的这些 VMX 的 capailities reporting
	struct nested_vmx_msrs msrs;
    //...
};

Cached VMCS12

nested_vmx_setup_ctls_msrs() / KVM

这个函数设置的是全局的 VMX capability reporting MSR 能力,并不是某一个 VM 的此能力。因为这个设置的是 vmcs_config.nested 而不是 vcpu_vmx->nested.msrs,这两者都是 struct nested_vmx_msrs 类型的。

vt_hardware_setup
    vmx_hardware_setup
        nested_vmx_setup_ctls_msrs

不同的 VM expose 的 VMX capability 并不是一样的,比如我们可能:

-cpu host,+vmx,-vmx-tsc-scaling

这样子不同 VM 看到 capability reporting MSR 应该是不一样的。如何保证呢?

/*
 * nested_vmx_setup_ctls_msrs() sets up variables containing the values to be
 * returned for the various VMX controls MSRs when nested VMX is enabled.
 * The same values should also be used to verify that vmcs12 control fields are
 * valid during nested entry from L1 to L2.
 * Each of these control msrs has a low and high 32-bit half: A low bit is on
 * if the corresponding bit in the (32-bit) control field *must* be on, and a
 * bit in the high half is on if the corresponding bit in the control field
 * may be on. See also vmx_control_verify().
 */
void nested_vmx_setup_ctls_msrs(struct vmcs_config *vmcs_conf, u32 ept_caps)
{
	struct nested_vmx_msrs *msrs = &vmcs_conf->nested;

	/*
	 * Note that as a general rule, the high half of the MSRs (bits in
	 * the control fields which may be 1) should be initialized by the
	 * intersection of the underlying hardware's MSR (i.e., features which
	 * can be supported) and the list of features we want to expose -
	 * because they are known to be properly supported in our code.
	 * Also, usually, the low half of the MSRs (bits which must be 1) can
	 * be set to 0, meaning that L1 may turn off any of these bits. The
	 * reason is that if one of these bits is necessary, it will appear
	 * in vmcs01 and prepare_vmcs02, when it bitwise-or's the control
	 * fields of vmcs01 and vmcs02, will turn these bits off - and
	 * nested_vmx_l1_wants_exit() will not pass related exits to L1.
	 * These rules have exceptions below.
	 */
    // 这些 setup 函数都是在设置暴露给 guest 的 capability reporting MSRs
	nested_vmx_setup_pinbased_ctls(vmcs_conf, msrs);
	nested_vmx_setup_exit_ctls(vmcs_conf, msrs);
	nested_vmx_setup_entry_ctls(vmcs_conf, msrs);
	nested_vmx_setup_cpubased_ctls(vmcs_conf, msrs);
	nested_vmx_setup_secondary_ctls(ept_caps, vmcs_conf, msrs);
	nested_vmx_setup_misc_data(vmcs_conf, msrs);
	nested_vmx_setup_basic(msrs);
	nested_vmx_setup_cr_fixed(msrs);

	msrs->vmcs_enum = nested_vmx_calc_vmcs_enum_msr();
}

nested_vmx_setup_pinbased_ctls() / KVM

is_guest_mode() / leave_guest_mode() / enter_guest_mode() (KVM)

L2 is active.

在 VMExit 出来之后,我们需要分辨到底是因为 L1 的 guest vmexit 了出来还是 L2 的 guest vmexit 了出来还是 L3 的。。。因为大家 vmexit 出来都是到 L0,所以 L0 无从分辨。

In guest mode 表示的是 L2 或者以上的 exit 了出来,而不是 L1 的。

#define HF_GUEST_MASK		(1 << 5) /* VCPU is in guest-mode */
static inline bool is_guest_mode(struct kvm_vcpu *vcpu)
{
	return vcpu->arch.hflags & HF_GUEST_MASK;
}

static inline void enter_guest_mode(struct kvm_vcpu *vcpu)
{
	vcpu->arch.hflags |= HF_GUEST_MASK;
	vcpu->stat.guest_mode = 1;
}

// 这个函数会在 vmexit 出来的时候被调用
// vmx_handle_exit()
// __vmx_handle_exit()
// nested_vmx_vmexit()
static inline void leave_guest_mode(struct kvm_vcpu *vcpu)
{
	vcpu->arch.hflags &= ~HF_GUEST_MASK;

	if (vcpu->arch.load_eoi_exitmap_pending) {
		vcpu->arch.load_eoi_exitmap_pending = false;
		kvm_make_request(KVM_REQ_LOAD_EOI_EXITMAP, vcpu);
	}

	vcpu->stat.guest_mode = 0;
}

VMCS Shadowing

VMCS Shadowing 就是为了 nested VMX 的场景准备的。

KVM: We can emulate "VMCS shadowing," even if the hardware doesn't support it.:

static bool __read_mostly enable_shadow_vmcs = 1;
#define SECONDARY_EXEC_SHADOW_VMCS		0x00004000
nested_vmx_setup_secondary_ctls
    msrs->secondary_ctls_high |= SECONDARY_EXEC_SHADOW_VMCS;

In Secondary Processor-Based VM-Execution Controls. Bit 14.

If this control is 1, executions of VMREAD and VMWRITE in VMX non-root operation may access a shadow VMCS (instead of causing VM exits).

Every VMCS is either an ordinary VMCS or a shadow VMCS. A VMCS’s type is determined by the shadow-VMCS indicator in the VMCS region: 0 indicates an ordinary VMCS, while 1 indicates a shadow VMCS.

A shadow VMCS differs from an ordinary VMCS in two ways:

  • An ordinary VMCS can be used for VM entry but a shadow VMCS cannot. Attempts to perform VM entry when the current VMCS is a shadow VMCS fail.
  • The VMREAD and VMWRITE instructions can be used in VMX non-root operation to access a shadow VMCS but not an ordinary VMCS.

In VMX root operation, both types of VMCSs can be accessed with the VMREAD and VMWRITE instructions. 这也就让 KVM 可以方便地管理 VMCS 和 Shadow VMCS。

Relationship between the Shadow VMCS, VMCS12 and VMCS02

VMCS01 的 VMCS link pointer 连着 Shadow VMCS

对于 L2,也就是 VMSC02,连着的应该是 Shadow VMCS 12 之类的。

并没有 Shadow VMCS 02。因为 02 是 L0 为 L2 准备的,而 Shadow VMCS 12 本来就是 L1 给 L2 准备的 shadow VMCS,不需要经过 L0 来处理和 merge,所以就叫 Shadow VMCS 12 好了。

VMREAD_BITMAP/VMWRITE_BITMAP

If the “VMCS shadowing” VM-execution control is 1, executions of VMREAD and VMWRITE may consult these bitmaps.

VMREAD_BITMAP 里的对应 bit 值是 1 的时候 guest VMREAD 这个 VMCS field 会发生 VMExit。See:

SDM:
VMREADRead Field from Virtual-Machine Control Structure

VMWRITE_BITMAP 里的对应 bit 值是 1 的时候 guest VMWRITE 这个 VMCS field 会发生 VMExit。See:

SDM:
VMWRITEWrite Field to Virtual-Machine Control Structure

据此推断,如果我们想让一个 VMCS field 变成对于 guest read-only 的,那么我们需要:

  • 清除 VMREAD_BITMAP 里的 bit;从而不 intercept read;
  • 置上 VMWRITE_BITMAP 里的 bit。从而 intercept write。

如果我们想让一个 VMCS field 变成对于 guest 是 read-write 的,那么我们需要:

  • 清除 VMREAD_BITMAP 里的 bit;
  • 清除 VMWRITE_BITMAP 里的 bit。

is_shadow_field_rw() / is_shadow_field_ro() KVM

  • rw:L1 更新这个 field 不需要发生 VMExit
  • ro:L1 读这个 field 不需要发生 VMExit
  • 普通 field:不管是读还是写都会触发 VM-exit。

一些 fields 是 RW 的,一些 fields 是 RO 的,这些都在 arch/x86/kvm/vmx/vmcs_shadow_fields.h 中有定义。

这些 fields 是 RW 还是 RO 的是由硬件决定的,还是说软件人为定义的。

/*
 * We do NOT shadow fields that are modified when L0 traps and emulates any vmx instruction (e.g. VMPTRLD,
 * VMXON...) executed by L1.
 *
 * For example, VM_INSTRUCTION_ERROR is read by L1 if a vmx instruction fails (part of the error path).
 * Note the code assumes this logic. If for some reason we start shadowing these fields then we need to
 * force a shadow sync when L0 emulates vmx instructions (e.g. force a sync if VM_INSTRUCTION_ERROR is modified
 * by nested_vmx_failValid).
 // force a sync 指的是把 VM_INSTRUCTION_ERROR field sync 到 shadow VMCS 当中,因为这个 field L1 在读的时候
 // 不会出现 VMExit。为什么不能有这种 sync 呢?
 *
 * When adding or removing fields here, note that shadowed
 * fields must always be synced by prepare_vmcs02, not just
 * prepare_vmcs02_rare.
 */

/*
 * Keeping the fields ordered by size is an attempt at improving
 * branch prediction in vmcs12_read_any and vmcs12_write_any.
 */
static bool is_shadow_field_rw(unsigned long field)
{
	switch (field) {
#define SHADOW_FIELD_RW(x, y) case x:
#include "vmcs_shadow_fields.h"
		return true;
	default:
		break;
	}
	return false;
}

static bool is_shadow_field_ro(unsigned long field)
{
	switch (field) {
#define SHADOW_FIELD_RO(x, y) case x:
#include "vmcs_shadow_fields.h"
		return true;
	default:
		break;
	}
	return false;
}

我们可以看到在 initialize shadow field 的时候会更新 vmread/vmwrite bitmap 里的 bits:

// clear 的逻辑请看 VMREAD_BITMAP/VMWRITE_BITMAP^
init_vmcs_shadow_fields
    struct shadow_vmcs_field entry = shadow_read_only_fields[i];
        clear_bit(field, vmx_vmread_bitmap);
    struct shadow_vmcs_field entry = shadow_read_write_fields[i];
        clear_bit(field, vmx_vmwrite_bitmap);
		clear_bit(field, vmx_vmread_bitmap);

void nested_vmx_set_vmcs_shadowing_bitmap(void)
{
	if (enable_shadow_vmcs) {
		vmcs_write64(VMREAD_BITMAP, __pa(vmx_vmread_bitmap));
		vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmwrite_bitmap));
	}
}

VMREAD_BITMAP / VMWRITE_BITMAP

这是两个 VMCS Field:0x000020260x00002028

If the “VMCS shadowing” VM-execution control is 1, executions of VMREAD and VMWRITE may consult these bitmaps.

What is "shadow VMCS sync" in VMCS shadowing?

Nested Virtualization 论文解读

The paper: Ben-Yehuda.pdf, this is OSDI 2010 paper.

Because of the lack of architectural support for nested virtualization, an x86 guest hypervisor cannot use the hardware virtualization support directly to run its own guests.

There are two possible models for nested virtualization, which differ in the amount of support provided by the

underlying architecture.

  • In the first model, multi-level architectural support for nested virtualization, each hypervisor handles all traps caused by sensitive instructions of any guest hypervisor running directly on top of it. This model is implemented for example in the IBM System z architecture
  • The second model, single-level architectural support for nested virtualization, regardless of the level in which a trap occurred, execution returns to the level 0 trap handler. VMX 和 SVM 都是基于这个 model 的

简单来说就是 single-level 的情况下就是不管是任何 level $L_n$ 都 trap 到 $L_0$。multi-level 下 $L_n$ 会 trap 到 $L_{n-1}$。

nested 可以有很多层级,因为 Since the Intel x86 architecture is a single-level virtualization architecture, only a single hypervisor can use the processor’s VMX instructions to run its guests. For unmodified guest hypervisors to use VMX instructions, this single bare-metal hypervisor, which we call L0, needs to emulate VMX. This emulation of VMX can work recursively. Given that L0 provides a faithful emulation of the VMX hardware any time there is a trap on VMX instructions, the guest running on L1 will not know it is not running directly on the hardware. Building on this infrastructure, the guest at L1 is itself able use the same techniques to emulate the VMX hardware to an L2 hypervisor which can then run its L3 guests. More generally, given that the guest at Ln−1 provides a faithful emulation of VMX to guests at Ln, a guest at Ln can use the exact same techniques to emulate VMX for a guest at Ln+1. We thus limit our discussion below to L0, L1, and L2.

Thus L0 multiplexes the hardware between L1 and L2, both of which end up running as L0 virtual machines.

When any hypervisor or virtual machine causes a trap, the L0 trap handler is called. The trap handler then inspects the trapping instruction and its context, and decides whether that trap should be handled by L0 (e.g., because the trapping context was L1) or whether to forward it to the responsible hypervisor (e.g., because the trap occurred in L2 and should be handled by L1). In the latter case, L0 forwards the trap to L1 for handling.

Nested virtualization for CPU

Although VMCS1→2 is never loaded into the processor, L0 uses it to emulate a VMX enabled CPU for L1.

  • In general, when L0 emulates VMX instructions, it updates VMCS structures according to the update process described in the next section. Then, L0 resumes L1, as though the instructions were executed directly by the CPU. Most of the VMX instructions executed by L1 cause, first, a VMExit from L1 to L0, and then a VMEntry from L0 to L1.
  • For the instructions used to run a new VM, vmresume and vmlaunch, the process is different, since L0 needs to emulate a VMEntry from L1 to L2. Therefore, any execution of these instructions by L1 cause, first, a VMExit from L1 to L0, and then, a VMEntry from L0 to L2.

The control data of VMCS1→2 and VMCS0→1 must be merged to correctly emulate the processor behavior. For example, consider the case where L1 specifies to trap an event EA in VMCS1→2 but L0 does not trap such event for L1 (i.e., a trap is not specified in VMCS0→1). To forward the event EA to L1, L0 needs to specify the corresponding trap in VMCS0→2. In addition, the field used by L1 to inject events to L2 needs to be merged, as well as the fields used by the processor to specify the exit cause.

When L2 is running and a VMExit occurs there are two possible handling paths, depending on whether the VMExit must be handled only by L0 or must be forwarded to L1.

  • When the event causing the VMExit is related to L0 only, L0 handles the event and resumes L2. This kind of event can be an external interrupt, a non-maskable interrupt (NMI) or any trappable event specified in VMCS0→2 that was not specified in VMCS1→2. From L1’s perspective this event does not exist because it was generated outside the scope of L1’s virtualized environment. By analogy to the non-nested scenario, an event occurred at the hardware level, the CPU transparently handled it, and the hypervisor continued running as before.
  • The second handling path is caused by events related to L1 (e.g., trappable events specified in VMCS1→2). In this case L0 forwards the event to L1 by copying VMCS0→2 fields updated by the processor to VMCS1→2 and resuming L1. The hypervisor running in L1 believes there was a VMExit directly from L2 to L1. The L1 hypervisor handles the event and later on resumes L2 by executing vmresume or vmlaunch, both of which will be emulated by L0.

当然这篇论文还有关于 MMU nested virtualization 和 I/O nested virtualization 的内容,后面我们再看。