Spec: 341431-remote-action-request-white-paper.pdf

Remote Action Request (RAR) is introduced in Intel Architecture as a model-specific feature to speed up inter-processor operations by moving parts of those operations from software (OS, App) to hardware (the IA core).

RAR is used for speeding up remote TLB shootdowns and allowing them to be serviced while a long instruction is executing on the remote processor or when interrupts are disabled on that processor.

When the target pCPU receive RAR, if it is in non-root mode, it won't VMExit!

Concretely, it can replace the IPI with the remote action request. so its use case is not limited to TLB shootdown.

Benefits:

  • Fewer context switches: RAR needn't to be handled by software, so we will have fewer context switches caused by interrupts.
  • Lower latency: If a long instruction is executing on the remote processor or when interrupts are disabled on that processor, TLB flush using IPI will have long latency, but RAR
    • doesn't need interrupts and,
    • can be handled during a long instruction is executing,

so it will have lower latency.

ๆณจๆ„ PVRAR ๅ’Œ RAR ็š„ๅŒบๅˆซ๏ผŒRAR ๆœฌ่บซๆ˜ฏไธ€ไธช bare-metal ็š„ featureใ€‚ๆ˜ฏๅฏไปฅ็”จๆฅๅŠ ้€Ÿ baremetal ไธŠ็š„ TLB invalidation ็š„ใ€‚

ๅœจ bare-metal ไธŠๅผ•ๅ…ฅ RAR ็š„ๅŽŸๅ› ๆ˜ฏๅฏไปฅ้ฟๅ…ๅ‘้€ IPI ็ป™ RLPใ€‚ไนŸๅฐฑๆ˜ฏ RLP ็ซฏ็š„ software ไธ้œ€่ฆๅ‚ไธŽ๏ผŒhardware ๅฐฑ็›ดๆŽฅๅšไบ†๏ผŒๆๅ‡ๆ•ˆ็އใ€‚่ƒŒๅŽๅŸบไบŽ็š„็†่ฎบๆ˜ฏ๏ผŒIPI ๅฏไปฅ็œ‹ไฝœๆ˜ฏไธ€ไธชๆ›ดๅŠ  general ็š„ๆœบๅˆถ๏ผŒ็”จๆฅ้€š็Ÿฅ RLP ๅšไปปไฝ•ไบ‹๏ผŒ่ฟ™ๅ–ๅ†ณไบŽ RLP ่ฟ™่พน็š„ hanlder๏ผŒไฝ†ๆ˜ฏๅ› ไธบๅคงๅคšๆ•ฐๆƒ…ๅ†ตไธ‹ workload ๅฐฑ้‚ฃไนˆๅ‡ ็ง๏ผŒๆฏ”ๅฆ‚ TLB shootdownใ€EPT invalidation ็ญ‰็ญ‰ใ€‚ๆ‰€ไปฅๆˆ‘ไปฌๅฏไปฅๆŠŠ่ฟ™ไบ› workload ็š„ๆต็จ‹ๅฎž็Žฐๅœจ hardware ไธญไปŽ่€Œ่ฎฉๆ‰ง่กŒ็š„้€Ÿๅบฆๆ›ดๅฟซ๏ผŒไนŸๅฐฑๆ˜ฏ RAR ็š„ๆœฌ่ดจๆ˜ฏ็‰บ็‰ฒ็ตๆดปๆ€งๆฅๆๅ‡ๆ•ˆ็އใ€‚

  • RAR_INFO read-only: Report capabilities.
  • RAR_CONTROL: enable the feature.

How to perform a RAR?

Signaling a RAR is done similar to sending an INTR, by writing to the Interrupt Command Register (ICR).^

There is a new Delivery Mode^ added: RAR.

RAR Physical Memory Regions

ๆœ‰ไธค็‰‡ๅ†…ๅญ˜ๅŒบๅŸŸ้œ€่ฆๅœจ MSR ไธญๆŒ‡ๅ‘๏ผš

  • Payload table;
  • Action vector.

Payload Table

ๆ˜ฏไธ€ไธช 4K ๅŒบๅŸŸ๏ผŒๆœ‰ 64 ไธช 64 bytes ็š„ entryใ€‚Payload table ๅˆๅซๅš RAR islandใ€‚

The payload table contains RAR payloads. It is allocated by the OS in contiguous physical memory and pointed by each LPโ€™s RAR_PAYLOAD_TABLE_BASE MSR. Each entry defines the payload for a single action.

This architecture doesnโ€™t preclude the option for the OS to allocate multiple Payload Tables, one per a subset of LPs and set the RAR_PAYLOAD_TABLE_BASE Logical Processor MSRs accordingly.

ไนŸๅฐฑๆ˜ฏไธๅŒ LP ็š„ MSR ้ƒฝๅฏไปฅ share ๅŒไธ€ไธช payload table ็š„ๅŸบๅ€ใ€‚ๅฝ“็„ถไนŸๅฏไปฅไธ share๏ผˆไปฃ็ ้‡Œ็›ฎๅ‰็š„ๅฎž็Žฐๆ˜ฏ share ็š„๏ผ‰๏ผš

In a PCID enabled system where software threads are allocated each with a different PCID, the OS may choose to allocate a separate Payload Table for each RLP, i.e., each RLP is its own โ€˜RAR Islandโ€™.

่ฟ™็งๆžถๆž„ๅนถๆฒก้™ๅˆถ็”จๅคšไธช Payload table๏ผŒๆž็ซฏไธ€็‚น๏ผŒ็”š่‡ณ OS ๅฏไปฅไธบๆฏไธ€ไธช LP ๅˆ†้…ไธ€ไธช Payload tableใ€‚

The payload table is fixed size at 4KB, contains 64 entries of 64 bytes (512 bits).

RAR Payload Type / Subtype

ๆฏไธ€ไธช payload ้ƒฝๆœ‰ Type ๅ’Œ Subtype ไธคไธชๅŸŸใ€‚Type ๅŸŸ๏ผš

  • Type 0 = Page invalidation; invalidate one or more pages.
  • Type 1 = Page invalidation; invalidate one or more pages, ignore CR3 match.
  • Type 2 = PCID invalidation; invalidate pages associated with a specific PCID.
  • Type 3 = EPT invalidate; invalidate pages associated with a specific EPTP.
  • Type 4 = VPID invalidation; invalidate pages associated with a specific VPID.
  • Type 5 = MSR write; write value to specified MSR.
  • Other types reserved for future usages.

For RAR-based PV TLB flush, Type 4 (VPID invalidation) is used.

Subtype ๅŸŸ็”จๆฅๅ†ณๅฎš่ฟ™ไธช type ็ป†ๅˆ†็š„ๅŠจไฝœ๏ผŒๆฏ”ๅฆ‚ Type 0 ๆ˜ฏ page invalidation๏ผŒsubtype ๅฐฑ็”จๆฅๅ†ณๅฎšๅš address specific ็š„ invalidation ่ฟ˜ๆ˜ฏ่ฏดๆ‰ง่กŒ all-context ็š„ invalidationใ€‚

Action Vector

The action vector is a per-RLP 64 Byte aligned vector of actions, contains 64 entries of 8 bits. It is pointed to by the RAR Action Vector MSR. ๆฏไธ€ไธช entry ๅฏนๅบ”็š„ๆ˜ฏๅฏนๅบ” request action ็š„ statusใ€‚

ไธบไป€ไนˆ่ฆๆœ‰ 64 ไธช entries ๅ‘ข๏ผŸๆ˜ฏไธบไบ†ๅ’Œ payload^ ็š„ๆ•ฐ้‡ๅฏนๅบ”ใ€‚Each 8-bit entry j (0 <= j < N) defines the per-RLP status of the jth action request;

  • 0x00 = RAR_SUCCESS.
  • 0x01 = RAR_PENDING.
  • 0x02 = RAR_ACKNOWLEDGED.
  • 0x80 = RAR_FAILURE.

RAR TLB Flush Design

Shoot4U

Shoot4U is the paper's name, it is the same as PV_SEND_IPI in KVM.

It is different than KVM PV-based TLB flush mechanism because when the target vCPU is online and NOT preempted, Guest still need to send vIPI to the target vCPU, but in Shoot4U, all vIPIs are translated to physical IPIs and the polling is also done by the VMM not the guest. So we don't need to inject the interrupt back to the vCPUs to perform TLB flush.

Traditional vIPI (note: not equals to IPIv) based TLB flush / PV-based flush when vCPU is NOT preempted:

  1. Prepare payload in memory
  2. vCPU0โ€™s sending vIPI is trapped to VMM
  3. The VMM sends IPI to pCPU1
  4. pCPU1 injects interrupt to vCPU1
  5. vCPU1 reads payload from memory and invalidate TLBs with INVLPG
  6. notify vCPU0 that invalidation is completed
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚            Guest VM            โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚                          โ”‚  โ”‚
โ”‚  โ”‚           โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚  โ”‚
โ”‚  โ”‚           โ”‚           โ”‚  โ”‚  โ”‚
โ”‚  โ”‚   Memory  โ”‚  Payload  โ”‚  โ”‚  โ”‚
โ”‚  โ”‚           โ”‚           โ”‚  โ”‚  โ”‚
โ”‚  โ”‚      โ”Œโ”€โ”€โ”€โ–บโ””โ”€โ”€โ”€โ”€โ”€โ–ฒโ”€โ”€โ”ฌโ”€โ”€โ”˜  โ”‚  โ”‚
โ”‚  โ”‚    1 โ”‚        6 โ”‚  โ”‚ 5   โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚         โ”‚          โ”‚  โ”‚        โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”ดโ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚   vCPU0   โ”‚  โ”‚   vCPU1   โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ–ฒโ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚       โ”‚                โ”‚       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
      2 โ”‚              4 โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚       โ”‚                โ”‚       โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚  pCPU0  โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ–บ  pCPU1  โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   3  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Notes: with IPIv, unicast vIPI won't VMexit. IPIv is a feature supported by hardware.

Shoot4U:

  1. Guest invokes a hypercall to flush TLBs
  2. VMM converts the hypercall into multiple remote INVVPID to the payload
  3. VMM sends IPIs to target pCPUs
  4. pCPU1 reads payload from memory and invalidate TLBs with INVVPID
  5. notify that invalidation is completed

Can eliminates the problems which KVM PV-based approach has:

  • Overheads of IPI routing between vCPUs
  • It is possible that the preemption state of a vCPU can change after its state has been checked by the invoking CPU but before the IPI is actually delivered (TOCTOU problem)

Shortcomings:

  • External interrupt VM exits on host in Step 3๏ผˆๅฝ“ target pCPU ๅค„ไบŽ non-root ๆจกๅผไธ‹ๆ—ถ๏ผŒไป็„ถ้œ€่ฆ VMExit ๅ‡บๆฅ๏ผ‰
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚            Guest VM            โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚   vCPU0   โ”‚  โ”‚   vCPU1   โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚       โ”‚                        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
  1 Hypercall
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚       โ”‚                        โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚  pCPU0  โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ–บ  pCPU1  โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”˜   3  โ””โ”€โ”ฌโ”€โ”€โ”€โ–ฒโ”€โ”€โ”€โ”˜  โ”‚
โ”‚          2โ”‚         โ”‚5  โ”‚4     โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚         โ”‚         โ”‚   โ”‚    โ”‚ โ”‚
โ”‚ โ”‚         โ””โ”€โ–บโ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”ดโ”€โ”  โ”‚ โ”‚
โ”‚ โ”‚   Memory   โ”‚   Payload  โ”‚  โ”‚ โ”‚
โ”‚ โ”‚            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚ โ”‚
โ”‚ โ”‚                            โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚                                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Shoot4U + RAR

Just replace the above 2,3,4,5 steps to RAR rather than Physical IPIs. Because RAR can be performed while the target CPU is running in non-root mode, so we can avoid the VM exits introduced by IPIs.

ไฝ†ๆ˜ฏ source hypercall ็š„ VMExit ๅบ”่ฏฅ่ฟ˜ๆ˜ฏๆฒกๆœ‰ๅŠžๆณ•้ฟๅ…็š„ๅงใ€‚

Shoot4U | Proceedings of the12th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments

RAR TLB Flush in KVM

native_send_rar_ipi() Kernel

void native_send_rar_ipi(const struct cpumask *mask)
{
	cpumask_var_t allbutself;

	if (!alloc_cpumask_var(&allbutself, GFP_ATOMIC)) {
        // allocate false
		apic->send_IPI_mask(mask, RAR_VECTOR);
		return;
	}

    // allbufself ๆ˜ฏไธ€ไธชๅธธ้‡๏ผŒ่กจ็คบ้™คไบ†่‡ชๅทฑๆ‰€ๆœ‰ online ็š„ CPU
	cpumask_copy(allbutself, cpu_online_mask);
	cpumask_clear_cpu(smp_processor_id(), allbutself);

    // ๅฆ‚ๆžœๆˆ‘ไปฌ่ฆๅ‘้€็š„ mask ๅ’Œ albutself ๆ˜ฏไธ€ๆ ท็š„๏ผŒ้‚ฃไนˆๆˆ‘ไปฌ็›ดๆŽฅ่ฐƒ็”จ send_IPI_allbutself๏ผŒ
    // ๅฏ่ƒฝไผšๆœ‰ไธ€ไบ›ไผ˜ๅŒ–ใ€‚
	if (cpumask_equal(mask, allbutself) && cpumask_equal(cpu_online_mask, cpu_callout_mask))
		apic->send_IPI_allbutself(RAR_VECTOR);
	else
		apic->send_IPI_mask(mask, RAR_VECTOR);
    //...
}

RAR TLB Flush Exposion and Detection

Guest enumerates this PV feature by CPUID in KVM_CPUID_FEATURES, and KVM will expose this feature as possible:

#define KVM_FEATURE_RAR_TLBFLUSH   		18

__do_cpuid_func
    case KVM_CPUID_FEATURES:
        //...
        if (rar_invvpid_supported())
            entry->eax |= (1 << KVM_FEATURE_RAR_TLBFLUSH);
        //...

How does guest kernel use RAR TLB Flush?

There is a new hypercall is defined:

#define KVM_HC_FLUSH_TLB       13

If guest want to use this feature, it should first use KVM based TLB flush PV feature, which means KVM_FEATURE_PV_TLB_FLUSH and KVM_FEATURE_STEAL_TIME PV features must be used as dependencies:

// If KVM_FEATURE_PV_TLB_FLUSH and KVM_FEATURE_STEAL_TIME are used,
// this function is added as the handler function of pv_ops.mmu.flush_tlb_others
// 
static void kvm_flush_tlb_multi(const struct cpumask *cpumask,
			const struct flush_tlb_info *info)
{
    //...
    // first use PV Flush TLB if possible, i.e., the VCPU is not preempted
	for_each_cpu(cpu, flushmask) {
		src = &per_cpu(steal_time, cpu);
		state = READ_ONCE(src->preempted);
		if ((state & KVM_VCPU_PREEMPTED)) {
			if (try_cmpxchg(&src->preempted, &state,
					state | KVM_VCPU_FLUSH_TLB))
				__cpumask_clear_cpu(cpu, flushmask);
		}
	}

    // nopvrar will be true if "nopvrar" is added to guest kernel cmdline 
    // Here, some vCPUs are now preempted and we need to send IPI to them
    // to flush their TLBs, with RAR PV featrue, we can replace this step with
    // RAR to avoid IPI as possible 
	if (!nopvrar && kvm_para_has_feature(KVM_FEATURE_RAR_TLBFLUSH)) {
        // A successful execution
		if (!pv_rar_flush_tlb_others(flushmask, info)) {
			CNT_STOP(kvm_flush_tlb, cyc);
			return;
		}
	}

    // If RAR PV is not used, use traditional non-pv method to flush TLB
    native_flush_tlb_multi(flushmask, info);
}

static long pv_rar_flush_tlb_others(const struct cpumask *cpumask,
				const struct flush_tlb_info *info)
{
	u64 flags = 0, start = 0, end = 0, mask=0;
	long ret;
	int cpu, apic_id;

	if (!info->mm || (info->end == TLB_FLUSH_ALL))
		flags |= PV_TLB_FLUSH_ALL;
	else {
		start = info->start;
		end = info->end;
	}

	for_each_cpu(cpu, cpumask) {
    	// apic_id may not equal to cpu_id, so first map the cpu id to apic_id
		apic_id = per_cpu(x86_cpu_to_apicid, cpu);
        // set all the apic_id bits in "mask"
		__set_bit(apic_id, (unsigned long *)&mask);
	}

    // invoke a hypercall "KVM_HC_FLUSH_TLB"
	ret = kvm_hypercall4(KVM_HC_FLUSH_TLB, mask, flags, start, end);
    //...
}

kvm_pv_rar_flush_tlb() / How KVM Handle the hypercall KVM_HC_FLUSH_TLB:

[EXIT_REASON_VMCALL] = kvm_emulate_hypercall
    __kvm_emulate_hypercall
        case KVM_HC_FLUSH_TLB:
            kvm_pv_rar_flush_tlb(vcpu, a0, a1, a2, a3);

static long kvm_pv_rar_flush_tlb(struct kvm_vcpu *vcpu, unsigned long mask,	unsigned long flags,
unsigned long start, unsigned long end)
{
    //...
    map = rcu_dereference(vcpu->kvm->arch.apic_map);
	if (likely(map)) {
        //...
        // map from apic_id to physical cpu id
		for_each_set_bit(i, &mask, min) {
			if (map->phys_map[i]) {
				lvcpu = map->phys_map[i]->vcpu;
                // get all vpid
				lvpid[pcpus] = to_vmx(lvcpu)->vpid;
                // get all physical cpu id mapped from apic_id
				phys_cpu[pcpus] = lvcpu->cpu;
				pcpus++;
			}
		}
	}

    //...
    // lvpid: vpid list maybe used for the RAR_ACTION_INVVPID
	rar_invalidate_vpid_others_batch(lvpid, phys_cpu, pcpus, &info);
    //...
}

static long kvm_pv_rar_flush_tlb(struct kvm_vcpu *vcpu,
				unsigned long mask,	unsigned long flags,
				unsigned long start, unsigned long end)
{
	struct flush_tlb_info info;
	int *lvpid;
	int *phys_cpu;
	int pcpus = 0;
	struct kvm_apic_map *map;
	int i;
	unsigned long long __maybe_unused cyc;

    // Assume INVVPID type is supported if RAR is present
	if (!rar_invvpid_supported())
		return 1;

    map = rcu_dereference(vcpu->kvm->arch.apic_map);

	if (likely(map)) {
		u32 min;
		struct kvm_vcpu *lvcpu;

		min = min((u32)BITS_PER_LONG, (map->max_apic_id + 1));
		printk_once("min: %d\n", min);

		lvpid = kmalloc(min * sizeof(int), GFP_KERNEL);
		phys_cpu = kmalloc(min * sizeof(int), GFP_KERNEL);

		for_each_set_bit(i, &mask, min) {
			if (map->phys_map[i]) {
				lvcpu = map->phys_map[i]->vcpu;
				lvpid[pcpus] = to_vmx(lvcpu)->vpid;
				phys_cpu[pcpus] = lvcpu->cpu;
				pcpus++;
			}
		}
	}

	if (pcpus <= 0)
		return 0;

	if (flags == TLB_FLUSH_ALL)
		end = TLB_FLUSH_ALL;

	info.start = start;
	info.end = end;
	info.stride_shift = PAGE_SHIFT; /* need guest to provide */

	rar_invalidate_vpid_others_batch(lvpid, phys_cpu, pcpus, &info);
	kfree(lvpid);
	kfree(phys_cpu);

	return 0;
}