../03-sosp25-motivation

Motivation

version 1

= Motivation

// In this section, we identify three key challenges that current approaches face: TLB flush overhead, PEBS accessibility, and scalability concerns. These challenges inform our guest-delegated design principle.

== TLB Flush Overhead <motivation-overhead>

Access tracking in existing hypervisor-based tiered memory management solutions depends on TLB flush-intensive PTE.A bits for hotness information.
To quantify this overhead, we compared TLB flush instruction counts between hypervisor-based and guest-based solutions.

Due to the lack of available source code and omitted design details in existing hypervisor-based solutions @vee15heterovisor @socc16raminate @isca17heteroos @eurosys23vtmm, we converted the host-based TPP @asplos23tpp to a hypervisor-based solution (H-TPP) by integrating its PTE.A scanning backend with KVM's MMU notifier. 
This allows it to observe guest accesses through PTE.A bits in EPT.
For guest-based solutions, we evaluated a direct application of the host-based TPP in guests (G-TPP).

For our evaluation, we limited the hypervisor-based system to 36 GB DRAM during boot, while for guest-based solutions, we started a VM with 36 GB DRAM.
We used PMEM as the SMEM tier and ran a GUPS workload with a 144 GB total footprint using 36 threads, as shown in fig-motivation-tlb-flush.
The results reveal that the hypervisor-based solution H-TPP generates 4.7× TLB flush instructions compared to G-TPP, resulting in 2.5× total execution time.

The severe performance penalty stems from the necessity of destructive full invalidation of all EPT mappings.
TLB flush instructions fall into two categories: full invalidation (`invept`) and single gVA invalidation (`invvpid`/`invpcid`/`invlpg`).
Hypervisor-based solutions, which capture GPT entries through faulting @vee15heterovisor @isca17heteroos or scanning @eurosys23vtmm, can only access GPT and EPT entries containing gPA and hPA.
Without gVA information, they must resort to full invalidation to ensure capturing future PTE.A/D bits.
In contrast, guest-based solutions can follow the entire GPT walk process and extract the initial gVA, enabling the use of more efficient single-address invalidations.
Furthermore, guest-based solutions can leverage EPT-friendly PEBS instead of TLB flush-intensive PTE.A/D bits.

== EPT-Friendly PEBS

Contrary to prior assumptions @eurosys23vtmm @osdi24memstrata, we find that PEBS sampling is now well-supported with strong isolation for guest states under virtualization @lkml22eptfriendlypebs.
Cross-architecture support for PMU sampling also exists @lkml24riscvguestsampling, allowing PEBS-based hotness classification to extend to a wide range of cloud machines with simple modifications to the collected events.

The primary concern regarding guest PEBS support has been the potential breach of isolation boundaries caused by sharing the sample buffer, potentially leaking sensitive information across VMs.
Prior systems intuitively assumed that PEBS enabled in guests would generate samples and write to the host OS's PEBS buffer @eurosys23vtmm, leaking load/store addresses via generated samples.
However, we found this is not the case.

The PEBS sample buffer is part of the CPU debug control data structure and is managed via a debug control register.
Hardware virtualization automatically switches to a guest-private PEBS buffer through the `vmcs.debugctl` field in the Virtual Machine Control Structure.
Samples generated while executing different VMs are written directly to their private buffers, ensuring proper isolation.

== Scalability Challenges

With PEBS now available in guests, it might seem intuitive to apply existing PEBS-based designs directly within guest VMs.
However, cloud environments often run numerous VMs concurrently on a single machine, which would compound management overhead if existing designs were directly repurposed.
For instance, HeMem @sosp21hemem dedicates one CPU core to pulling samples from the PEBS buffer, which would waste one core per VM, a prohibitive overhead in multi-tenant environments.

We conducted a scalability study of existing designs using both PTE.A/D-based (TPP) and PEBS-based (Memtis) approaches, as shown in @fig-motivation-scalability.
We launched up to nine virtual machines on a 36-core system and ran 8.1 billion GUPS transactions with a 126 GiB working set divided evenly across all VMs while preserving the access distribution.

The results demonstrate that naively using TPP in guests could waste more than 3.5 CPU cores in a 36-core system, while even the optimized PEBS design of Memtis still wastes approximately half a core.
In cloud environments where CPU resources are rented to customers with the goal of maximizing utilization, such wastage would increase total cost of ownership (TCO), potentially negating the benefits of memory expansion through tiered memory.

// These findings highlight the need for a fundamentally new approach to tiered memory management in virtualized environments—one that leverages the benefits of guest-side access tracking while addressing the scalability challenges of existing designs. This motivates our guest-delegated principle and the development of HyperFlex and HyperPlace, which we detail in the following sections.

Draft

= Motivation

== TLB flush
Access tracking in existing hypervisor-based tiered memory management solutions, depends on TLB flush intensive PTE.A bits hotness information.
We present a TLB flush instruction count measurement comparison between hypervisor-based and guest-based solutions.
Because the lack of avaialble source code and ommitance of design details of existing hypervisor-based solutioins @vee15heterovisor @socc16raminate @isca17heteroos @eurosys23vtmm, we convert host-based TPP @asplos23tpp to a hypervisor-based soltion, named H-TPP, by plugging its PTE.A scanning backend with KVM's MMU notifier, allowing it to observe guest accesses through PTE.A bits in EPT.
For guest-based solutions, we present results of directly applying existing host-based solution TPP in guests, named G-TPP, and our HyperTier design.

For hypervisor-based, we limit usable DRAM of the evaluation machine to 36G during boot, for guest-based, we start a VM with 36G DRAM.
We use PMEM as the SMEM tier.
We run a GUPS workload with 144G total footprint with 36 threads and plot results in @fig-motivation-tlb-flush.
The results show hypervisor-based solution H-TPP exhibits 4.7x total TLB flush instruction counts compared to G-TPP, taking 2.5x total execution time to finish.

The culprint behind such severe performance penality is the necessity of destructive full invalidation to all EPT mappings.
TLB flush instrictions can be catergorized into two kinds, full invalidation (`invept`) and single gVA invalidation (`invvpid/invpcid/invlpg`).
Because hypervisor-based solutions, capturing GPT entries through faulting @vee15heterovisor @isca17heteroos and scanning @eurosys23vtmm, can only see GPT entries and EPT entries, which only contains gPA and hPA, they have to resorts to a full invalidation because the lack of gVA, to ensure the capture of futher PTE.A/D bits.
However, guest-based solution is able to follow the entire process of GPT walk and extract the initial gVA, enabling the use of cheaper single invalidation.
Furthermore, guest-based solutions is now able to leverage more efficient EPT-friendly PEBS instead of TLB flush intensive PTE.A/D bits.

== EPT-friendly PEBS
Contrary to prior assumptions @eurosys23vtmm @osdi24memstrata, we find that PMU sampling is now well-supported with strong isolation for guest states under virtualization @lkml22eptfriendlypebs.
Even cross-architecture support for PMU sampling exists @lkml24riscvguestsampling, allowing extending PMU sampling based hotness classification to a wide arrange of machines in the cloud with simple changes on the collected events.

The main concern for guest PEBS support is the _potential_ breach of isolation boundaries caused by sharing the sample buffer and leaking sensitive information across VMs.
Prior systems intuitively assume that PEBS enabled in the guests will generate samples and write to the host OS's PEBS buffer @eurosys23vtmm, leaking load/store addresses via generated samples. 
However, we found that this is not the case.
The PEBS sample buffer is part of the CPU debug control data structure and is controlled via a debug control register.
Hardware virtualization automatic switches to a guest-private PEBS buffer via `vmcs.debugctl` field in the Virtual Machine Control Structure’s.
Samples generated in executing different VM will be written directly to its private buffer ensuring isolation.

== Scalability
With the avaialbity of PEBS in guests, its intuitive to apply prior PEBS-based designs directly in guests.
However, a machine often have a large number of VMs running concurrently, a direct repurposing would mean compound management overhead.
HeMem @sosp21hemem dedicate one core to pull samples from the PEBS buffer, wasting the total number of VMs worth of CPU cores just on access tracking.

We present a scalability study of eixting both of PTE.A/D (TPP) and PEBS-based (Memtis) designs in @fig-motivation-scalability.
We launch up to nine virtual machines on a 36 core machine and 8.1 billion GUPS transactions with a 126 GiB working set divided evenly across all virtual machines while preserving the access distribution.

Results show that, naively using TPP as guests could waste more than three and a half CPU cores in a 36-core system, while optimized PEBS design like Memtis still waste half a core.
Cloud environments rent CPU resources to customers and often aim for maximal resource utilization.
Such wastage might introduce TCO increase, negating the benefits of memory expansion through a slow memory tier.

TotalinveptElapsedSeconds
H-TPP82,504,46620,214,84014:56.35896.35
G-TPP17,707,15405:53.91353.91
G-Ours9,305,36304:59.57299.57

Design

Implementation

Evaluation

Discussion


Raw data

HostGuestOurs
Roottlb_flush 5869756713301029
Rootremote_tlb_flush2021484000
Non-rootnr_tlb_local_flush_one297677427526132753576
Non-rootnr_tlb_remote_flush602668149297546508334
Non-rootnr_tlb_local_flush_all126172345742424
Total flush82,504,46617,707,1549,305,363
GUPS Elapsed14:56.355:53.914:59.57