Plan
- Collection (kernel PEBS)
- What is the cost for collecting one access sample?
- Identification (SDH)
- How accurate is the identified working set compared to ground truth?
- Migration (no special design)
- Do we really need special migration design?
- What is the marginal cost?
- How often does migration actually happen?
- How much data to migrate each time?
Collection
Nimble
Nimble doesn't have its own sample collection technique. Instead, it relies on Linux's sample source for the active/inactive list.
we build a holistic multi-level memory solution that directly moves data between heterogeneous memories using the existing OS active/inactive page lists, eliminating another major source of software overhead in current systems.
So, to find out how to measure the cost for collecting one sample, we first need to determine how Linux collects samples for the active/inactive list.
From lwn, we can learn that the cost is mainly the active list accounting:
The active list contains anonymous and file-backed pages that are thought (by the kernel) to be in active use by some process on the system. The inactive list, instead, contains pages that the kernel thinks might not be in use. When active pages are considered for eviction, they are first moved to the inactive list and unmapped from the address space of the process(es) using them. Thus, once a page moves to the inactive list, any attempt to reference it will generate a page fault; this "soft fault" will cause the page to be moved back to the active list. Pages that sit in the inactive list for long enough are eventually removed from the list and evicted from memory entirely.
From folio_add_lru()
,
we can learn that the accounting process can be broken down to two parts:
- mark the page as active via
folio_mark_accessed()
or wrappermark_page_accessed()
- process a queued batch of pages and finalise the active/inactive decision via
lru_add_drain_cpu()
The decision on whether to add the page to the [in]active [file|anon] list is deferred until the folio_batch is drained. This gives a chance for the caller of
folio_add_lru()
have the folio added to the active list usingfolio_mark_accessed()
Here is a another source for reference.
Ours
Our samples are all collected through PMI. We can calculate the total cycle count of all PMI, then divide it by the number of samples collected.
HeMem
HeMem's samples are also collected through PMI. But the major difference is that PMI only sends samples to the kernel. The kernel then copy them on to a user-visible buffer (perf buffer). HeMem constantly polls the buffer to find the samples passed from kernel. So, the cost includes total cycles spent on:
- PMI handling (include copy to perf buffer)
- polling
Sample Collection
To measure sample collection performance,
we measure average time taken for collecting one sample.
For Nimble and other kernel LRU based design,
this is done via counting how many times folio_referenced()
is called
and its total time taken via native_sched_clock()/rdtsc
1.
We check folio_referenced()
because every time kernel LRU is checked,
it will inspect each page's hardware access count (PTE.A
)
on the active and inactive list and update them accordingly.
Check here for more details on kernel's LRU design.
To count invoke and cycle count we can add additional statistic items
in vm_event_item
and vmstat_text
.
Then use count_vm_event(ITEM)
to make them visiable to userspace via /proc/vmstat
.
We can reliably use rdtsc
because morden Intel processors have a fixed TSC frequency.
You can check this via TscInvariant
CPU feature, e.g. sudo cpuid -1 | rg TscInvariant
.
TSC check is also done at kernel boot through determine_cpu_tsc_frequencies()
.
Nimble
Nimble's code cannot boot currently, we are trying to rebase it onto a newer version of Linux instead of v5.6-rc6.
Version | Status |
---|---|
v5.6-rc6 | Boot hang due to virtio related issue |
v5.6 | Boot hang due to NX violation of network drivers |
v5.7 | Ditto |
v5.8 | Compilation failed due to arch/x86/entry/thunk_64.o |
v5.9 | Ditto |
v6.4 | Works |
v6.5-rc1 | Unsolved page fault in kernel space (virtio-net ok) |
v6.5-rc2 | Ditto (triggered in ip_vs_protocol_init) |
v6.5-rc2 | Works after disabling IP_VS |
v6.5-rc3 | Ditto (released on 7.24) |
v6.5-rc4 | Boot hang at virtnet_probe (released on 7.31) (Works after revert 2526612 listed here) |
v6.5-rc5 | Boot hang at virtnet_probe (released on 8.7) |
v6.5-rc5 | Still hang after disabling IP_VS |
After trying several candidate versions, it's obvious old kernel has many unsolved problems. It's best we can merge Nimble's code into our codebase.
Benefits:
- We gain more insight about Nimble's work, i.e. exactly what gives the most improvement
- Our migration can reuse Nimble's code
- Simpler bechmarking process