../microbenchmarks

Microbenchmarks

Plan

Collection

Nimble

Nimble doesn't have its own sample collection technique. Instead, it relies on Linux's sample source for the active/inactive list.

we build a holistic multi-level memory solution that directly moves data between heterogeneous memories using the existing OS active/inactive page lists, eliminating another major source of software overhead in current systems.

So, to find out how to measure the cost for collecting one sample, we first need to determine how Linux collects samples for the active/inactive list.

From lwn, we can learn that the cost is mainly the active list accounting:

The active list contains anonymous and file-backed pages that are thought (by the kernel) to be in active use by some process on the system. The inactive list, instead, contains pages that the kernel thinks might not be in use. When active pages are considered for eviction, they are first moved to the inactive list and unmapped from the address space of the process(es) using them. Thus, once a page moves to the inactive list, any attempt to reference it will generate a page fault; this "soft fault" will cause the page to be moved back to the active list. Pages that sit in the inactive list for long enough are eventually removed from the list and evicted from memory entirely.

From folio_add_lru(), we can learn that the accounting process can be broken down to two parts:

The decision on whether to add the page to the [in]active [file|anon] list is deferred until the folio_batch is drained. This gives a chance for the caller of folio_add_lru() have the folio added to the active list using folio_mark_accessed()

Here is a another source for reference.

Ours

Our samples are all collected through PMI. We can calculate the total cycle count of all PMI, then divide it by the number of samples collected.

HeMem

HeMem's samples are also collected through PMI. But the major difference is that PMI only sends samples to the kernel. The kernel then copy them on to a user-visible buffer (perf buffer). HeMem constantly polls the buffer to find the samples passed from kernel. So, the cost includes total cycles spent on:

Sample Collection

To measure sample collection performance, we measure average time taken for collecting one sample. For Nimble and other kernel LRU based design, this is done via counting how many times folio_referenced() is called and its total time taken via native_sched_clock()/rdtsc 1.

We check folio_referenced() because every time kernel LRU is checked, it will inspect each page's hardware access count (PTE.A) on the active and inactive list and update them accordingly. Check here for more details on kernel's LRU design.

To count invoke and cycle count we can add additional statistic items in vm_event_item and vmstat_text. Then use count_vm_event(ITEM) to make them visiable to userspace via /proc/vmstat.

1

We can reliably use rdtsc because morden Intel processors have a fixed TSC frequency. You can check this via TscInvariant CPU feature, e.g. sudo cpuid -1 | rg TscInvariant. TSC check is also done at kernel boot through determine_cpu_tsc_frequencies().

Nimble

Nimble's code cannot boot currently, we are trying to rebase it onto a newer version of Linux instead of v5.6-rc6.

VersionStatus
v5.6-rc6Boot hang due to virtio related issue
v5.6Boot hang due to NX violation of network drivers
v5.7Ditto
v5.8Compilation failed due to arch/x86/entry/thunk_64.o
v5.9Ditto
v6.4Works
v6.5-rc1Unsolved page fault in kernel space (virtio-net ok)
v6.5-rc2Ditto (triggered in ip_vs_protocol_init)
v6.5-rc2Works after disabling IP_VS
v6.5-rc3Ditto (released on 7.24)
v6.5-rc4Boot hang at virtnet_probe (released on 7.31) (Works after revert 2526612 listed here)
v6.5-rc5Boot hang at virtnet_probe (released on 8.7)
v6.5-rc5Still hang after disabling IP_VS

After trying several candidate versions, it's obvious old kernel has many unsolved problems. It's best we can merge Nimble's code into our codebase.

Benefits: