on
Developing a performance test for Kernel Samepage Merging (KSM)
1 Introduction
This report presents the contribution project for the Linux Kernel developed during the MO806A course.
The project consists of developing code to test the performance of the memory-saving feature Kernel Samepage Merging (KSM) when using huge pages. These tests are called Kselftests, and are contained in the Kernel tree.
Section 2 presents the Kselftest testing framework. Sections 3 and 4 present the KSM feature and the Transparent Huge Pages system, respectively, so that in section 5 the problem that occurs when enabling both systems together is presented. The contribution, which is related to this problem, is presented in section 6.
2 Kselftests
Kselftests are a set of unit tests and regression tests for the Linux Kernel [1] , which can be found in the tools/testing/selftest directory in the Kernel tree. These tests are executed daily on several Kernel trees.
Each test aims to execute specific code paths in order to find bugs and regressions.
The tests are organized into groups, where each one tests a different part of the Kernel (memory, scheduler, KVM, etc.). The user can run all the tests, just a group of tests, or just a specific test.
One of the requirements of Kselftests is that the execution time of all the tests does not exceed 20 minutes.
The Kselftests development tree is maintained by developer Shuah Khan, and can be found at [2]
3 Kernel Samepage Merging (KSM)
Kernel Samepage Merging is a memory-saving feature of the Linux Kernel. When enabled, the Kernel scans the memory for identical memory pages. When it finds duplicate pages, one of them is kept and shared between processes, and the others are released, saving memory usage. The page shared by the processes is marked as copy on write (COW) [3].
KSM was developed to be used on hosts that run several virtual machines. In these systems, a significant part of the memory used by the guests is identical, so activating KSM generates a large memory saving.
The figure below shows a representation of a physical machine, using KSM, running three virtual machines. Part of the pages used by the virtual machines are shared by KSM, and the rest are private to each virtual machine.
Host machine with KSM activated sharing pages of virtual machines [4]
4 Transparent Huge Pages
Transparent Huge Pages (THP) is a system implemented in the Linux Kernel to use huge pages, normally 2MB in size, in a way that is transparent to the programmer.
A huge page is allocated, created, and mapped at the first page fault in a 2MB-aligned region. If there is no free contiguous memory region for the huge page, the default behavior is to compact the memory to create it.
In addition, Linux implements the khugepaged daemon, which asynchronously scans the page tables of processes, looking for anonymous 2MB-aligned regions that contain at least one page marked as dirty, to be combined to create a huge page. Huge pages can also be “broken” into base pages when necessary.
Using huge pages reduces the number of pages used by applications, which reduces the number of TLB misses and increases performance. [5] shows that some applications can spend 58% of their execution time handling TLB misses. Therefore, the use of huge pages is essential to increase the performance of many applications.
5 THP vs KSM
Using THP with KSM can lead to a significant drop in system performance, as KSM “breaks” huge pages when it finds duplicate base pages within a huge page. This leads to low usage of huge pages, which can negate their benefits, leaving the system with only the overhead of THP.
Due to this problem, several modifications to both KSM and THP are being proposed. With this in mind, the project presented in this report consists of contributing tests to measure the speed of the KSM system using THP.
6 Contribution
The contribution made to this project was in the Kselftest code that tests KSM. A parameter was added to test the speed of sharing huge pages.
The purpose of this test is to help KSM system developers by evaluating the system’s performance when interacting with huge pages.
Contributions to the Linux Kernel are made through patches, which are sent to email lists. There are several email lists, and each one receives patches for different parts of the Kernel.
The patches are managed by the maintainers, who are responsible for approving them, rejecting them, or suggesting improvements. Each maintainer has a Kernel tree where, if the patch is approved, it is applied.
There are several maintainers, responsible for managing the development of different parts of the Kernel.
To find out which email list to send a patch to, simply run the get_maintainer.pl script, which is found in the Kernel tree, passing the path to your patch as a parameter. It will return the name and email of the maintainers and the mailing lists to which the patch should be sent.
The following subsections describe details about the code and the interaction with the community when contributing.
6.1 Code
The test to measure the speed of KSM page sharing is simple. It is located in tools/testing/selftest/vm/ksm_tests.c, and receives as parameters the type of test and the size of the duplicated memory region. It measures the execution time of the ksm_merge_pages function, shown below.
static int ksm_merge_pages(void *addr, size_t size,
struct timespec start_time, int timeout)
{
if (madvise(addr, size, MADV_MERGEABLE)) {
perror("madvise");
return 1;
}
if (ksm_write_sysfs(KSM_FP("run"), 1))
return 1;
/* Since merging occurs only after 2 scans,
make sure to get at least 2 full scans */
if (ksm_do_scan(2, start_time, timeout))
return 1;
return 0;
}This function first enables the memory region used in the test to be scanned by KSM. Then, it executes KSM. Finally, it commands KSM to scan the memory twice, since only the second time does page sharing occur.
To add the option to test the speed of sharing huge pages, simply allocate the memory region to be shared using huge pages.
Two versions of the patch were developed, the first and second versions can be found in appendices A and B, respectively.
In the first version, the madvise system call is used to allocate the memory region to be tested with huge pages. However, this method means that there is no control over the number of huge pages that were allocated to that region, which can interfere with the tests.
In the second version, the allocate_transhuge function is used to allocate the huge pages. This function attempts to allocate huge pages and returns -1 if it is not successful. This way, it is possible to know how many huge pages were allocated in the region. The number of normal pages and huge pages allocated is informed to the user.
The memory is allocated in chunks of size HPAGE_SIZE, which corresponds to the size of a Huge Page, as shown in the code snippet below. The allocate_transhuge function is called for each chunk, and according to its return, the number of allocated pages of each type is counted.
n_normal_pages = 0;
n_huge_pages = 0;
for (void *p = map_ptr; p < map_ptr + len;
p += HPAGE_SIZE) {
if (allocate_transhuge(p, pagemap_fd) < 0)
n_normal_pages++;
else
n_huge_pages++;
}The allocate_transhuge function, shown below, uses the Linux kernel interface called pagemap [6], which allows the user to access a program’s page table to check the type of page allocated. It can be accessed in /proc/<pid>/pagemap, where <pid> is the process identification number or self, if the process wants to access its own page table.
int64_t allocate_transhuge(void *ptr, int pagemap_fd)
{
uint64_t ent[2];
/* drop pmd */
if (mmap(ptr, HPAGE_SIZE, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_ANONYMOUS |
MAP_NORESERVE | MAP_PRIVATE, -1, 0) != ptr)
errx(2, "mmap transhuge");
if (madvise(ptr, HPAGE_SIZE, MADV_HUGEPAGE))
err(2, "MADV_HUGEPAGE");
/* allocate transparent huge page */
*(volatile void **)ptr = ptr;
if (pread(pagemap_fd, ent, sizeof(ent),
(uintptr_t)ptr >> (PAGE_SHIFT - 3)) != sizeof(ent))
err(2, "read pagemap");
if (PAGEMAP_PRESENT(ent[0]) &&
PAGEMAP_PRESENT(ent[1]) &&
PAGEMAP_PFN(ent[0]) + 1 == PAGEMAP_PFN(ent[1]) &&
!(PAGEMAP_PFN(ent[0]) &
((1 << (HPAGE_SHIFT - PAGE_SHIFT)) - 1)))
return PAGEMAP_PFN(ent[0]);
return -1;
}First, the information of the physical page mapped by the virtual address ptr and its successor are stored in the array ent. For a huge page to be allocated, both pages must be present, i.e., not in swap, the Page Frame Number (PFN) of the second page must be the successor of the first, and the PFN of the first page must be a multiple of 512, since a huge page has 512 normal pages.
The organization of the physical memory pages mapped by huge pages is detailed in [7].
To run this test, simply compile Kselftests and run sudo ./ksm_tests -H -s <N>, where N is the size in MB of the duplicated region, in the tools/testing/selftests/vm directory.
An example of the test output with a 100MB duplicated region is shown below.
Number of normal pages: 0
Number of huge pages: 50
Total size: 100 MiB
Total time: 0.178690991 s
Average speed: 559.625 MiB/sThe full test code can be found at [8].
6.2 Community
The patches were sent to the emails obtained through the get_maintainer.pl script.
The first version of the patch sent received no responses, and a few days later the second version was sent.
The second version was first applied to the linux-mm tree of the Kernel, which receives patches related to the memory system. It was then selected by its maintainer Andrew Morton to be sent to the Mainline Kernel, and was applied to Kernel v5.16-rc3. The commit in which the patch was applied can be found at [9].
After being applied to the Mainline Kernel, the patch was also applied to the Kselftests development tree.
7 Conclusion
This project made it possible to get in touch with the Linux Kernel developer community, to learn the tools used to make a contribution and to test changes to the Kernel, and mainly to learn how the memory management system works.
In addition to the contribution itself, reading mailing lists and information portals about the Linux Kernel, such as LWN.net [10], throughout the development of the project, helped to follow the discussion process that occurs until a major change is applied to the Mainline Kernel.
Appendix A
This appendix presents the first version of the patch developed in the project.
Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
---
tools/testing/selftests/vm/ksm_tests.c | 40 +++++++++++++++++++-------
1 file changed, 30 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/vm/ksm_tests.c b/tools/testing/selftests/vm/ksm_tests.c
index b61dcdb44c5b..92b716565d9c 100644
--- a/tools/testing/selftests/vm/ksm_tests.c
+++ b/tools/testing/selftests/vm/ksm_tests.c
@@ -5,6 +5,7 @@
#include <time.h>
#include <string.h>
#include <numa.h>
+#include <err.h>
#include "../kselftest.h"
#include "../../../../include/vdso/time64.h"
@@ -34,7 +35,8 @@ enum ksm_test_name {
CHECK_KSM_ZERO_PAGE_MERGE,
CHECK_KSM_NUMA_MERGE,
KSM_MERGE_TIME,
- KSM_COW_TIME
+ KSM_COW_TIME,
+ KSM_MERGE_TIME_HUGE_PAGES
};
static int ksm_write_sysfs(const char *file_path, unsigned long val)
@@ -99,6 +101,9 @@ static void print_help(void)
" -U (page unmerging)\n"
" -P evaluate merging time and speed.\n"
" For this test, the size of duplicated memory area (in MiB)\n"
+ " must be provided using -s option\n"
+ " -H evaluate merging time and speed of huge pages.\n"
+ " For this test, the size of duplicated memory area (in MiB)\n"
" must be provided using -s option\n"
" -C evaluate the time required to break COW of merged pages.\n\n");
@@ -118,10 +123,14 @@ static void print_help(void)
exit(0);
}
-static void *allocate_memory(void *ptr, int prot, int mapping, char data, size_t map_size)
+static void *allocate_memory(void *ptr, int prot, int mapping, char data, size_t map_size,
+ bool huge_page)
{
void *map_ptr = mmap(ptr, map_size, PROT_WRITE, mapping, -1, 0);
+ if (huge_page && madvise(map_ptr, map_size, MADV_HUGEPAGE))
+ err(2, "MADV_HUGEPAGE");
+
if (!map_ptr) {
perror("mmap");
return NULL;
@@ -250,7 +259,7 @@ static int check_ksm_merge(int mapping, int prot, long page_count, int timeout,
}
/* fill pages with the same data and merge them */
- map_ptr = allocate_memory(NULL, prot, mapping, '*', page_size * page_count);
+ map_ptr = allocate_memory(NULL, prot, mapping, '*', page_size * page_count, false);
if (!map_ptr)
return KSFT_FAIL;
@@ -282,7 +291,7 @@ static int check_ksm_unmerge(int mapping, int prot, int timeout, size_t page_siz
}
/* fill pages with the same data and merge them */
- map_ptr = allocate_memory(NULL, prot, mapping, '*', page_size * page_count);
+ map_ptr = allocate_memory(NULL, prot, mapping, '*', page_size * page_count, false);
if (!map_ptr)
return KSFT_FAIL;
@@ -325,7 +334,7 @@ static int check_ksm_zero_page_merge(int mapping, int prot, long page_count, int
return KSFT_FAIL;
/* fill pages with zero and try to merge them */
- map_ptr = allocate_memory(NULL, prot, mapping, 0, page_size * page_count);
+ map_ptr = allocate_memory(NULL, prot, mapping, 0, page_size * page_count, false);
if (!map_ptr)
return KSFT_FAIL;
@@ -416,7 +425,7 @@ static int check_ksm_numa_merge(int mapping, int prot, int timeout, bool merge_a
return KSFT_FAIL;
}
-static int ksm_merge_time(int mapping, int prot, int timeout, size_t map_size)
+static int ksm_merge_time(int mapping, int prot, int timeout, size_t map_size, bool huge_page)
{
void *map_ptr;
struct timespec start_time, end_time;
@@ -424,7 +433,7 @@ static int ksm_merge_time(int mapping, int prot, int timeout, size_t map_size)
map_size *= MB;
- map_ptr = allocate_memory(NULL, prot, mapping, '*', map_size);
+ map_ptr = allocate_memory(NULL, prot, mapping, '*', map_size, huge_page);
if (!map_ptr)
return KSFT_FAIL;
@@ -466,7 +475,7 @@ static int ksm_cow_time(int mapping, int prot, int timeout, size_t page_size)
/* page_count must be less than 2*page_size */
size_t page_count = 4000;
- map_ptr = allocate_memory(NULL, prot, mapping, '*', page_size * page_count);
+ map_ptr = allocate_memory(NULL, prot, mapping, '*', page_size * page_count, false);
if (!map_ptr)
return KSFT_FAIL;
@@ -541,7 +550,7 @@ int main(int argc, char *argv[])
bool merge_across_nodes = KSM_MERGE_ACROSS_NODES_DEFAULT;
long size_MB = 0;
- while ((opt = getopt(argc, argv, "ha:p:l:z:m:s:MUZNPC")) != -1) {
+ while ((opt = getopt(argc, argv, "ha:p:l:z:m:s:MUZNPCH")) != -1) {
switch (opt) {
case 'a':
prot = str_to_prot(optarg);
@@ -598,6 +607,9 @@ int main(int argc, char *argv[])
case 'C':
test_name = KSM_COW_TIME;
break;
+ case 'H':
+ test_name = KSM_MERGE_TIME_HUGE_PAGES;
+ break;
default:
return KSFT_FAIL;
}
@@ -645,12 +657,20 @@ int main(int argc, char *argv[])
return KSFT_FAIL;
}
ret = ksm_merge_time(MAP_PRIVATE | MAP_ANONYMOUS, prot, ksm_scan_limit_sec,
- size_MB);
+ size_MB, false);
break;
case KSM_COW_TIME:
ret = ksm_cow_time(MAP_PRIVATE | MAP_ANONYMOUS, prot, ksm_scan_limit_sec,
page_size);
break;
+ case KSM_MERGE_TIME_HUGE_PAGES:
+ if (size_MB == 0) {
+ printf("Option '-s' is required.\n");
+ return KSFT_FAIL;
+ }
+ ret = ksm_merge_time(MAP_PRIVATE | MAP_ANONYMOUS, prot, ksm_scan_limit_sec,
+ size_MB, true);
+ break;
}
if (ksm_restore(&ksm_sysfs_old)) {
--
2.25.1
Appendix B
This appendix presents the second version of the patch developed in the project.
Add test case of KSM merging time using mostly huge pages
Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
---
tools/testing/selftests/vm/ksm_tests.c | 125 ++++++++++++++++++++++++-
1 file changed, 124 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/vm/ksm_tests.c b/tools/testing/selftests/vm/ksm_tests.c
index b61dcdb44c5b..61f3360a19df 100644
--- a/tools/testing/selftests/vm/ksm_tests.c
+++ b/tools/testing/selftests/vm/ksm_tests.c
@@ -5,6 +5,10 @@
#include <time.h>
#include <string.h>
#include <numa.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <stdint.h>
+#include <err.h>
#include "../kselftest.h"
#include "../../../../include/vdso/time64.h"
@@ -18,6 +22,15 @@
#define KSM_MERGE_ACROSS_NODES_DEFAULT true
#define MB (1ul << 20)
+#define PAGE_SHIFT 12
+#define HPAGE_SHIFT 21
+
+#define PAGE_SIZE (1 << PAGE_SHIFT)
+#define HPAGE_SIZE (1 << HPAGE_SHIFT)
+
+#define PAGEMAP_PRESENT(ent) (((ent) & (1ull << 63)) != 0)
+#define PAGEMAP_PFN(ent) ((ent) & ((1ull << 55) - 1))
+
struct ksm_sysfs {
unsigned long max_page_sharing;
unsigned long merge_across_nodes;
@@ -34,6 +47,7 @@ enum ksm_test_name {
CHECK_KSM_ZERO_PAGE_MERGE,
CHECK_KSM_NUMA_MERGE,
KSM_MERGE_TIME,
+ KSM_MERGE_TIME_HUGE_PAGES,
KSM_COW_TIME
};
@@ -99,6 +113,9 @@ static void print_help(void)
" -U (page unmerging)\n"
" -P evaluate merging time and speed.\n"
" For this test, the size of duplicated memory area (in MiB)\n"
+ " must be provided using -s option\n"
+ " -H evaluate merging time and speed of area allocated mostly with huge pages\n"
+ " For this test, the size of duplicated memory area (in MiB)\n"
" must be provided using -s option\n"
" -C evaluate the time required to break COW of merged pages.\n\n");
@@ -416,6 +433,101 @@ static int check_ksm_numa_merge(int mapping, int prot, int timeout, bool merge_a
return KSFT_FAIL;
}
+int64_t allocate_transhuge(void *ptr, int pagemap_fd)
+{
+ uint64_t ent[2];
+
+ /* drop pmd */
+ if (mmap(ptr, HPAGE_SIZE, PROT_READ | PROT_WRITE,
+ MAP_FIXED | MAP_ANONYMOUS |
+ MAP_NORESERVE | MAP_PRIVATE, -1, 0) != ptr)
+ errx(2, "mmap transhuge");
+
+ if (madvise(ptr, HPAGE_SIZE, MADV_HUGEPAGE))
+ err(2, "MADV_HUGEPAGE");
+
+ /* allocate transparent huge page */
+ *(volatile void **)ptr = ptr;
+
+ if (pread(pagemap_fd, ent, sizeof(ent),
+ (uintptr_t)ptr >> (PAGE_SHIFT - 3)) != sizeof(ent))
+ err(2, "read pagemap");
+
+ if (PAGEMAP_PRESENT(ent[0]) && PAGEMAP_PRESENT(ent[1]) &&
+ PAGEMAP_PFN(ent[0]) + 1 == PAGEMAP_PFN(ent[1]) &&
+ !(PAGEMAP_PFN(ent[0]) & ((1 << (HPAGE_SHIFT - PAGE_SHIFT)) - 1)))
+ return PAGEMAP_PFN(ent[0]);
+
+ return -1;
+}
+
+static int ksm_merge_hugepages_time(int mapping, int prot, int timeout, size_t map_size)
+{
+ void *map_ptr, *map_ptr_orig;
+ struct timespec start_time, end_time;
+ unsigned long scan_time_ns;
+ int pagemap_fd, n_normal_pages, n_huge_pages;
+
+ map_size *= MB;
+ size_t len = map_size;
+
+ len -= len % HPAGE_SIZE;
+ map_ptr_orig = mmap(NULL, len + HPAGE_SIZE, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
+ map_ptr = map_ptr_orig + HPAGE_SIZE - (uintptr_t)map_ptr_orig % HPAGE_SIZE;
+
+ if (map_ptr_orig == MAP_FAILED)
+ err(2, "initial mmap");
+
+ if (madvise(map_ptr, len + HPAGE_SIZE, MADV_HUGEPAGE))
+ err(2, "MADV_HUGEPAGE");
+
+ pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
+ if (pagemap_fd < 0)
+ err(2, "open pagemap");
+
+ n_normal_pages = 0;
+ n_huge_pages = 0;
+ for (void *p = map_ptr; p < map_ptr + len; p += HPAGE_SIZE) {
+ if (allocate_transhuge(p, pagemap_fd) < 0)
+ n_normal_pages++;
+ else
+ n_huge_pages++;
+ }
+ printf("Number of normal pages: %d\n", n_normal_pages);
+ printf("Number of huge pages: %d\n", n_huge_pages);
+
+ memset(map_ptr, '*', len);
+
+ if (clock_gettime(CLOCK_MONOTONIC_RAW, &start_time)) {
+ perror("clock_gettime");
+ goto err_out;
+ }
+ if (ksm_merge_pages(map_ptr, map_size, start_time, timeout))
+ goto err_out;
+ if (clock_gettime(CLOCK_MONOTONIC_RAW, &end_time)) {
+ perror("clock_gettime");
+ goto err_out;
+ }
+
+ scan_time_ns = (end_time.tv_sec - start_time.tv_sec) * NSEC_PER_SEC +
+ (end_time.tv_nsec - start_time.tv_nsec);
+
+ printf("Total size: %lu MiB\n", map_size / MB);
+ printf("Total time: %ld.%09ld s\n", scan_time_ns / NSEC_PER_SEC,
+ scan_time_ns % NSEC_PER_SEC);
+ printf("Average speed: %.3f MiB/s\n", (map_size / MB) /
+ ((double)scan_time_ns / NSEC_PER_SEC));
+
+ munmap(map_ptr_orig, len + HPAGE_SIZE);
+ return KSFT_PASS;
+
+err_out:
+ printf("Not OK\n");
+ munmap(map_ptr_orig, len + HPAGE_SIZE);
+ return KSFT_FAIL;
+}
+
static int ksm_merge_time(int mapping, int prot, int timeout, size_t map_size)
{
void *map_ptr;
@@ -541,7 +653,7 @@ int main(int argc, char *argv[])
bool merge_across_nodes = KSM_MERGE_ACROSS_NODES_DEFAULT;
long size_MB = 0;
- while ((opt = getopt(argc, argv, "ha:p:l:z:m:s:MUZNPC")) != -1) {
+ while ((opt = getopt(argc, argv, "ha:p:l:z:m:s:MUZNPCH")) != -1) {
switch (opt) {
case 'a':
prot = str_to_prot(optarg);
@@ -595,6 +707,9 @@ int main(int argc, char *argv[])
case 'P':
test_name = KSM_MERGE_TIME;
break;
+ case 'H':
+ test_name = KSM_MERGE_TIME_HUGE_PAGES;
+ break;
case 'C':
test_name = KSM_COW_TIME;
break;
@@ -647,6 +762,14 @@ int main(int argc, char *argv[])
ret = ksm_merge_time(MAP_PRIVATE | MAP_ANONYMOUS, prot, ksm_scan_limit_sec,
size_MB);
break;
+ case KSM_MERGE_TIME_HUGE_PAGES:
+ if (size_MB == 0) {
+ printf("Option '-s' is required.\n");
+ return KSFT_FAIL;
+ }
+ ret = ksm_merge_hugepages_time(MAP_PRIVATE | MAP_ANONYMOUS, prot,
+ ksm_scan_limit_sec, size_MB);
+ break;
case KSM_COW_TIME:
ret = ksm_cow_time(MAP_PRIVATE | MAP_ANONYMOUS, prot, ksm_scan_limit_sec,
page_size);
--
2.25.1