All modern operating systems provide similar memory management services. This article will discuss them, and their internals.
Before you read this article, I recommend you brush up on your powers of 2.
Memory spaces
Each process has its own "virtual" memory address space.
None of these processes can directly access each other's memory.
This is done for two major reasons:
- Security - a process might deal with sensitive information like passwords, so it's best if others cannot access it
- Reliability - if one process crashes, it can't corrupt the memory of other processes as it does it
When a process tries to access memory, the processor will first translate the address it uses.
1 2 3 4 5 | void IncrementVariable(int *pointer) { int value = *pointer; // Read from memory value++; *pointer = value; // Write to memory } |
In a 32-bit program, the pointer will be a 32-bit number. In a 64-bit program, the pointer will be a 64-bit number.
This number is translated to some index into physical memory.
Not every address has a translation; attempting to access an unmapped address will result in a page fault.
Unless processes are sharing memory (which we'll talk about later), on every process where a given address is valid, it will be translated into a different byte of physical memory.
The translation, or mapping, of virtual memory to physical memory, for reasons we'll discuss shortly, is done in usually fixed-sized blocks called pages.
On many modern processors, such as AMD64, these pages are 4KB each in size.
Note how the orange and green blocks (the first two pages in Process 2) are virtually contiguous - they might represent an 8KB array - but in RAM (where they are the second and third pages) there is a gap between them. This means even when RAM becomes fragmented, processes can still allocate contiguous regions of memory.
Page mappings are usually stored in some form of sparse structure. Each mapping needs to store the physical page number it refers to. The processor then can look up a virtual page in the structure to get a physical page number.
Process 1: 0 -> 4, 1 -> invalid, 2 -> 0, 3 -> invalid, 4 -> invalid, etc. Process 2: 0 -> 1, 1 -> 3, 2 -> invalid, 3 -> invalid, 4 -> invalid, etc.
So when process 1 accesses the blue block (the leftmost, page 0) the processor knows to translate this to physical page 4. And when process 2 accesses the orange block (the leftmost, page 0) the processor knows to translate this to physical page 1.
AMD64 page tables
AMD64 processors have 48-bit address spaces. This means there's 256TB of virtual memory in each process; at 4KB per page, that's 64 billion pages. If we use 8 bytes per translation (to store the number of the physical page), we'd need 512GB of RAM - per process!
Obviously, it's therefore necessary to store the translations in a sparse structure. AMD64 processors use a tree-like system of page tables.
Each page table is exactly 4KB, the same as the size of a page. Since each entry in a page table is 8 bytes, a page table will contain 512 entries. Every entry in a page table will either refer to more page tables, or physical pages containing data.
The page tables are arranged in a 4-level tree.
.
Every process has 1 page-map level-4 (PML4) table, with 512 entries (PML4Es). The PML4 covers the entire 256TB address space. Each PML4E points to a page-directory-pointer (PDP) table, with 512 entries (PDPEs). Each PDP covers 512GB of the address space (2^48 / 512 = 2^39 = 512GB). Each PDPE points to a page-directory (PD) table, with 512 entries (PDEs). Each PD covers 1GB of the address space (2^39 / 512 = 2^30 = 1GB). Each PDE points to a page-table (PT) table, with 512 entries (PTEs). Each PT covers 2MB of the address space (2^30 / 512 = 2^21 = 2MB). Each PTE points to a page of RAM - which hence is 4KB (2^21 / 512 = 2^12). Read that through until you understand it.
During address translation, the process breaks up a virtual address into 5 sections:
We need 9 bits to index into a page table since each table contains 512 entries. The index is multiple by 8 bytes to get the byte offset into the table, since each entry is 8 bytes. (512 * 8 = 4096, the size of a page [table].)
CR3
which contains the physical address of the page containing the PML4 table.
When the running process changes, the operating system will update this register.
Now let's look at the general format of an entry in one of these tables:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | struct Entry { uint64_t present : 1, // Indicates the entry contains a valid mapping. // When a process attempts to access an invalid address, a page-fault exception occurs. writable : 1, // Set if writable. // All 4 entries leading to a page of RAM must have this bit set to allow writes. user : 1, // Clear if only the kernel can access the page. // All 4 entries leading to a page of RAM must have this bit set to allow user access. writethrough : 1, // Writes to this page don't go into the cache. cacheDisable : 1, // Reads from this page don't come from the cache. accessed : 1, // Set when the page is accessed. dirty : 1, // Set when the page is modified. We'll discuss this and the accessed bit later. pageSize : 1, // Used for large pages. Discussed later. global : 1, // Don't remove the page from the TLB when CR3 changes. Discussed later. available : 3, // Three bits the OS can use freely to store extra information about the translation. physicalPageIndex : 40, available2 : 11, // More bits available for the OS to use. noExecute : 1, // The page cannot be used as code. } |
This all fits neatly in 8 byte entries.
So when the processor needs to look up a translation, it first visits the PML4, then a PDP, then a PD, and finally a PT. If the present bit is not set in any of the entries, or the writable/user/noExecute bits indicate the access is invalid, a page fault exception occurs, alerting the OS's memory manager.
Allocating virtual memory
See also: Making a Heap Allocator
Committing memory and working sets
When you allocate memory from the OS, you are committing memory. The OS keeps track of a commit charge - how much of the physical memory in the system has been allocated. When you try to allocate memory, it will check if the commit charge would go over the amount of memory you have. If it does, the allocation will fail. Otherwise the commit charge will be increased by the amount allocated, and decreased when the memory is freed.
However, committing memory does not actually allocate physical memory. It just ensures that when an attempt is made to allocate physical memory it will never fail.
The amount of physical memory that has been mapped into your process's virtual address space is called the working set. Physical memory is only placed into the working set as necessary: when the memory is accessed.
For example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | // Allocate 1MB of virtual address space. void *address = MemoryReserve(0x100000); // Commit 1MB of physical memory to the address. bool success = MemoryCommit(address, 0x100000); if (!success) { Print("The OS is unable to guarantee that it will have enough physical memory when we need it.\n"); return; } // Note that no memory has been mapped into our paging tables yet. // We have to access the memory. // Fill the megabyte with the byte 0xCC. MemorySetBytes(address, 0xCC, 0x100000); // When the above function runs, // the processor will generate a page-fault exception at the start of every new 4KB page, // alerting the OS that we tried to access a address that was not mapped. // The OS will then allocate a page of physical memory and then map it into our address space. // Our working set will increase, and we can then access the memory proper. |
By delaying actually allocating the memory until we access it, the OS's memory manager can perform some neat tricks which we'll discuss later.
Basic physical memory management
When your OS boots it will receive a memory map from the bootloader. This will essentially be a list of usable regions of RAM:
Here is the output from an emulator with 256MB of RAM:
We'll keep track of all the free physical pages in a singly linked list. So we need to make an array containing an entry for every page up to the highest address in the memory map.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | struct PhysicalPage { enum { FREE, // The page is available for use. PRIVATE, // The page is in use by a process. } state; union { struct { uintptr_t next; // The next free page in the linked list. // Set to -1 to indicate the end of the list. } free; }; }; PhysicalPage *physicalPages; // The physical page database. uintptr_t firstFreePage; // The index of the first free page. Mutex freePageListMutex; // A mutex to synchronise access to the free page list. |
When the physical memory manager (PMM) initialises the physicalPages
array, it can construct the linked list of free pages.
So we can implement very simple allocate and free functions:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | uintptr_t PhysicalMemoryAllocatePage() { MutexAcquire(freePageListMutex); uintptr_t page = firstFreePage; if (page == (uintptr_t) -1) { KernelPanic("Out of physical memory\n"); } PhysicalPage *entry = physicalPages + page; if (entry->state != PhysicalPage::FREE) { KernelPanic("Corrupt physical page database\n"); } firstFreePage = entry->free.next; entry->state = PhysicalPage::PRIVATE; MutexRelease(freePageListMutex); } void PhysicalMemoryFreePage(uintptr_t page) { PhysicalPage *entry = physicalPages + page; if (entry->state == PhysicalPage::FREE) { KernelPanic("Attempt to free a page multiple times\n"); } MutexAcquire(freePageListMutex); entry->free.next = firstFreePage; firstFreePage = page; MutexRelease(freePageListMutex); } |
Note that the allocation function does not expect to every run out of memory, since, as explained above, the OS should not allow the commit charge to go over the limit of physical memory.
Here is a diagram showing the lifecycle of a physical page:
Zeroing memory
For security reasons, we can't give uninitialised memory to user (non-kernel) processes. This means we have to zero pages before we map them into a user's address space.
1 2 3 | if (forUser) { ZeroPhysicalPage(page); } |
This will delay allocating physical memory, since zeroing out the page takes time. Instead, why don't we zero pages during CPU idle time?
We'll now introduce a thread that runs at the lowest priority possible that:
- removes pages from the free list
- zeroes them
- and adds them to a zeroed list
So, when we allocate a page:
- if it's for the kernel/a driver, we first check the free list, then the zeroed list
- if it's for a user process, we first check the zeroed list, then the free list
The kernel should use the free list for its own allocations, since it won't be a security problem for the kernel to receive non-zeroed pages. If the free list is empty, then we can use the zeroed list. If we're allocating for a user process and the zeroed list is empty, we'll take a page from the free list and zero it on demand. This is why the page-zeroing thread runs at the lowest possible priority - we'll have to do it at some point, so there's no point wasting CPU time if it wouldn't otherwise be idling.
Here's the new page lifecycle diagram:
As far as our "implementation" is concerned, again - it's not very complicated.
We add the new page state, and create a new singly linked list of zeroed pages.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | struct PhysicalPage { enum { FREE, // The page is available for use. PRIVATE, // The page is in use by a process. ZEROED, // The page is available for use, AND contains all zeroes. } state; union { struct { uintptr_t next; // The next free page in the linked list. // Set to -1 to indicate the end of the list. } free; struct { uintptr_t next; // The next zeroed page in the linked list. } zeroed; }; }; PhysicalPage *physicalPages; // The physical page database. uintptr_t firstFreePage; // The index of the first free page. uintptr_t firstZeroedPage; // The index of the first zeroed page. Mutex freePageListMutex; // A mutex to synchronise access to the free page list. Mutex zeroedPageListMutex; // A mutex to synchronise access to the zeroed page list. |
We can then update PhysicalMemoryAllocatePage
(FreePage
doesn't need to change.)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | uintptr_t PhysicalMemoryAllocatePage() { MutexAcquire(zeroedPageListMutex); uintptr_t page = firstZeroedPage; if (page != (uintptr_t) -1) { PhysicalPage *entry = physicalPages + page; if (entry->state != PhysicalPage::ZEROED) { KernelPanic("Corrupt physical page database\n"); } firstZeroedPage = entry->zeroed.next; entry->state = PhysicalPage::PRIVATE; MutexRelease(zeroedPageListMutex); return page; } MutexRelease(zeroedPageListMutex); MutexAcquire(freePageListMutex); uintptr_t page = firstFreePage; if (page != (uintptr_t) -1) { PhysicalPage *entry = physicalPages + page; if (entry->state != PhysicalPage::FREE) { KernelPanic("Corrupt physical page database\n"); } firstFreePage = entry->free.next; entry->state = PhysicalPage::PRIVATE; MutexRelease(freePageListMutex); ZeroPhysicalPage(page); return page; } MutexRelease(freePageListMutex); KernelPanic("Out of physical memory\n"); } |