Advice on manual memory management scheme

AlexKindel

#20908

April 14, 2019

I had previously only ever taken the "call the language's default dynamic allocator whenever I want" approach to memory management, but I recently decided to try rewriting one of my existing programs in C, and figured it would be a good time to try out a more controlled approach to memory management at the same time. It would be helpful to get some outside thoughts on what I've considered so far.

The program is a computer algebra system. If you dig down far enough, almost everything in the program is done in terms of an Integer type, which contains an array of uint32_t's, which is dynamically-sized so that it can represent however large a value turns out to be necessary in order to complete the calculation. I usually want Integers to behave as if they were on the stack - in C++, I could make Integer a class, where the destructor cleans up the array on the heap, and call it good. My understanding is that the functional equivalent of this in C would be to allocate the array using malloc when the Integer is created, and call free on it when I'm done with it, but at least in the common case, I think I can do better.

What I have in mind is to allocate a big block of memory at startup, like Handmade Hero does. Every time someone needs a new Integer, they would put the dynamic array at the location of the memory pointer, then increment the pointer by the size of the array. Functions that need to create Integers for use in internal calculations but don't return anything that contains an Integer would take a pointer to the memory - that is, a local copy of the caller's memory pointer, which can be incremented in the function body without affecting the caller's copy. That way, any Integers the caller allocates after calling the function will overwrite the ones allocated inside the function.

On the other hand, functions that do return something that contains one or more Integers would take a pointer to the memory pointer, so that they could increment the caller's pointer to the end of the last array that the return value references. Often, generating the return value requires allocating additional Integers that are not themselves part of the return value. Ideally, the caller's memory pointer would not be incremented in such a way that the heap memory for these temporary values falls before the pointer, preventing the caller from overwriting it. If an upper bound on the size of the return value's heap data is known a priori, this could be done by leaving an empty block of that size at the initial location of the memory pointer before allocating the local values. If no such bound is known, all Integer heap data could be allocated in the order that it's needed for the calculation, and then the parts that are needed for the return value could be copied to the initial location of the memory pointer.

One uncertainty I have about this is how big a block of memory to allocate initially. In a video game like Handmade Hero, it makes sense for the creator to decide what they want the game to be like, then choose an amount of memory that will suffice to deliver that experience. The idea of choosing an amount of memory that will suffice to deliver the kinds of calculations I want the user to be able to do doesn't seem as sensible. I'd rather the user be able to do as intensive a calculation as is possible on their hardware. The simplest approach would seem to be allocating the largest block the OS will allow. That seems rude to say the least. Alternatively, I could do something like a C++ Vector, where if the currently-allocated block gets filled, I could copy the data to a new, larger one. If I did that, I suppose I'd want to allocate out of the block using offsets from the start rather than using pointers so all the pointers to parts of the old block wouldn't have to be updated.

Another thing I'm not sure about is a case that doesn't seem as well-behaved as the Integer one. There is a Number type, which is capable of representing various irrational numbers. This type includes a rational_approximation field, which contains two Integer ratios - a lower bound and an upper bound. Whenever someone happens to need a rational approximation, they access it through a function that takes a parameter that specifies an upper bound on what the distance between the upper and lower bound of rational_approximation needs to be. If the existing value of rational_approximation doesn't satisfy the bound, then the function replaces it with a new one that does before returning it. The problem is that, unlike Integers themselves, whose heap memory will always stay the same size once they have been created, the heap memory of a rational approximation might need to grow, if the Integers in its ratios need to be replaced with ones with larger arrays. Given that there is no upper bound on the amount of heap memory a given Number's rational approximation might need, I don't see a reasonable way to reconcile this with the stacklike approach I described for Integers. I could allocate a second large memory block for such cases, with a more complicated custom allocator that does searches for holes that can be filled and such, but at that point, I don't know if there would be any advantage to writing the allocator myself over just using malloc.

Edited by AlexKindel on April 14, 2019, 7:18pm

Ryan Fleury

#20916

April 15, 2019

Some others in the community might do a much better job of explaining this than I can (and they might give you better advice), but you have many options here. The important thing to consider first here, I think, is allocation lifetime variance, and allocation size variance. The reason why I say this is because if you can guarantee certain constraints, like uniform lifetimes or uniform sizes, the problem of allocation becomes much simpler.

If you can guarantee a case in which some operation requires uniform lifetimes but non-uniform sizes, you have a few options.

1. Memory Block Linked List

This case is particularly good when you tend to linearly iterate through your allocations (a good example of this is when allocating, then traversing, an abstract syntax tree, as the traversal of the abstract syntax tree tends to be in the same order in which it was allocated). In this case, you might have a couple of structures that looks something like this:

struct MemoryBlock
{
    int size;
    int alloc_position;
    void *memory;
    struct MemoryBlock *next;
};

struct Allocator
{
    struct MemoryBlock *active;
};

Each MemoryBlock is pretty much just a stack. When you run out of space, you allocate a new one, forming a linked list of memory blocks over time. When you are done, you can traverse the linked list and release each block. Alternatively, if you don't need to free memory during runtime, you can let the operating system release all of your memory, and never free anything.

Also in the category of "non-variable lifetimes, variable sizes", you have:

2. Virtual Memory Linear Arena

In this case, pages of physical memory are not actually reserved until your program accesses them. The operating system simply gives your program a section of virtual address space, and when you require a read or write on those pages, the operating system will physically reserve them. This is done through operating system calls like VirtualAlloc.

In this case, you can treat the reserved virtual memory as a single MemoryBlock in the previous example, where to allocate, you just bump a pointer forward.

In cases where you cannot guarantee uniform lifetimes, but you can guarantee uniform sizes, you have the following option:

3. Memory Pool

In this case, you can allocate a big block of memory that is subdivided into small chunks. You maintain a linked list, which is usually called the "free list". In this case, you just store the head of the linked list. At initialization time, all chunks in the memory block should be in the free list. When you need to allocate, you grab the head of the free list (some chunk of the memory), change the head to the previous head's next pointer, and use the chunk of memory.

When you need to free a chunk of memory, you simply replace the current head of the free list with the chunk to free, and set the new head's next pointer to the previous head.

In this case, when a memory chunk is free, you treat it as a node in a linked list:

struct MemoryChunk
{
    struct MemoryChunk *next;
};

But when a memory chunk is allocated, the caller that allocated that memory treats that block of memory however it wants (not aliasing it as a MemoryChunk).

Now, in the last case, you have variable sizes and variable lifetimes.

4. Variable-Sized Memory Pool

In this case, you subdivide a pre-allocated block of memory into variably-sized chunks. For example, you might have chunks for 16 bytes, 32 bytes, 64 bytes, 128 bytes, 256 bytes, and so on. Then, if an allocation call requires something bigger than the maximum chunk size, you can make an operating system call to reserve pages for it directly. Each memory block is managed by you in a similar fashion to that of the previously discussed memory pool.

In the variable-size, variable-length case, things can get complicated. Also keep in mind that, for simplicity, I didn't discuss alignment which is something you might want to look into if you are writing an allocator.

I am of the opinion that you should always simplify your problem wherever you can. If you can somehow reduce a problem to one where you have uniform lifetimes or uniform sizes, even if some amount of memory is wasted, that is preferable to a complex, slow system that doesn't waste any memory.

AlexKindel

#20920

April 15, 2019

Suppose a virtual memory linear arena would suffice for all of one's heap allocation needs. Would it be possible to reserve the largest theoretically allowed amount of memory (how much is that?), regardless of how much memory is currently available? If so, doing that, and committing one page at a time as needed seems like it would be the neatest way to pursue the goal of allowing the program to consume all available memory if necessary. I assume when no more memory is available, the next commit is the thing that would fail.

Could you define "variable lifetime"? I think I have a sense of what it means, but mental clarity would be helpful here. If I understand correctly, the rational approximations necessarily qualify as variable size, variable lifetime, in which case I'll need at least two memory blocks in which to do different kinds of allocations. I suppose for each block, I would estimate what fraction of the program's memory footprint is typically occupied by data that belongs in that block, and set the size to reserve to that fraction of the total amount reserved.

Edited by AlexKindel on April 16, 2019, 12:55am

Simon Anciaux

#20928

April 17, 2019

AlexKindel
I don't know if there would be any advantage to writing the allocator myself over just using malloc.

If malloc is working and you don't see any problem (measured performances or limiting the user possibilities) than there is no problem using malloc.

AlexKindel
Would it be possible to reserve the largest theoretically allowed amount of memory (how much is that?), regardless of how much memory is currently available? If so, doing that, and committing one page at a time as needed seems like it would be the neatest way to pursue the goal of allowing the program to consume all available memory if necessary.

On Windows, I believe you can use VirtualAlloc with MEM_RESERVE to reserve all the possible address space (8TB before Windows 8.1, 128TB since Windows 8.1). Reserved memory is not counted toward the system commit limit and can't be used until you call VirtualAlloc again with MEM_COMMIT, but reserved memory still has a memory cost because the system needs memory to store the page tables describing that memory. For example reserving 1GB will cost about 2MB in page tables. You can also decommit memory using VirtualFree with MEM_DECOMMIT. As those are system calls there might be some performance issue (I don't actually know) and you might want to use bigger size than a page size when you call them. The allocation granularity with VirtualAlloc is 64KB (see GetSystemInfo) meaning that if you try to reserve less than 64KB you'll waste the remaining memory of the 64KB.

You can't commit more memory that the system commit limit, which is the size of the physical memory plus the size of the paging file, and that's a lot smaller (than 8TB). You can query the system commit limit using GetPerformanceInfo. If you try to allocate more than the system commit limit, VirtualAlloc will fail. The system commit limit on Windows can change because by default the paging file is set to grow when needed up to some value (2 times physical memory size or something like that).

I don't know if you can reserve memory on other OSes. As far as I know, on linux, when you asked for a big chunk of memory, the kernel reserves it and it commits automatically when you actually use the memory. If you ask for too big of a chunk (I don't know what size that is), the allocation will just fail. And you can't decommit. If anyone as more information on that I would appreciate your knowledge.

Here are some video about Windows memory management:
Mysteries of Windows Memory Management Revealed (Virtual memory)
Mysteries of Windows Memory Management Revealed (Physical memory)
An article in the wiki: Memory Management.
Ginger Bill's articles: Memory allocation strategies.

AlexKindel
Could you define "variable lifetime"? I think I have a sense of what it means, but mental clarity would be helpful here. If I understand correctly, the rational approximations necessarily qualify as variable size, variable lifetime, in which case I'll need at least two memory blocks in which to do different kinds of allocations. I suppose for each block, I would estimate what fraction of the program's memory footprint is typically occupied by data that belongs in that block, and set the size to reserve to that fraction of the total amount reserved.

I think he meant "known" VS "unknown" lifetime meaning whether or not, when you allocate some memory, you know when it will not be needed anymore. For example a known lifetime could be the scope of a function, an iteration in a loop, the program lifetime... An unknown lifetime could be a lifetime that depends on user input, like pressing delete when something is not needed anymore.

I'm also trying to figure out memory allocations for myself so don't take these as fact.

Does your system really needs to support the maximum memory amount available ? Aren't there other constraints that would make it impossible/improbable and that could define the memory model more precisely ? I don't needs those answer but those are question that I think you should think about.

If I understood your description correctly, you seem to have three memory use case:
- Permanent storage: your growing stack;
- Temporary storage: memory used for computation that can be freed after the computation. It could be another growing stack that you reset to 0 after the computation;
- Storage for ratios that needs to be able to grow: could be a more involve memory allocator that allows you to free random memory. Or another stack if you don't mind wasting some space, in which case it could just be in the permanent storage. You could also try to reserve twice the memory needed initially for 1 number so you could grow it before needing to move it and see if that works for the most common cases. If those ratio when they are modified are always near the end of the stack, maybe it would be possible to always rewrite the end of the stack ?

Edited by Simon Anciaux on April 17, 2019, 3:08pm Reason: typo

Miguel Lechón

#20929

April 17, 2019

Edit: This post is wrong. Keep reading the thread to see why.

mrmixer

I don't know if you can reserve memory on other OSes. As far as I know, on linux, when you asked for a big chunk of memory, the kernel reserves it and it commits automatically when you actually use the memory. If you ask for too big of a chunk (I don't know what size that is), the allocation will just fail. And you can't decommit. If anyone as more information on that I would appreciate your knowledge.

On linux you can reserve memory with mmap with flags=MAP_PRIVATE|MAP_ANONYMOUS (to avoid using a backing file) and protection=PROT_NONE (to specify that the memory cannot be used). Then, when the time comes to commit some of that reserved memory, you call mprotect with protection=PROT_READ|PROT_WRITE. Decommiting is also an mprotect call with protection=PROT_NONE.

Here's a copy-and-paste of the relevant code from a project of mine:

// NOTE: The base argument should be set to NULL unless you want to hardcode the virtual 
//       memory address, and you should check first that that address is available
internal void * lin64_platform_reserve_memory(void *base, U64 size){
    int pagesize = sysconf(_SC_PAGE_SIZE); 
    ASSERT(pagesize != -1);
    ASSERT(!(size & (pagesize-1)));
    void *result = mmap(base, size, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, (off_t)0);
    if(result == MAP_FAILED){
        result = 0;
    }
    return result;
}

internal void lin64_platform_commit_memory(void *address, U64 size){
    int ret = mprotect(address, size, PROT_READ|PROT_WRITE);

    if(ret){
        ASSERT(!"mprotect failed");
    }
}

internal void lin64_platform_release_memory(void *address, U64 size){
    int ret = mprotect(address, size, PROT_NONE);

    if(ret){
        ASSERT(!"mprotect failed");
    }
}

Edited by Miguel Lechón on April 21, 2019, 8:19am Reason: Damage control

Simon Anciaux

#20931

April 17, 2019

I remember trying that. But, if I remember correctly, you can't remove the MAP_ANONYMOUS flag and have the paging file used for that memory so if you run out of physical memory the allocations will fail ?

Also I wrote a comment in my code saying that mprotect with PROT_NONE will only prevent any access to the memory but might not actually free it (and since there is no paging file, it can't be paged out). Am I wrong about that ?

Mārtiņš Možeiko

#20933

April 17, 2019

Having no MAP_ANONYMOUS means that memory is backed to actual real file, right? And doesn't OS simple discard physical memory mapped to virtual memory which is backed to actual file (in case system is running out of physical memory)? Because it can always reload content of this memory page from actual file in case you are accessing the memory. That's my understanding how it works.

Miguel Lechón

#20937

April 18, 2019

mrmixer
I remember trying that. But, if I remember correctly, you can't remove the MAP_ANONYMOUS flag and have the paging file used for that memory so if you run out of physical memory the allocations will fail ?

Also I wrote a comment in my code saying that mprotect with PROT_NONE will only prevent any access to the memory but might not actually free it (and since there is no paging file, it can't be paged out). Am I wrong about that ?

Both are good points and I'll make some time this weekend to test them experimentally and report back.

Simon Anciaux

#20939

April 18, 2019

@mmozeiko I think I'm mixing up memory backed by a file and virtual memory using the swap partition. I thought that MAP_ANONYMOUS would allocate memory that wouldn't go to the swap partition if necessary.

After reading the mmap man page, it seems that there is a MAP_NORESERVE flag, which would suggest that by default the mapping is always backed by the swap file/partition if it is not set. So does that mean that unless MAP_NORESERVE is set, all memory allocations are counted toward the system commit limit (search for /proc/sys/vm/overcommit_memory in the proc man page); and if you set it, it's not counted toward the system commit limit but if your application use more than the available physical memory it will just be killed (as stated in the bug section of mmap) ? That doesn't make sens to me. Unless pages that are never accessed don't count toward system commit limit ? I didn't read all of the proc man page, maybe there are some explanations about how virtual memory works on linux.

@debiatan Thanks, that would be useful. If you write some code, could you share it ?

ratchetfreak

#20940

April 18, 2019

if you want to avoid it going to the swap you need to pin the memory. However it's usecase is security based (avoiding secrets ending up on disk instead of performance). Once you start needing swapping you are hose with regards to memory read speeds anyway.

Miguel Lechón

#20941

April 18, 2019

I think that the MAP_ANONYMOUS flag just means that instead of using mmap for its apparent purpose (mapping a file to memory), the caller wants a portion of memory with no associated file, but that does not mean that the memory is prevented from swapping to disk.

Regarding my use of mprotect(..., PROT_NONE), I think it's incorrect (as mrmixer points out) and that the intended result is actually accomplished through madvise(..., MADV_DONTNEED).

@mrmixer Yes, I'll link the code for my tests once I'm done. My first comment on this thread was so stackoverflowish that it's the least I can do.

Miguel Lechón

#20958

April 21, 2019

I'm back with some experimental results. I've only tested one machine and one kernel, so don't take my conclusions as universal. The code is here.
Run 'make' to generate these three binaries: test_mmap_swap, test_mmap_swap_unused_memory and test_mmap_decommit. Here's what they do:

test_mmap_swap:
This test shows that anonymous mappings are backed by swap memory. It does so by allocating and filling 128 MiB blocks of RAM using mmap(..., MAP_ANONYMOUS) until some of the memory of the process is swapped. I detect swap use by reading /proc/self/status:VmSwap between mmap calls. Before running this test, be aware that your kernel might swap memory of other processes before reaching that of the test, so don't be surprised if some applications freeze momentarily.

test_mmap_swap_unused_memory:
This test shows that unused mmapped memory does count towards the system overcommit limit. It allocates 128 MiB blocks of memory without ever using them and inspects /proc/meminfo:Committed_AS between mmap calls. The exact amount of memory that you'll be able to reserve depends on your overcommit configuration:
- With /proc/sys/vm/overcommit_memory=1 (unlimited overcommit) the kernel allows to allocate 128 TiB on x64, the full 47 bits of user-space virtual memory.
- With /proc/sys/vm/overcommit_memory=2 (strict overcommit) the kernel respects /proc/sys/vm/overcommit_kbytes if nonzero or /proc/sys/vm/overcommit_ratio otherwise. This mode is not great if you rely on overcommitment as an allocation strategy.
- With /proc/sys/vm/overcommit_memory=0 (the default configuration) the kernel uses a heuristic to decide which allocations are OK. I think there is no official explanation of what that heuristic does (a sad state of affairs) and that we can only rely on reading the kernel source (in particular the __vm_enough_memory function in mm/util.c, under the sysctl_overcommit_memory == OVERCOMMIT_GUESS if branch).
In my case, the kernel still allows me to allocate the full 128 TiB in this mode when I have plenty of free available physical memory.

test_mmap_decommit:
This test shows that you can decommit and recover physical pages for anonymous mmapped memory by calling madvise(..., MADV_DONTNEED) to tell the kernel to discard the contents of a range of memory. There's no need to munmap that memory. The kernel will start committing pages in that range as the user reuses them.

Simon Anciaux

#20959

April 21, 2019

Thanks. Did you try MAP_NORESERVE (and PROT_NONE maybe) ?

Miguel Lechón

#20960

April 21, 2019

I haven't tried MAP_NORESERVE. Since I'm already already getting the full 128 TiB range user-space can provide (with overcommit_memory=0), I don't expect to see any difference in the result, unless I run more complex tests with some other processes consuming memory. The results for the other two overcommit modes should remain the same.

AlexKindel

#20963

April 22, 2019

mrmixer
If I understood your description correctly, you seem to have three memory use case:
- Permanent storage: your growing stack;
- Temporary storage: memory used for computation that can be freed after the computation. It could be another growing stack that you reset to 0 after the computation;
- Storage for ratios that needs to be able to grow: could be a more involve memory allocator that allows you to free random memory. Or another stack if you don't mind wasting some space, in which case it could just be in the permanent storage. You could also try to reserve twice the memory needed initially for 1 number so you could grow it before needing to move it and see if that works for the most common cases. If those ratio when they are modified are always near the end of the stack, maybe it would be possible to always rewrite the end of the stack ?

In more detail: the program runs a loop where it accepts a text input from the console, parses it into a binary tree and, as its primary task, edits the tree according to certain rules. I haven't chosen to constrain the length of the input, but even if I did, the number of nodes a given tree will end up with once the editing is done isn't knowable in advance, so the nodes have to go on the heap. Nodes might have to be removed or inserted at any point in the tree, but the nodes are all the same size, so a free list is probably appropriate.

Most of the allocation the editing routine needs to do aside from nodes is for temporary Integers and dynamically-sized arrays of Integers that can be treated as lexically scoped. A stack should indeed work for them.

A trickier case is a set of potentially-expensive-to-calculate values associated with a node, which are stored on the node so they don't have to be recalculated if they're needed multiple times over the node's lifetime, and are set to null until they're requested for the first time. This includes the rational approximations, and a polynomial. The approximations might need to be refined, as described before, at any point during the node's lifetime, so it unfortunately can't be ensured that they will always be at the end of a stack when modified. The other suggestions you made are all possible though. It is indeed the case that value arrays only one element long is the common case for Integers, including the ones in approximations. The polynomial would seem to be a simpler case, because once it has been calculated, it stays the same for the lifetime of the node. Unfortunately, its memory footprint still isn't knowable in advance, so it isn't possible to let the free list system for the nodes take care of it. I don't see a way to leverage the stability of the polynomial to allow for a simpler scheme than the approximations need. One last consideration is that either could in principle be moved into the lexically-scoped, stack-managed category by requiring that they be recalculated with every use rather than saving them. It's a question of whether the redundant calculation would be cheaper than a more complicated memory management scheme.

Finally, while the tree manipulation happens in a loop, and each tree only lasts for its iteration, there are a few things that last across iterations; a rational approximation of pi, an array of prime Integers that grows as needed, and a similarly growable set of trees, similar to the ones being edited, complete with polynomial and rational approximations. The primes never need to be freed, so they can just go in a linear arena. The trees, unlike the user input ones, never have nodes added or deleted, so they can go in a linear arena too. The polynomials and approximations, I see no advantage to treating any differently than the ones local to the loop.

mrmixer
Does your system really needs to support the maximum memory amount available ? Aren't there other constraints that would make it impossible/improbable and that could define the memory model more precisely ? I don't needs those answer but those are question that I think you should think about.

Can you give an example of such a constraint in another program? I don't see a clear rationale for choosing any particular hard memory usage cap. Is the idea that a program is likely to run out of some other resource before it runs out of heap memory, so one might as well estimate how much memory would be in use when that happens, and set the cap there?

I get the impression reserving the maximum allowed amount of memory is an unusual thing to do, in any case. I could dodge the issue of how much to reserve by taking the approach where I reserve only a page at a time at the same time as I commit the page, and chain the possibly noncontiguous-in-virtual-memory pages in a linked list. I didn't favor that idea because I'd have to add extra machinery to account for arrays that cross page boundaries.