Introduction
I never cease being amazed by what you can find by browsing the MFC source code files. My latest discovery is rather spectacular. It consists of a small utility class that allows you to modify how objects for a given class are dynamically allocated. This class is CFixecAlloc. It is used inside the famous CString class to allocate string buffers. Using CFixedAlloc with the existing code requires very minimal changes and yields surprising improvements in performance. This article will discuss CRT dynamic allocation overheads, usual alternatives to CRT shortcomings, the anatomy of the CFixedAlloc class, how to use CFixedAlloc in your programs, and finally, a demo application so you can see for yourself what CFixedAlloc can do for your programs.
Why Use Alternatives to malloc/free, new/delete?
It is a well-known fact that dynamic memory allocation is expensive. It incurs performance overhead both in execution time and space. If you need to be convinced, just take a peek in the malloc.c file provided with the CRT source code files. The malloc() function is a somewhat complex function that takes some time to execute. Concerning the space overhead, the subject is a bit complex. Here is an overview. Depending on the version of VC++ you are using when you compile your program, and the Windows version where your program is running, a different version of heap will be used. This is out of scope for this article, but if you are curious about how CRT chooses the heap version, you can find the function __heap_select() in the heapinit.c file. In the CRT winheap.h header file, three heap versions are defined:
- __SYSTEM_HEAP
- __V5_HEAP
- __V6_HEAP
The V5 and V6 heap versions map to heap implementations private to the CRT library, whereas the system heap maps directly to the Win32 heap services (HeapCreate(), HeapAlloc(), and so forth). What is interesting about the V5 and V6 versions is that you can know the space overhead for these heaps exactly by looking at the CRT source code. For example, you can find this statement for the V6 heap:
// add 8 bytes entry overhead and round up to next para size sizeEntry = (intSize + 2 * (int)sizeof(int) + (BYTES_PER_PARA - 1)) & ~(BYTES_PER_PARA - 1);
BYTES_PER_PARA equals 16 bytes. This is a huge overhead. Consider the case where you would request twelve bytes. The CRT would reserve 32 bytes for this request. This is more than double what you requested! Unfortunately (or fortunately; it depends on your perspective), the V5 and the V6 heap are nowadays rarely used and you cannot know for sure what the HeapAlloc() overhead is because you do not have access to this source code. However, it is a very good assumption that there is an overhead. Microsoft states this fact in the HeapCreate() documentation:
The system uses memory from the private heap to store heap support structures, so not all of the specified heap size is available to the process. For example, if the HeapAlloc function requests 64 kilobytes (K) from a heap with a maximum size of 64K, the request may fail because of system overhead.
Limitations of Other Alternatives
Some people have written memory pool classes to work around the CRT heap overhead. The excellent article from Uri Twig is an example of such a class. However, most of them require massive code change to replace all new and delete invocations to some calls to the buffer pool class member functions. Another common potential limitation (“potential” because it depends on your needs) is that they lack thread safety. CFixedAlloc addresses these limitations and it will be explained in the next section.
Anatomy of CFixedAlloc
You can start by viewing the class declaration:
class CFixedAlloc { // Constructors public: CFixedAlloc(UINT nAllocSize, UINT nBlockSize = 64); // Attributes UINT GetAllocSize() { return m_nAllocSize; } // Operations public: void* Alloc(); // return a chunk of memory of // nAllocSize void Free(void* p); // free chunk of memory returned // from Alloc void FreeAll(); // free everything allocated from // this allocator // Implementation public: ~CFixedAlloc(); protected: struct CNode { CNode* pNext; // only valid when in // free list }; UINT m_nAllocSize; // size of each block from Alloc UINT m_nBlockSize; // number of blocks to get at a time CPlex* m_pBlocks; // linked list of blocks // (is nBlocks*nAllocSize) CNode* m_pNodeFree; // first free node // (NULL if no free nodes) CRITICAL_SECTION m_protect; };
As you can see, it is a very simple class. The first thing that you can notice is that the class provides thread safety with a critical section. Next, m_AllocSize contains the object size of the class that will be using its service, and m_blockSize specifies the number of objects each fixed block of memory can contain. Both data members are set at the construction. The remaining unknown is the mysterious CPlex pointer. I will not go in to the details of this class, but just know that it is the class that does the actual CRT dynamic allocation calls. It only contains a pointer to another CPlex object to create a linked list of CPlex so its size is only 4 bytes. When allocating memory, it requests:
m_allocSize*m_blockSize+sizeof(CPlex)
If you are interested in knowing more about CPlex, I recommend you to read the book MFC Internals, which discusses CPlex in more details. With this information, you already can compare the memory overhead difference between CRT and CFixedAlloc. By using CRT, you will have the CRT overhead for each object, whereas with CFixedAlloc, it is the CRT overhead plus the size of CPlex (4 bytes) divided by the number of allocated objects. When the number of objects is big, the memory overhead is very close to zero, and as you will see with the demo, it can make a huge difference of many MBs. Now, look more closely at the two more important CFixedAlloc functions:
void* CFixedAlloc::Alloc() { EnterCriticalSection(&m_protect); if (m_pNodeFree == NULL) { CPlex* pNewBlock = NULL; TRY { // add another block pNewBlock = CPlex::Create(m_pBlocks, m_nBlockSize, m_nAllocSize); } CATCH_ALL(e) { LeaveCriticalSection(&m_protect); THROW_LAST(); } END_CATCH_ALL // chain them into free list CNode* pNode = (CNode*)pNewBlock->data(); // free in reverse order to make it easier to debug (BYTE*&)pNode += (m_nAllocSize * m_nBlockSize) - m_nAllocSize; for (int i = m_nBlockSize-1; i >= 0; i--, (BYTE*&)pNode -= m_nAllocSize) { pNode->pNext = m_pNodeFree; m_pNodeFree = pNode; } } ASSERT(m_pNodeFree != NULL); // we must have something // remove the first available node from the free list void* pNode = m_pNodeFree; m_pNodeFree = m_pNodeFree->pNext; LeaveCriticalSection(&m_protect); return pNode; } void CFixedAlloc::Free(void* p) { if (p != NULL) { EnterCriticalSection(&m_protect); // simply return the node to the free list CNode* pNode = (CNode*)p; pNode->pNext = m_pNodeFree; m_pNodeFree = pNode; LeaveCriticalSection(&m_protect); } }
In Alloc(), the code checks whether the pool of free memory is empty; if it is, it creates a new block of memory (CPlex) and sets up a list of free nodes. It is interesting to note that there is no wasted space to place node information because it is placed directly into each block. Of course, for this scheme to work, m_nAllocSize must be bigger than sizeof(CNode). This is why it is checked in the constructor:
ASSERT(nAllocSize >= sizeof(CNode));
If you want to see the CRT code, you can tell with a glimpse that CFixedAlloc is much lighter. From an algorithmic point of view, the main factor for this simplicity is because the memory block’s size is fixed (thus the name of the class). In a heap that allows variable size allocation, after many allocations and deallocations, the heap gets fragmented and the heap code has to scan the heap once in a while to merge adjacent small free blocks into big ones to make an efficient use of memory. With my demo program, I got the allocation time shred by a factor of 4 to 5.
And finally, let me introduce to you macros that help you use CFixedAlloc with your classes:
// DECLARE_FIXED_ALLOC -- used in class definition #define DECLARE_FIXED_ALLOC(class_name) \ public: \ void* operator new(size_t size) \ { \ ASSERT(size == s_alloc.GetAllocSize()); \ UNUSED(size); \ return s_alloc.Alloc(); \ } \ void* operator new(size_t, void* p) \ { return p; } \ void operator delete(void* p) { s_alloc.Free(p); } \ void* operator new(size_t size, LPCSTR, int) \ { \ ASSERT(size == s_alloc.GetAllocSize()); \ UNUSED(size); \ return s_alloc.Alloc(); \ } \ protected: \ static CFixedAlloc s_alloc \ // IMPLEMENT_FIXED_ALLOC -- used in class implementation file #define IMPLEMENT_FIXED_ALLOC(class_name, block_size) \ CFixedAlloc class_name::s_alloc(sizeof(class_name), block_size) \
The DECLARED_FIXED_ALLOC() macro overloads the new and delete operators of your class, thus allowing you to use CFixedAlloc with absolutely no code change. With the IMPLEMENT_FIXED_ALLOC(), this is where you specify the block size.