1nedalloc v1.05 15th June 2008: 2-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= 3 4by Niall Douglas (http://www.nedprod.com/programs/portable/nedmalloc/) 5 6Enclosed is nedalloc, an alternative malloc implementation for multiple 7threads without lock contention based on dlmalloc v2.8.4. It is more 8or less a newer implementation of ptmalloc2, the standard allocator in 9Linux (which is based on dlmalloc v2.7.0) but also contains a per-thread 10cache for maximum CPU scalability. 11 12It is licensed under the Boost Software License which basically means 13you can do anything you like with it. This does not apply to the malloc.c.h 14file which remains copyright to others. 15 16It has been tested on win32 (x86), win64 (x64), Linux (x64), FreeBSD (x64) 17and Apple MacOS X (x86). It works very well on all of these and is very 18significantly faster than the system allocator on all of these platforms. 19 20By literally dropping in this allocator as a replacement for your system 21allocator, you can see real world improvements of up to three times in normal 22code! 23 24To use: 25-=-=-=- 26Drop in nedmalloc.h, nedmalloc.c and malloc.c.h into your project. 27Configure using the instructions in nedmalloc.h. Run and enjoy. 28 29To test, compile test.c. It will run a comparison between your system 30allocator and nedalloc and tell you how much faster nedalloc is. It also 31serves as an example of usage. 32 33Notes: 34-=-=-= 35If you want the very latest version of this allocator, get it from the 36TnFOX SVN repository at svn://svn.berlios.de/viewcvs/tnfox/trunk/src/nedmalloc 37 38Because of how nedalloc allocates an mspace per thread, it can cause 39severe bloating of memory usage under certain allocation patterns. 40You can substantially reduce this wastage by setting MAXTHREADSINPOOL 41or the threads parameter to nedcreatepool() to a fraction of the number of 42threads which would normally be in a pool at once. This will reduce 43bloating at the cost of an increase in lock contention. If allocated size 44is less than THREADCACHEMAX, locking is avoided 90-99% of the time and 45if most of your allocations are below this value, you can safely set 46MAXTHREADSINPOOL to one. 47 48You will suffer memory leakage unless you call neddisablethreadcache() 49per pool for every thread which exits. This is because nedalloc cannot 50portably know when a thread exits and thus when its thread cache can 51be returned for use by other code. Don't forget pool zero, the system pool. 52 53For C++ type allocation patterns (where the same sizes of memory are 54regularly allocated and deallocated as objects are created and destroyed), 55the threadcache always benefits performance. If however your allocation 56patterns are different, searching the threadcache may significantly slow 57down your code - as a rule of thumb, if cache utilisation is below 80% 58(see the source for neddisablethreadcache() for how to enable debug 59printing in release mode) then you should disable the thread cache for 60that thread. You can compile out the threadcache code by setting 61THREADCACHEMAX to zero. 62 63Speed comparisons: 64-=-=-=-=-=-=-=-=-= 65See Benchmarks.xls for details. 66 67The enclosed test.c can do two things: it can be a torture test or a speed 68test. The speed test is designed to be a representative synthetic 69memory allocator test. It works by randomly mixing allocations with frees 70with half of the allocation sizes being a two power multiple less than 71512 bytes (to mimic C++ stack instantiated objects) and the other half 72being a simple random value less than 16Kb. 73 74The real world code results are from Tn's TestIO benchmark. This is a 75heavily multithreaded and memory intensive benchmark with a lot of branching 76and other stuff modern processors don't like so much. As you'll note, the 77test doesn't show the benefits of the threadcache mostly due to the saturation 78of the memory bus being the limiting factor. 79 80ChangeLog: 81-=-=-=-=-= 82v1.05 15th June 2008: 83 * { 1042 } Added error check for TLSSET() and TLSFREE() macros. Thanks to 84Markus Elfring for reporting this. 85 * { 1043 } Fixed a segfault when freeing memory allocated using 86nedindependent_comalloc(). Thanks to Pavel Vozenilek for reporting this. 87 88v1.04 14th July 2007: 89 * Fixed a bug with the new optimised implementation that failed to lock 90on a realloc under certain conditions. 91 * Fixed lack of thread synchronisation in InitPool() causing pool corruption 92 * Fixed a memory leak of thread cache contents on disabling. Thanks to Earl 93Chew for reporting this. 94 * Added a sanity check for freed blocks being valid. 95 * Reworked test.c into being a torture test. 96 * Fixed GCC assembler optimisation misspecification 97 98v1.04alpha_svn915 7th October 2006: 99 * Fixed failure to unlock thread cache list if allocating a new list failed. 100Thanks to Dmitry Chichkov for reporting this. Futher thanks to Aleksey Sanin. 101 * Fixed realloc(0, <size>) segfaulting. Thanks to Dmitry Chichkov for 102reporting this. 103 * Made config defines #ifndef so they can be overridden by the build system. 104Thanks to Aleksey Sanin for suggesting this. 105 * Fixed deadlock in nedprealloc() due to unnecessary locking of preferred 106thread mspace when mspace_realloc() always uses the original block's mspace 107anyway. Thanks to Aleksey Sanin for reporting this. 108 * Made some speed improvements by hacking mspace_malloc() to no longer lock 109its mspace, thus allowing the recursive mutex implementation to be removed 110with an associated speed increase. Thanks to Aleksey Sanin for suggesting this. 111 * Fixed a bug where allocating mspaces overran its max limit. Thanks to 112Aleksey Sanin for reporting this. 113 114v1.03 10th July 2006: 115 * Fixed memory corruption bug in threadcache code which only appeared with >4 116threads and in heavy use of the threadcache. 117 118v1.02 15th May 2006: 119 * Integrated dlmalloc v2.8.4, fixing the win32 memory release problem and 120improving performance still further. Speed is now up to twice the speed of v1.01 121(average is 67% faster). 122 * Fixed win32 critical section implementation. Thanks to Pavel Kuznetsov 123for reporting this. 124 * Wasn't locking mspace if all mspaces were locked. Thanks to Pavel Kuznetsov 125for reporting this. 126 * Added Apple Mac OS X support. 127 128v1.01 24th February 2006: 129 * Fixed multiprocessor scaling problems by removing sources of cache sloshing 130 * Earl Chew <earl_chew <at> agilent <dot> com> sent patches for the following: 131 1. size2binidx() wasn't working for default code path (non x86) 132 2. Fixed failure to release mspace lock under certain circumstances which 133 caused a deadlock 134 135v1.00 1st January 2006: 136 * First release