In a previous article I reported what appeared to be a performance issue with CAS/LOCK instructions on the Sandy Bridge microarchitecture compared to the previous Nehalem microarchitecture. Since then I’ve worked with the good people of Intel to understand what was going on and I’m now pleased to be able to shine some light on the previous results.
I observed a small drop in throughput with the uncontended single-thread case, and an order-of-magnitude decrease in throughput once multiple threads contend when performing updates. This testing spawned out of observations testing Java Queue implementations and the
Disruptor for the multi-producer case. I was initially puzzled by these findings because almost every other performance test I applied to Sandy Bridge indicated a major step forward for this microarchitecture.
After digging deeper into this issue it has come to light that my tests have once again fallen fowl of the difficulties in micro-benchmarking. My test is not a good means of testing throughput and it is actually testing fairness in a roundabout manner. Let’s revisit the code and work through what is going on.
Test Code
#include <time.h> #include <pthread.h> #include <stdlib.h> #include <iostream> typedef unsigned long long uint64; const uint64 COUNT = 500 * 1000 * 1000; volatile uint64 counter = 0; void* run_add(void* numThreads) { register uint64 value = (COUNT / *((int*)numThreads)) + 1; while (--value != 0) { __sync_add_and_fetch(&counter, 1); } } void* run_xadd(void*) { register uint64 value = counter; while (value < COUNT) { value = __sync_add_and_fetch(&counter, 1); } } void* run_cas(void*) { register uint64 value = 0; while (value < COUNT) { do { value = counter; } while (!__sync_bool_compare_and_swap(&counter, value, value + 1)); } } void* run_cas2(void*) { register uint64 value = 0; register uint64 next = 0; while (value < COUNT) { value = counter; do { next = value + 1; value = __sync_val_compare_and_swap(&counter, value, next); } while (value != next); } } int main (int argc, char *argv[]) { const int NUM_THREADS = atoi(argv[1]); const int TESTCASE = atoi(argv[2]); pthread_t threads[NUM_THREADS]; void* status; timespec ts_start; timespec ts_finish; clock_gettime(CLOCK_MONOTONIC, &ts_start); for (int i = 0; i < NUM_THREADS; i++) { switch (TESTCASE) { case 1: std::cout << "LOCK ADD" << std::endl; pthread_create(&threads[i], NULL, run_add, (void*)&NUM_THREADS); break; case 2: std::cout << "LOCK XADD" << std::endl; pthread_create(&threads[i], NULL, run_xadd, (void*)&NUM_THREADS); break; case 3: std::cout << "LOCK CMPXCHG BOOL" << std::endl; pthread_create(&threads[i], NULL, run_cas, (void*)&NUM_THREADS); break; case 4: std::cout << "LOCK CMPXCHG VAL" << std::endl; pthread_create(&threads[i], NULL, run_cas2, (void*)&NUM_THREADS); break; default: exit(1); } } for (int i = 0; i < NUM_THREADS; i++) { pthread_join(threads[i], &status); } clock_gettime(CLOCK_MONOTONIC, &ts_finish); uint64 start = (ts_start.tv_sec * 1000000000) + ts_start.tv_nsec; uint64 finish = (ts_finish.tv_sec * 1000000000) + ts_finish.tv_nsec; uint64 duration = finish - start; std::cout << "threads = " << NUM_THREADS << std::endl; std::cout << "duration = " << duration << std::endl; std::cout << "ns per op = " << (duration / (COUNT * 2)) << std::endl; std::cout << "op/sec = " << ((COUNT * 2 * 1000 * 1000 * 1000) / duration) << std::endl; std::cout << "counter = " << counter << std::endl; return 0; }The code above makes it possible to test the major CAS based techniques on x86. For full clarity an objdump -d of the binary reveals the compiler generated assembly instructions for the above methods. The “lock” instruction in each section is where the atomic update is happening.
0000000000400dc0 <_z8run_cas2pv>: 400dc0: 48 8b 05 d9 07 20 00 mov 0x2007d9(%rip),%rax 400dc7: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) 400dce: 00 00 400dd0: 48 8d 50 01 lea 0x1(%rax),%rdx 400dd4: f0 48 0f b1 15 c3 07 lock cmpxchg %rdx,0x2007c3(%rip) 400ddb: 20 00 400ddd: 48 39 c2 cmp %rax,%rdx 400de0: 75 ee jne 400dd0 <_z8run_cas2pv> 400de2: 48 3d ff 64 cd 1d cmp $0x1dcd64ff,%rax 400de8: 76 d6 jbe 400dc0 <_z8run_cas2pv> 400dea: f3 c3 repz retq 400dec: 0f 1f 40 00 nopl 0x0(%rax) 0000000000400df0 <_z7run_caspv>: 400df0: 48 8b 15 a9 07 20 00 mov 0x2007a9(%rip),%rdx 400df7: 48 8d 4a 01 lea 0x1(%rdx),%rcx 400dfb: 48 89 d0 mov %rdx,%rax 400dfe: f0 48 0f b1 0d 99 07 lock cmpxchg %rcx,0x200799(%rip) 400e05: 20 00 400e07: 75 e7 jne 400df0 <_z7run_caspv> 400e09: 48 81 fa ff 64 cd 1d cmp $0x1dcd64ff,%rdx 400e10: 76 de jbe 400df0 <_z7run_caspv> 400e12: f3 c3 repz retq 400e14: 66 66 66 2e 0f 1f 84 data32 data32 nopw %cs:0x0(%rax,%rax,1) 400e1b: 00 00 00 00 00 0000000000400e20 <_z8run_xaddpv>: 400e20: 48 8b 05 79 07 20 00 mov 0x200779(%rip),%rax 400e27: 48 3d ff 64 cd 1d cmp $0x1dcd64ff,%rax 400e2d: 77 1b ja 400e4a <_z8run_xaddpv> 400e2f: 90 nop 400e30: b8 01 00 00 00 mov $0x1,