Mike Ash has written some microbenchmarks to test the speed of operations like Objective-C message dispatch and object creation, in response to people’s premature optimizations based on unfound assumptions. This is one of those issues that comes up rather often. The numbers are interesting – especially the Objective-C message send vs. floating point division – but I wanted some numbers for PowerPC, since ABI and hardware differences could be expected to reorder the list somewhat.
So here they are. There are two sets of numbers, with and without optimization (-O3 vs. -O0) because they have different weak points. The test system is a 1.6 Ghz iMac G5 with 1.5 GB of RAM and stock hard disk. First, the optimized numbers:
Name | Iterations | Total time (sec) | Time per (ns) |
Floating-point division | 100000000 | -0.0 | -0.0 |
Integer division | 1000000000 | -0.0 | -0.0 |
Float division with int conversion | 100000000 | 0.0 | 0.0 |
IMP-cached message send | 1000000000 | 4.6 | 4.6 |
C++ virtual method call | 1000000000 | 6.5 | 6.5 |
Objective-C message send (accelerated) | 1000000000 | 13.9 | 13.9 |
16 byte memcpy | 100000000 | 2.0 | 19.8 |
Objective-C message send (regular) | 1000000000 | 20.5 | 20.5 |
16 byte malloc/free | 100000000 | 18.2 | 182.3 |
NSInvocation message send | 10000000 | 7.3 | 728.5 |
NSObject alloc/init/autorelease | 10000000 | 7.6 | 761.1 |
NSAutoreleasePool alloc/init/autorelease | 10000000 | 8.8 | 883.2 |
16MB malloc/free | 100000 | 1.3 | 13261.6 |
NSButtonCell creation | 1000000 | 15.5 | 15461.7 |
Read 16-byte file | 100000 | 3.2 | 31881.5 |
Zero-second delayed perform | 100000 | 5.3 | 53081.2 |
pthread create/join | 10000 | 1.0 | 97580.2 |
NSButtonCell draw | 100000 | 19.2 | 192317.0 |
Write 16-byte file | 10000 | 5.0 | 498845.9 |
1MB memcpy | 10000 | 10.3 | 1029161.4 |
Write 16-byte file (atomic) | 10000 | 12.1 | 1208244.8 |
NSTask process spawn | 1000 | 16.5 | 16538000.4 |
Read 16MB file | 100 | 9.1 | 91019080.5 |
Write 16MB file | 30 | 18.8 | 625328159.8 |
Write 16MB file (atomic) | 30 | 19.1 | 635190321.8 |
The obvious problem here is that the division loops have been optimized away. Other than that, we can see that, er, my computer is slower than Mike’s, as expected. We can also see that the “accelerated Objective-C dispatch” (PowerPC only) provides a significant boost. I wanted to try “assume non-nil receivers”, too, but couldn’t get it to link. The 16-byte memcpy is actually faster on my system, which I suspect is due more to set-up overhead within memcpy than anything else. Anyone have numbers on L1 cache performance?
Unoptimized numbers:
Name | Iterations | Total time (sec) | Time per (ns) |
Floating-point division | 100000000 | 0.1 | 1.3 |
Integer division | 1000000000 | 8.1 | 8.1 |
16 byte memcpy | 100000000 | 1.6 | 15.9 |
Objective-C message send | 1000000000 | 25.4 | 25.4 |
IMP-cached message send | 1000000000 | 30.5 | 30.5 |
Float division with int conversion | 100000000 | 3.4 | 33.9 |
C++ virtual method call | 1000000000 | 45.2 | 45.2 |
16 byte malloc/free | 100000000 | 18.6 | 185.6 |
NSObject alloc/init/autorelease | 10000000 | 7.3 | 726.2 |
NSInvocation message send | 10000000 | 7.9 | 791.3 |
NSAutoreleasePool alloc/init/autorelease | 10000000 | 9.1 | 906.7 |
16MB malloc/free | 100000 | 1.3 | 13073.2 |
NSButtonCell creation | 1000000 | 15.8 | 15770.0 |
Read 16-byte file | 100000 | 3.2 | 31810.6 |
Zero-second delayed perform | 100000 | 5.5 | 54678.7 |
pthread create/join | 10000 | 1.0 | 96391.2 |
NSButtonCell draw | 100000 | 19.3 | 192928.8 |
Write 16-byte file | 10000 | 5.0 | 501396.6 |
1MB memcpy | 10000 | 10.2 | 1022297.1 |
Write 16-byte file (atomic) | 10000 | 12.4 | 1240962.9 |
NSTask process spawn | 1000 | 17.0 | 16983961.1 |
Read 16MB file | 100 | 9.0 | 89896392.0 |
Write 16MB file | 30 | 18.7 | 624687735.4 |
Write 16MB file (atomic) | 30 | 19.0 | 634871834.7 |
This bit adds the division loops, and the various method dispatches are significantly slower. Everything else happens outside the program’s own code and is therefore much the same.
I expected the floating point divide to be higher on the list than in the x86 version, but I was a bit sceptical at this result. However, it really is looping over the fdiv
instruction, rather inefficiently:
+96 000047a0 3c400000 lis r2,0x0 ; Get offset of constant 1e8, high word +100 000047a4 384269a8 addi r2,r2,0x69a8 ; low word +104 000047a8 c9a20000 lfd f13,0x0(r2) ; f13 = 1e8 (literal) +108 000047ac c81e0040 lfd f0,0x40(r30) ; f0 = 42.3 (y) +112 000047b0 fc0d0024 fdiv f0,f13,f0 ; f0 = f13 / f0 +116 000047b4 d81e0048 stfd f0,0x48(r30) ; store result (x) +120 000047b8 805e0038 lwz r2,0x38(r30) ; r2 = counter (i) +124 000047bc 38020001 addi r0,r2,0x1 ; r0 = r2 + 1 +128 000047c0 901e0038 stw r0,0x38(r30) ; store result (i) +132 000047c4 801e0038 lwz r0,0x38(r30) ; load counter (i) +136 000047c8 805e003c lwz r2,0x3c(r30) ; load counter max (iters) +140 000047cc 7f801000 cmpw cr7,r0,r2 ; if (r0 < r2) +144 000047d0 409dffd0 ble cr7,0x47a0 ; goto top
For comparison, here is the integer division loop, with much the same structure:
+80 000046dc 3c003b9a lis r0,0x3b9a +84 000046e0 6000ca00 ori r0,r0,0xca00 +88 000046e4 805e0038 lwz r2,0x38(r30) +92 000046e8 7c0013d6 divw r0,r0,r2 +96 000046ec 901e0040 stw r0,0x40(r30) +100 000046f0 805e0038 lwz r2,0x38(r30) +104 000046f4 38020001 addi r0,r2,0x1 +108 000046f8 901e0038 stw r0,0x38(r30) +112 000046fc 801e0038 lwz r0,0x38(r30) +116 00004700 805e003c lwz r2,0x3c(r30) +120 00004704 7f801000 cmpw cr7,r0,r2 +124 00004708 409dffd4 ble cr7,0x46dc
Even so, two cycles for a 13-instruction loop with loads and stores in it is a bit unbelievable, especially since Shark tells me there’s a 32-cycle latency between the divide and the immediately-following store. Part of the explanation is that the benchmark framework is subtracting the cost of a do-nothing loop which compiles to this:
+80 00004160 805e0038 lwz r2,0x38(r30) +84 00004164 38020001 addi r0,r2,0x1 +88 00004168 901e0038 stw r0,0x38(r30) +92 0000416c 801e0038 lwz r0,0x38(r30) +96 00004170 805e003c lwz r2,0x3c(r30) +100 00004174 7f801000 cmpw cr7,r0,r2 +104 00004178 409dffe8 ble cr7,0x4160
(Corresponding instructions are greyed out in the previous listings.) This still doesn’t completely explain the performance being observed. The accuracy of the overhead calculation is probably a bit off, and it’s hard to get it exactly right. However, any bias in this calculation applies to all the benchmarks, not just this one. It’s in the right place on the chart even if the number is suspect.
In summary, when people say PowerPCs rule on floating point performance, we really mean it.
Some things to consider with the divide loop timings are the processor’s out of order execution, caching, branch prediction, and memory prefetching facilities. If you’re doing predictable accesses and huge loop iteration counts you can bet the processor will hide most of it. Core 2 can even hoist loads past stores in many places. As such, trying to figure out timings by counting instructions is somewhat futile.
Your memory ops that seem faster than Mike’s are most likely actually faster. Mike used a Mac Pro which uses FBDIMMS which are not predictable in terms of latency and bandwidth. He should have used a Core 2 iMac or something to get more reliable numbers.