Performance Comparisons of Common Operations, PPC Edition

Mike Ash has written some microbenchmarks to test the speed of operations like Objective-C message dispatch and object creation, in response to people’s premature optimizations based on unfound assumptions. This is one of those issues that comes up rather often. The numbers are interesting – especially the Objective-C message send vs. floating point division – but I wanted some numbers for PowerPC, since ABI and hardware differences could be expected to reorder the list somewhat.

So here they are. There are two sets of numbers, with and without optimization (-O3 vs. -O0) because they have different weak points. The test system is a 1.6 Ghz iMac G5 with 1.5 GB of RAM and stock hard disk. First, the optimized numbers:

Name	Iterations	Total time (sec)	Time per (ns)
Floating-point division	100000000	-0.0	-0.0
Integer division	1000000000	-0.0	-0.0
Float division with int conversion	100000000	0.0	0.0
IMP-cached message send	1000000000	4.6	4.6
C++ virtual method call	1000000000	6.5	6.5
Objective-C message send (accelerated)	1000000000	13.9	13.9
16 byte memcpy	100000000	2.0	19.8
Objective-C message send (regular)	1000000000	20.5	20.5
16 byte malloc/free	100000000	18.2	182.3
NSInvocation message send	10000000	7.3	728.5
NSObject alloc/init/autorelease	10000000	7.6	761.1
NSAutoreleasePool alloc/init/autorelease	10000000	8.8	883.2
16MB malloc/free	100000	1.3	13261.6
NSButtonCell creation	1000000	15.5	15461.7
Read 16-byte file	100000	3.2	31881.5
Zero-second delayed perform	100000	5.3	53081.2
pthread create/join	10000	1.0	97580.2
NSButtonCell draw	100000	19.2	192317.0
Write 16-byte file	10000	5.0	498845.9
1MB memcpy	10000	10.3	1029161.4
Write 16-byte file (atomic)	10000	12.1	1208244.8
NSTask process spawn	1000	16.5	16538000.4
Read 16MB file	100	9.1	91019080.5
Write 16MB file	30	18.8	625328159.8
Write 16MB file (atomic)	30	19.1	635190321.8

The obvious problem here is that the division loops have been optimized away. Other than that, we can see that, er, my computer is slower than Mike’s, as expected. We can also see that the “accelerated Objective-C dispatch” (PowerPC only) provides a significant boost. I wanted to try “assume non-nil receivers”, too, but couldn’t get it to link. The 16-byte memcpy is actually faster on my system, which I suspect is due more to set-up overhead within memcpy than anything else. Anyone have numbers on L1 cache performance?

Unoptimized numbers:

Name	Iterations	Total time (sec)	Time per (ns)
Floating-point division	100000000	0.1	1.3
Integer division	1000000000	8.1	8.1
16 byte memcpy	100000000	1.6	15.9
Objective-C message send	1000000000	25.4	25.4
IMP-cached message send	1000000000	30.5	30.5
Float division with int conversion	100000000	3.4	33.9
C++ virtual method call	1000000000	45.2	45.2
16 byte malloc/free	100000000	18.6	185.6
NSObject alloc/init/autorelease	10000000	7.3	726.2
NSInvocation message send	10000000	7.9	791.3
NSAutoreleasePool alloc/init/autorelease	10000000	9.1	906.7
16MB malloc/free	100000	1.3	13073.2
NSButtonCell creation	1000000	15.8	15770.0
Read 16-byte file	100000	3.2	31810.6
Zero-second delayed perform	100000	5.5	54678.7
pthread create/join	10000	1.0	96391.2
NSButtonCell draw	100000	19.3	192928.8
Write 16-byte file	10000	5.0	501396.6
1MB memcpy	10000	10.2	1022297.1
Write 16-byte file (atomic)	10000	12.4	1240962.9
NSTask process spawn	1000	17.0	16983961.1
Read 16MB file	100	9.0	89896392.0
Write 16MB file	30	18.7	624687735.4
Write 16MB file (atomic)	30	19.0	634871834.7

This bit adds the division loops, and the various method dispatches are significantly slower. Everything else happens outside the program’s own code and is therefore much the same.

I expected the floating point divide to be higher on the list than in the x86 version, but I was a bit sceptical at this result. However, it really is looping over the fdiv instruction, rather inefficiently:

   +96  000047a0  3c400000  lis      r2,0x0         ; Get offset of constant 1e8, high word
  +100  000047a4  384269a8  addi     r2,r2,0x69a8   ; low word
  +104  000047a8  c9a20000  lfd      f13,0x0(r2)    ; f13 = 1e8 (literal)
  +108  000047ac  c81e0040  lfd      f0,0x40(r30)   ; f0 = 42.3 (y)
  +112  000047b0  fc0d0024  fdiv     f0,f13,f0      ; f0 = f13 / f0
  +116  000047b4  d81e0048  stfd     f0,0x48(r30)   ; store result (x)
  +120  000047b8  805e0038  lwz      r2,0x38(r30)   ; r2 = counter (i)
  +124  000047bc  38020001  addi     r0,r2,0x1      ; r0 = r2 + 1
  +128  000047c0  901e0038  stw      r0,0x38(r30)   ; store result (i)
  +132  000047c4  801e0038  lwz      r0,0x38(r30)   ; load counter (i)
  +136  000047c8  805e003c  lwz      r2,0x3c(r30)   ; load counter max (iters)
  +140  000047cc  7f801000  cmpw     cr7,r0,r2      ; if (r0 < r2)
  +144  000047d0  409dffd0  ble      cr7,0x47a0     ; goto top

For comparison, here is the integer division loop, with much the same structure:

   +80  000046dc  3c003b9a  lis      r0,0x3b9a
   +84  000046e0  6000ca00  ori      r0,r0,0xca00
   +88  000046e4  805e0038  lwz      r2,0x38(r30)
   +92  000046e8  7c0013d6  divw     r0,r0,r2
   +96  000046ec  901e0040  stw      r0,0x40(r30)
  +100  000046f0  805e0038  lwz      r2,0x38(r30)
  +104  000046f4  38020001  addi     r0,r2,0x1
  +108  000046f8  901e0038  stw      r0,0x38(r30)
  +112  000046fc  801e0038  lwz      r0,0x38(r30)
  +116  00004700  805e003c  lwz      r2,0x3c(r30)
  +120  00004704  7f801000  cmpw     cr7,r0,r2
  +124  00004708  409dffd4  ble      cr7,0x46dc

Even so, two cycles for a 13-instruction loop with loads and stores in it is a bit unbelievable, especially since Shark tells me there’s a 32-cycle latency between the divide and the immediately-following store. Part of the explanation is that the benchmark framework is subtracting the cost of a do-nothing loop which compiles to this:

   +80  00004160  805e0038  lwz      r2,0x38(r30)
   +84  00004164  38020001  addi     r0,r2,0x1
   +88  00004168  901e0038  stw      r0,0x38(r30)
   +92  0000416c  801e0038  lwz      r0,0x38(r30)
   +96  00004170  805e003c  lwz      r2,0x3c(r30)
  +100  00004174  7f801000  cmpw     cr7,r0,r2
  +104  00004178  409dffe8  ble      cr7,0x4160

(Corresponding instructions are greyed out in the previous listings.) This still doesn’t completely explain the performance being observed. The accuracy of the overhead calculation is probably a bit off, and it’s hard to get it exactly right. However, any bias in this calculation applies to all the benchmarks, not just this one. It’s in the right place on the chart even if the number is suspect.

In summary, when people say PowerPCs rule on floating point performance, we really mean it.

2 Responses to Performance Comparisons of Common Operations, PPC Edition

David Smith says:

2007-08-29 at 03:34

Some things to consider with the divide loop timings are the processor’s out of order execution, caching, branch prediction, and memory prefetching facilities. If you’re doing predictable accesses and huge loop iteration counts you can bet the processor will hide most of it. Core 2 can even hoist loads past stores in many places. As such, trying to figure out timings by counting instructions is somewhat futile.
hachu says:

2007-09-27 at 20:41

Your memory ops that seem faster than Mike’s are most likely actually faster. Mike used a Mac Pro which uses FBDIMMS which are not predictable in terms of latency and bandwidth. He should have used a Core 2 iMac or something to get more reliable numbers.

Performance Comparisons of Common Operations, PPC Edition

2 Responses to Performance Comparisons of Common Operations, PPC Edition

Leave a Reply Cancel reply

Categories