Motivation for hand-optimized Assembly code:

Last updated on 5th July, 2013 by Shervin Emami. Posted originally on 8th Oct, 2010.

There's a popular saying that "in 90% of cases, a modern compiler writes faster code than a typical Assembly programmer would". But anyone that has actually tested this theory knows how wrong this statement is! Hand-written Assembly code is ALWAYS faster and/or smaller than the equivalent compiled code, as long as the programmer understands the many intricate details of the CPU they are developing for. eg: I wrote both an optimized C function and an optimized Assembly function (using NEON SIMD instructions) to shrink an image by 4, and my Assembly code was 30x faster than my best C code! Even the best C compilers are still terrible at using SIMD acceleration, which is a feature that is available on most modern CPUs and can allow code to run between 4 to 50 times faster, and yet is rarely used properly!

ARM's RVDS compiler typically generates code that is upto 2x faster than any other C compiler for ARM, but on most ARM devices, hand-written Assembly code can often be 10x faster! (Assuming you use SIMD vectorization such as ARM's NEON Media Processing Engine or Intel's MMX/SSE/AVX). This is similar to the speedups you can expect from GPGPU acceleration (using NVidia's CUDA or OpenCL), but on a small mobile device rather than an expensive desktop video card! And luckily the iPhone, iPad, iPod, Raspberry Pi, ODROID and Android phones & tablets nearly all use ARM CPUs with NEON vector processing, so you can use the same Assembly code in apps for the official iPhone App Store and the Android Market (with NDK) and Raspberry Pi. And with the recent popularity of ARM CPUs in portable devices, this is likely to continue for several generations of smartphones, tablets, and ultra-portables (eg: in the NVidia Tegra3 "Kal-el", TI OMAP4, QualComm Snapdragon S4 "Krait", Apple iPad2 & iPhone5, etc). Obviously you shouldn't write a whole app using Assembly language, but if you need certain loops to run as fast as possible, then a few sections of Assembly language might be exactly what you need!

Modern processor architectures are much more complicated now than they were at the start of the PC era, which definitely makes efficient Assembly code hard to write by hand, but it also makes efficient code hard for a compiler to generate, and so there is significant room for improvement in efficient code design.

UPDATE: Note that Cortex-A9 and Cortex-A15 CPUs are much more advanced than Cortex-A5, Cortex-A7 & Cortex-A8, so the advantages of Assembly code & NEON SIMD will be less important in Cortex-A9 than in simpler devices such as Cortex-A8.

Free libraries with hand-optimized Assembly code:

There are already some free libraries of hand-optimized code for Intel x86 and ARM CPUs, so for some tasks you can simply use one of these existing libraries from your C/C++ code without doing any Assembly language code yourself.

For ARM CPUs (including nearly all smartphones, tablets & Linux embedded systems):

ARM's OpenMAX DL implementation with hundreds of fast functions for ARMv7A Cortex-A8 and ARM11 (devices such as iPhone, iPad, Android, Raspberry Pi, BlackBerry PlayBook, Palm Pre, etc). OpenMAX functions are arranged for Audio Codecs, Image Codecs, Image Processing, Signal Processing, and Video Codecs.
An OpenMAX example given by ARM:

#include <omxSP.h>

// An OMX_S16 stores 8 short integers into a single 128-bit SIMD register.
OMX_S16 source1[] = {42, 23, 983, 7456, 124, 11111, 4554, 10002};
OMX_S16 source2[] = {242, 423, 9832, 746, 1124, 1411, 2254, 1298};

// Calculates the dot product of two arrays of signed 16-bit integers using
// the OpenMAX function omxSP_DotProd_S16(), which uses NEON SIMD instructions.
OMX_S32 fast_dotproduct(void)
{
	OMX_INT len = sizeof(source1) / sizeof(OMX_S16);
	return omxSP_DotProd_S16(source1, source2, len);
}

Eigen high-level C++ math library has SIMD vectorization for both Intel SSE and ARM NEON.
An Eigen example:

#include <Eigen/Dense>
using namespace Eigen;

// A Vector4f stores 4 floats into a single 128-bit SIMD register.
Vector4f source1(23.0, 0.5, 2.0, 6.5);
Vector4f source2(1.0, 2.0, 6.4, 3.14);

// Calculates the dot product of two arrays of 4 floats using
// the Eigen function dot(), which should use NEON SIMD instructions.
double fast_dotproduct(void)
{
	return source1.dot(source2);
}

If you use ARM's DS-5 or RVDS 4 compiler, you can enable auto vectorization so it will try to optimize your C code using NEON, perhaps generating code that runs twice as fast as normal.
Or if you use GCC or LLVM or CodeSourcery you can also enable auto vectorization, but it rarely makes any improvement (in XCode 3, it would be "GCC 4.2 - Language" -> "Other C flags"):
"-O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -ftree-vectorize -ffast-math -funsafe-math-optimizations -fsingle-precision-constant"

For Intel x86 (desktop) CPUs:

Eigen high-level C++ math library with SIMD vectorization for both Intel SSE and ARM NEON.
SIMDx86 low-level SIMD functions for Intel MMX,SSE,SSE2,SSE3,SSSE3 and AMD 3DNow!+.
SSEPlus low-level Intrinsic SIMD functions for Intel SSE,SSE2,SSE3,SSSE3,SSE4,SSE5.
libSIMD low-level SIMD functions for Intel SSE,SSE2.
Perhaps other SIMD libraries, on a search of SourceForge for SIMD.
Commercial or free math libraries such as ATLAS, PLASMA, libFlame, Intel's MKL & IPP.

How to write ARM Assembly code for Android or iPhone or Raspberry Pi:

To write Assembly language code for ARM, you can either:

write inline asm statements in C/C++/Objective-C code, or
write standalone Assembly functions in a '.s' file and simply add it in your XCode sources, or
write standalone Assembly functions for an external assembler. You write Assembly code in a '.s' file and generate a '.o' object file and link the object file with your project in XCode.

So if you are just trying to write a few Assembly instructions then inline assembler would be the easiest way, but if you plan on writing many Assembly functions then I'd recommend a standalone Assembly file for GCC, or an external assembler such as FASMARM.

Once you have setup your assembler environment, you need to learn how to write ARM Assembly code, since iPhones and pretty much all portable devices, smartphones, tablets, smartwatches & Raspberry Pi / Linux dev boards use the same ARM instruction set. Some good intro tutorials to learn ARM Assembly are:

A very good but old intro at Coranac's Whirlwind Tour of ARM Assembly,
A recent intro including Thumb-2 instructions at DaveSpace's Introduction to ARM,
Another good intro at BraveGNU's Embedded Programming with the GNU Toolchain.
A brief introduction at WebShaker's Begin Programming Assembler with GCC,
And if you specifically want to use GCC inline assembler then you should read Ethernut's ARM GCC Inline Assembler Cookbook.

There are also the books ARM System Developer's Guide and ARM Assembly Language. These are a good way to learn the basics of ARM Assembly from scratch, and then you can target specific features for your device such as NEON or Thumb-2 or multi-core.

When it comes to Assembly programming, the official instruction set reference manual is usually the main source of information for everything you will write, so you should go to the ARM website and download the ARM and Thumb-2 Quick Reference Card (6 pages long) as well as the 2 full documents for your exact CPU. For example, the iPhone 3GS and iPhone 4 both have an ARMv7-A Cortex-A8 CPU, so you can download the ARM Architecture Reference Manual ARMv7-A and ARMv7-R Edition (2000 pages long) that tells you exactly which instructions are available and exactly how they work, and the Cortex-A8 Technical Reference Manual (700 pages long) that explains the instruction timing, etc for your specific CPU. There is also a recent ARM Cortex-A Programmer's Guide, containing useful info and comparisons of Cortex-A8, Cortex-A9, Cortex-A5 and Cortex-A15 CPUs.

UPDATE: Note that the Cortex-A5 & Cortex-A7 CPUs in recent ARM devices such as Raspberry Pi 2 and ODROID-C1 all use the ARMv7 instruction set and have CPUs similar to ARM Cortex-A8 or Cortex-A9 CPUs. Whereas the original Raspberry Pi 1 and the original iPhone 1 use an old ARMv6 instruction set and use an old ARM11 CPU, so they are quite different to all modern ARM CPUs.

It is important to understand that many ARM CPU's include the NEON Advanced SIMD coprocessor (aka NEON or Media Processing Engine), and so if you expect to run operations that can take advantage of SIMD architecture (eg: heavily data parallel tasks), then you should make it a big priority to learn how to use NEON effectively! As mentioned above, the official ARM Architecture Reference Manual and ARM Cortex-A8 Reference Manual are the most important sources of info, but there are other places for quicker info such as:

The List of NEON Instructions,
The official ARM Tech Forum,
An example using NEON for optimization at Hilbert-Space.de,
Another blog that includes some NEON for iPhone by Wandering Coder,
An official ARM blog with an intro on NEON at "Coding For NEON,
ARM's Fastest memcpy() implementation,
A forum post with an Even faster memcpy() implementation,
An experimental ARM Cortex-A8 cycle counter online tool,
A discussion on How to efficient shrink an image by 50% or 25%,
Some hints on how to use NEON for Floating Point Optimization and Assembly Code Optimization,
Many good ARM Assembly links collected by dpt,
A list of many Bit Twiddling Hacks that might help you reduce some "if" statements in your SIMD code, etc.
A 270-page tutorial for beginners on Intel x86 and ARM assembly topics and compiler / reverse engineering Quick Introduction to Reverse Engineering for Beginners by Dennis Yurichev.

Example Assembly module for GCC / GNU Assembler:

//---------------------------------------------------------------------------//
// libASM.s:	Assembly language functions that can be called from C/C++.
//---------------------------------------------------------------------------//

// Create some macros to generate the start and end of my assembly functions
// that can be called from C code.
// WARNING: These functions don't allow more than 4 arguments in each function.
// For more than 4 arguments, you need a stack frame for arguments after the 4th.
.macro BEGIN_FUNCTION
	.align 2		// Align the function code to a 4-byte (2^n) word boundary.
	.arm			// Use ARM instructions instead of Thumb.
	.globl _$0		// Make the function globally accessible.
	.no_dead_strip _$0	// Stop the optimizer from ignoring this function!
	.private_extern _$0
_$0:				// Declare the function.
.endmacro

.macro BEGIN_FUNCTION_THUMB
	.align 2		// Align the function code to a 4-byte (2^n) word boundary.
	.thumb			// Use THUMB-2 instrctions instead of ARM.
	.globl _$0		// Make the function globally accessible.
	.thumb_func _$0		// Use THUMB-2 for the following function.
	.no_dead_strip _$0	// Stop the optimizer from ignoring this function!
	.private_extern _$0
_$0:				// Declare the function.
.endmacro

.macro END_FUNCTION
	bx	lr		// Jump back to the caller.
.endmacro


// Store a 32-bit constant into a register.
// eg: SET_REG r1, 0x11223344
.macro 	SET_REG
	// Recommended for ARMv6+ because the number is stored inside the instruction:
	movw	$0, #:lower16:$1
	movt	$0, #:upper16:$1
.endmacro

//---------------------------------------------------------------------------//


	// Initialize this module so it can have code that is visible from iPhone with XCode.
	.syntax unified		// Allow both ARM and Thumb-2 instructions
	.section __TEXT,__text,regular
	.section __TEXT,__textcoal_nt,coalesced
	.section __TEXT,__const_coal,coalesced
	.section __TEXT,__symbol_stub4,symbol_stubs,none,12
	.text


BEGIN_FUNCTION testSimpleAddFunction
	add		r0, r0, r1		// Return the sum of the first 2 function parameters
END_FUNCTION


BEGIN_FUNCTION_THUMB testAddFunctionWithProlog
	// Function prolog that saves all important registers.
	push		{r4,r5,r6, r7,lr}	// Save registers r4-r6 if used and Frame Pointer (r7) and Link Register (r14).
	add		r7, sp, #12		// Adjust FP to point to the saved FP (r7).
	push		{r8,r10,r11,r14}	// Save any general registers that should be preserved.
	//vstmdb	sp!, {d8-d15}		// Save any VFP or NEON registers that should be preserved (S16-S31 / Q4-Q7).
	//sub		sp, sp, #4		// Allocate space for some local storage (optional).

	add		r0, r0, r1		// Return the sum of the first 2 function parameters
	
	// Function epilog that restores all important registers.
	//add		sp, sp, #4		// Deallocate space for the local storage.
	//vldmia	sp!, {d8-d15}		// Restore any VFP or NEON registers that were saved.
	pop		{r8,r10,r11,r14}	// Restore any general registers that were saved.
	pop		{r4,r5,r6, r7,pc}	// Restore saved registers, the saved FP (r7), and return to the caller (saved LR as PC).
END_FUNCTION


// Add 4 integers A + B at the same time using 1 NEON SIMD instruction.
BEGIN_FUNCTION addFourIntsUsingNeon
	// Function prolog that saves all important registers.
	push	{r4,r5,r6, r7,lr}	// Save registers r4-r6 if used and Frame Pointer (r7) and Link Register (r14).
	add		r7, sp,#12	// Adjust FP to point to the saved FP (r7).
	push	{r8,r10,r11,r14}	// Save any general registers that should be preserved.
	//vstmdb	sp!, {d8-d15}	// Save any VFP or NEON registers that should be preserved (S16-S31 / Q4-Q7).
	//sub		sp, sp,#4	// Allocate space for some local storage (optional).

//------ ARM registers used:
// r0:	arg0 & result (input & output parameters)
// r1:	arg1 (input parameter)
// r2:		
// r3:	
// r4:		(must restore if modified)
// r5:		(must restore if modified)
// r6:		(must restore if modified)
// r7:		* Frame Pointer in iOS (dont touch!)
// r8:		(must restore if modified)
// r9:	
// r10:		(must restore if modified)
// r11:		(must restore if modified)
// r12: 
// r13:		* Stack Pointer (dont touch!)
// r14:		(must restore if modified)
// r15:		* Program Counter (dont touch!)
//------ NEON registers used:
// q0:	arrayA	(4 x 32bit integers)
// q1:	arrayB	(4 x 32bit integers)
//------ Local stack variables used:
// (none)


// If NEON instructions aren't available, don't execute anything.
#if defined __ARM_NEON__

	vld1.i32	{q0}, [r0]	// Load 4 x 32bit integers of the 1st array.
	vld1.i32	{q1}, [r1]	// Load 4 x 32bit integers of the 2nd array.
	vadd.i32	q0, q0, q1	// Add the values of 4 int32 using just one operation.
	vst1.i32	{q0}, [r0]	// Store the 4 x 32bit integers back into the 1st array.

#endif //defined __ARM_NEON__
	
	// Function epilog that restores all important registers.
	//add		sp, sp,#4		// Deallocate space for the local storage.
	//vldmia	sp!, {d8-d15}		// Restore any VFP or NEON registers that were saved.
	pop		{r8,r10,r11,r14}	// Restore any general registers that were saved.
	pop		{r4,r5,r6, r7,pc}	// Restore saved registers, the saved FP (r7), and return to the caller (saved LR as PC).
END_FUNCTION

Note: The Assembler in GCC (GNU "as", "gas" or "gcc -assembler-as-cpp") can have certain peculiarities, such as:

All Assembly instructions should be in lower-case, so CAPITALS are not allowed!
The macro features are not nearly as powerful as other assemblers such as NASM.
Some versions of GCC / BINUTILS have a bug in parsing NEON alignment, so you may need a space after the comma. eg: vld1.8 {q0}, [r0, :64] instead of: vld1.8 {q0}, [r0,:64]
I found that NEON can only do simple addressing modes (as mentioned in the specs), but GCC does not give an error if you use an invalid one!
eg:     vld1.32     {q0}, [r1,r2]           // Load vector from mem[r1+r2]
or:      vld1.32     {q0}, [r1,r2,lsl#2]    // Load vector from mem[r1+r2*4]
silently compiles to:
          vld1.32      {q0}, [r1]              // Load vector from mem[r1]

Combining Assembly code with Objective-C in XCode for iPhone:

To actually use Assembly code in your XCode project, I recommend creating a .H header file with function headers that can be included by your iPhone code. For example:

In your Objective-C or C or C++ files:

#include "libASM.h"

....

// Add the 4 pairs of numbers using NEON SIMD.
int arrA[4] = {5,10,15,20};
int arrB[4] = {1,2,3,4};
int *arrOut;
// NOTE: The same array is used in this function for arrA and the return value
// (ie: it will overwrite the data in arrA), but I'm just using a separate "arrOut"
// pointer to show how to return data from the Assembly function using r0.
arrOut = addFourIntsUsingNeon(arrA, arrB);
printf("arrOut = {%d, %d, %d, %d}\n", arrOut[0], arrOut[1], arrOut[2], arrOut[3]);

In the file "libASM.h":

#ifndef _LIBASM_H_
#define _LIBASM_H_

// Allow the functions to work in C, C++ and Objective-C code:
#ifdef __cplusplus
extern "C" {
#endif	

	//----  Function declarations in C/C++ syntax:  ----
	
	// Add 2 numbers and return the result.
	int testSimpleAddFunction(int argA, int argB);
	int testAddFunctionWithProlog(int argA, int argB);
	
	// Add 4 integers A + B at the same time using 1 NEON SIMD instruction.
	int* addFourIntsUsingNeon(int *argA, int *argB);

#ifdef __cplusplus
}
#endif

#endif // _LIBASM_H_

If you have a syntax error in your Assembly code, then XCode will fail with the error message:
Command /Developer/Platforms/iPhoneOS.platform/Developer/usr/bin/gcc-4.2 failed with exit code 1
But to see what the actual error message is, click on the small icon on the right-side of that line, which expands the message. At the bottom, it should now show the GNU assembler's error message, such as:
/iPhone/TestAssembly/Classes/libASM.s:341:bad instruction `xor r0, r0, r0'

Accessing elements of a structure in Assembly:

Here is an example of how to create or access a C struct using GCC. This example lets you access all the elements of an IplImage that OpenCV is based on:

// IplImage struct from OpenCV, with the field offsets on the left:
.set IplImage_nSize,		0	//int  nSize;             /* sizeof(IplImage) */
.set IplImage_ID,		4	//int  ID;                /* version (=0)*/
.set IplImage_nChannels,	8	//int  nChannels;         /* Most of OpenCV functions support 1,2,3 or 4 channels */
.set IplImage_alphaChannel,	12	//int  alphaChannel;      /* Ignored by OpenCV */
.set IplImage_depth,		16	//int  depth;             /* Pixel depth in bits: IPL_DEPTH_8U, IPL_DEPTH_8S, IPL_DEPTH_16S,
					//                                   IPL_DEPTH_32S, IPL_DEPTH_32F and IPL_DEPTH_64F are supported.  */
.set IplImage_colorModel,	20	//char [4];     /* Ignored by OpenCV */
.set IplImage_channelSeq,	24	//char channelSeq[4];     /* ditto */
.set IplImage_dataOrder,	28	//int  dataOrder;         /* 0 - interleaved color channels, 1 - separate color channels.
					//                                   cvCreateImage can only create interleaved images */
.set IplImage_origin,		32	//int  origin;            /* 0 - top-left origin,
					//                                   1 - bottom-left origin (Windows bitmaps style).  */
.set IplImage_align,		36	//int  align;             /* Alignment of image rows (4 or 8).
					//                                   OpenCV ignores it and uses widthStep instead.    */
.set IplImage_width,		40	//int  width;             /* Image width in pixels.                           */
.set IplImage_height,		44	//int  height;            /* Image height in pixels.                          */
.set IplImage_roi,		48	//struct _IplROI *roi;    /* Image ROI. If NULL, the whole image is selected. */
.set IplImage_maskROI,		52	//struct _IplImage *maskROI;      /* Must be NULL. */
.set IplImage_imageId,		56	//void  *imageId;                 /* "           " */
.set IplImage_tileInfo,		60	//struct _IplTileInfo *tileInfo;  /* "           " */
.set IplImage_imageSize,	64	//int  imageSize;         /* Image data size in bytes
					//                                   (==image->height*image->widthStep
					//                                   in case of interleaved data)*/
.set IplImage_imageData,	68	//char *imageData;        /* Pointer to aligned image data.         */
.set IplImage_widthStep,	72	//int  widthStep;         /* Size of aligned image row in bytes.    */
.set IplImage_BorderMode,	76	//int  BorderMode[4];     /* Ignored by OpenCV.                     */
.set IplImage_BorderConst,	92	//int  BorderConst[4];    /* Ditto.                                 */
.set IplImage_imageDataOrigin, 108	//char *imageDataOrigin;  /* Pointer to very origin of image data
					//                                   (not necessarily aligned) -
					//                                   needed for correct deallocation */

For example, you can get the width of an image with this code:

	ldr	r1, [r0, #IplImage_width]	// Get src->width in pixels

ARM Cortex-A8 CPU of many Smart Phones:

The ARM Cortex-A8 single-core CPU is used on many smart phones and portable devices, such as the Apple iPhone 3GS, iPhone 4, iPad, iPod Touch (3rd Gen), the Palm Pre, Motorola Droid, Nokia N900, the BeagleBoard (palm-sized Linux computer), the Gumstix Overo (finger-sized computer) and Pandora (open-source gaming console). So I have collected the following information while developing Assembly language code for the iPhone, but that can be useful for other mobile devices with ARM CPUs:

ARM functions can use 32bit registers R0-R12 for general-purpose (R13 is SP and R15 is PC), but must restore R4-R14. In iOS, R7 is frame pointer so should never be used but R9 and R12 can be used without preserving. Also in iOS, a 'bool' is possibly 1 byte and data is little-endian (ARM can potentially support little-endian & big-endian).
When interfacing to C programs, the first 4 integer parameters of a function are passed in r0-r3. Return value is in r0 and even more in r1-r3.
ARMv7 has dual-pipeline execution, and most instructions take 1 cycle, so 2 instructions can potentially run in 1 cycle. But any load/store/multiply/branch/register-reuse must wait for next cycle, and accessing memory from a destination register has a 2 cycle stall. LDM can load 2 memory words per cycle but only runs in pipeline0 and will only free pipeline1 in the last iteration.
ARMv7-A has a 13-stage pipeline, which means that if a branch-prediction fails (eg: an 'else' statement), it is a 13 cycle penalty! And accessing memory that is not in L2 cache is atleast 25 cycles!
Cache line size of Cortex-A8 is 16-words (64 bytes), so should ideally align everything to 64 bytes.
NEON coprocessor mainly runs at 1 SIMD instruction per cycle, 5 cycles behind the ARM unit. Data in Level-1 cache is accessed instantly by NEON. The NEON coprocessor can access ARM registers instantly, but for ARM to access a NEON register or the same memory location, it has atleast 20 cycles penalty! But during the penalty, it's possible to continue processing as long as nothing requires the ARM registers.
ARM & NEON can be combined for faster memcpy() & memset().

ARM Cortex-A9 CPU of new multi-core Tablets:

The ARM Cortex-A9 CPU can be single-core, dual-core or quad-core, and features speculative Out-of-Order Execution (allows high-level code such as C/C++ to automatically run more efficiently), yet is extremely low in battery power. So the ARM Cortex-A9 is used in most of the latest multi-core devices, such as the Apple iPad2 (Apple A5 processor), LG Optimus 2X (nVidia Tegra2), Samsung Galaxy S II (Samsung Exynos 4210), Sony NGP PSP2, and the PandaBoard (TI OMAP4430). Here are some notes I made when reading the ARM Cortex-A Programmer's Guide:

Differences between ARM Cortex-A8 and Cortex-A9 (eg: iPad 1 vs iPad 2):

Cortex-A9 has many advanced features for a RISC CPU, such as speculative data accesses, branch prediction, multi-issuing of instructions, hardware cache coherency, out-of-order execution and register renaming. Cortex-A8 does not have these, except for dual-issuing instructions and branch prediction. Therefore Assembly code optimizations &NEON SIMD are not as important in Cortex-A9 anymore.
Cortex-A9 has 32 bytes per L1 cache line, whereas Cortex-A8 has 64 bytes per cache line.
Cortex-A9 has an external L2 cache (a separate "outer" PL310 or new L2C-310 chip), whereas Cortex-A8 has an internal L2 cache (on-chip "inner" cache, therefore faster).
Cortex-A9 MPCore has separate L1 Data and Instruction caches for each core, with hardware cache coherency for the L1 Data cache but not the L1 Instruction cache. Any L2 cache is shared externally between all the cores.
Cortex-A9 must use the PreLoad Engine in the external L2 cache controller (if it has one), whereas Cortex-A8 has an internal PLE for its L2 cache.
Cortex-A9 has a full VFPv3 FPU, whereas Cortex-A8 only has VFPLite. The main difference being that most float operations take 1 cycle on Cortex-A9 but take 10 cycles on Cortex-A8! Therefore VFP is very slow on Cortex-A8 but decent on Cortex-A9.
Cortex-A9 allows half-precision (16-bit) floats, whereas Cortex-A8 only allows 32-bit singles and 64-bit floats. But half-precision has almost no supported operations directly anyway.
Cortex-A9 can't dual-issue multiple NEON instructions, whereas Cortex-A8 can potentially dual-issue certain NEON load/store instructions with other NEON instructions.
Cortex-A8 had the NEON unit behind the ARM unit, so NEON had fast access to ARM registers & memory but it took 20 cycles delay for any registers or flags from NEON to reach the ARM! This often occurs with function return values (unless if "hardfp" convention or function inlining is used).
Cortex-A8 had a separate load/store unit for NEON and one for ARM, so if they were both loading or storing addresses in the same cache line, it adds about 20 cycles delay.
Cortex-A9 uses LDREX/STREX for multi-threaded synchronization without blocking all cores, whereas Cortex-A8 uses simple disabling of interrupts for mutexes.
All Cortex-A8 CPUs have a NEON SIMD unit, where some Cortex-A9 CPUs don't have a NEON SIMD unit (eg: nVidia Tegra 2 does not have NEON, but nVidia Tegra 3 will have NEON).

Notes on ARM Cortex-A9 or any ARM Cortex-A in general:

Cortex-A9 has a 4-way set associative L1 Data Cache using 32 bytes per cache line (16kB, 32kB or 64kB of L1 cache, which is 512, 1024 or 2048 L1 cache lines).
Cortex-A9 MPCore can't clean or invalidate both L1 & external L2 at the same time, so incoherency can occur unless if done in correct order by softare: To clean, clean the L1 cache first then L2, or to invalidate, invalidate the L2 cache first then L1.
Cortex-A9 contains a "Fast Loop Mode" where very small loops (under 64 bytes of code and possibly cache line aligned) can run completely in the CPU decode & prefetch stages without accessing the instruction cache.
Cortex-A9 has support for Automatic Data Prefetching (if enabled by the OS), so that if you are accessing 1 or 2 arrays sequentially, it will detect this and prefetch the next data to cache before you will need it.
Cortex-A9 can detect when the instruction STM is used for memset() & memcpy(), and optimize the cache access by not loading data into cache if it will be overwritten anyway.
Cortex-A9 MPCore has a separate NEON module for each core. eg: a quad-core Cortex-A9 has 4 NEON units!
If the TLB does not have an page in its table, then a "page table walk" needs 2 or 3 memory accesses instead of 1.
"char" variables on ARM may default to unsigned chars, whereas they default to signed chars on x86, so this can cause runtime errors if not expected.
The first 4 arguments to a function are sent directly in the first 4 32-bit registers, whereas the rest of arguments use stack memory so are slower. But C++ automatically uses the 1st argument as a pointer to "this", so only 3 function arguments can go in registers.
64-bit arguments are more tricky and limiting due to the 8-byte alignment requirement.
If a function will call another function, it needs to maintain an 8-byte stack alignment, so should PUSH/POP an even number of times. Leaf functions don't need 8-byte stack alignment.
When passing arguments with NEON Advanced SIMD using the "hardfp" calling convention, registers q0-q3 (s0-s15 or d0-d7) are used. Registers q4-q7 (s16-s31 or d8-d15) must be preserved if modified.
Newer C99 compilers allow the "restrict" keyword to say that pointers do not overlap other pointers, allowing compiler optimizations.
Cortex-A does not have integer division, so any divide instruction is a slow (~50 cycle) function call or floating-point divide. But shifts left or right are often free.
Since the Branch Target Address Cache (BTAC) is based on 16-byte sizes and only allows 2 branches per line, if any code has more than 2 branches within 16-bytes of code, then it is likely to flush the instruction pipeline.
Since Cortex-A9 does Register Renaming at upto 2 registers per cycle, LDM or STM instructions of 5 or more registers can cause pipeline stalls.
Conditional Execution of ARM mode (not Thumb) allowed speedups in older CPUs but now it is often faster to use branches, because conditional instructions may need unwinding.
Good info on optimizing memset() & memcpy() is given on page 17.19 of the ARM Programmers Guide, saying to use LDM & STM of a whole cache line, where aligned store is more important than aligned load, and upto 4 PLD's should be inserted, for roughly 3 cache lines ahead of current cache line.
Some info on optimizing float operations with VFP are given in Chapter 18 of the ARM Programmers Guide.
The Cortex-A9 has a big delay when switching between VFP and NEON instructions.
NEON can't process 64-bit floats, divisions or square roots, so they are done with VFP instead.
NEON can be detected at compile time by checking: #ifdef __ARM_NEON__
NEON can be detected at runtime on Linux by checking the CPU flags, by running "cat /proc/cpuinfo" or searching the file "/proc/self/auxv" for AT_HWCAP to check for the HWCAP_NEON bit (4096).
Cortex-A9 MPCore uses the MESI protocol to keep all L1 caches coherent. Unfortunately, if a thread is often writing to a piece of data and another thread is often reading from a different piece of data on the same cache line, that cache line is transferred significantly (thrashed).
The ARM DS-5 development suite generates faster code than GCC/LVDS compilers and has a more powerful debugger (using Eclipse IDE) that can analyze the system non-intrusively using CoreSight or JTAG.
The ARM "Vector Floating Point" (VFP) module was intended for SIMD vector operations, but it never became so! The VFP unit is just a scalar FPU for 32-bit floats and 64-bit doubles.
The ARM "Advanced SIMD" (NEON Media Processing Engine) unit is a true SIMD unit for integers (8, 16, 32 or 64 bit signed or unsigned), floats (32-bit only, plus limited 16-bit half-precision float support) and 16-bit binary polynomials.

Tutorial: Rotating an image using SIMD instructions

Many image processing operations can be performed very efficiently using NEON SIMD acceleration, since they operate locally on a small neighborhood of pixels at a time and can scan through the whole image in the same serial manner that it is stored in memory (ie: row-major form). But some tasks such as rotation require sparse or discontinuous memory access, and therefore may be tricky to implement in SIMD and still achieve a high performance boost.

This animation details how to rotate an image by 90 degrees (turn it onto its side) efficiently using NEON instructions. You can watch the data move instead of just reading about it:

Note: The very last slide in this animation shows the pixels as rotated 90 degrees counter-clockwise instead of 90 degrees clockwise! (Thanks to John Driscoll for pointing it out).

Speed comparisons:

Here are some speed results I obtained from 2 different types of image processing functions: Rotating an image, and Shrinking an image. These two operations behave quite differently, since image shrinking can be done from top-to-bottom so memory is accessed in a serial manner, whereas rotation requires accessing memory in "column-major" format that causes major delays in memory access rather than CPU delays.

Rotating a 480x640 pixel BGRA color image by 90 degrees on an iPhone 3GS:

Implementation:	Time:	Speedup:
Transpose & Flip in OpenCV C library (GCC4.2 with -O3):	66 msec
My NEON Assembly code using 4x4 blocks:	7.5 msec	(9x faster than C code!)
My NEON Assembly code using 8x8 blocks:	6.9 msec	(9.5x faster than C code!)
My NEON Assembly code using 16x4 blocks:	7.8 msec	(8.5x faster than C code!)

Shrinking a 480x360 pixel greyscale image to 25% width and 25% height on an iPhone 3GS:

Implementation:	Time:	Speedup:
Resize function in OpenCV C library (GCC4.2 with -O3):	38 msec
My C code (GCC4.2 with -O3):	27 msec
My ARMv7 Assembly code:	7.8 msec	(3.5x faster than C code)
My NEON Assembly code:	0.9 msec	(30x faster than C code!)