ARM assembler in Lisp

21st January 2020

This page has now been superseded by the latest version at ARM assembler overview.


This article describes an in-line assembler written in Lisp, as an add-on for uLisp. The project is currently a work in progress, but I thought it would be interesting to describe where I've got to, and I'd be interested to hear suggestions for extensions or improvements.

The assembler generates ARM Thumb code, and I've tested it on ARM M0 platforms, such as the Adafruit ItsyBitsy M0, and ARM M4 platforms, such as the Adafruit Metro M4. It is currently restricted to generating one assembler routine with up to three integer parameters, and returning an integer result.

22nd January 2020: Have updated the ARM Assembler to v2 and improved the tak assembler program.

24th January 2020: Have added an example to find the least prime factor of a number.

30th January 2020: Have added an example to reverse the order of bits in a number, and updated the assembler to add support for in-line constants and a debug mode that prints the generated code.

31st January 2020: Added an example for $word.

uLisp extensions

The assembler uses a special version of uLisp to which I've added two functions:

The assemble function takes a series of 16-bit integer arguments comprising the machine code of the routine being assembled. These are written into RAM. Any argument can also return nil, in which case it is ignored.

The call function will then call the assembler routine in RAM, and return the result.

The actual instruction generation is handled by an assembler written in Lisp. This makes it easy to check the generated code, and extend the assembler to handle additional instructions and addressing modes.

Using the assembler

To use the assembler you need to run the version of uLisp ARM 3.0x from here: uLisp ARM Version 3.0x

Then load in the Lisp source of the assembler from here: ARM Assembler

Assembler syntax

Where possible the syntax is very similar to ARM assembler syntax, with the following differences:

  • The mnemonics are prefixed by '$' (because some mnemonics such as push and pop are already in use as Lisp functions).
  • Registers are represented as symbols, prefixed with a quote. Constants are just numbers.
  • Lists of registers, as used in the $push and $pop mnemonics, are represented as a Lisp list.

Assembler instructions are just Lisp functions, so you can see the code they generate:

> (x ($mov 'r1 13))
#x210d

The function x is a convenient addition that prints a 16-bit value in hexadecimal.

The following table shows typical ARM assembler formats, and the equivalent in this Lisp assembler:

Examples ARM assembler uLisp assembler
Push and pop push  {r4, r5, r6, lr} ($push '(r4 r5 r6 lr))
Registers mov  r1, r4 ($mov 'r1 'r4)
Immediate mov r2, #3 ($mov 'r2 3)
Load relative ldr  r0, [r3, #0] ($ldr 'r0 '(r3 0))
Load in-line constant ldr  r0, label ($ldr 'r0 'label)
Branch bne label ($bne 'label)
Constant .word  0x0f0f0f0f ($word #x0f0f0f0f)

Simple example

Here's a trivial example consisting of three ARM Thumb instructions that multiplies its parameter by 13 and returns the result:

(assemble
 ($mov 'r1 13)
 ($mul 'r0 'r1)
 ($bx 'lr))

After executing the assemble command we can then call the function as follows:

> (call 11)
143

The result is the number returned in the r0 register.

If you prefer you could wrap up the call in a Lisp function:

(defun times13 (x) (call x))

Labels

The previous example assembled a function with no branches, in which case you can simply give the mnemonics as arguments to the assemble function.

However, to assemble branches to labels you have to do a bit of extra work as described here:

Enclose the whole program in a let to declare the labels you are going to use as local variables.

Then enclose the assemble call in a

  (dotimes (p 2 ass)
    (setq *pc* 0)
    ....
    )

to do a two-pass assembly which will resolve forward references [1].

Where you want to insert a label such as loop in the program use a statement such as:

($label 'loop)

This assigns the current program counter to the label variable, but returns nil so that no additional code is generated.

For example, here's a simple routine to calculate the Greatest Common Divisor, which uses three labels:

; Greatest Common Divisor
(let ((swap 0) (again 0) (minus 0))
  (dotimes (p 2)
    (setq *pc* 0)
    (assemble
     ($label 'swap)
     ($mov 'r2 'r1)
     ($mov 'r1 'r0)
     ($label 'again)
     ($mov 'r0 'r2)
     ($label 'minus)
     ($sub 'r2 'r2 'r1)
     ($blt 'swap)
     ($bne 'again)
     ($bx 'lr))))

For example, to find the GCD of 3287 and 3460:

> (call 3287 3460)
173

In-line constants

You can insert an in-line 32-bit constant with the $word function. This is often used in conjunction with the $ldr mnemonic to load a 32-bit constant into a register. The following example loads 1234567890 into r0 and returns it:

(let ((const 0))
  (dotimes (p 2)
    (setq *pc* 0)
    (assemble
     ($ldr 'r0 'const)
     ($bx 'lr)
     ($label 'const)
     ($word 1234567890))))
The result:
> (call)
1234567890 

For some more examples see Examples below.

The assembler

Here's a description of some of the key assembler routines.

The program counter *pc*

To handle labels and branches the assembler uses a global variable *pc* that needs to be defined as follows:

(defvar *pc* 0)

The other routines will give errors if you don't do this.

The debug option *debug*

The global variable *debug* should be defined:

(defvar *debug* nil)

To print a hexadecimal listing of the machine code generated by the assemble command set *debug* to t.

Packing arguments into bit fields

The emit function takes a list of bit field widths, and a series of arguments, packs the arguments into a 16-bit number according to the bit widths, and returns the result:

(defun emit (bits &rest args)
  (let ((word 0))
    (unless (= (apply #'+ bits) 16) (error "Incorrect number of bits"))
    (mapc #'(lambda (width value)
              (when (> value (1- (expt 2 width))) (error "Won't fit"))
              (setq word (logior (ash word width) value)))
          bits args)
    (when *debug* (x *pc*) (princ " ") (x word) (terpri))
    (incf *pc* 2)
    word))

The bit field widths should add up to 16. For example:

> (x (emit '(4 4 4 4) 12 13 14 15))
#xcdef

As a side effect the emit function also increments the global variable *pc*.

Generating the assembler instructions

A set of helper routines are provided that handle groups of mnemonics with similar parameters; for example mov-cmp-2 handles the $mov and $cmp mnemonics with an 8-bit immediate argument:

(defun mov-cmp-2 (op argd immed8)
  (emit '(4 1 3 8) 2 op (regno argd) immed8))

These are called in turn by the functions for each individual mnemonic such as $cmp:

(defun $cmp (argd argm)
  (cond
   ((numberp argm)
    (mov-cmp-2 1 argd argm))
   (t
    (and-mvn-4 10 argd argm))))

The number in the name, 2, refers to the value in the top four bits of the instruction to help classify the routines.

A separate helper routine, and-mvn-4, handles the case of $cmp with two register arguments.

I found Appendix B2 from the ARM System Developer's Guide by Andrew Sloss, Dominic Symes, and Chris Wright useful. [2]

Examples

Here are some examples, many from the Benchmarks page.

Fibonacci sequence

The Fibonacci sequence is:

1, 1, 2, 3, 5, 8, 13, 21 ...

where the first two terms are 1, and each subsequent term is the sum of the two previous terms. The following recursive function finds the nth term, counting from 0:

(defun fib (n)
  (if (< n 3) 1
    (+ (fib (- n 1)) (fib (- n 2)))))

Running the Lisp version on an Adafruit Metro M4:

> (for-millis () (print (fib 27)))

196418
24609

The first number is the result and the second number is the time, in milliseconds.

Here's the assembler version:

; Fibonacci sequence
(let ((fib 0) (loop 0) (add 0))
  (dotimes (p 2)
    (setq *pc* 0)
    (assemble
     ($label 'fib)
     ($push '(r4 r5 r6 lr))
     ($mov 'r5 'r0)
     ($mov 'r4 0)
     ($label 'loop)
     ($cmp 'r5 2)
     ($ble 'add)
     ($sub 'r0 'r5 1)
     ($bl 'fib)
     ($sub 'r5 2)
     ($add 'r4 'r4 'r0)
     ($b 'loop)
     ($label 'add)
     ($add 'r0 'r4 1)
     ($pop '(r4 r5 r6 pc)))))

Running the assembler version on an Adafruit Metro M4:

> (for-millis () (print (call 27)))

196418
61

The assembler version is approximately a factor of 400 faster.

Takeuchi function

The Takeuchi function is a classic benchmark for comparing implementations of Lisp, originally used by Ikuo Takeuchi of Japan. Here's the Lisp version:

(defun tak (x y z)
  (if (not (< y x))
      z
    (tak
     (tak (1- x) y z)
     (tak (1- y) z x)
     (tak (1- z) x y))))

For example:

> (for-millis () (print (tak 18 12 6)))

7 
5230

Here's an improved assembler version:

(let ((tak 0) (less 0))
  (dotimes (p 2)
    (setq *pc* 0)
    (assemble
     ($label 'tak)
     ($push '(lr))
     ($sub 'r3 'r1 'r0)
     ($blt 'less)
     ($mov 'r0 'r2)
     ($pop '(pc))
     ($label 'less)
     ($push '(r2 r1 r0))
     ($sub 'r0 1)
     ($bl 'tak)
     ($mov 'r3 'r0)
     ($pop '(r2))
     ($pop '(r0 r1))
     ($push '(r3 r2 r1 r0))
     ($sub 'r0 1)
     ($bl 'tak)
     ($mov 'r3 'r0)
     ($pop '(r2))
     ($pop '(r0 r1))
     ($sub 'r0 1)
     ($push '(r3))
     ($bl 'tak)
     ($mov 'r2 'r0)
     ($pop '(r1))
     ($pop '(r0))
     ($bl 'tak)
     ($pop '(pc)))))

Run it as follows:

> (for-millis () (print (call 18 12 6)))

7 
13

On a Adafruit Metro M4 the assembler version is a factor of 400 faster.

Hofstadter Q sequence

This is one of several recursive sequences described in Douglas Hofstadter's book "Gödel, Escher, Bach: an Eternal Golden Braid". It is related to the Fibonacci sequence, except that in this case the two preceding terms specify how far to go back in the sequence to find the two terms to be summed:

(defun q (n)
  (if (<= n 2) 1
    (+
     (q (- n (q (- n 1))))
     (q (- n (q (- n 2)))))))

Running the Lisp version:

> (for-millis () (print (q 21)))

12 
6798

Here's the assembler version:

; Hofstadter Q sequence
(let ((q 0) (compare 0) (add 0))
  (dotimes (p 2)
    (setq *pc* 0)
    (assemble
     ($label 'q)
     ($push '(r4 r5 r6 lr))
     ($mov 'r4 'r0)
     ($mov 'r5 0)
     ($label 'compare)
     ($cmp 'r4 2)
     ($ble 'add)
     ($sub 'r0 'r4 1)
     ($bl 'q)
     ($sub 'r0 'r4 'r0)
     ($bl 'q)
     ($mov 'r6 'r0)
     ($sub 'r0 'r4 2)
     ($bl 'q)
     ($add 'r5 'r5 'r6)
     ($sub 'r4 'r4 'r0)
     ($b 'compare)
     ($label 'add)
     ($add 'r0 'r5 1)
     ($pop '(r4 r5 r6 pc))))))

Running the assembler version:

> (for-millis () (print (call 21)))

12 
25

In other words, about 270 times faster.

Factor

This function takes a simple approach to finding the least prime factor of a number:

(defun factor (n)
  (let ((d 2) (i 1))
    (loop
     (when (> (* d d) n) (return n))
     (when (zerop (mod n d)) (return d))
     (incf d i) (setq i 2))))

If the number is prime, factor will print the number itself. To find the least prime factor of 2146654199 (46327 x 46337):

> (for-millis () (print (factor 2146654199)))

46327 
8189

Here's an equivalent assembler version:

(let ((factor 0) (test 0) (returnd 0) (returnn 0) (rem 0) (loop 0) (skip 0))
  (dotimes (p 2)
    (setq *pc* 0)
    (assemble
     ($label 'factor)
     ($push '(r3 r4 r5 r6 lr))
     ($mov 'r5 'r0)
     ($mov 'r4 2)
     ($mov 'r6 1)
     ($label 'test)
     ($mov 'r3 'r4)
     ($mul 'r3 'r4)
     ($mov 'r0 'r5)
     ($cmp 'r3 'r5)
     ($bgt 'returnn)
     ($mov 'r1 'r4)
     ($bl 'rem)
     ($cmp 'r0 0)
     ($beq 'returnd)
     ($add 'r4 'r6)
     ($mov 'r6 2)
     ($b 'test)
     ($label 'returnd)
     ($mov 'r0 'r4)
     ($label 'returnn)
     ($pop '(r3 r4 r5 r6 pc))
     ; rem subroutine
     ($label 'rem)
     ($push '(r4 lr))
     ($mov 'r3 0)
     ($mov 'r2 32)
     ($label 'loop)
     ($lsl 'r0 1)
     ($adc 'r3 'r3)
     ($cmp 'r3 'r1)
     ($blt 'skip)
     ($sub 'r3 'r1)
     ($add 'r0 1)
     ($label 'skip)
     ($sub 'r2 1)
     ($bne 'loop)
     ($mov 'r0 'r3)
     ($pop '(r4 pc)))))

It uses a subroutine, rem, to calculate the remainder when an unsigned 32-bit number is divided by a 32-bit divisor. On an ARM Cortex M4 board you could replace this with a udiv instruction (and a mul).

Doing the same test to find the least prime factor of 2146654199 (46327 x 46337):

> (for-millis () (print (call 2146654199)))

46327 
140

In other words, approximately 60 times faster.

Factorize

You can use the above function as the basis for a simple recursive routine to factorize a number into a list of its prime factors:

(defun factorize (n)
  (let ((f (call n)))
    (if (= n f) (list n) (cons f (factorize (/ n f))))))

For example:

> (factorize 731731731)
(3 17 43 333667)

Reversebits

This is a function to efficiently reverse the order of bits in a 32-bit number [3]. Here's the uLisp version:

(defun reversebits (x)
  (setq x (logior (ash (logand x #x55555555) 1)
                  (logand (ash x -1) #x55555555)))
  (setq x (logior (ash (logand x #x33333333) 2)
                  (logand (ash x -2) #x33333333)))
  (setq x (logior (ash (logand x #x0f0f0f0f) 4)
                  (logand (ash x -4) #x0f0f0f0f)))
  (setq x (logior (ash x 24) (ash (logand x #xff00) 8)
                  (logand (ash x -8) #xff00) (ash x -24))))

For example:

> (reversebits #x12345678)
510274632

where 510274632 is #x1E6A2C48, which you'll see is the right answer.

Here's the assembler version, which demonstrates the use of in-line constants:

(let ((rev 0) (fives 0) (threes 0) (nibbles 0) (return 0))
  (dotimes (p 2)
    (setq *pc* 0)
    (assemble
     ($label 'rev)
     ($push '(r4 lr))
     ; Swap odd and even
     ($ldr 'r2 'fives)
     ($lsr 'r1 'r0 1)
     ($lsl 'r0 1)
     ($and 'r1 'r2)
     ($bic 'r0 'r2)
     ($orr 'r0 'r1)
     ; Swap adjacent pairs
     ($ldr 'r1 'threes)
     ($lsr 'r2 'r0 2)
     ($lsl 'r0 2)
     ($and 'r2 'r1)
     ($bic 'r0 'r1)
     ($orr 'r2 'r0)
     ; Swap adjacent nibbles
     ($ldr 'r0 'nibbles)
     ($lsr 'r1 'r2 4)
     ($lsl 'r2 4)
     ($and 'r1 'r0)
     ($bic 'r2 'r0)
     ($orr 'r1 'r2)
     ; Reverse all bytes
     ($rev 'r0 'r1)
     ($label 'return)
     ($pop '(r4 pc))
     ($label 'fives)
     ($word #x55555555)
     ($label 'threes)
     ($word #x33333333)
     ($label 'nibbles)
     ($word #x0f0f0f0f))))

Testing it:

> (call #x12345678)
510274632

  1. ^ If anyone remembers the Acorn Atom, a home computer with an inline 6502 assembler, this is the trick that it used to resolve forward references.
  2. ^ It's available as a PDF here: http://engold.ui.ac.ir/~nikmehr/Appendix_B2.pdf.
  3. ^ From "Hacker's Delight" by Henry S. Warren, Jr., page 129.