FlashForth: Execution Speed

2024-04-01

9 minute read

Forth , FlashForth , ATmega328

Where I discuss the various methods of increasing execution speed in Forth and demonstrate the ease in doing so.

Introduction

In this entry, I showed the relative speeds of board/language combinations from a ARM M0 running CircuitPython to a Pi Pico running C. While execution speed is important in embedded microcontrollers, development speed is equally important along with ease of development. A great attribute of Forth is its ability to combine all three. In this entry, I’ll show how you can start with a simple, fast example and make it significantly faster without great difficulty.

This entry will require a tool to measure the frequency on a pin. I’ll use the frequency to derive the execution speed of a word or application. I’ll be using a Digilent Analog Discovery as an oscilloscope, however, many digital multimeters (DMM)may provide a frequency readout as well.

Why is this Important?

Execution speed may matter even in embedded applications which aren’t time-critical. A common example is button debouncing, which I will cover later in this entry. If a button debounce routine is too slow, the application will feel unresponsive to the user leading to a frustrated user. If the debounce routine consumes too much processor time, it could lead to bad data, when the processor misses other inputs. Even a very simple application such as persistence-of-vision requires speed to fool the eye. Anything complex quickly becomes untenable, if the application isn’t capable of delivering sufficient speed.

In Forth its quite easy to develop an embedded application without an initial concern for speed. Th.is is due to the development process of Forth and it is quite fast “out-of-the-box”. The language C has the latter, however, its rarely as easy to use as Forth and MicroPython can be quite easy to use, however, its speed is sorely lacking.

Start With a Word

It seems like all Forth application development starts with a word. Typically, the critical one which serves as an example as to what you want to accomplish. In this case, let’s start with one, which is very simple, light an LED.

The previously discussed 328P_ports provides the words neccesary to setup a pin and light the pin. Load the file on to your Forth board and enter:

LED out
LED on

The first command, sets the built-in LED as an output and the second, turns it on. You can also access the level of the pin by connecting your scope or DMM to pin D13 on your Uno, Nano or equivalent.

As I’ll be building the task switching application show in here, I’ll create two words:

\ used to setup tasks, sets pins to output
: setup ( -- )
    D13 out
;

\ individual tasks to run, toggles the pin
: task_13 ( -- )
	D13 tog
;

The words do the same thing as our original commands, however, we can expand on them. I also used tog which flips the bit as compared to on and off as its easier to read, use and already begins to enhance our speed. Using the words on/off requires far more overhead, than a simple toggle. Load the above words, either from a file or by typing them in then run them.

setup task_13
task_13

You’ll see the built-in LED light then turn off, (or vice versa), depending on the original value. You may also enter two sequential task_13 words, and the value won’t appear to change, as the two words execute faster than the eye can detect.

Depending how you loaded the words (type or download file), it is immediately apparent that this type of development, is at worst equal to using the Python REPL and much easier than C or Arduino’s compile/link/load. I am still unable to measure execution time, unless I want to use my watch.

Execution Time

Measuring execution time with a scope or DMM requires repeated execution. When the task is repeatedly executed, I can measure the frequency, calculate the positive width, based on the duty cycle and this will be the time it takes for a full round-trip of execution.

Adding an infinite loop, looks like this:

\ run all tasks including setup to measure execution speed
: alltasks ( -- )
	setup
	begin
		task_13
	again
;

And when I run alltasks and measure on pin 13 using the scope, I get the following image:

Large Version to see detail

The calculations are automatically performed, however, if you were using a DMM, you would do the following:

1/frequency = period, 1/103 x 10^3 = 9.6 microseconds (us)
50% duty cycle needs to be confirmed, if only one task
9.6 * .5 is the width of each toggle or 4.8us

Which means it takes 4.8us for each time through the loop.

Let’s add more tasks.

Adding Tasks

To add task, duplicate the word, task_13 and change the numbers appropriately, for each pin added, also add them to setup. To do this for 8 pins, it would look like this:

\ used to setup tasks, sets pins to output
: setup_8 ( -- )
    D2 out
    D3 out
    D4 out
    D5 out
    D6 out
    D7 out
    D8 out
    D13 out
;

\ used to initialize pins to known (low) state
: init_8 ( -- )
    D2 low
    D3 low
    D4 low
    D5 low
    D6 low
    D7 low
    D8 low
    D13 low
;

\ individual tasks to run, toggles the pin
: task_2 ( -- )
	D2 tog
;

\ individual tasks to run, toggles the pin
: task_3 ( -- )
	D3 tog
;

\ individual tasks to run, toggles the pin
: task_4 ( -- )
	D4 tog
;

\ individual tasks to run, toggles the pin
: task_5 ( -- )
	D5 tog
;

\ individual tasks to run, toggles the pin
: task_6 ( -- )
	D6 tog
;

\ individual tasks to run, toggles the pin
: task_7 ( -- )
	D7 tog
;

\ individual tasks to run, toggles the pin
: task_8 ( -- )
	D8 tog
;

\ run all tasks including setup to measure execution speed
: alltasks_8 ( -- )
	setup_8
	begin
		task_2
		task_3
		task_4
		task_5
		task_6
		task_7
		task_8
		task_13
	again
;

Large Version to see detail

It took me a moment to understand why the pulse width was so narrow. Using the toggle method to flip a pin value, will flip all pins on the port, which means when D4 (blue) is high, and D5 (orange) is toggled, it will automatically toggle D4 back to low. Which might not be what we want to have happen. For now, let’s focus on how often D4 is toggled high, which is every 37.6us, which is a little less 8 * 4.8us ~ 38.4us above.

As for the toggle issue, this is easy to confirm. Enter the following in your serial console and watch the values for the pins change:

D4 tog \ D4 goes high, D5 no change
D5 tog \ D5 goes high, D4 goes low
D4 tog \ D4 goes high, D5 goes low

This is an artifact of using the mset word in FlashForth. The mset word will read the address, then or it with the bit mask. If a pin has been toggled, a 1 has been written to the pin and when read, it will read high, toggling its value as well as the desired value. I won’t worry about this yet, as it might not be a concern when using assembly language-based words.

Understanding the Timing

What does this mean?

It is a simple experiment to help us understand the execution time of the task. We start to gain a better understanding of the costs by expressing it in clock cycles. With a 16Mhz ATmega328P, most instructions take 1 clock cycle, or 62.5 nanoseconds (ns). If we have a series of instructions which take 4.6us, we divide 4600ns/62.5ns to arrive at ~74 clock cycles.

As we refine how long the task execution is, we also can begin to understand the task switching costs, which in some languages can be quite high. Let’s continue.

Optimizing using Assembly Language

Forth and particularly FlashForth has an extremely easy and powerful mechanism which allows replacing Forth words with Forth assembler words. By doing so, I can remove overhead and increase the speed of execution. The change is absurdly easy, we do the following:

\ replace the tog word with assembly language
: task_2 [ pind-io #2 sbi, ] ;

This new task word, uses an assembly language command sbi to set a bit on I/O port. In this example, we are setting pin 2 of the PIND register in the I/O space. It uses the microcontroller’s ability to set a bit, which means we’ll probably fix the toggle issue. Our new task code is the following, and notice that we also change the names slightly. We will also need to load asm.fs (found in FlashForth/avr/forth):

: task_2a [ pind-io #2 sbi, ] ; 
: task_3a [ pind-io #3 sbi, ] ; 
: task_4a [ pind-io #4 sbi, ] ; 
: task_5a [ pind-io #5 sbi, ] ; 
: task_6a [ pind-io #6 sbi, ] ; 
: task_7a [ pind-io #7 sbi, ] ; 
: task_8a [ pinb-io #0 sbi, ] ; 
: task_13a [ pinb-io #5 sbi, ] ;

Here’s the scope version:

alltasks running 8 assembly language tasks

Large Version to see detail

We immediately see two things, our overall speed as gone from 26kHz to 108kHz, which means the execution time of each task is 4.8us. And the waveform is symmetrical, which means we no longer have the toggle issue. (Another fix for the toggle issue is to use bit addr c! on the PINn register.)

We have one more trick up our sleeve. If we had the word inlined after each set of assembly code, it will inline the code as compared to a call. This increases the speed as well.

: task_2a [ pind-io #2 sbi, ] ; inlined
: task_3a [ pind-io #3 sbi, ] ; inlined 
: task_4a [ pind-io #4 sbi, ] ; inlined
: task_5a [ pind-io #5 sbi, ] ; inlined
: task_6a [ pind-io #6 sbi, ] ; inlined
: task_7a [ pind-io #7 sbi, ] ; inlined
: task_8a [ pinb-io #0 sbi, ] ; inlined
: task_13a [ pinb-io #5 sbi, ] ; inlined

Here’s the scope version:

alltasks running 8 assembly language, inlined tasks

Large Version to see detail

This is just about as fast as it can get! Here’s the analysis from the board speed page:

With 1.12 microseconds between when a task toggled the bit high then went it toggled the bit low. Let’s analyze the overhead.

With a 16Mhz clock, 1.12us is 18 clock cycles.
toggle consumes 1 clock cycle * 8 = 8
begin/again are 1 clock cycles each = 2
total overhead is the balance or 8 clock cycles
for an overhead of 1 clock cycle per task

Using Button as a Parameter

This approach is more like an Arduino way of doing things. It uses a series of arrays to track button pin, button history etc. This allows for the greatest flexibility particularly when in development. It also consumes more execution time, as to interact more with memory than the hard-coded example below.

Description	avr_blink	blink
Program (bytes)	624	1022
Data (bytes)	11	11
Overhead (us)	2.6	10.8

Hardcoding Button

This is much more Forth-like. Forth is designed to be simple and specific. Hard-coding the pin numbers into the Forth words, makes sense as the numbers are

FlashForth: Execution Speed

Introduction

Why is this Important?

Start With a Word

Execution Time

alltasks running 1 task

Adding Tasks

alltasks running 8 tasks

Understanding the Timing

Optimizing using Assembly Language

alltasks running 8 assembly language tasks

alltasks running 8 assembly language, inlined tasks

Using Button as a Parameter

Hardcoding Button