K4print

lib/v2.1 lib/v3.0 Example program: concepts and tests of the serial terminal output driver for kernel-v4.

Overview

Output to a serial terminal, or many other output devices, is notoriously slow compared to the execution pace of code. One single character over a UART running at 38,400 Baud takes some 260 microseconds to transmit, which is over 30,000 clock cycles at 125 MHz. Keeping the CPU spinning while busy waiting wastes all these cycles, and can disrupt the responsiveness and timeliness of the control program.^[1]

Now, with kernel-v4 being pre-emptive, any busy waiting loop can be interrupted to run higher priority program code (actors), that is, the kernel remains responsive and keeps the timing, within the limitations imposed by the context switch using the NVIC. Hence the question arises if it's even worth implementing a more complex scheme to avoid busy waiting. The answer is, as so often: it depends.

This test program serves to investigate three different approaches when using kernel-v4:

busy waiting directly in the control program actor;
busy waiting in a dedicated driver-level actor;
no busy waiting using interrupts.

Cases 2. and 3. also show a real-world application for kernel-v4, beyond the test programs so far. In fact, their design and implementation have lead to a few improvements of the kernel implementation, but the concepts are sound. So far. I'll admit that thinking in event and ready queues, following the execution through all the involved interrupts etc. requires some efforts initially. :)

The Test Program

The test program runs three actors:

One high priority actor ahp to visualise the pre-emptive nature of the kernel. It periodically reactivates itself with a period of 10 microseconds, and simply toggles a GPIO pin.
Two actors a0 and a1 running back to back, activated by a one seconds kernel tick. They run at a medium priority, executing the same code, writing to the serial terminal. Using two actors allows to better explore the effect of the UART FIFO. Their test output is a mix of literal strings and numbers.

The behaviours and effects were visualised and measured using GPIO pins and an oscilloscope, apart from the boring terminal output, which however shows if things are even working.^[2]

The UART Drivers

Overview

The UART drivers conform to the concepts defined in module TextIO. They are selected and configured in module Main. The test program will use the corresponding TextIO.Writer, without the need to modify the program itself.

The drivers take over when the output has been converted to strings in module Texts – see PutString in the driver modules.

UARTstr

UARTstr is the plain, default busy-waiting driver for string output to the terminal. As all other drivers, it is based on UARTdev for all device data and the base operations, which are used by all output drivers. UARTstr is not dependent on the kernel.

UARTstrKbw

UARTstrKbw uses the concepts of kernel-v4, namely actors, messages, and event and ready queues.

The basic idea is to delegate the actual terminal output to a dedicated actor in the driver which runs at the lowest priority. When a program-level actor writes to the serial terminal, the output string is divided up into parts as needed (possibly only one), which are put onto an event queue as messages with a string data field (payload). The driver-level actor fetches these messages, is put on a ready queue, and then writes the corresponding characters to the UART FIFO when it is run from the ready queue's interrupt handler. When no message is on the event queue, the driver actor subscribes to the event queue, awaiting the next message. That is, the event queue serves as "job queue" for the driver actor.

When writing the characters contained in a message to the UART FIFO, the actor writes the FIFO using busy-waiting (write FIFO on 'not full'), and then fetches the next message from the even queue.

As outlined above, the pre-emptive nature of the kernel allows all actors with higher priority than the driver actor to run timely. What sets this implementation apart from the plain busy-waiting UART driver UARTstr is that, first, the application level actor writing to the serial terminal is only blocked for the time it takes to parcel the output string into message strings, and two, that it does not prevent actors of lower priority to be activated for all the busy-waiting time.

For clarity, since any actor can be placed on any ready queue, we could also implement this concept on program level, by setting up a lowest prio ready queue to be used with the plain UARTstr, on which the actor would be put while writing to the terminal. Obviously, the actor would not be responsive any more to its own triggering events while writing to the terminal, but depending on the actual use case and program architecture, this can still be valid implementation.

UARTstrKint

AsUARTstrKbw, UARTstrKint uses the concepts of kernel-v4. It works the same way regarding the preparation of the message strings to be put on the event queue (see below). However, it does not use busy-waiting when writing to the UART FIFO, but the interrupts from the UART.

Unfortunately, the UART does not provide an interrupt for the "FIFO empty" condition. When using the FIFO, the transmit interrupt is triggered when the FIFO transitions defined threshold levels regarding the number of elements it contains. The lowest threshold possible is four elements. So if we only write up to four characters to the FIFO, we never get an interrupt.

I have experimented with different solutions to try to work around this limitation, but have come to the conclusion that the simplest approach is to avoid the FIFO for this driver. Without FIFO, whenever the one character written to the single-item transmit buffer is transmitted out to the serial line, we get an interrupt.

As with UARTstrKbw, a string to be written to the terminal gets assigned to a message, or split into several such messages if the message string buffer cannot hold the whole string. These messages are queued on the output event queue (the "job queue"). Again, the device actor acquires these messages, gets put onto the ready queue and run form there, and writes them to the UART. More precisely, it writes one character, then subscribes to a second event queue designated to the UART. After the one character has been written to the serial line, the interrupt signalling the empty transmit buffer fires, which then repeatedly writes all characters of the string contained in the original message to the UART. That is, the message assigned to the driver actor serves as temporary storage for the string. After the whole string has been written to the UART, the UART interrupt handler sends a "done" message to the waiting driver actor, which will be executed via the one ready queue, and fetch the next string message from the "job queue".

This approach avoids busy waiting, at the cost of slightly higher conceptual and implementation complexity.

Results

Let's look at the behaviour and results with each of the drivers when running the test program, as described above.

The serial line is configured to a low 38,400 Baud, the rate of the built-in Astrobe terminal. The RP2350 MCU clock frequency is 125 MHz.

As usual, you can click on the thingie in the top right corner to enlarge the screenshots.

UARTstr

With this driver, thanks to the 32 entry transmit FIFO, writing fewer than 32 characters is as fast as it gets – nothing beats a tight loop filling up the FIFO. If we await the emptying of the FIFO over the serial line until we write again, this is the fastest output method with the least impact on the rest of the system. However, running the two actors back to back, we see that the first is fast, but the second will have two pace itself down to the speed of the serial connection, resulting in a long-lasting busy wait loop.

a0 and a1 writing 33 characters^[3] each:

blue: actor a0, time to write all output (see program)
green: actor a1, ditto
yellow: high prio actor

First, we see that the high prio actor, running with a period of 10 microseconds, does pre-empt the other actors indeed (yellow).

The blue line for a0 shows that writing to the empty FIFO is very fast (about 24 microseconds, see the blue cursors), but then a1 (green) has to substantially slow down. Compressing the time base to make the green busy waiting loop fully visible:

We get more about 8.6 milliseconds (8,600 microseconds) for the second actor. Compare the merely visible blue loop time of a0 to the green of a1.^[4]

If we add one character to write by both actors a0 and a1:

Oops! Now also actor a0 has to wait, for one character to be transmitted out of the FIFO, which takes about 230 microseconds (note the different time base).

With even more characters to serially transmit, the timing gets worse, obviously. With 60 characters, a0 takes nearly 7,000 microseconds for its blocking busy waiting loop, a1 takes over 15,000 microseconds.

Note that during the blue and green busy waiting loops, all actors with the same or lower priority than a0 and a1 cannot run. The scheduler is pre-emptive, but does not do any time slicing, so all actors of the same priority run sequentially, and while one actor runs it cannot be interrupted by any actor of the same (or lower) priority.^[5]

UARTstrKbw

Same test with 33 characters:

a0 (blue) and a1 (green) take just over 50 microseconds.

Notably, both actors a0 and a1 take about the same time to do their outputs, because they simply offload the actual serial transmit to another actor inside the driver, as described above. So they only block for the time it takes to split up all output strings as needed and assign them to string messages, which are put on the corresponding event queue, from where the driver actor will fetch them and do the serial transmission. Eagle-eyed, you have realised that it takes a longer time to create all these string messages than filling up an empty FIFO: as said above, with fewer than the FIFO depth characters to transmit, nothing beats the simple UARTstr driver.

The added red-ish trace shows the behaviour of the driver actor, namely the time it spends writing to the FIFO, including any busy waiting. The driver actor repeatedly gets the next message string from the event queue, then fills the FIFO in a busy waiting loop.

Compressing the time base:

The sharp red spikes downwards signify when the driver actors fetches another string message from the event queue, that is, the busy waiting is split into several shorter loops.

The overall time spent busy-looping to transmit all strings from the messages is about 8,600 microseconds, that is, the same as with UARTstr. Which is not surprising, since we have simply "delegated" the busy waiting to a lowest prio actor, allowing actors a0 and a1 to continue, using only just above the 50 microseconds it takes to split up their output strings into the message strings, indicated by the blue and green spikes on the left hand side of the screen.

With 60 transmitted characters per actor:

a0 and a1 now take close to 80 microseconds blocking time each, compared to 7,000 and 15,000 microseconds, respectively, with UARTstr.

The overall transmission time is close to 23,000 microseconds:

Only a higher serial baud rate would bring down that busy waiting period. However, any actor with a higher priority than the driver actor can run during this time. Even an actor with the same priority would run, if triggered, whenever the driver actor gets the next message (the red downward spikes).

UARTstrKint

Last but no least, the driver based on the UART interrupt.

The usual test with 33 characters per actor a0 and a1:

Since creating the strings messages, and putting them on the event queue, is the same for UARTstrKbw and UARTstrKint, we again get about 50 microseconds for both a0 and a1.

Here, the red trace shows the runs of the UART interrupt handler once every character:

We get an overall time duration of 17,000 microseconds for 33 characters per a0 and a1. which basically corresponds to the serial transmission time, since each characters is written by the software individually, without the benefit of the FIFO.

With 60 characters written by a0 and by a1 each, the picture is basically the same, just with more "red spikes", one per character.

Bottom Line

Responsiveness

The high priority actor in the test program, toggling a pin every 10 microseconds (the yellow trace), demonstrates that the kernel and the control program remains responsive also during busy-wait loops in lower priority actors. Pro memoria, the context switches are handled by the NVIC hardware, which also takes care of the prioritising the actors, or control processes. I still need to run the kernel through its paces to actually measure the responsiveness and latency.

The UART String Output Drivers

With the responsiveness aspect of the control processes covered with kernel-v4, and using a suitable interrupt priority scheme, busy waiting loses some of its detrimental potential to disrupt the timing of a control program. Even the simple UARTstr driver, which does busy waiting right in the actor (task) that writes to the terminal, may be a suitable design. If we just sample some data every second, and update a log or a screen, we probably don't even need a kernel. I personally would use one anyway, since I find structuring a control program into control processes an approach that fits my brain.

UARTstrKbw is the more elaborate cousin of UARTstr. It still busy-waits, but delegated to a dedicated actor in the driver itself, which runs at a priority lower than the control program actors, possibly the lowest the MCU provides. This allows the control program actors so continue while the output is written to the terminal, and remain responsive to their triggering events.

UARTstrKint uses the same principles as UARTstrKbw regarding the "input job queue" with all the strings, or parts thereof, to be written to the UART as payloads in messages lined up in the event queue. However, UARTstrKint employs the transmit interrupt to write the strings to the UART, without busy waiting.

At 38,400 baud, the interrupt handler runs every 260 microseconds, once per character transmitted, for about 0.5 microseconds. That's a very low duty cycle, and may be used to reduce power consumption, depending on the overall program design (think WFI).

That is, as mentioned at the top, each of the UART string output drivers can have its use, depending on the overall program requirements and design.

General Notes

Using messages to share data between concurrent processes is a useful concept (duh). The messages are either on event queues, where they are not mutated, or are either "owned" by the sender or the receiver. As we don't allow creating new message instances after program initialisation, the messages are held in message pools when not in use. These pools must be sized appropriately. The queues themselves are just linked lists, so there are no size limits, and thus no corresponding design decision to make.
There's quite a bit of boilerplate program text to set up all the queues and the pools. Together with consolidating the kernel-v4 modules, this will be addressed.
With kernel-v4 getting more "complete" and stable, I'll finally need to test and measure its responsiveness, with and without busy waiting loops involved.
In general, interrupt handlers can be written to handle exactly one specific interrupt of one device, or in a more generic fashion. For example, one handler procedure in UARTstrKint can handle the interrupt from UART0, and another one from UART1 (devHandler0 and devHandler1), or one procedure can handle both (devHandler, commented out). The generic devHandler needs to figure out at run-time which UART interrupt it serves, devHandler0 and devHandler1 are hard-coded for their interrupts. It is always possible for an interrupt handler to determine "who am I" by inspecting register IPSR, that is, in this case, which UART it serves, and consequently which interrupt data to use – at the cost of some complexity and run-time overhead.^[6] If we have two possible devices, such as the UARTs, two separate hard-coded handler procedures are feasible. For timer alarms, where we have eight alarm devices and separate interrupts, this may be too much "dead code" in case we don't use many alarms. Module KernelAlarms employs a one-handler approach. As so often, it's a program design trade-off – time vs. (memory) space.

Repository

lib/v2.1 {{< reporefd "/" "examples/v2.1/rpi/pico2" >}}
lib/v3.0 {{< reporefd "/" "examples/v3.0/rpi/any" >}}

There is also the aspect of wasting energy in a tight busy waiting loop, but we'll not investigate this here. ↩︎
Debugging terminal output drivers has its challenges, as the usual terminal output cannot be used to trace program flow. I have left a second terminal use the simple output driver, writing any debug and trace data there, apart from using the oscilloscope. ↩︎
Since the serial transmitter starts before the FIFO is full, we can write more than 32 characters without filling up the FIFO. ↩︎
I'll use microseconds for all times. However, a number such as 8,600 does not imply a measurement by that precision. In fact, times are measured using visual cursors on the oscilloscope screen. ↩︎
Which simplifies actor coordination and mutual lockout, very similar to cooperative scheduling. ↩︎
The generic interrupt handler in UARTstrKint takes about 0.8 microseconds to run, compared to 0.5 microseconds for the specific ones. ↩︎

Last updated: 23 October 2025