K4perf
lib/v2.1 lib/v3.0 Example/test programs: kernel-v4 performance.
Introduction
This is a test program K4perf0 for Kernel-v4 to measure the responsiveness.
Description
The objective is to measure the latency of activating an actor by an interrupt, using the message passing mechanism of kernel-v4.
There are three actors:
-
Actor
am: at medium priority, running off a 1 millisecond kernel tick, switching a GPIO pin on and off periodically. This GPIO pin is wired externally to another GPIO pin, which triggers an interrupt configured for this pin. The GPIO interrupt is configured to trigger on the rising and also the falling edge of the GPIO signal caused byam. -
Actor
ah: activated by the handler servicing the GPIO interrupt. It runs at the highest priority. The handler acquires a message from the pool, and sends it to actorahvia the kernel, which will causeahto run via its ready queue. The latency from the interrupt to the agent starting its processing is the time we're interested in. -
Actor
al: a potential troublemaker, running at the lowest priority, and simply running an infinite loop, to test if this loop has an impact on the the latency of actorah.
All code is minimal to measure the basic, raw latency.
The measured activation sequence:
- the GPIO interrupt caused by
amis acknowledged and the handler executed by the MCU; - the GPIO interrupt handler fetches a message from the pool;
- the GPIO interrupt handler sends the message to actor
ah; - the kernel puts
ahon its ready queue, triggering the ready queues interrupt; - the ready queues interrupt handler runs the actor's task.
The GPIO interrupt handler runs at the same priority as the ready queue interrupt handler of ah.
Results
Since we have sub-microseconds or low single digit microsecond times, the latency is measured using GPIO pins and an oscilloscope. As usual, clicking the white thingie at the top right of the screenshot enlarges the picture, and gives a dark background for better visual contrast.
The RP2350 MCU runs at 125 MHz.
Overview
The following overview timing shot shows the GPIO signal created by am in yellow. The interrupt handler and the task of ah are barely visible as fine spikes at the rising and falling edge of this signal (red, blue).
{{< gallery gallery0 "oscilloscope screenshot" "600px" >}}
Details
The following screenshots shows the details at the aforementioned spikes.
{{< gallery gallery1 "oscilloscope screenshot" "600px" >}}
We measure, from the rising edge of the GPIO signal of am (yellow trace):
- blue trace: about 2.3 microseconds for actor
aito set its GPIO pin high.
Unsurprisingly, the picture is the same at the falling edge (yellow trace):
{{< gallery gallery2 "oscilloscope screenshot" "600px" >}}
The interrupt handler itself:
{{< gallery gallery3 "oscilloscope screenshot" "600px" >}}
- red trace: about 230 nanoseconds
Discussion
Actor Latency
With the minimalistic code used, our actor is activated within 2.3 microseconds of the GPIO interrupt when the pin gets electrically raised to high, for example by an instrument or a sensor. This may or may not be sufficiently fast for a specific control program. But as outlined below, we're operating at the MCU's capability limits here.
Interrupt Handler Latency
The interrupt handler itself starts doing any "useful" processing after about 230 nanoseconds, with this latency divided between the Cortex-M33 hardware interrupt handling latency 12 clock cycles[1], and the handler's procedure prologue. So in case we need a response time of under 1 microsecond, the handler itself can assume some the corresponding operations, eg. read and acknowledge sensor values, then pack up the read value into a message and send it to an actor for processing. If we require continuous sampling and response times of below 1 microsecond, we probably better employ a signal processor, or a dedicated hardware frontend to our software. The kernel can still serve to collect and process consolidated data very quickly, but is clearly at its performance edge at these operating frequencies and latencies close to the hardware capabilities of the M33 itself.
Cross Check: Clock Cycle Counting
Apart from using the oscilloscope to measure the latency and activation times, I have also counted the clock cycles using the built-in data watchpoint and trace (DWT) facilities.[2] If we count to right after the handler prologue, there are 26 clock cycles, ie. 208 nanoseconds, including reading the cycle counter, which takes three instructions. If we measure right after the GPIO pin is set high by the handler, we get 29 cycles, ie. 232 nanoseconds. The same measurement for the activated actor is 281 clock cycles, ie. 2.25 microseconds. These measurements jibe well with the readings on the oscilloscope.
Reproducible and Predictable Behaviour
An important take-away here is that the response times are consistent and reproducible, based on the Cortex-M33 hardware design and the compiler's code generation, which enables guaranteed response times, important for a "hard" real-time system. The Astrobe compiler produces compact code, but also code that is a direct representation of the source code, again, important for a real-time system: you look at the source code, and you know exactly what the generated code looks like. If in doubt, look at the assembly listing, courtesy of Astrobe, but you'll never be surprised. It's exactly the kind of compiler I want to program a real-time system.
The Troublemaker
Actor al, which runs an infinite loop at the lowest priority, has no impact on the above measurements, which is not surprising, since the detection, acknowledgement, and invocation of the triggering GPIO interrupt happens all in hardware.
Flash Cache Impact
The above times were measured in a steady state, after the code has been cached. The following screenshots were captured at start-up, loading the code directly from the flash memory:
{{< gallery gallery5 "oscilloscope screenshot" "600px" >}}
{{< gallery gallery6 "oscilloscope screenshot" "600px" >}}
The raw interrupt latency is now over 3.3 microseconds, actor activation nearly takes 29 microseconds. Hence, a rare event requiring a fast response requires appropriate measures to keep (or prepare) the corresponding code in the cache, due to the design of the RP2350.
Non-minimalistic Interrupt Handler
As explained above, the GPIO interrupt handler used to determine the basic, raw response times is minimalistic, ie. only using the absolutely necessary code to use the kernel facilities. In general, the interrupt handler will be more complex, since we only have one GPIO interrupt line per core, and the handler would need to determine which GPIO interrupt it is serving. There's a commented-out handler pinHandler2 in the test program module to demonstrate this concept.
With this GPIO interrupt handler we get:
{{< gallery gallery4 "oscilloscope screenshot" "600px" >}}
Unsurprisingly, the interrupt latency stays the same, but the actor is activated only after just above 3.1 microseconds, due to the increased run time of the handler.
Kernel Composition
Other than kernel-v1, which is pretty much monolithic, kernel-v4 is modular, and needs to be composed by the control program: prioritised ready queues, event queues, and message pools are to be created. The upside is that the kernel can be set up and structured exactly as required for the control program. This composition, or design, of kernel-v4 also demands more upfront definition work compared to kernel-v1, such as to determine the actor priorities to achieve the appropriate pre-emption scheme, the timely responses to the triggering interrupts, and the number of messages required in the different pools.
The modularity of kernel-v4 is preserved, even though it is now consolidated into a single Oberon module.
Typical, for an interrupt during the execution of secure code, with a handler running also in the secure realm. If the handler were running in the non-secure realm, the hardware latency would be 21 cycles due to the extended stacking. However, in the latter case the procedure prologue would not need to push all the additional registers, so it would be a wash. Note that the use of the FPU, even though relevant for the stacking, does not come into play for the initial activation of the handler due to lazy stacking. ↩︎
There's some commented-out corresponding code in the test module. Slow down the kernel to one second in lieu of one millisecond to get useful terminal output. :) ↩︎
Last updated: 4 September 2025