The real realtime preemption end game

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

By Jonathan Corbet
November 16, 2023

LPC

The addition of realtime support to Linux is a long story; it first shows up in LWN in 2004. For much of that time, it has seemed like only a little more work was needed to get across the finish line; thus we ran headlines like the realtime preemption endgame — in 2009. At the 2023 Linux Plumbers Conference, Thomas Gleixner informed the group that, now, the end truly is near. There is really only one big problem left to be solved before all of that work can land in the mainline.

The point of realtime preemption is to ensure that the highest-priority process will always be able to run with a minimum (and predictable) delay. To that end, it makes the kernel preemptible in as many situations as possible, with the exceptions being tightly limited in scope. The basic mechanics of how that works have been established for a long time, but there have been a lot of details to resolve along the way. The realtime preemption work has resulted in the rewriting of much of the core kernel over the years, with benefits that extend far beyond the realtime use case.

Gleixner started by noting that, while the realtime preemption project has been underway for nearly 20 years, it is actually closer to 25 years for him — he started working on realtime support for Linux in 1999. Once it's done, he said, there will be "a big party". Is that point at hand? The answer, he said, is "yes — kind of". There is one last holdout to be dealt with: printk().

Whenever code in the kernel needs to send something to the system consoles and logs, it calls printk() or one of the numerous functions built on top of it. One might not think that printing a message would be a challenging task, but it is. A call to printk() can come from any context, including in non-maskable-interrupt handlers or other printk() calls. The information being printed may be crucial, especially in the case of a system crash, so printk() calls have to work regardless of the context. As a result, there are a lot of concurrency and locking issues, and lots of driver-related complications.

printk(), Gleixner said, is fully synchronous in current kernels; a call will not return until the message has been sent to all of the configured destinations. That is "stupid"; much of what is printed is simply noise, especially during the boot process, and there is no point to waiting for it all to go out. Beyond being pointless, that waiting introduces latency, which runs counter to the goals of the realtime work, so the realtime developers have long since moved printk() output into separate threads, making it asynchronous. That code is a bunch of hacks rather than a real solution, though. A better job must be done to make this work useful for the rest of the kernel.

The printk() problem has been worked on seriously since 2018, resulting in about 300 patches that have either gone upstream or are waiting in linux-next; this work has been covered here at times. There are, he said, three final patch sets currently in the works to finish the job. A few tricky details are still being worked on. One of those is the handover mechanism; if the kernel has an emergency message to put out (it's crashing, for example), it may need to grab control of a console that is currently printing a lower-priority message. Doing that safely from any context is not an easy thing to do.

Another ongoing task is marking console drivers that are not safe to use in some contexts; if, for example, outputting a message during a non-maskable interrupt requires doing video-mode setting, it's just not going to work.

Gleixner finished the prepared part of his talk by saying that, even though it's getting close, nobody should ask him when the work will be done. printk() is unpredictable, and he is no longer willing to even try. Even so, he expressed hopes that the rest of the realtime preemption code would be in mainline before the 20th anniversary comes late in 2024.

An audience member asked whether there had been any interesting changes in the printk() code over the last year; Gleixner answered that there have been no fundamental conceptual changes. John Ogness, who has done much of the printk() work, said that the handover code has been reduced somewhat, but that some work remains; there are 76 console drivers in the kernel that need to be fixed, and it may take a while until they are all done. The handover code has been changed to allow drivers to be updated one at a time rather than requiring that this work all be done at once. (See this article for more discussion on the recent printk() work).

Masami Hiramatsu asked which kernel messages need to be printed synchronously; Gleixner answered that almost everything should be made asynchronous. Beyond reducing latency associated with printk() calls, asynchronous output allows the creation of a separate kernel thread for each console, letting the faster consoles go at full speed rather than waiting for the slowest one. He also said that the code has been changed to ensure that important messages are fully copied into the message buffer before the first line is output, just in case a faulty console driver brings the whole system down in flames. Further safety is obtained by writing to the known-safe consoles first. If, for example, there is a persistent-memory store available, messages are put there before being sent to physical devices, once again preserving the output even if a faulty driver kills the system.

As the session closed, Clark Williams asked whether, once the printk() patches go upstream, Gleixner would try to push the rest of the realtime code (which wasn't discussed in this session) in the same merge window. The answer was a qualified "yes"; he might try if all of the code is staged in linux-next and seems ready to go.

[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting our travel to this event.]

Index entries for this article
Kernel	Kernel messages
Kernel	Realtime
Conference	Linux Plumbers Conference/2023

(Log in to post comments)

The real realtime preemption end game

Posted Nov 16, 2023 14:17 UTC (Thu) by grawity (subscriber, #80596) [Link]

> printk(), Gleixner said, is fully synchronous in current kernels; a call will not return until the message has been sent to all of the configured destinations.

I remember when I was managing a large Linux-based gateway, and I configured serial console (over IPMI), and later I added some iptables LOG rules, and it turned out that just a few matching packets per second would DoS it because it wasn't processing any packets while waiting for each log message to go out through ttyS1...

The real realtime preemption end game

Posted Nov 17, 2023 1:30 UTC (Fri) by areilly (subscriber, #87829) [Link]

I remember a microVax that I used as an undergrad that had a real teletype set up as the (serial) console. Had the effect of (a) rebooting the machine if it was ever accidentally turned off, and (b) halting the machine whenever it ran out of paper. I was always amazed that the Ultrix would keep going as though nothing had happened as soon as more paper was loaded and it was put back on line.

The real realtime preemption end game

Posted Nov 17, 2023 5:51 UTC (Fri) by donald.buczek (subscriber, #112892) [Link]

Nothing tops the feeling when a bugcheck on vax/vms system causes all these hex dumps to be printed slowly and noisily to the paper of your hardcopy console. Physical printout and slowness signifies importance while stackdumps with hex values signify secret knowledge. This is not for mortals.

The real realtime preemption end game

Posted Nov 17, 2023 6:23 UTC (Fri) by donald.buczek (subscriber, #112892) [Link]

Oh, and of course everybody had to wait for you, while you carefully read through the secret scripture, which nobody but you could understand, until you decide to enter "b <ret>", which initiated thr printing of another few pages of startup messages during the next 10 minutes :-)

The real realtime preemption end game

Posted Nov 16, 2023 16:17 UTC (Thu) by IanKelling (subscriber, #89418) [Link]

https://wiki.linuxfoundation.org/realtime/start link to latest development version patch, cloc says:

Language                     files          blank        comment           code
-------------------------------------------------------------------------------
diff                             1           1899           5676           8032

That is pretty small. I've enjoyed reading about this over the years.

The real realtime preemption end game

Posted Nov 16, 2023 21:44 UTC (Thu) by itsmycpu (subscriber, #139639) [Link]

That has me wondering... in general, is it possible to create a protected memory space that survives rebooot?

The real realtime preemption end game

Posted Nov 16, 2023 22:53 UTC (Thu) by mjg59 (subscriber, #23239) [Link]

In general? No. You have no guarantees about the behaviour of the firmware over reboot, and it's legitimate for it to just wipe the entire contents of RAM before booting anything else. But there are specific cases where this can be guaranteed - check the various pstore backends for examples.

The real realtime preemption end game

Posted Nov 17, 2023 0:47 UTC (Fri) by itsmycpu (subscriber, #139639) [Link]

Maybe something like this could preserve a range of memory? However I wouldn't know if that could be interesting in this context...

https://en.wikipedia.org/wiki/Reboot#Warm

"The Linux family of operating systems supports an alternative to warm boot; the Linux kernel has optional support for kexec, a system call which transfers execution to a new kernel and skips hardware or firmware reset. The entire process occurs independently of the system firmware. The kernel being executed does not have to be a Linux kernel.[citation needed]"

The real realtime preemption end game

Posted Nov 17, 2023 5:33 UTC (Fri) by mjg59 (subscriber, #23239) [Link]

kexec certainly provides a mechanism for preserving memory ranges over kernel switches, but it's not really what most people would describe as reboot

The real realtime preemption end game

Posted Nov 18, 2023 5:36 UTC (Sat) by mirabilos (subscriber, #84359) [Link]

It works in practice on BSD, dmesg shows the previous run’s messages as well.

In practice here means x86 hardware like Thinkpads and other assorted PCs and servers whose BIOS will not overwrite the entire memory during warm reboot, as well as SPARCstations whose OpenBoot will similarily not clear the high-up memory used for the kernel log buffer.

The real realtime preemption end game

Posted Nov 19, 2023 3:20 UTC (Sun) by Paf (subscriber, #91811) [Link]

“ It works in practice on BSD, dmesg shows the previous run’s messages as well.”

And surely this is only possible through the retention of data in memory over reboot! What other magic could do this?

Sorry, but I’d lay a lot of money this is done with storage.

The real realtime preemption end game

Posted Nov 19, 2023 5:15 UTC (Sun) by mirabilos (subscriber, #84359) [Link]

How much money are you willing to hand over?

https://mbsd.evolvis.org/cvs.cgi/src/sys/kern/subr_log.c?...
(I’m using a somewhat beefier mirror here to not get the main server slashdotted)
look for initmsgbuf near the beginning of the file, which gets a pointer to the RAM region.

It is called for SPARC from:
https://mbsd.evolvis.org/cvs.cgi/src/sys/arch/sparc/sparc...
(initmsgbuf called with an almost fixed (only the oldest systems avoid the first page) address…)

For i386, the call is at…
https://mbsd.evolvis.org/cvs.cgi/src/sys/arch/i386/i386/m...
… where msgbufp comes from…
https://mbsd.evolvis.org/cvs.cgi/src/sys/arch/i386/i386/p...
(the __OpenBSD__ ifdef) which sets the virtual address. The physical address (MMU mapping) is done somewhere between locore.s and there, and it looks to me like its location depends on the size of the kernel image, so you’d only get the log messages if you boot the same or a very similar-sized kernel after warm reboot.

And yes, it’s purely memory-based. It helps immensely in copying e.g. the remainder of a ddb(4) session (in-kernel debugger) out if you don’t have a serial console.

The real realtime preemption end game

Posted Nov 19, 2023 5:17 UTC (Sun) by mirabilos (subscriber, #84359) [Link]

Heh, and of course just as I posted this, the hoster fell off the ’net (including not only the hoster’s own homepage but also their status page, which is not hosted at a different site as is usual).

Oh well, it’ll be back at some point.

The real realtime preemption end game

Posted Nov 19, 2023 18:29 UTC (Sun) by kreijack (guest, #43513) [Link]

My understanding is that (at least in the x86 world), the memory is cleaned during the reboot.

So the problem is not to find a fixed area where store the data, but avoid that this area is cleaned up during a reboot.
And this cannot be done in a generic way.

The kind of reboot that I am talking, is the one that allow you to exit from a "crash", so I think that we are talking about an hard reboot. And an hard reboot implies the memory cleanup.
Think if this wouldn't exists: this would allow to extract from the memory some secret with a simple reboot at the "right time"; it would be a giant security hole.

The pstore back-ends in the x86 are mostly two: the first one relies on the UEFI variable storage; the second one relies on the ACPI-ERST, which is like a flash memory.

The real realtime preemption end game

Posted Nov 19, 2023 18:35 UTC (Sun) by mirabilos (subscriber, #84359) [Link]

I’m afraid your understanding has always been wrong, then.

The real realtime preemption end game

Posted Nov 19, 2023 18:40 UTC (Sun) by mirabilos (subscriber, #84359) [Link]

Hm, perhaps a bit more elaborating.

Yes, it’s not a persistent storage like the BIOS (or EFI) settings.

No, a boot does not imply memory cleaning (except for memory used during boot, of course). It usually does imply some kind of memory test, and several kinds of memory amount probing by different places in the boot process, but these are often nōn-intrusive enough to keep the memory contents.

A cold boot does have empty memory simply because the memory had no power and the memory controller likewise did not refresh the memory banks.

A warm reboot does not have a period of such, so the memory is *usually* retained.

A hard reboot can fall into either category, depending on how it is executed and wired. The usual power button long-press will be a poweroff followed by a mostly-cold boot; a watchdog reboot, or if the kernel crashed but is still able to reboot-ish (even if just by causing a triple-fault) can be warm reboots (this mostly depends on the memory controller to continue refreshing the memory during that, and of course the firmware not overwriting it).

The real realtime preemption end game

Posted Nov 20, 2023 19:57 UTC (Mon) by kreijack (guest, #43513) [Link]

> A warm reboot does not have a period of such, so the memory is *usually* retained.

I think that the key word is "*usually*". On my UEFI system I build a UEFI program that dump the first 4 bytes of the following address:
- 3GB
- 7GB
- 14GB

Then it sets these bytes to a specific value, and then it dump again.

What I saw is:
1) the first time that I run the program, I saw "random values", like 0 and other non 0 values.
2) the 2nd time that I run the program, I saw the same values that I set in the first iteration.

This proof that UEFI doesn't reset the memory between different program invocation.

Then I "warm rebooted" the system, and I saw the "random values" at 1). So it seemed that in my system the memory is cleared between the reboot.

What I'm telling is that at least some bios clears the memory. In may case (a ASUS B550 desktop mainboard) it seems that the BIOS clear the memory.

What I found is that it is possible to force the BIOS to not clear the memory after a reset [1]. But again this is not typically what happens after a crash; after a crash you push the reset physical buttons.

[1] https://stackoverflow.com/questions/36608101/does-a-soft-...

The real realtime preemption end game

Posted Nov 21, 2023 23:31 UTC (Tue) by mirabilos (subscriber, #84359) [Link]

Yes, that’s precisely what I meant with “usually”: a sufficient amount of systems keeps sufficient amounts of memory alive to make this feature worth being in existence, even if counter-examples exist and no spec supports this usage.

The reset button as the only way out of a crash is such a PC thing though. Some machines have watchdogs, and some have something like ddb(4) on BSD or SysRq on Linux that allow for warm reboots even in the face of a crash.

The real realtime preemption end game

Posted Nov 19, 2023 5:40 UTC (Sun) by mjg59 (subscriber, #23239) [Link]

It works as long as your firmware behaves in a specific way, something no specification requires of it.

The real realtime preemption end game

Posted Nov 19, 2023 18:22 UTC (Sun) by mirabilos (subscriber, #84359) [Link]

Yes, of course. Which is why I said that “it works in practice”: it works on a sufficiently large array of machines that OpenBSD (and probably NetBSD before) chose to implement it so, and even if it doesn’t work on one machine it’s no big loss.

The real realtime preemption end game

Posted Nov 19, 2023 9:40 UTC (Sun) by DemiMarie (subscriber, #164188) [Link]

In my expierience, pstore is the only reliable way to get a stack trace on panic, unless one is in a VM. Serial consoles don’t work because end-user systems don’t have them. Graphical consoles don’t work because the user is in an X11 or Wayland session.

How does Windows manage to display the BSOD message?

The real realtime preemption end game

Posted Nov 19, 2023 14:10 UTC (Sun) by Wol (subscriber, #4433) [Link]

> How does Windows manage to display the BSOD message?

I guess it just seizes control of the graphics card, or puts it into text mode, or whatever.

Cheers,
Wol

The real realtime preemption end game

Posted Nov 19, 2023 20:38 UTC (Sun) by ballombe (subscriber, #9523) [Link]

It displays the BSOD at start up, and then prints the normal screen as an overlay on top of it. This way, when something goes wrong, the overlay disappears and you see the BSOD. That is why the option "customize the BSOD" requires you to reboot. That is also why, if you set the background to an image with transparency, you get the BSOD.

(just jocking of course)

The real realtime preemption end game

Posted Nov 19, 2023 21:07 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

> How does Windows manage to display the BSOD message?

Windows drivers are much more resilient than the drivers in Linux. A surprising amount of functionality remains working in Windows even if half the kernel is going haywire.

In particular, modesetting and simple framebuffer access have always been a part of the kernel driver. And each driver is also responsible for pre-allocating its object pools, so there's much less dependency on memory allocation. The IRQL system also has a side effect of forcing driver writers to avoid putting anything too involved inside the critical pathways.

The real realtime preemption end game

Posted Dec 8, 2023 17:45 UTC (Fri) by pawel44 (guest, #162008) [Link]

Wishful thinking. USB driver failure will bring Windows down. Not to mention Windows Driver Model and Windows Driver Frameworks are holey like Swiss cheese.

The real realtime preemption end game

Posted Nov 24, 2023 21:06 UTC (Fri) by mtthu (subscriber, #123091) [Link]

This would maybe be doable if you have control over all stages of the boot process and the environment the kernel runs in. I guess it would be easier to be introduced in virtualized environments as a memory region could be mapped into a file on the host where it could be synced to disk before a restart. Memory integrity could be checked on that level as well, for example for the case that the host has an uncontrolled reboot.