x86-64 vs x86
Having written code that bare-metal boots 20 years ago and a few years ago, these are the main differences between booting on the two:
- Going from 32-bit mode to 64-bit mode is about 10-20 lines of assembly. However, that's not the hard part. The hard part is that there is no such thing as 64-bit "real mode", so you also need to set up page tables. This isn't difficult, and it's reasonably inexpensive now that we have 1G page tables. (But see below the discussion about APs.) Difficulty level: Hey, not too rough.
- More complicated than this is the move from single core to multicore. I'm going to talk about this in a bit more detail in a moment. Difficulty level: Hurt me plenty.
- 20 years ago, we had 32-bit plug-and-play BIOS calls to handle most hardware discovery. Today, this has been replaced by ACPI. ACPI is a bunch of data structures written into the BIOS ROM or RAM which the operating syste, reads. No operating system that I'm aware of does this using code rolled by itself. Instead, they use the client library ACPICA, which clocks in at over 125,000 lines of code. Integrating ACPICA is one of the reasons why Homebrew OS is on hiatus. Difficulty level: Nightmare.
Multicore is complicated, but the problem is complicated. There are basically two issues that you need to solve:
- Booting a multicore system is tricky, but not as tricky as it could have been. Intel took the decision that all the good hardware should be dedicated to running multicore systems efficiently, at the cost of making them difficult to boot. At the time your kernel starts running, and you are in 64-bit mode running on one core, known as the bootstrap processor or BSP for short. The other cores are known as application processors (APs), and are still in 16-bit mode. So you need to set up enough of the system for the APs to be able to run, and then get them into 64-bit mode. This requires placing their boot code in 16-bit addressable space... yeah. It's tricky.
- Interrupt delivery is more complex. The central problem is that a piece of hardware needs to be serviced (e.g. a network packet comes in, a disk has finished reading a block, or something), but which CPU gets the job of servicing it? There is a progammable chip, the IOAPIC, whose job it is to decide that. On server-class hardware, there's more than one of them because CPUs are clustered. Discovering the topology of the system and then doing something appropriate is a little tricky, but more or less straightforward for consumer-class hardware.
There are also a bunch of issues that a running operating system needs to get right once the system is booted (e.g. TLB coherency), but that's a different issue.
Deployment
The principle of every program coming with its own operating system is sort of already happening in the data centre (and, to a lesser extent, on the cloud). If you have a data centre that you built last year, you probably have a bunch of boxes that all run hypervisors, on which you install one virtual machine per service.
Even if you don't do it that way, and install Linux on each machine, your services are probably implemented as containers. A container is essentially its own deployable operating system, sharing only the kernel with its host. Nesting operating systems on top of operating systems is arguably a step backwards as far as source code line count goes, but it illustrates that deploying an operating system to run a program is feasible today because it happens today on the server side.
Research
One thing that industry may not be aware of is that the early-to-mid 1990s, before we ended up with essentially three operating systems in the world, was a fertile time for operating systems research, too, and this research into new basic ideas for operating systems essentially halted 15 years ago.
Many of these research operating systems were made possible due to OSKit, which was a toolkit for building your own kernels. It had its problems, but it made it easy to experiment and got out of the way if you wanted it to.
A lot of it was about anticipated advances in hardware: the move to multicore and network-distribution, for example. Especially the need to run untrusted code downloaded over a network; the Flask design had the idea of nesting virtual machines from a security perspective without sacrificing straight-line code performance.
But here's the one I'd like to highlight: Exokernels. Here's the blurb:
Since its inception, the field of operating systems has been attempting to identify an appropriate structure: previous attempts include the familiar monolithic and micro-kernel operating systems as well as more exotic language-based and virtual machine operating systems. Exokernels dramatically depart from this previous work. An exokernel eliminates the notion that an operating system should provide abstractions on which applications are built. Instead, it concentrates solely on securely multiplexing the raw hardware: from basic hardware primitives, application-level libraries and servers can directly implement traditional operating system abstractions, specialized for appropriateness and speed.
This is from 1995! But it sounds familiar, no?
The frightening thing is that this "research", on new ways to think about operating systems, is now largely been taken up by hobbyists. All due respect to hobbyists, but we can't bet our indstry on this.
The flip side
Finally, I want to briefly mention the flip side. There are essentially two benefits from using a shared operating system as an application platform instead of raw hardware: security and balance.
(Note that I am not claiming that you get either from a modern operating system! Merely that they are much more difficult in a "just multiplex the hardware" environment.)
Security is an obvious one, and Casey touched on this in the Q&A. The job of memory protection is to abstract RAM, so that an application can't read or scribble on memory that it's not supposed to. But the same argument applies to other parts of the computer: If an application has access to the raw disk, it has access to everything, so that clearly needs to be abstracted in some sense. If an application has access to the raw network card, you essentially have no firewall, so that needs to be abstracted in some sense. The more you think along these lines, the more you essentially end up with a modern operating system. So are we really talking about "raw hardware", or are we really talking about "getting the abstraction level right"?
The other benefit is that an operating system which acts as an application platform can balance the needs of competing applications. The Windows NT kernel, for example, goes to a lot of trouble to determine exactly how much RAM each application is really using and adjust the virtual memory settings for the system as a whole to suit. Similarly, the Unix timesharing algorithm dynamically decides if a given program is an "interactive" or "background" task; interactive jobs suspend often (waiting for input/disk/whatever), and should get priority when they wake to make the system seem more responsive. Background tasks don't need to run as often, but should get a big timeslice when they do.
The flipside of that, of course, is something that my postgrad supervisor (who taught me most of what I know about modern operating systems) pointed out: the Unix timesharing system makes sure that everyone gets equal service, by which we mean equally shitty service.
One thing that needs to be mentioned is that operating system vendors have a vested interest in making applications on the same platform have a similar "look and feel", and commercial reality is that this matters. If you've ever worked in a business environment, you know that non-technically-savvy people need to be trained in how to use the software they rely on for their job, and following platform conventions saves time and money there. But this is a matter of conventions. The existence of Qt (which of course has many problems of its own) proves that you don't need to use the operating system's toolkits to maintain look and feel.
The disadvantage is the advantage
My final thought is that one underlying principle is that the key disadvantage of some technology is often its key advantage. Take C++. Its main disadvantage, the thing that makes the higher-level stuff utterly painful to use, is that all the good syntax is already taken by the C subset. But that is also the main advantage of C++: it includes C.
One key difference between GPUs and x86 is that GPUs run well-written code extremely fast, but x86 runs moderately crappy code quite well under the circumstances. Modern operating systems also run moderately crappy code quite well under the circumstances. That's a disadvantage, but it's also an advantage. We don't want that crappy code at all! But the crappy code out there does reasonably well under the circumstances.
It's hard out there.