FORGE Scheduling

Update: the project FORGE in this post refers to the FORGE Operating System. Also, when discussing feedback schedulers, note that the type of feedback discussed (thread pre-emption) is not the only possible method for feedback – just the one I’ve selected for this post.

Update #2: please note that this all changes when more than one core/CPU can run threads, as various additional factors exist (CPU loads, caches, NUMA domains, etc…) that make scheduling more complex. This blog post is primarily written within the context of a uniprocessor system.

The topic of scheduling in operating system theory is one which can easily fill a chapter, if not several, of a textbook. Optimally scheduling the various tasks running during the course of normal operation is a unique challenge and one which has many solutions. One must consider the type of tasks being run, their overheads, the overhead of switching threads and processes, and various other factors. Furthermore, scheduling may change between operating system types. A real-time operating system will have different scheduling semantics to, say, a general-purpose operating system.

I will not discuss co-operative scheduling here; this is all discussion around a pre-emptive scheduler. That is, a scheduler where threads are given a certain amount of time to run (their timeslice), and are interrupted if they exceed this time.

In FORGE, the current scheduler is simply a round-robin scheduler that switches between threads and doesn’t care about any metadata. This means that every task in the system runs at an equal priority, and also means that threads in the same process are not ‘grouped’.

Let’s take a quick break from discussing round-robin scheduling to discuss this grouping of threads by process, and why it is important in scheduling.

Conceptually, a process can be thought of as an enclosing environment around a set of threads. This environment typically includes an address space, any data shared across threads, and metadata about the actual process itself. For the purposes of this detour, we will consider a process and address space to have a 1:1 relationship. In an environment where address space switches are cost-free (ie, every process runs in the same address space), scheduling threads out-of-order is fine. The threads from various processes can be switched to and from at will, without worrying about costly address space transitions. Consider however the case where an address space switch is in fact a costly operation (as it should be expected to be on most modern architectures): switching between a number of threads from different processes will induce a costly address space switch each time.

Now, consider the following – three threads, two processes. Threads one and three are linked to process one. Thread two is linked to process two. Switching from thread one to two requires an address space switch into process two’s address space. Then, switching from two to three requires yet another address space back into process one’s address space. It would be far more sensible to queue threads from the same process together.

Back to round-robin scheduling.

We essentially have two first-in-first-out queues: the ‘ready’ queue, and the ‘already’ queue. Threads ready to be scheduled are queued in the ready queue, and threads that have already been scheduled are queued in the already queue. When the ready queue runs out of items, the two queues are swapped (already queue becomes the ready queue, and vice versa). This works excellently for an ‘initial’ testing algorithm for an operating system, but does not offer any priorities or have any sort of scheduling heuristics.

To add priorities, it is possible to create a list of queues, and there are various other improvements that can be made to round-robin scheduling as well. In FORGE however, I have decided to take the current round-robin scheduler and replace it with a ‘feedback’ scheduler. This is a very powerful scheduler type that can dynamically respond to the changing requirements of the system as time goes on. Consider a general purpose operating system under a reasonable load. There are a number of threads in a number of processes and each is performing various tasks. Some threads are performing heavy I/O, while others are heavily using the CPU. Others still are mixing the two. This variety in workloads is very common and also very difficult to handle ‘well’.

A feedback scheduler works off the feedback from the threads it runs (hence the name). If a thread is interrupted by the operating system for completing its timeslice, the operating system can generally assume the thread is using the CPU quite a bit. If however a thread gives up its timeslice to the operating system (by yielding, or implicitly by performing I/O), the operating system can determine that the thread is probably more I/O heavy. In a sense, the feedback the threads give to the scheduler allow the scheduler to determine if the threads are using the resources effectively.

So how does a feedback scheduler respond to the feedback? Well, if we consider that each thread has a priority (and a base priority, which is its ‘default’ and starting priority), the scheduler can both punish threads that use too much CPU and reward threads that do not. So, if a thread completes its timeslice before being rescheduled, the scheduler reduces its priority. If a thread does not complete its timeslice before being rescheduled, the scheduler increases its priority. This balances the system by prioritising I/O tasks (which end up blocking while the CPU-heavy tasks do their thing). Then, if a CPU-heavy task needs to block and wait for I/O, its priority can again be increased, and the system is responsive and continues to be usable even during this common workload.

Conceptually this is very easy to understand, but the actual implementation in code (or, perhaps, data structures) is a challenge. At this stage, this has not been implemented into FORGE, so no code is available for demonstration. However, the planned implementation will consist of a combination of binary trees and first-in-first-out queues. Remember earlier the discussion about grouping threads by process; by using a binary tree containing each process ID, we can traverse the tree completely and then load the queue of threads on each node (ie, process). This works for selecting the next thread to schedule.

Now, for the priorities, a simple array suffices; with the index being the priority.

So, for scheduling, we end up with an array, containing binary trees for each priority level, which then contain queues for each process.

This enables a reasonably efficient lookup (assuming iterators are sensible for trees, and that lookup costs and the like are negligible for linked lists), groups threads by process, and enables the feedback system to work correctly. The outcome is a reasonably-well balanced system with priority given to I/O.

As mentioned previously, scheduler design is a unique challenge, and this particular type of scheduler is not necessarily the ideal solution for, say, a real-time operating system. For reference, a similar design is used in OSX, FreeBSD, NetBSD, and Windows NT-based operating systems. Perhaps at a later date I will investigate and discuss an O(1) scheduler type. I’ll blog again later with the intricate details of actually implementing this algorithm.

FORGE – Timer API

This blog post will refer primarily to the FORGE Operating System, a new project I am working on with a primary goal of being able to host various different websites. This project is in C (not C++, as Pedigree was), and I have been working on it over a number of months. Progress is slow; I am currently not spending too much time working on the codebase and rather working on other things such as finishing my university course.

As an aside, the name FORGE is partially because I’d love to host the site that’s hosting it on it (QuokForge), and partially because the idea of a “forge” for any cool ideas I come up with appeals to me. It’s awesome to be able to sit back and think of something different or unique for memory management or caching or something and be able to just design it and implement it straight into FORGE. In that sense, it’s kind of like an incubator for ideas.

Anyway – one thing I really like in FORGE is the timer API. In a modern operating system, time is an important resource. The scheduler typically uses ticks to determine when a process should be interrupted, some operations need to be given a timeout, and yet more operations need to happen at a particular time in the future (whether regularly, or once-off). To add to the complexity, on x86 alone there are a variety of timers available, including the tried-and-true programmable interval timer (PIT), the high precision event timer (HPET), and even the system’s real time clock (RTC) can be used for this purpose. Other platforms such as ARM can have even more clocks, each with their own individual characteristics. And timers such as the HPET may not be present on all hardware – particularly legacy machines.

So, conceptually, time is simple in an operating system. Practically, it is a nightmare.

In FORGE, I happened to already know this fact from my experience with Pedigree and my other OS projects before that. I also knew that trying to integrate the HPET back into Pedigree (as an example) was going to be painful – so whatever I came up with for FORGE had to be flexible and ready to handle any potential new timing mechanism that might come up in the future. That means I can’t ever assume a particular type of device, and the frequencies and even semantics of the timer must be configurable.

Let’s have a look at how FORGE handles all of this: timer.h, timer.c.

First up, there are a number of different timer types that FORGE is capable of dealing with (these can be OR-d together, too):

One-shot. These type of timers designate a time in the future at which the timer should fire (or, a time period from the current time).
Periodic. These timers fire at regular time intervals. This is the most conventional type of timer that just about everyone (should) know about.
UNIX Timestamp. Okay, this isn’t a timer, but it’s an important feature regardless. The ability for a clock to provide a UNIX timestamp (that is, to know the current time and date) is very useful for a variety of system modules. For example, a filesystem driver might want to use this kind of timer to change the access time of a file.
Tick count. Similar to the UNIX timestamp type – not a timer, but useful. Here, we simply count ticks of an arbitrary time period.
Useless. Really only useful for a timer which has been disabled. Any timer that the system knows about, but can’t do any of the above, will go under this category.

Okay, so that’s the types of timer that we can use – what about time resolution? Again, these can be OR-d together to indicate multiple granularity are supported.

Terrible. Higher granularity than seconds. Perhaps useful only for timers with time periods measured in days/months/years.
Seconds. As the name suggests, time periods measured in seconds. For example, a timer which needs to trigger in an hour’s time.
Milliseconds. A millisecond timer. This is more useful for things such as a scheduler which might need to tick at 100 Hz.
Microseconds, and nanoseconds. Very high-resolution timing.

As each driver that configures and provides a timing device to the system completes its initialisation, it will register the timer with the system with a mix of the above features. Each time the timer ‘ticks’ (eg, the PIT IRQ fires), it will call a function provided by the timer API which registers the tick.

Modules, drivers, and kernel code all hook into the timer API by installing what is essentially a callback. This callback is installed with a request for a particular type and resolution (eg, one-shot, fire in 500 time units, time unit is milliseconds) and this is then added to the queue of pending callbacks. Eventually, this will be stored in an optimal data structure that places the callbacks with the lowest remaining time first. Callbacks are assigned the best possible timer for the request.

This is where things get really cool. Each tick that comes in needs to be normalised to the highest possible resolution (to simplify operations across the numerous timer resolutions that could be present). Each callback in the queue that has hooked this particular timer has its ‘remaining time’ reduced by this tick’s time. If the time period has completed (ie, remaining time is less than or equal to zero), the callback is called. Should the callback type be ‘periodic’, a copy of the callback is made and prepared for the next trigger.

This all hides a lot of the details from the developer – all a developer needs to do is ask for a particular timer type, at a particular resolution, and it all “just works”. That is, assuming the resolution is attainable. If it is not, the callback is not installed.

The nice part about having every timer in the system integrated under one module is that I can do some really handy things as a result. The canonical example, which I think is just awesome, is to implement one-shot timers on top of periodic timers where a hardware-supported one-shot timer is not available, and vice versa. This means that in systems where one-shot timing is not available, it is still possible to request a one-shot timer – the implementation will just recognise this and avoid reinstalling the callback after it fires.

So far, this has worked beautifully in the scheduler for FORGE, and I can see it being incredibly useful for more complex code down the track, including networking or USB. It also makes implementing a function to put a thread to sleep for a period of time ridiculously easy – install a callback, acquire a mutex that only the callback can release, release it in the callback (greatly simplified – multi-core and even multi-thread issues abound, but you get the idea).

The code linked above is unfortunately fairly messy and the implementation is in need of a cleanup, but at this stage it’s fully functional and entirely available for perusal. The license for FORGE is permissive, so by all means have a look around and use anything you like (albeit with appropriate attribution). And I’m always open to constructive criticism (it’s an excellent way to build my skills) or questions – just hit the comments below.

So that’s the timer API in FORGE – flexible, usable, and quite powerful.

Pedigree: New Core Features

It’s been a long time since I posted anything here. Moving to a new city has slowed my development down significantly, but things are getting back to normal again.

Recently there has been a fair bit of activity, and I’ll summarise what’s been going on in this post. There’s a couple new features, some changes to Pedigree itself, and some bug fixes. In-depth discussion of each, where necessary, follows the list.

ZombieQueue (commit d5314bf7)
Using > 4 GB of RAM now works, with the x86_64 port (commit bb6b954b)
MemoryPool (commit 64454975)
UdpLogger (versatile log callbacks) (commit ff5fddba)

ZombieQueue

ZombieQueue solves a problem that has been lying around for quite some time. Things such as signals require that a Process object is deleted as part of it being killed, so it is no longer schedulable. However, this was dealt with originally by a “delete this;” line. Any form of “delete this” can be dangerous when the next line is not a return statement. Even when the next line is a return statement, a reschedule out of the current context and into another, which allocates memory, may overwrite the existing object… making “this” invalid. That said, it did suffice for quite some time.

The idea of ZombieQueue is that you queue these deletions, allowing the thread to be rescheduled (or the call that needs deletion to return) and the object to be deleted at a later time, when it’s safe to do so. Eventually ZombieQueue might also have a “delay” parameter, which ensures that it never deletes an object until after a specific time period has passed. This could easily be used for things such as TIME_WAIT handling in the TCP stack.

ZombieQueue has already been implemented into Process, Pipe, and Socket, all of which needed to delete themselves somehow. Now they call into ZombieQueue, which is able to safely delete the object.

x86_64 port using > 4 GB of RAM

A TODO all through the physical memory manager for quite some time has been to handle regions above 4 GB in the x86_64 port. Until this change was made, no ranges of memory above 4 GB were actually accessible. In fact, using the x86_64 port on a system with over 4 GB of memory actually caused anything from a panic to a triple fault and reboot.

This has been rectified now. I’m still not sure it’ll handle a full 64-bit physical address space yet, but it will handle a massive amount of memory nonetheless.

MemoryPool

This one is something I’m surprised hasn’t been implemented yet. The idea of MemoryPool is to take buffer allocation out of the kernel heap, where it consumes far too much memory – and sometimes even consumed the full size of the heap.

MemoryPool takes a block of memory, outside of the heap, and splits it up into buffers of a specified size. These buffers are distributed to modules and applications where necessary, and returned to the pool when they are no longer needed. Buffers in, for example, the network stack, are around 2048 bytes (rounded up to the next power of 2 from 1500). Rather than have every incoming packet cause an allocation on the kernel heap of, on average, 800 bytes, this moves the processing into a pool of memory dedicated to the job.

A key feature to MemoryPool is its blocking nature. If an allocation cannot be satisfied by the current pool of memory (ie, all buffers are allocated out), it will block until a buffer becomes available. Compared to the heap, this means a constant memory usage is enforced. Effectively, the pool provides a strict contract on its memory consumption: a pool will never resize to fit a new allocation.

This does bring up a minor problem: some code can’t block – an interrupt handler for example. For this situation, a non-blocking allocation method is available, which merely returns NULL if no buffers are available.

UdpLogger

It has long been possible to install custom Log callbacks which take log entries and write them to a destination. Callbacks that already exist include the Serial logger, and a callback is used for the loading screen log dump. For testing on real machines, however, neither of these suffice. None of my test machines have a serial port, for example. They are all networked via Ethernet, however.

UdpLogger is able to be installed after the route table is installed, so it’s only really useful for debugging if you can actually boot to a shell on the test machine. It dumps all log entries to a given IP address and port. A test machine can then dump all its debugging output to a development machine running netcat.

Here it is, in action:

Log::installCallback has now also been updated to dump all the old log entries to the new callback, so every logging destination should have the same data once they are all installed.

These are just a couple of exciting new features and fixes that have happened in the past couple of weeks. At this stage we have not planned a second release – we’re still working through the bugs from Foster and finally getting around to fixing all those TODOs scattered in the code.

At the moment git master is being kept “stable” – that is, it builds and runs to the bash shell (apart from x86_64) – so feel free to grab the latest code and have a play!

Pedigree’s “Services” Feature

Well, the Pedigree team is currently in bug fixing mode as we prepare for our first release, codenamed “Foster” and as such, there’s not been a lot to talk about innovation-wise in the kernel. So I’ve decided instead to chat about a feature of the kernel, which was implemented recently.

For our “live CDs” we realised we would need a read/write file system, in memory, with a set of applications and files to allow people to try Pedigree out without making changes to their computers. This meant that we needed a way to mount a disk image as a usable disk, which is a feature we lacked at that stage in Pedigree. So I spent a couple of hours whipping up a module to support such a feature. However, this module required a couple of tricky design decisions: Cache, and write support.

Cache is a must-have in such a module, as without it each read from the disk image ends up in a read to its parent hardware – not so great if a slow CD drive needs to seek! However, the cache must also be big enough to hold at least one sector of disk data (preferably more). I eventually settled upon a 4096-byte buffer cache, which holds two full CD sectors, and eight hard disk sectors. Initial reads will miss the cache and read straight from the hardware into the cache. Further reads come from that buffer in cache. For an image formatted as FAT, for example, this improves read times from the file allocation table and root directory (high-use areas) significantly.

Writing to the virtual disk makes things slightly more complicated. I decided upon an implementation, which gave the option for the virtual disk to be write-through, or totally in RAM. These options need a bit of explanation. A write-through virtual disk will place writes into cache, and write to the real disk image itself. This is an ideal setup for something like Linux-style loopback disks. However, consider the “live CD” situation: we can’t write back to a CD! So the second option writes only into the cache without affecting the original file. This means all changes are kept as long as the system is running, and do not persist. Implementing these two write options generalised the module – a massive bonus.

However, all is not well at this stage. A module to provide a disk is one part of the battle. At this stage, all we have is an abstraction of a disk – no partitions, no file systems – it’s useless for normal usage. At the time of implementation, Pedigree had no way to dynamically detect and mount such disks. This simply cannot do for a modern operating system where storage devices are hardly static – USB mass storage devices come and go, hot-pluggable hard disks exist, and so on.

I didn’t want to have my loopback disk module talk directly to the partition driver though. Exposing the internals of the partitioner to other modules makes changing the partitioner’s interface awfully complicated, and creates an explicit dependency on a specific module.

I decided instead to implement what I call Pedigree’s service manager. This kernel feature sits between different parts of the operating system and provides a standardised interface to other modules. Each service provides the following types of functionality (at the time of writing):

Write: Send data to the module
Read: Read data from the module
Touch: Inform the module of new state
Probe: Probe the module for a specific state or piece of information

Each service decides which features it provides, so it is possible for a service to provide only read as a function. The service manager takes these potential features and provides a generic interface for drivers and modules to talk to named services. In effect, this idea of services is a method of inter-process communication using named destinations. Therefore, with this new service manager, I was able to modify the partitioner to add support for the touch service. There is no need for a partitioner to support read, write, or probe, as the only notification to be sent to the partitioner is to inform it of a new disk.

With a quick modification to the loopback disk code, I was able to inform the partitioner of the presence of the new disk with error handling and no direct partitioner-specific functionality used:

// Chat to the partition service and let it pick up that we're around now
ServiceFeatures *pFeatures = ServiceManager::instance().enumerateOperations(String("partition"));
Service         *pService  = ServiceManager::instance().getService(String("partition"));
NOTICE("Asking if the partition provider supports touch");
if(pFeatures->provides(ServiceFeatures::touch))
{
    NOTICE("It does, attempting to inform the partitioner of our presence...");
    if(pService)
    {
        if(pService->serve(ServiceFeatures::touch, reinterpret_cast<void *>(this), sizeof(FileDisk)))
            NOTICE("Successful.");
        else
            ERROR("Failed.");
    }
    else
        ERROR("FileDisk: Couldn't tell the partition service about the new disk presence");
}
else
    ERROR("FileDisk: Partition service doesn't appear to support touch");

This feature has already been added to other areas of the kernel (mainly talking to the partitioner), but has the potential to even be expanded to call applications that the user runs. This means it would be theoretically possible to replace the partitioner at runtime, or replace a component of the network stack to provide a different level of service. That means that Pedigree can be modular and flexible, even though it uses the conventionally rigid “monolithic kernel” design. Now that’s something to write home about!