Update: the project FORGE in this post refers to the FORGE Operating System. Also, when discussing feedback schedulers, note that the type of feedback discussed (thread pre-emption) is not the only possible method for feedback – just the one I’ve selected for this post.
Update #2: please note that this all changes when more than one core/CPU can run threads, as various additional factors exist (CPU loads, caches, NUMA domains, etc…) that make scheduling more complex. This blog post is primarily written within the context of a uniprocessor system.
The topic of scheduling in operating system theory is one which can easily fill a chapter, if not several, of a textbook. Optimally scheduling the various tasks running during the course of normal operation is a unique challenge and one which has many solutions. One must consider the type of tasks being run, their overheads, the overhead of switching threads and processes, and various other factors. Furthermore, scheduling may change between operating system types. A real-time operating system will have different scheduling semantics to, say, a general-purpose operating system.
I will not discuss co-operative scheduling here; this is all discussion around a pre-emptive scheduler. That is, a scheduler where threads are given a certain amount of time to run (their timeslice), and are interrupted if they exceed this time.
In FORGE, the current scheduler is simply a round-robin scheduler that switches between threads and doesn’t care about any metadata. This means that every task in the system runs at an equal priority, and also means that threads in the same process are not ‘grouped’.
Let’s take a quick break from discussing round-robin scheduling to discuss this grouping of threads by process, and why it is important in scheduling.
Conceptually, a process can be thought of as an enclosing environment around a set of threads. This environment typically includes an address space, any data shared across threads, and metadata about the actual process itself. For the purposes of this detour, we will consider a process and address space to have a 1:1 relationship. In an environment where address space switches are cost-free (ie, every process runs in the same address space), scheduling threads out-of-order is fine. The threads from various processes can be switched to and from at will, without worrying about costly address space transitions. Consider however the case where an address space switch is in fact a costly operation (as it should be expected to be on most modern architectures): switching between a number of threads from different processes will induce a costly address space switch each time.
Now, consider the following – three threads, two processes. Threads one and three are linked to process one. Thread two is linked to process two. Switching from thread one to two requires an address space switch into process two’s address space. Then, switching from two to three requires yet another address space back into process one’s address space. It would be far more sensible to queue threads from the same process together.
Back to round-robin scheduling.
We essentially have two first-in-first-out queues: the ‘ready’ queue, and the ‘already’ queue. Threads ready to be scheduled are queued in the ready queue, and threads that have already been scheduled are queued in the already queue. When the ready queue runs out of items, the two queues are swapped (already queue becomes the ready queue, and vice versa). This works excellently for an ‘initial’ testing algorithm for an operating system, but does not offer any priorities or have any sort of scheduling heuristics.
To add priorities, it is possible to create a list of queues, and there are various other improvements that can be made to round-robin scheduling as well. In FORGE however, I have decided to take the current round-robin scheduler and replace it with a ‘feedback’ scheduler. This is a very powerful scheduler type that can dynamically respond to the changing requirements of the system as time goes on. Consider a general purpose operating system under a reasonable load. There are a number of threads in a number of processes and each is performing various tasks. Some threads are performing heavy I/O, while others are heavily using the CPU. Others still are mixing the two. This variety in workloads is very common and also very difficult to handle ‘well’.
A feedback scheduler works off the feedback from the threads it runs (hence the name). If a thread is interrupted by the operating system for completing its timeslice, the operating system can generally assume the thread is using the CPU quite a bit. If however a thread gives up its timeslice to the operating system (by yielding, or implicitly by performing I/O), the operating system can determine that the thread is probably more I/O heavy. In a sense, the feedback the threads give to the scheduler allow the scheduler to determine if the threads are using the resources effectively.
So how does a feedback scheduler respond to the feedback? Well, if we consider that each thread has a priority (and a base priority, which is its ‘default’ and starting priority), the scheduler can both punish threads that use too much CPU and reward threads that do not. So, if a thread completes its timeslice before being rescheduled, the scheduler reduces its priority. If a thread does not complete its timeslice before being rescheduled, the scheduler increases its priority. This balances the system by prioritising I/O tasks (which end up blocking while the CPU-heavy tasks do their thing). Then, if a CPU-heavy task needs to block and wait for I/O, its priority can again be increased, and the system is responsive and continues to be usable even during this common workload.
Conceptually this is very easy to understand, but the actual implementation in code (or, perhaps, data structures) is a challenge. At this stage, this has not been implemented into FORGE, so no code is available for demonstration. However, the planned implementation will consist of a combination of binary trees and first-in-first-out queues. Remember earlier the discussion about grouping threads by process; by using a binary tree containing each process ID, we can traverse the tree completely and then load the queue of threads on each node (ie, process). This works for selecting the next thread to schedule.
Now, for the priorities, a simple array suffices; with the index being the priority.
So, for scheduling, we end up with an array, containing binary trees for each priority level, which then contain queues for each process.
This enables a reasonably efficient lookup (assuming iterators are sensible for trees, and that lookup costs and the like are negligible for linked lists), groups threads by process, and enables the feedback system to work correctly. The outcome is a reasonably-well balanced system with priority given to I/O.
As mentioned previously, scheduler design is a unique challenge, and this particular type of scheduler is not necessarily the ideal solution for, say, a real-time operating system. For reference, a similar design is used in OSX, FreeBSD, NetBSD, and Windows NT-based operating systems. Perhaps at a later date I will investigate and discuss an O(1) scheduler type. I’ll blog again later with the intricate details of actually implementing this algorithm.