Every time I’ve spun up a blog I’ve been motivated and inspired to write more. And… every time I’ve had a small burst of energy and then run out of interesting things to write about. It’s frustrating to want to publish content but to get caught up in the practicalities of actually doing so and forever procrastinate writing another post.
That’s what this is for. It’s to admit that I’m just not very good at this. Admitting this now gives me space to not be good at this. If I miss posts for a few months, that’s okay. If I get very productive and write some interesting content and then go radio-silent, that’s okay. I can’t get better at long-form blog writing without accepting these facts, because otherwise I’ll forever be caught up in my head about it.
To help I’ve also spun up short-form writing destinations, so I can keep practicing putting thoughts into the Internet without the daunting shape of a long-form blog post.
Those are Twitter (@matt_iselin) and I also spun up a Mastodon instance in my homelab (@firstname.lastname@example.org).
Short-form goes there. Long-form stays here.
Here’s to being bad at blogging and knowing that that’s perfectly fine.
The other major holiday project I worked on over the end-of-year break was to take my 3D engine and build something with it. I have a lot of little tech demos and ideas but nothing that’s actually “finished”.
Many years ago, I worked on a 3D engine that ended up being capable of loading Quake 3 maps:
So I thought – why not do what I didn’t do in 2007 – actually build something that is more or less a clone of Quake 3?
Well, I managed to get a little further than I did in 2007 to start with – with working lightmaps for example:
OK, but rendering a Quake 3 map is just scratching the surface of making a game. What’s more, Quake 3 BSP maps have more than just blocky rectangular geometry – they have Bezier surfaces to create curved geometry at runtime. Back then, this let id offer a geometry detail slider to let weaker computers reduce tessellation and get less-round round surfaces. Nowadays, that’s less of a big deal, so I thought I’d take it a little further.
Rather than write code to grab these Bezier surfaces (defined by 9 control points), and tessellate them at load time, I thought this would be a great opportunity to learn how to write OpenGL tessellation shaders.
It took a little while to wrap my head around the surfaces, which did involve drawing several example surfaces in a notebook (always keep a notebook near your keyboard!) and recognizing the characteristics of the control points and how they resulted in curved geometry.
Once I managed to figure that out, and load the control points in the correct order from the BSP file (again…. pen & paper), I was able to put together the shader:
The results worked out great, and it was very satisfying to use this as an opportunity to build out support in my 3D engine for tessellation shaders. The only shader type not yet supported is a Compute shader, and I’m hoping to dig into those soon too.
Quake 3 Shaders
Quake 3 (well, idTech 3) allow the use of their own scripted shaders to shade geometry. While many surfaces are textured with just an image (plus a lightmap), shaders allow for significantly more flexibility when rendering. For example, the flames on torches in maps are shaders that set up an animation across up to 8 image files – so with just two quads, a flickering flame can be created.
The shaders also support texture coordinate modification – based on constants or trigonometric functions – used to great effect to create moving skies, lava, or plasma electric effects by just rotating texture coordinates around.
Vertex deformation is even an option – more on that later.
Traditionally, these shaders would require multiple draw passes to merge the layers of the shader into one final visual result. I implemented this at first, but realized my 3D engine already supports “uniform buffer objects” (i.e. buffers of data to be passed to shaders), so I rebuilt my rendering path for Quake3 models and geometry. Now, the Quake3 shader is converted into a structure that contains all the information, texture stages, and other flags related to the rendering. That’s passed to the graphics card where the GLSL shader iterates through the list of stages and renders, performing blending between stages within the fragment shader itself.
The end result is a single draw call for geometry but with all of the same visual results! Combined with bindless textures, the result was a dramatic reduction in draw and bind calls, and no need to sequence multiple draw calls and OpenGL blending stages.
* stage_pack0: packed bitfield with stage shader parameters. 4 bytes.
* alt_texture : 2 - alternative texture flag
* blend_src : 3 - source blend mode in this stage
* blend_dst : 3 - destination blend mode in this stage
* rgbgen_func : 3 - rgbgen function selector
* rgbgen_wave : 3 - rgbgen wave to use (if function is wave)
* tcgen_mode : 2 - tcgen mode selector
* num_tcmods : 3 - number of tcmods in use
// rgbgen wave params (if function is wave) - 16 bytes.
// all our texture coord modification parameters
layout (std140) uniform Game
* game_pack0: packed bitfield with common shader parameters. 4 bytes.
* num_stages : 3 - number of stages in this shader
* direct : 1 - avoid caring about blends and whatnot
* sky : 1 - rendering sky
* model : 1 - rendering a model (ignores lightmaps)
layout(bindless_sampler) uniform sampler2D lightmap;
layout(bindless_sampler) uniform sampler2D textures;
// additional transformations to apply beyond the model/vp matrices
// used for models attached to attachment points on other models
I packed a number of the parameters into bitfields to reduce the size of the uniform buffer. I had visions of sending the GPU the entire selection of shaders in use and passing only a shader index in the draw call (or perhaps storing it as a vertex attribute), but I ultimately ran into the uniform buffer size limit on larger maps and decided to leave this idea alone.
I mentioned vertex deformation earlier in this post.
Quake 3 shaders allow for vertex deformation, utilizing trigonometric functions to manipulate the vertices of the geometry in real time. This is used for things like flags. This seemed even more like a perfect place to use tessellation shaders – even more than the Bezier surfaces I mentioned above.
This was just a matter of implementing an alternative tessellation shader that would perform the deformations. In the code example below, “q3sin” and such are functions that just call the relevant trigonometric function, utilizing the parameters provided to correctly manipulate the result. In the original Quake 3 source, these were lookups in precomputed tables. Now, I’m just running the functions on the GPU.
if (wave == RGBGEN_WAVE_SIN) d_val = q3sin(base, amp, phase, freq);
if (wave == RGBGEN_WAVE_TRI) d_val = q3tri(base, amp, phase, freq);
if (wave == RGBGEN_WAVE_SQUARE) d_val = q3square(base, amp, phase, freq);
if (wave == RGBGEN_WAVE_SAW) d_val = q3saw(base, amp, phase, freq);
if (wave == RGBGEN_WAVE_INVSAW) d_val = q3invsaw(base, amp, phase, freq);
// deform along the normal
vertex_out.Position.xyz += vertex_out.Normal * d_val;
This works pretty well:
There’s much more work to be done – I’ve been working on getting 3D models loaded and animated, building out the projectile logic so the game’s weapons work, and all that gameplay stuff that turns this from a cool handful of technical demo elements to a playable game.
The main reason I wanted to set this as a challenge for myself is that I can use the assets from the base game to be able to build the clone – I’m not a particularly great artist – and also have a point of reference along the way. At first I thought it was a little strange to build something that already exists, but the extent to which this project has already pushed the edge of my 3D engine and my own thinking about how to build 3D applications has been invaluable.
Here’s to being able to actually play the game soon, though 🙂
Over the new year’s break I’ve been working on a few projects – it’s been a while since I had a chance to get into writing some code for a fun project.
I’m still working on another holiday project – so this might be a bit short on deep technical details – but here’s some of what I’ve been doing.
One project I set out to build was a software raytracer. Having already done some work with 3D graphics (see other posts on this blog), it seemed like an interesting project to take on to see the differences between rasterization and tracing. The vast majority of rendering code I’ve written has been for OpenGL rasterization so it was certainly a change in mental paradigm.
It was fascinating to be able to build scenes with mirrors, reflective surfaces, shadows… coming from rasterization, it felt like “cheating” to just add another light to the scene, or to add a pair of mirrors and have the infinite mirror effect just work.
I have a physically-based rendering demo in my 3D game engine source tree that I tried to replicate in the raytracer:
I had a lot of fun writing the raytracer and for the most part I was able to get it up and running over a weekend.
For a long time I’ve been using physically-based rendering in my 3D engine. Over the break I was able to dig into it a little more and fix a number of bugs across the implementation.
One such bug was an inversion of parameters when generating a lookup table texture for the image-based lighting specular calculations. I flipped parameters causing the lookups to return incorrect data, creating inaccurate scenes.
I also spent a lot of time rewriting the equations. I made a fundamental error years ago when I first write the physically-based rendering shaders. The rendering integral was implemented as an equation and never ever worked the way I expected, so I jammed factors and such around until it kind of worked. Working on the raytracer helped me realize it’s an integral – because using mutiple rays per pixel in a raytracer completes a Monte Carlo estimation of the integral – and so I made sure everything was up-to-date in source control and rewrote the shaders.
The results look particularly better on the dielectrics where the roughness scale looks a lot more realistic.
For several years now, I’ve been experimenting with the integration of tech into the houses and apartments I’ve lived in. For the most part, this has manifested as very basic automation – some trivial routines in Phillips Hue, or an auto-off timer on a Wemo power socket, for example.
Well, in the past few months, I’ve been able to dramatically amplify my efforts in this realm by introducing a wide variety of Zigbee and Zwave home automation devices.
Zigbee & Zwave
Zigbee and Zwave are wireless communication systems designed for Internet-of-Things devices, including features like mesh networking and security to create an environment ripe for automation.
The distinction between “secure” and “insecure” tends to follow the categorization of the device – you probably don’t want your wireless door lock to be accessible to anyone nearby that can transmit a Zigbee control code, but perhaps you don’t have quite the same concerns with a temperature sensor.
Both systems require a hub somewhere that owns the network and keeps track of the devices within. The websites of each protocol have information about possible hubs, but I went with a USB variant and connected it to HomeAssistant.
HomeAssistant is the “brains” of my home automation. Some folks use OpenHab as an alternative. I run my instance on a VMware server in my garage – you can run yours on anything from a home PC to a Raspberry Pi.
HomeAssistant is the hub through which all of the Zigbee and Zwave devices are communicating, sending their sensor readings and receiving commands. While it can also be used as a dashboard, I personally use a Grafana instance (also on my VMware server) along with InfluxDB and Prometheus instances to collect and present data from around the house.
HomeAssistant also allows automations that can use readings from various devices to make decisions and trigger other devices. For example, I have lights in my backyard that are turned on when the outdoor luminance drops to a certain level (rather than using sunset data), and turn off at 10pm unless there’s motion in the yard. This is just one example of what is possible – I’ve barely even begun to scratch the service in my own usage of HomeAssistant!
I’ve spent some time building up a Grafana dashboard that I can use as a quick go-to to see what’s happening in my smart home. Internet stats, power and water usage, and environmental data is all available at a glance.
My favorite part of this dashboard is the power data. Using a Smart Utility Meter connected to our PG&E meter, I was able to set up a data feed with point-in-time power meter readings. I had to write a little Python daemon to receive the regularly-submitted reports from the meter and push them to my InfluxDB instance, but once this was up and running, it’s been pretty solid! … other than one little bug with units that sometimes reported a power consumption of 1 MW or more – but once that got squashed, everything’s looked very reasonable.
My second favorite part of this dashboard!
Using a handful of Zwave multi-sensors, I have all sorts of useful data from various locations in the house – temperature, humidity, luminance, motion. Some of these I am already using in automation – the use of dwindling luminance to trigger turning on smart bulbs is very useful – but others are more informational.
I don’t yet use the other data for anything other than my own interest. It is however useful to know what the temperature is around the house, and I can use that to experiment.
Which brings me to…
All this data is useful to satisfy a curiosity, but I’ve been slowly starting to find new ways to make use of it.
Temperature data is useful to figure out if I should wear a jacket today (I should set up a push notification for that…), but I’ve also been experimenting with turning heaters on and off and seeing the impact on the longer-term temperature graphs. This is particularly interesting when looking at ways to trim back cost – heating isn’t particularly efficient, and if it is not making a major difference, that’s a strong signal that we can change our usage of it!
I’ve also used power data to recognize particularly power-hungry light fixtures and identify them as candidates for a bulb conversion. That’s something that’s doable without all this fancy tech, but I love how quickly I can see results – with the dashboard open on my phone, after throwing a light switch, I can see results in seconds and move on to another fixture.
Many folks have used these sensors for even more interesting purposes – calibrating the UV sensors to open and close blinds to protect furniture from UV rays, for example.
I’m barely scratching the surface of what’s possible, but it’s nonetheless an exciting place to be.
Just moments ago while writing this article, I realized it had become quite dark in my office. I think it’s time to publish this… and start working on an automation to turn the lights on when it gets dark and I’m working!
I made a search engine, mostly out of curiosity about what such an undertaking might entail, and also to build my skills with Go.
Update: I ended up shutting down the search engine, after stumbling into altogether too many bugs with robots.txt and crawling pages that should have been ignored. This post remains as a historical record.
I’ve been coding in Go a bit lately as it’s extremely well-suited to server-side workloads. This blog is served by a Go binary which enables full HTTP2 support as well as further customizations as I see fit.
The ease of integration with systems like Prometheus for capturing metrics for monitoring and statistical analysis (e.g. requests per second) has been exceptional as well compared to my experience with similar systems in C++. It has been a great opportunity to also play with Grafana for graphing metrics scraped by Prometheus.
Initially this blog is the main starting point for crawls, but I would like to also crawl Lobsters and Hacker News to bring in a fairly tech-news-heavy dataset to the index.
You can also see some interesting data from the index including navigating its full index, inspecting the TLDs seen by the crawler, or even enumerating the hostnames seen during crawls.
These extra data pages do reveal some of the internal structure of the index.
URL paths, page descriptions, and titles are stored with full-text search indexes to enable querying.
Each entry in the index is associated with a single URL and stores the URL path as well as a link to both a TLD and a hostname record. The use of TLD and hostname records allows for many URLs to come from the same domain name and TLD (e.g. hostname “google” and TLD “com”) without duplicating both unnecessarily in the database.
This means that for a 500 URL index, the storage space consumed is around 200KiB, most of which is consumed by full-text titles and descriptions from crawled webpages.
It’s certainly been an interesting learning journey to get here and in particular it has been fascinating discovering the subtle nuances of robots.txt, crawling many URLs in a scalable way, and building something useful from the ground up in Go.
This demonstrates some of the improvements I’ve made. Recent work since this video has been primarily focused on building a voxel geometry demonstration and implementing some performance improvements.
nanogui is a great little GUI framework that integrates nicely with OpenGL. It’s quite extensible, which is great as I have made use of its core set of user interface widgets to build more complex user interfaces.
I’ve supported a number of effects in Victor for a while, but this video fully shows several of them including SSAO (screen-space ambient occlusion), motion blur, and a vignette effect.
The last post showed a PBR-specific video, while this video shows the PBR functioning in an existing environment.
It’s worth noting that since this video there’s been further changes to the PBR support in Victor. Some of these are still in-progress, particularly related to physically-correct lighting.
Render targets are well-supported now, and are shown in the video by rendering an entirely new scene and showing it embedded within a nanogui window.
I’m using these for a proof-of-concept editor that allows editing materials with a live demonstration of any changes made. It’s still in progress and the latest work to fix performance and add voxel support has broken it a bit, but once things are working properly I’ll be able to upload a video of it in action.
The last video didn’t show this at all, but it’s shown in this video (though it may be difficult to see!) – the bricks on the “ground” are all parallax mapped to give an illusion of depth without requiring the extra geometry.
The parallax mapping can be seen more obviously in the render target view, where the parallax effect stretches around the sphere.
What’s happening now?
My focus at the moment in Victor is primarily:
a voxel demonstration with modifiable geometry
performance improvements; visibility culling was a great start but Victor really needs culling via something like an octree to really start performing nicely on complex scenes (right now it still has to manipulate every object in a scene to perform the visibility cull, whereas with an octree a large number of objects could be skipped).
fixes to PBR lighting and shadowing, which is currently a little broken
This might take a while (months), but I’m aiming to add more videos of some of these features as they mature. The editor proof-of-concept is also something that I’m pretty excited about, so once that’s cleaned up a bit and has some key problems fixed I’ll be able to show that off too.
I’ve done some work recently to put together a new build system for Pedigree’s ports, where dependencies can be a first-class citizen and various other modes of operation can be used. For example, it’s now fairly trivial to just dump the commands that would be run into a file.
I hope to discuss the system more once it’s a little more tested, but for now, I’ll leave you with the latest SVG (will update after each build!) of the (build – not necessary installation) dependency tree for all Pedigree ports.
My last post on this blog covered off the work so far on memory mapped files. There has been quite a bit of progress since then in this area:
Memory mapped file cleanup works as expected. Programs can remove their memory mappings at runtime, and this will be successful – including ‘punching holes’ in mappings.
Remapping an area with different permissions is now possible. The dynamic linker uses this to map memory as required for the type of segment it is loading – for example, executable, or writeable. This means it is no longer possible to execute data as code on Pedigree on systems which support this.
Anonymous memory maps are mapped to a single zeroed page, copy-on-write. Programs that never write to an anonymous memory map are therefore significantly lighter in physical memory.
The virtual address space available for memory maps on 64-bit builds of Pedigree is now significantly larger.
Other significant changes since the last post include:
Implemented a VGA text-mode VT100 emulator and necessary fallbacks for a system that does not have the ability to support a graphics mode for Pedigree. This significantly improves the experience.
Psuedo-terminal support has improved substantially, such that the ‘Dropbear’ SSH server runs and accepts numerous connections, without nasty error messages.
POSIX job control is functional.
I have successfully used Pedigree on my old eee PC with a USB Mass Storage device as the root filesystem; writing files on Pedigree using Vim and reading them on a Linux system.
The build system now uses GCC 4.8.2 and Binutils 2.24.
Pedigree is now only 64-bit when targeting x86 hardware, in order to reduce development complexity and to acknowledge the fact that very few modern systems are 32-bit-only anymore.
Of particular interest has been the switch to 64-bit-only when targeting x86. The following is a post-mortem from a particularly interesting side-effect of this.
Python has been a supported port in Pedigree for quite a while. Python entered the tree proper in 2009, version 2.5. The process of and lessons learned while building Python for Pedigree led to the creation of the Porting Python page on the OSDev.org wiki. Suffice it to say, this is a port that has great significance to the project. Our build system (SCons) also uses Python, so it is critical to support Python in order to achieve the goal of building Pedigree on Pedigree. Recently I noticed that Python was consistently hitting a segmentation fault during its startup. Noting that this is probably not a great state for the Python port to be in, I decided to take a closer look.
All code is from Python 2.7.3.
The problem lies in moving from 32-bit to 64-bit; I am sure by now many readers will have identified precisely what the problem is, or will do so within the first paragraph or two of reading. Read on and find out if you are correct! 🙂
The first order of business was to get an idea as to where the problem was taking place. I rebuilt Python making sure that `-fno-omit-frame-pointer was in the build flags, so the Pedigree kernel debugger could do a trivial backtrace for me. I removed the code that only enters the kernel debugger when a kernel thread crashes (normally, it makes more sense for a SIGSEGV to be sent to the faulting process – but I needed more debugging flexibility to fix this). I managed to get a backtrace and discovered that the process was crashing within the _PySys_Init function.
With a disassembly of the Python binary in hand, and the source code available, I quickly identified that the problem line was:
Okay, so it turns out that somehow, the sys module’s dictionary of attributes, methods, and documentation is returning a ‘not found’. This is bad! The question is, why is the lookup failing?
I ended up having to trace through the source with breakpoints and disassembly, which took a good 5-6 man-hours to complete. I reached a point where I could no longer isolate the issue and it was at this point I realised I needed something a bit heavier than Pedigree’s builtin debugging tools. The QEMU emulator provides a GDB stub, which is perfect for debugging this kind of thing.
I also reached the conclusion to use GDB after a number of test runs where I ended up inspecting the raw contents of RAM to decipher the problem at hand – while this is helpful for learning a lot about how Python data structures work and how they look, it is nowhere near a sustainable solution for debugging a problem like this.
I linked a local GDB up to the Python binary, with a .gdbinit file that made sure to transform the file paths the binary held within it so GDB could show me source references while running. The file looks a little like this:
target remote localhost:1234
set substitute-path /path/to/builds/build-python27-2.7.3/build/.. /real/path/to/misc/src/Python-2.7.3
set filename-display absolute
set disassemble-next-line on
The breakpoint on the final line is set to the line of code shown above.
The key to the .gdbinit file is that it essentially specifies a list of GDB commands to run before GDB is interactive. This saves a huge amount of time when doing the same general debug process repeatedly. So the stage is set, the actors are ready!
Up comes Pedigree, up comes GDB, everything is connected and functioning correctly. QEMU hits the breakpoint address and hands off control to GDB. At this point, I am able to print the value of various variables in scope and start tracing. First of all, I check the sysdict dictionary to make sure it actually has items…
> print sysdict.ma_used
(number greater than zero)
Okay, so there’s items in the dictionary. Excellent. I’ll confess at this point I became a little bit excited – I hadn’t used GDB with QEMU before, and I hadn’t realised that it would be exactly the same as debugging any other userspace application. The entire toolset is at my fingertips.
So I trace, stepping through function after function, nested deeply. Fortunately GDB has the finish command – which basically continues execution until the current function is about to return. Many functions included things like allocating memory, interning strings, and creating Python objects. Jumping to the end and seeing each of these functions completed successfully indicated the issue was not in any of these particular areas of the Python source tree.
Finally, after much stepping and moving through the call tree, I ended up at the PyDict_GetItem function. Excellent – I know I’m close now!
I’ll confess, as soon as I saw the source dump for this function I had a bit of an a-ha moment; the first line of the function is:
From my previous memory dumping and traversing the Python codebase, I happened to have an awareness that dictionary objects use the type Py_ssize_t for their hashes. This is defined as ssize_t normally, which is fine on most systems. I had a hunch at this point, but I continued stepping – I wanted conclusive evidence before I left the GDB session and identified a fix.
The next few steps essentially involved tracing until finding something along the lines of:
I aborted the GDB session here, closed QEMU, and ran a quick test to see what the actual size of Pedigree’s ssize_t on 64-bit is… and discovered that it is in fact only 4 bytes (where size_t is 8 bytes). Of course, a long on 64-bit is a full 8-byte integer. Matching the hash would be a true fluke; the dictionary lookup could never succeed.
The problem has now been fixed and Python now runs perfectly well on 64-bit systems. Python checks the size of size_t in its configure script but not the signed variant; nor should it need to – the two types should be the same size. Even so, PyObject_Hash returns a long; there is a comment to this effect in Python’s dictobject.h:
/* Cached hash code of me_key. Note that hash codes are C longs.
* We have to use Py_ssize_t instead because dict_popitem() abuses
* me_hash to hold a search finger.
I have not yet checked whether this is resolved in newer Python.
It’s nice to be able to run Python code again in Pedigree. 🙂
Update: the project FORGE in this post refers to the FORGE Operating System. Also, when discussing feedback schedulers, note that the type of feedback discussed (thread pre-emption) is not the only possible method for feedback – just the one I’ve selected for this post.
Update #2: please note that this all changes when more than one core/CPU can run threads, as various additional factors exist (CPU loads, caches, NUMA domains, etc…) that make scheduling more complex. This blog post is primarily written within the context of a uniprocessor system.
The topic of scheduling in operating system theory is one which can easily fill a chapter, if not several, of a textbook. Optimally scheduling the various tasks running during the course of normal operation is a unique challenge and one which has many solutions. One must consider the type of tasks being run, their overheads, the overhead of switching threads and processes, and various other factors. Furthermore, scheduling may change between operating system types. A real-time operating system will have different scheduling semantics to, say, a general-purpose operating system.
I will not discuss co-operative scheduling here; this is all discussion around a pre-emptive scheduler. That is, a scheduler where threads are given a certain amount of time to run (their timeslice), and are interrupted if they exceed this time.
In FORGE, the current scheduler is simply a round-robin scheduler that switches between threads and doesn’t care about any metadata. This means that every task in the system runs at an equal priority, and also means that threads in the same process are not ‘grouped’.
Let’s take a quick break from discussing round-robin scheduling to discuss this grouping of threads by process, and why it is important in scheduling.
Conceptually, a process can be thought of as an enclosing environment around a set of threads. This environment typically includes an address space, any data shared across threads, and metadata about the actual process itself. For the purposes of this detour, we will consider a process and address space to have a 1:1 relationship. In an environment where address space switches are cost-free (ie, every process runs in the same address space), scheduling threads out-of-order is fine. The threads from various processes can be switched to and from at will, without worrying about costly address space transitions. Consider however the case where an address space switch is in fact a costly operation (as it should be expected to be on most modern architectures): switching between a number of threads from different processes will induce a costly address space switch each time.
Now, consider the following – three threads, two processes. Threads one and three are linked to process one. Thread two is linked to process two. Switching from thread one to two requires an address space switch into process two’s address space. Then, switching from two to three requires yet another address space back into process one’s address space. It would be far more sensible to queue threads from the same process together.
Back to round-robin scheduling.
We essentially have two first-in-first-out queues: the ‘ready’ queue, and the ‘already’ queue. Threads ready to be scheduled are queued in the ready queue, and threads that have already been scheduled are queued in the already queue. When the ready queue runs out of items, the two queues are swapped (already queue becomes the ready queue, and vice versa). This works excellently for an ‘initial’ testing algorithm for an operating system, but does not offer any priorities or have any sort of scheduling heuristics.
To add priorities, it is possible to create a list of queues, and there are various other improvements that can be made to round-robin scheduling as well. In FORGE however, I have decided to take the current round-robin scheduler and replace it with a ‘feedback’ scheduler. This is a very powerful scheduler type that can dynamically respond to the changing requirements of the system as time goes on. Consider a general purpose operating system under a reasonable load. There are a number of threads in a number of processes and each is performing various tasks. Some threads are performing heavy I/O, while others are heavily using the CPU. Others still are mixing the two. This variety in workloads is very common and also very difficult to handle ‘well’.
A feedback scheduler works off the feedback from the threads it runs (hence the name). If a thread is interrupted by the operating system for completing its timeslice, the operating system can generally assume the thread is using the CPU quite a bit. If however a thread gives up its timeslice to the operating system (by yielding, or implicitly by performing I/O), the operating system can determine that the thread is probably more I/O heavy. In a sense, the feedback the threads give to the scheduler allow the scheduler to determine if the threads are using the resources effectively.
So how does a feedback scheduler respond to the feedback? Well, if we consider that each thread has a priority (and a base priority, which is its ‘default’ and starting priority), the scheduler can both punish threads that use too much CPU and reward threads that do not. So, if a thread completes its timeslice before being rescheduled, the scheduler reduces its priority. If a thread does not complete its timeslice before being rescheduled, the scheduler increases its priority. This balances the system by prioritising I/O tasks (which end up blocking while the CPU-heavy tasks do their thing). Then, if a CPU-heavy task needs to block and wait for I/O, its priority can again be increased, and the system is responsive and continues to be usable even during this common workload.
Conceptually this is very easy to understand, but the actual implementation in code (or, perhaps, data structures) is a challenge. At this stage, this has not been implemented into FORGE, so no code is available for demonstration. However, the planned implementation will consist of a combination of binary trees and first-in-first-out queues. Remember earlier the discussion about grouping threads by process; by using a binary tree containing each process ID, we can traverse the tree completely and then load the queue of threads on each node (ie, process). This works for selecting the next thread to schedule.
Now, for the priorities, a simple array suffices; with the index being the priority.
So, for scheduling, we end up with an array, containing binary trees for each priority level, which then contain queues for each process.
This enables a reasonably efficient lookup (assuming iterators are sensible for trees, and that lookup costs and the like are negligible for linked lists), groups threads by process, and enables the feedback system to work correctly. The outcome is a reasonably-well balanced system with priority given to I/O.
As mentioned previously, scheduler design is a unique challenge, and this particular type of scheduler is not necessarily the ideal solution for, say, a real-time operating system. For reference, a similar design is used in OSX, FreeBSD, NetBSD, and Windows NT-based operating systems. Perhaps at a later date I will investigate and discuss an O(1) scheduler type. I’ll blog again later with the intricate details of actually implementing this algorithm.