Lottery and stride scheduling : umbrant

Lottery and stride scheduling

July 17, 2011

Today is a shorter post than previous topics, since I didn't want to lump the last paper (Paxos, ick) in with these. I'm covering lottery and stride scheduling, two very related approaches to doing efficient proportional-share scheduling. I believe this is the canonical way of doing things, since mClock (by Gulati et al., presented at OSDI 2010) used stride scheduling successfully to schedule disk I/O in a hypervisor.

"Lottery Scheduling: Flexible Proportional-Share Resource Management", Waldspurger and Weihl, 1994
"Stride Scheduling: Deterministic Proportional-Share Resource Management", Waldspurger and Weihl, 1995

Background

The basic problem for proportional-share scheduling is, given a set of processes that have been assigned relative weights (e.g. 3:2:1), schedule the processes with some kind of quantum such that they all get their assigned proportion of CPU time (e.g. 1/2rd, 1/3rd, 1/6th). A naive way of doing this is to simply schedule each process for weight number of scheduling quantums, but this penalizes processes that do not use their entire allocation (for instance, if they block on I/O early) and results in unfairness at small timescales (think of a 100:1:1 weighting). It's also desired that scheduling is responsive to changes in priorities.

These are some of the practical problems that any good proportional share scheduler has to solve.

Lottery Scheduling

Waldspurger and Weihl's first approach is a probabilistically fair one. Processes are assigned a number of tickets based on their relative weight (bigger weight=more tickets), and the scheduler holds a lottery each scheduling quantum (choosing a random ticket) where the winner gets scheduled. A quantum is small unit of time, in this case 10ms, which is the smallest unit that the scheduler will assign to a process. This means processes with more tickets get scheduled more often, and over time the actual scheduling should probabilistically approaches the desired relative weighting between the processes with standard deviation proportional to sqrt(n).

This can be implemented by logically storing the number of tickets of each processes in an array, and keeping a running sum of tickets as one traverses the array, advancing until the winning ticket is found. This is an O(n) operation, but sorting the array in descending order with insertion sort (or, as the authors do it, the "move to front" heuristic) can make this a lot faster. For large n, it's better to take an approach with a tree and binary search toward the winning leaf node process.

If a process ends early or runs over its quantum, the size of its ticket pool gets adjusted until it gets scheduled again with a compensation ticket. This compensation ticket is valued at (1/f)*num_tickets, where f is the fraction of its allocated time that it actually used. This will increase or decrease its likelihood of getting scheduled appropriately.

It's relatively straightforward to build a hierarchical scheduling system via what the authors term ticket currencies: different types of tickets that are backed by other tickets. In this way, groups of processes can be weighted based on one currency, and then the members of each group weighted based on another currency. This is basically just a nice management feature; in the end, everything gets translated back into the base currency.

They also have this idea of letting clients pass their ticket allocations off to a server, which is kind of cool when combined with a microkernel where everything is a server. It means you can give tickets only to clients, and make it so all server requests have to be paid for with a ticket allocation. Then, tickets accurately capture all work done in the system on behalf of a client process. My random idea.

They also have a similar idea with a priority inheritance scheme of sorts for locks, where all the tickets of processes waiting on a lock are passed to the process currently holding the lock. They also test lottery scheduling lock acquisition, which seems of somewhat lower utility to me. I don't know how often I would ever want to use this, as a programmer.

Lottery scheduling isn't really that responsive to dynamic changes, since it takes time to converge. Its probabilistic nature also means that it can't really give predictable performance, meaning that it might be decently suited for "average throughput" schedulers, but probably sucks for interactive uses where you want SLA-like performance guarantees.

Stride Scheduling

The follow up paper the year after solves many of the problems with lottery scheduling, by introducing the concept of stride scheduling, which has all the benefits of lottery scheduling while also being deterministic, responsive to dynamic priorities, and better error properties. This is slightly different from how it's presented in the paper, but I think it makes more sense.

This is something most easily explained visually, but the basic idea is the concept of virtual time, where each process has a clock that ticks at a different rate depending on its priority. High priority clocks tick slowly, while low priority clocks tick quickly. The rate at which a clock ticks is called its stride. The scheduler makes decisions by finding the clock with the oldest time, scheduling the corresponding process, and then advancing the clock by the clock's stride. This is implemented simply by keeping the clocks in ascending sorted order via a heap, insertion sort, or something like that.

Fractional quantums are handled via a multiplicative compensation factor: simply multiplying the stride by the fraction of the quantum used, before advancing the local clock.

There is also the concept of global virtual time that advances at the rate of the slowest possible tick (~1/sum(tickets)). This is used to calculate a compensation factor, remain, used to compensate a process for time spent waiting when there is a dynamic change. remain is the amount of virtual time until a process would next be scheduled, i.e. the difference between the global virtual time and the local virtual time. When the process re-enters the system, its local time is set to global_time+remain. In this way, if the process waited to be allocated before leaving (remain<stride) it gets scheduled sooner. The opposite happens if the process previously got an early allocation.

Since the stride and remain are both related to the size of the ticket allocation, in the case of a dynamic change, the stride is recomputed and used to scale remain appropriately to immediate reflect the new allocation.

This is extremely predictable since scheduling is deterministic, and processes are guaranteed to be scheduled at least once every complete cycle of virtual time (where a cycle is the slowest stride). This is a major boon for responsiveness since we no longer have to wait for probabilities to converge. Glancing at the evaluation section, I see major improvements to predictability, responsiveness to priority changes, and accuracy. The same ideas of ticket currencies and ticket passing also apply to stride scheduling.