Module 1

What Does a Scheduler Actually Do?

Imagine a traffic controller at a busy airport. Dozens of planes are ready to land, but only a handful of runways are available. The controller decides which plane lands next, for how long it uses the runway, and when to wave it off so someone else gets a turn. That is what a scheduler does — except the planes are threads and the runways are CPUs.

Source File Map

File	Purpose
`kern/sched_4bsd.c`	The traditional 4BSD scheduler implementation
`kern/sched_ule.c`	The ULE scheduler (default since FreeBSD 7.1)
`kern/sched_shim.c`	Pluggable scheduler framework (FreeBSD 16+)
`sys/sched.h`	Scheduler API — public function declarations
`kern/kern_switch.c`	Run queue operations and `mi_switch()`
`sys/runq.h`	Run queue data structure definitions
`kern/kern_synch.c`	Sleep/wakeup, thread blocking and unblocking
`amd64/amd64/cpu_switch.S`	AMD64 low-level context switch assembly
`arm64/arm64/swtch.S`	ARM64 low-level context switch assembly

The Scheduler API

The sched.h header defines the contract every scheduler must fulfill:

Function	What It Does
`sched_add()`	Place a thread on a run queue — "this thread is ready to run"
`sched_switch()`	Pick the next thread and switch to it
`sched_choose()`	Select the highest-priority runnable thread
`sched_clock()`	Periodic tick — update usage stats, check time slices
`sched_prio()`	Set a thread's priority
`sched_wakeup()`	Wake a sleeping thread and schedule it
`sched_fork()`	Initialize scheduling state for a new child thread
`sched_exit()`	Clean up scheduling state when a thread dies
`sched_affinity()`	Handle CPU affinity changes

The Priority Space

FreeBSD uses numeric priorities where lower = more important. The 256-value priority space is divided into classes:

Class	FreeBSD 10	FreeBSD 15	FreeBSD 16
PRI_ITHD (interrupts)	0–47	0–15	0–7
PRI_REALTIME	48–79	16–47	8–39
PRI_KERN (kernel)	80–119	48–87	40–55
PRI_TIMESHARE (user)	120–223	88–223	56–223
PRI_IDLE	224–255	224–255	224–255

💡

The Trend

Interrupt and realtime priorities have been getting narrower across versions, giving more room to timeshare threads. This reflects modern workloads where interactive responsiveness matters more than having many distinct interrupt priority levels.

SRQ Flags

When a thread is added to a run queue, flags describe why it is being added:

CODE


#define SRQ_BORING    0x0000   /* No special circumstances */
#define SRQ_YIELDING  0x0001   /* Thread is yielding voluntarily */
#define SRQ_OURSELF   0x0002   /* Adding ourselves to the run queue */
#define SRQ_INTR      0x0004   /* Wakeup is interrupt-driven (urgent) */
#define SRQ_PREEMPTED 0x0008   /* Thread was preempted */
#define SRQ_BORROWING 0x0010   /* Priority updated due to lending */
#define SRQ_HOLD      0x0020   /* Return holding td lock (14+) */
#define SRQ_HOLDTD    0x0040   /* Return holding td lock (14+) */

PLAIN ENGLISH

SRQ_BORING: The default — just a regular thread being placed on the queue, nothing unusual.

SRQ_YIELDING: The thread said "I'm done for now, let someone else go." The scheduler places it at the back of its priority queue.

SRQ_INTR: This thread was woken by a hardware interrupt (like a network packet arriving). It may need to run urgently.

SRQ_PREEMPTED: A higher-priority thread showed up, so this one was forcibly pulled off the CPU. Gets special treatment when re-queued.

SRQ_HOLD / SRQ_HOLDTD (FreeBSD 14+): Lock-management flags that control which locks are held when the function returns — important for avoiding lock-order violations.

A web server thread is sleeping, waiting for a network packet. The packet arrives and triggers a hardware interrupt. Which SRQ flag will be used when adding this thread back to the run queue?

Module 2

Meet the 4BSD Scheduler

The 4BSD scheduler is the old guard — the scheduling algorithm that dates back to original BSD Unix. Its approach is elegant in its simplicity: track how much CPU time each thread has used recently, and gradually lower the priority of CPU-hungry threads so interactive programs stay responsive.

The Priority Decay Formula

Every time a clock tick fires, the 4BSD scheduler asks: "How much CPU has this thread used?" and adjusts its priority accordingly:

newpriority = PUSER + (ts_estcpu / INVERSE_ESTCPU_WEIGHT) + NICE_WEIGHT × (p_nice − PRIO_MIN)

PUSER — base priority for timeshare threads
ts_estcpu — estimated recent CPU usage (higher = used more CPU)
INVERSE_ESTCPU_WEIGHT — controls how strongly CPU usage affects priority (typically 8)
p_nice — user-settable "niceness" value (−20 to +20)

💡

The Key Insight

The more CPU you use, the higher your priority number becomes — which means lower actual priority. CPU hogs naturally sink to the back of the line.

CPU Usage Decay (schedcpu)

Every second, the schedcpu() function applies exponential decay to each thread's CPU usage estimate:

ts_estcpu = (2 × loadavg × ts_estcpu) / (2 × loadavg + FSCALE)

This means old CPU usage is gradually "forgotten." Under high load, decay happens more slowly — the system remembers CPU-hungry threads longer when resources are scarce.

The Clock Tick Handler

CODE


/* Called on every scheduler clock tick */
ts->ts_cpticks++;
ts->ts_estcpu = ESTCPULIM(ts->ts_estcpu + 1);
if ((ts->ts_estcpu % INVERSE_ESTCPU_WEIGHT) == 0)
    resetpriority(td);

PLAIN ENGLISH

Increment the tick counter (ts_cpticks) — this tracks raw clock ticks for the current scheduling window.

Add 1 to the estimated CPU usage (ts_estcpu) and clamp it to a maximum value so it does not overflow.

Every Nth tick (where N = INVERSE_ESTCPU_WEIGHT, typically 8), recalculate the thread's priority using the decay formula above.

Priority does not change on every tick — it is batched for efficiency. The thread's position in the run queue only shifts every 8 ticks.

Per-Thread State: `struct td_sched`

The 4BSD scheduler's per-thread state has evolved across versions:

FreeBSD 10


struct td_sched {
    fixpt_t     ts_pctcpu;
    int         ts_cpticks;
    int         ts_slptime;
    int         ts_flags;
    struct runq *ts_runq;
};
/* FreeBSD 11 added: */
    int         ts_slice; /* Remaining ticks in quantum */
/* FreeBSD 12+ added: */
    u_int       ts_estcpu; /* Estimated CPU utilization */

PLAIN ENGLISH

ts_pctcpu: Percentage of CPU used — shown in ps output as the CPU% column.

ts_cpticks: Raw clock ticks consumed in the current window. Feeds the decay formula.

ts_slptime: How long this thread has been sleeping. After sleeping, the next wakeup gets a priority boost.

ts_slice (FB11+): Remaining ticks before the scheduler forces a switch. Adds explicit time-slicing.

ts_estcpu (FB12+): Moved from the process struct into the per-thread struct for finer-grained accounting.

Run Queue Organization

On SMP systems, there are two pools of queues: a global queue for threads without CPU preferences, and per-CPU queues for threads pinned to specific CPUs.

4BSD Feature Matrix

Feature	10	11	12	13	14	15	16
ts_slice field	—	✓	✓	✓	✓	✓	✓
TDF_SLICEEND flag	—	✓	✓	✓	✓	✓	✓
ts_estcpu in td_sched	—	—	✓	✓	✓	✓	✓
sched_clock(td, cnt)	—	—	—	—	✓	✓	✓
sched_switch(td, flags)	—	—	—	—	✓	✓	✓
HWT hooks	—	—	—	—	—	—	✓
sched_4bsd_* naming	—	—	—	—	—	—	✓
struct sched_instance	—	—	—	—	—	—	✓

A thread has been using a lot of CPU. What happens to its `ts_estcpu` value during the `schedcpu()` decay pass?

Module 3

Meet the ULE Scheduler

ULE (pronounced "you-lee") replaced the 4BSD scheduler as the default in FreeBSD 7.1. Where 4BSD treats SMP as an afterthought, ULE was designed from the ground up for multi-core systems.

Per-CPU Design

The fundamental insight of ULE is that each CPU gets its own scheduler state — a struct tdq structure with separate queues for real-time and timeshare threads:

💡

Scalability Win

On a 64-core server, 64 CPUs can schedule threads simultaneously without contending on a single lock — a massive scalability win over 4BSD's global queue approach.

Interactivity Detection

ULE automatically classifies threads as interactive or batch-oriented by measuring their voluntary sleep patterns. The magic happens in sched_interact_score():

CODE


static int
sched_interact_score(struct thread *td) {
    struct td_sched *ts = td->td_sched;
    int div;
    if (ts->ts_runtime >= ts->ts_slptime) {
        div = max(1, ts->ts_runtime / SCHED_INTERACT_HALF);
        return (SCHED_INTERACT_HALF +
            (SCHED_INTERACT_HALF - (ts->ts_slptime / div)));
    }
    div = max(1, ts->ts_slptime / SCHED_INTERACT_HALF);
    return (ts->ts_runtime / div);
}

PLAIN ENGLISH

Get the thread's scheduler state, which tracks time spent running vs. sleeping.

If the thread has spent more time running than sleeping, it gets a high score (less interactive — it is a CPU hog).

If the thread sleeps more than it runs, it gets a low score (more interactive — waiting for user input or I/O).

The SCHED_INTERACT_HALF constant creates a threshold: scores above it mark batch threads, below mark interactive ones.

Interactive threads receive a priority boost, keeping them responsive even under heavy load.

Work Stealing & Load Balancing

1

Local check

Is my own queue empty? If no, run the highest-priority thread. No cross-CPU coordination needed.

2

Steal threshold

Check if any other CPU's tdq_load exceeds the load balancing threshold. If so, steal from the busiest.

3

Topology-aware

Prefer stealing from CPUs in the same NUMA domain, then same physical package, then same SMT core. Cross-domain transfers cost cache-line flushes.

ULE Feature Matrix

Feature	10	11	12	13	14	15	16
3 run queues per CPU	✓	✓	✓	✓	✓	✓	—
1 unified run queue	—	—	—	—	—	—	✓
sched_slice_min	—	—	✓	✓	✓	✓	✓
always_steal tunable	—	—	—	—	✓	✓	✓
Cache-padded lock	—	—	—	—	✓	✓	✓
tdq_curthread field	—	—	—	—	—	✓	✓
Lockless TDQ accessors	—	—	—	—	—	—	✓
kern.sched.ule.* namespace	—	—	—	—	—	—	✓

A thread spends 90% of its time sleeping (waiting for user keystrokes) and only 10% running. What does ULE's `sched_interact_score()` return for it?

Module 4

The Run Queue

The run queue is the central data structure that holds all threads ready to execute. Its design has evolved significantly between FreeBSD versions, with the most dramatic change arriving in FreeBSD 16.

Data Structure Evolution

FreeBSD 10–15


#define RQ_NQS      64    /* 64 run queues */
#define RQ_PPQ       4    /* 4 priorities per queue */

struct rqbits {
    rqb_word_t rqb_bits[RQB_LEN];
};
struct runq {
    struct rqbits  rq_status;
    struct rqhead  rq_queues[RQ_NQS];
};

PLAIN ENGLISH

64 queues, each covering 4 priority values. Priority 0–3 → queue 0, priority 4–7 → queue 1, and so on.

rq_status: a bitmask with one bit per queue — set means "this queue has threads, check it."

Finding the highest-priority runnable thread = finding the lowest-set bit in rq_status. That is a single bsf (bit-scan forward) instruction — O(1).

The downside: 4 different priorities share a queue, so a priority-5 thread might wait behind priority-7 threads at the same queue slot.

FreeBSD 16


#define RQ_NQS      256   /* 256 run queues */
#define RQ_PPQ         1   /* 1 priority per queue */

typedef unsigned long rqsw_t;
#define RQSW_NB     4     /* 4 × 64-bit = 256 bits */

struct rq_status {
    rqsw_t rq_sw[RQSW_NB];
};
struct runq {
    struct rq_status rq_status;
    struct rq_queue  rq_queues[RQ_NQS];
};

PLAIN ENGLISH

256 queues — one per priority level. Priority 5 → queue 5, no sharing.

rq_status: now 4 × 64-bit words = 256 bits total. One bit per queue, just more of them.

Finding the highest-priority thread: scan the 4 words with __builtin_ctzl() (count trailing zeros) — still effectively O(1).

✅ Eliminates intra-band priority inversion: no thread with priority 5 waits behind priority 7 in the same slot.

Run Queue Comparison

Aspect	FreeBSD 10–15	FreeBSD 16
Queues	64	256
Priorities per queue	4	1
Status tracking	`struct rqbits` (platform-specific)	`struct rq_status` with `rqsw_t` array
Bit operations	Direct `rqb_bits` manipulation	`runq_sw_*()` abstraction functions
Priority mapping	`pri / RQ_PPQ` = queue index	1:1 mapping

Core Queue Operations

CODE


/* Conceptual runq_add (simplified) */
void
runq_add(struct runq *rq, struct thread *td, int flags) {
    int pri = td->td_priority;
    int qi = pri / RQ_PPQ; /* FB10-15: pri/4  FB16: pri/1 */
    struct rq_queue *rqh = &rq->rq_queues[qi];
    if (flags & SRQ_YIELDING)
        TAILQ_INSERT_TAIL(rqh, td, td_runq);
    else
        TAILQ_INSERT_HEAD(rqh, td, td_runq);
    runq_setbit(&rq->rq_status, qi);
}

PLAIN ENGLISH

Convert the thread's numeric priority to a queue index. In FB10-15, divide by 4 so priorities 0–3 share queue 0. In FB16, it is 1:1.

Get a pointer to the target queue head in the run queue array.

If the thread is yielding (gave up CPU voluntarily), add it to the tail — it waits behind others at the same priority.

Otherwise (preempted or newly runnable), add it to the head — it runs first when its priority comes up.

Set the corresponding bit in the status bitmask to indicate this queue is non-empty. Enables O(1) queue selection via bit-scan instructions.

💡

Key Insight: RQ_PPQ=1

With RQ_PPQ=1 in FB16, every priority gets its own queue. This eliminates "priority inversion within a band" — no thread with priority 5 has to wait behind a thread with priority 7 in the same queue slot.

How does FreeBSD 16 track which of its 256 run queues are non-empty?

Module 5

Context Switching

Every time the scheduler picks a new thread, the CPU needs to save the old thread's register state and load the new thread's state. This is the context switch — the most performance-critical path in the entire kernel.

What Gets Saved?

AMD64 (x86-64)

Register Class	Registers	Why
Callee-saved GPRs	`r12, r13, r14, r15, rbp, rsp, rbx`	ABI requires these survive function calls
Instruction pointer	`rip` (via return address on stack)	Resume execution at the right place
FPU/SIMD state	xsave area (or fxsave on older CPUs)	AVX/SSE registers — lazily saved only if used

ARM64 (AArch64)

Register Class	Registers	Why
Callee-saved GPRs	`x19–x29, lr (x30)`	ARM64 calling convention preserves these
Stack & frame	`sp, fp (x29)`	Each thread has its own kernel stack
VFP/NEON state	Floating-point registers	Lazily saved when next used
PAC keys	`ptrauth_switch()`	Pointer authentication keys (security feature)

💡

Lazy FPU Saving

Both architectures use lazy FPU saving: the FPU state is only saved/restored when a thread actually uses floating-point instructions. This avoids saving 512+ bytes of SIMD state on every context switch — most kernel threads never touch the FPU.

The CPU Conversation

What a context switch looks like as a conversation between CPU components:

The `mi_switch()` Orchestrator

CODE


void
mi_switch(int flags) {
    struct thread *td = curthread;
    /* 1. Account for CPU time used */
    sched_switch(td, flags);
    /* sched_switch picks next thread and calls */
    /* cpu_switch(old_td, new_td, lock) */
    /* 2. We return here when scheduled again */
    td->td_oncpu = PCPU_GET(cpuid);
}

PLAIN ENGLISH

Get the currently running thread (curthread is a per-CPU global — fast, no locking needed).

Call sched_switch(), which asks the scheduler to pick the next thread. The scheduler calls cpu_switch() under the hood.

The key insight: cpu_switch() does not return to us — it returns to the new thread's saved state. We only resume when some future scheduling decision picks us again.

When we are re-scheduled (maybe milliseconds or seconds later), we update our CPU ID to reflect which CPU we are now running on — we might have migrated to a different core!

Context Switch Overhead

1

Thread→Thread (same process) — ~1–3 μs

Only register save/restore. No TLB flush. Cheapest possible switch.

2

Process→Process — ~3–10 μs

Registers + page table switch + TLB flush. Some TLB entries may be preserved with PCID/ASID tags.

3

+ FPU state — add ~0.5–2 μs

Added if either thread uses FPU/AVX/NEON. The xsave area can be 512–8192 bytes depending on available extensions.

After `cpu_switch()` is called, when does the old thread resume executing?

Module 6

The Pluggable Scheduler Framework

Until FreeBSD 15, choosing between 4BSD and ULE was a compile-time decision — you had to rebuild the kernel to switch. FreeBSD 16 changes this with a new pluggable scheduler framework that enables boot-time selection.

The Shim Architecture

💡

IFUNC Trampolines

The shim layer uses IFUNC trampolines — the same mechanism used for CPU-optimized string functions like memcpy(). The linker resolves each function pointer once at boot time. After that, each scheduler call goes directly to the implementation — no extra pointer dereference at runtime.

`struct sched_instance` — The Vtable

Each scheduler registers itself by filling in a structure of function pointers:

CODE


struct sched_instance {
    /* Thread lifecycle */
    void (*sched_fork)(struct thread *, struct thread *);
    void (*sched_exit)(struct proc *, struct thread *);
    /* Core scheduling */
    void (*sched_clock)(struct thread *, int);
    void (*sched_switch)(struct thread *, int);
    void (*sched_add)(struct thread *, int);
    struct thread *(*sched_choose)(void);
    void (*sched_wakeup)(struct thread *, int);
    /* Priority management */
    void (*sched_prio)(struct thread *, u_char);
    void (*sched_user_prio)(struct thread *, u_char);
    /* ~45 total: affinity, preempt, bind, load… */
};

PLAIN ENGLISH

This is a vtable — the same design pattern as C++ virtual functions, but in plain C.

Each scheduler (4BSD, ULE, or a future third-party one) fills in this struct with its own function pointers.

The kernel calls sched_add(). The shim resolves this to whichever scheduler's .sched_add pointer was set at boot.

With ~45 function pointers, the interface covers everything from creating threads (sched_fork) to making scheduling decisions (sched_choose).

How `DECLARE_SCHEDULER` Works

CODE


#define DECLARE_SCHEDULER(si)  \
    DATA_SET(schedulers, si); \
    SCHED_DEFINE_IFUNCS(si)

/* In sched_ule.c: */
static struct sched_instance ule_instance = {
    .sched_add     = sched_ule_add,
    .sched_switch  = sched_ule_switch,
    .sched_choose  = sched_ule_choose,
    /* ... all ~45 functions ... */
};
DECLARE_SCHEDULER(ule_instance);

/* Loader tunable — set in /boot/loader.conf: */
kern.sched.name="ULE"     # or "4BSD"

PLAIN ENGLISH

DECLARE_SCHEDULER does two things: registers the scheduler instance in a linker set, and generates IFUNC trampoline stubs.

DATA_SET(schedulers, si) places the instance in a special linker section so the kernel can discover all available schedulers at boot.

SCHED_DEFINE_IFUNCS generates one IFUNC resolver per scheduler function. At boot, the linker resolves each to the active scheduler's function pointer.

Setting kern.sched.name=ULE in /boot/loader.conf selects the scheduler — no recompilation needed.

What This Enables

A/B

A/B Testing

Compare schedulers on the same hardware by rebooting with a different tunable. No recompilation, no separate kernel binary.

🔬

Third-Party Schedulers

Researchers can implement novel scheduling algorithms as kernel modules, loadable without modifying the base kernel.

📦

Unified Kernel

Distributions ship a single kernel binary with both schedulers compiled in, selected at boot time based on workload.

A researcher wants to test a new scheduling algorithm on FreeBSD 16. What do they need to do?

Module 7

The Big Picture

Now that we have explored each subsystem in detail, let us step back and see how the FreeBSD scheduler has evolved across major versions, how the two schedulers compare, and how to observe it all with SDT probes and DTrace.

Timeline of Major Changes

FreeBSD 10 — Baseline

Both schedulers present, 64-queue runq, compile-time selection. The 4BSD and ULE schedulers coexist as separate compile options.

FreeBSD 11

ts_slice added to 4BSD for explicit time-slicing. SDT probes added to both schedulers for DTrace observability.

FreeBSD 12

ts_estcpu becomes explicit in 4BSD's td_sched. ULE gets a minimum time-slice tunable (sched_slice_min).

FreeBSD 14

API cleanup: sched_switch loses the newtd parameter, sched_wakeup gains flags. ULE gets cache-padded locks to avoid false sharing on multi-socket systems.

FreeBSD 15

tdq_curthread added to ULE, sched_ap_entry for application processor startup, AST-based preemption improvements.

FreeBSD 16

Unified run queue (3 queues → 1 per CPU), 256-queue runq (RQ_PPQ=1), pluggable framework (sched_shim.c), per-scheduler sysctl namespaces (kern.sched.ule.*).

4BSD vs ULE Comparison

Feature	4BSD	ULE
Design	Single global lock, global + per-CPU queues	Per-CPU locks, per-CPU queues only
Priority algorithm	Exponential CPU-usage decay	Interactivity scoring (sleep/run ratio)
Time slices	Fixed quantum (~100 ms)	Variable, based on classification
Interactive boost	None	Automatic: score < 30 → boosted
CPU topology	Basic (last-CPU preference)	Full: SMT, packages, NUMA
Load balancing	Shortest-queue + IPI	Work-stealing, topology-aware
Run queues (FB10–15)	1 global + N per-CPU	3 per CPU: realtime, timeshare, idle
Run queues (FB16)	1 global + N per-CPU	1 unified per CPU
Locking	`sched_lock` global spinlock	Per-CPU `tdq_lock` spinlocks
Default since	Original BSD (retired from default in 7.1)	FreeBSD 7.1 onward

Available SDT Probes

Probe	Fires When
`sched:::change-pri`	Thread priority changed via `sched_thread_priority()`
`sched:::lend-pri`	Priority lent due to priority propagation
`sched:::enqueue`	Thread added to run queue (`sched_add()`)
`sched:::dequeue`	Thread removed from run queue
`sched:::load-change`	Run queue load changes
`sched:::on-cpu`	Thread begins executing after context switch
`sched:::off-cpu`	Thread switched off CPU
`sched:::surrender`	Thread yields due to preemption

DTrace One-Liners

DTRACE


# Watch threads being enqueued in real time
dtrace -n 'sched:::enqueue {
    printf("%s (pid %d) enqueued", execname, pid); }'

# Top 10 threads by scheduling events, every 5 seconds
dtrace -n 'sched:::on-cpu { @[execname, tid] = count(); }' \
  -n 'tick-5s { trunc(@, 10); printa(@); clear(@); }'

# How long are context switches taking?
dtrace -n 'sched:::off-cpu { self->ts = timestamp; }' \
  -n 'sched:::on-cpu /self->ts/ {
    @[execname] = quantize(timestamp - self->ts);
    self->ts = 0; }'

# Watch priority changes in real time
dtrace -n 'sched:::change-pri {
    printf("%s[%d]: %d -> %d",
        execname, tid, arg1, arg2); }'

PLAIN ENGLISH

enqueue probe: Fires every time a thread enters the run queue. Great for seeing which processes are most active.

on-cpu aggregation: Counts scheduling events per process, reports every 5 seconds. Shows who the CPU-hungriest processes are in production.

off-cpu/on-cpu latency: Measures how long each thread waits between being switched out and switched back in. A distribution (quantize) shows if outliers exist.

change-pri trace: Shows priority changes as they happen. Useful for diagnosing unexpected priority boosts or inversions.

All probes fire at near-zero cost when not active — SDT probes are nop instructions when DTrace is not running.

Scheduling Data Flow

The complete journey of a scheduling decision, from timer interrupt to new thread executing:

❶ Timer Interrupt

hardclock() fires → calls sched_clock(td, cnt) to update CPU usage statistics and check if the time slice has expired.

❷ Preemption Check

If the time slice is up, the scheduler sets TDF_NEEDRESCHED on the thread. This flag is checked on return from interrupt.

❸ mi_switch()

The kernel calls mi_switch(), which calls sched_switch(). The scheduler picks the highest-priority runnable thread from the run queue.

❹ cpu_switch()

Architecture-specific code saves old registers, loads new registers, switches the kernel stack pointer. On return, we are running as the new thread.

❺ Address Space Switch

If the new thread is in a different process, switch page tables and flush the TLB (unless PCID/ASID is available to avoid the flush).

❻ Resume Execution

The new thread returns from its saved mi_switch() call frame and continues executing. sched:::on-cpu fires.

⚡

Speed

The entire flow — from timer interrupt to new thread executing — typically takes 1–10 microseconds. Fast enough to happen thousands of times per second on each CPU without noticeable overhead.

Putting It All Together

Priority system: 256 levels, lower is better, divided into interrupt / realtime / kernel / timeshare / idle classes — and those classes have been shrinking/growing across versions.
4BSD: Classic decay-based scheduling, simple and predictable. Still useful for workloads where global lock contention is not an issue.
ULE: Modern per-CPU design with interactivity scoring and work stealing. The default since FreeBSD 7.1 and continually refined.
Run queue: Evolved from 64 shared-priority queues (FB10–15) to 256 one-priority-per-queue (FB16), eliminating intra-band inversions.
Context switching: Architecture-specific register save/restore with lazy FPU and PCID/ASID support for cheaper process switches.
Pluggable framework: IFUNC-based dispatching for boot-time scheduler selection — the prerequisite for a future ecosystem of specialized schedulers.
DTrace: Production-safe observability into every scheduling decision with near-zero overhead when inactive.

What Does a Scheduler Actually Do?

Source File Map

The Scheduler API

The Priority Space

SRQ Flags

A web server thread is sleeping, waiting for a network packet. The packet arrives and triggers a hardware interrupt. Which SRQ flag will be used when adding this thread back to the run queue?

Meet the 4BSD Scheduler

The Priority Decay Formula

CPU Usage Decay (schedcpu)

The Clock Tick Handler

Per-Thread State: struct td_sched

Run Queue Organization

4BSD Feature Matrix

A thread has been using a lot of CPU. What happens to its ts_estcpu value during the schedcpu() decay pass?

Meet the ULE Scheduler

Per-CPU Design

Interactivity Detection

Work Stealing & Load Balancing

ULE Feature Matrix

A thread spends 90% of its time sleeping (waiting for user keystrokes) and only 10% running. What does ULE's sched_interact_score() return for it?

The Run Queue

Data Structure Evolution

Run Queue Comparison

Core Queue Operations

How does FreeBSD 16 track which of its 256 run queues are non-empty?

Context Switching

What Gets Saved?

AMD64 (x86-64)

ARM64 (AArch64)

The CPU Conversation

The mi_switch() Orchestrator

Context Switch Overhead

After cpu_switch() is called, when does the old thread resume executing?

The Pluggable Scheduler Framework

The Shim Architecture

struct sched_instance — The Vtable

How DECLARE_SCHEDULER Works

What This Enables

A researcher wants to test a new scheduling algorithm on FreeBSD 16. What do they need to do?

The Big Picture

Timeline of Major Changes

FreeBSD 10 — Baseline

FreeBSD 11

FreeBSD 12

FreeBSD 14

FreeBSD 15

FreeBSD 16

4BSD vs ULE Comparison

Available SDT Probes

DTrace One-Liners

Scheduling Data Flow

❶ Timer Interrupt

❷ Preemption Check

❸ mi_switch()

❹ cpu_switch()

❺ Address Space Switch

❻ Resume Execution

Putting It All Together

You notice a thread's priority changing unexpectedly under load. Which DTrace probe would you use to investigate?

Per-Thread State: `struct td_sched`

A thread has been using a lot of CPU. What happens to its `ts_estcpu` value during the `schedcpu()` decay pass?

A thread spends 90% of its time sleeping (waiting for user keystrokes) and only 10% running. What does ULE's `sched_interact_score()` return for it?

The `mi_switch()` Orchestrator

After `cpu_switch()` is called, when does the old thread resume executing?

`struct sched_instance` — The Vtable

How `DECLARE_SCHEDULER` Works