Module 1

What Does a Scheduler Actually Do?

Imagine a traffic controller at a busy airport. Dozens of planes are ready to land, but only a handful of runways are available. The controller decides which plane lands next, for how long it uses the runway, and when to wave it off so someone else gets a turn. That is what a scheduler does — except the planes are threads and the runways are CPUs.

Source File Map

FilePurpose
kern/sched_4bsd.cThe traditional 4BSD scheduler implementation
kern/sched_ule.cThe ULE scheduler (default since FreeBSD 7.1)
kern/sched_shim.cPluggable scheduler framework (FreeBSD 16+)
sys/sched.hScheduler API — public function declarations
kern/kern_switch.cRun queue operations and mi_switch()
sys/runq.hRun queue data structure definitions
kern/kern_synch.cSleep/wakeup, thread blocking and unblocking
amd64/amd64/cpu_switch.SAMD64 low-level context switch assembly
arm64/arm64/swtch.SARM64 low-level context switch assembly

The Scheduler API

The sched.h header defines the contract every scheduler must fulfill:

FunctionWhat It Does
sched_add()Place a thread on a run queue — "this thread is ready to run"
sched_switch()Pick the next thread and switch to it
sched_choose()Select the highest-priority runnable thread
sched_clock()Periodic tick — update usage stats, check time slices
sched_prio()Set a thread's priority
sched_wakeup()Wake a sleeping thread and schedule it
sched_fork()Initialize scheduling state for a new child thread
sched_exit()Clean up scheduling state when a thread dies
sched_affinity()Handle CPU affinity changes

The Priority Space

FreeBSD uses numeric priorities where lower = more important. The 256-value priority space is divided into classes:

ClassFreeBSD 10FreeBSD 15FreeBSD 16
PRI_ITHD (interrupts)0–470–150–7
PRI_REALTIME48–7916–478–39
PRI_KERN (kernel)80–11948–8740–55
PRI_TIMESHARE (user)120–22388–22356–223
PRI_IDLE224–255224–255224–255
💡
The Trend

Interrupt and realtime priorities have been getting narrower across versions, giving more room to timeshare threads. This reflects modern workloads where interactive responsiveness matters more than having many distinct interrupt priority levels.

SRQ Flags

When a thread is added to a run queue, flags describe why it is being added:

CODE

#define SRQ_BORING    0x0000   /* No special circumstances */
#define SRQ_YIELDING  0x0001   /* Thread is yielding voluntarily */
#define SRQ_OURSELF   0x0002   /* Adding ourselves to the run queue */
#define SRQ_INTR      0x0004   /* Wakeup is interrupt-driven (urgent) */
#define SRQ_PREEMPTED 0x0008   /* Thread was preempted */
#define SRQ_BORROWING 0x0010   /* Priority updated due to lending */
#define SRQ_HOLD      0x0020   /* Return holding td lock (14+) */
#define SRQ_HOLDTD    0x0040   /* Return holding td lock (14+) */
        
PLAIN ENGLISH

SRQ_BORING: The default — just a regular thread being placed on the queue, nothing unusual.

SRQ_YIELDING: The thread said "I'm done for now, let someone else go." The scheduler places it at the back of its priority queue.

SRQ_INTR: This thread was woken by a hardware interrupt (like a network packet arriving). It may need to run urgently.

SRQ_PREEMPTED: A higher-priority thread showed up, so this one was forcibly pulled off the CPU. Gets special treatment when re-queued.

SRQ_HOLD / SRQ_HOLDTD (FreeBSD 14+): Lock-management flags that control which locks are held when the function returns — important for avoiding lock-order violations.

A web server thread is sleeping, waiting for a network packet. The packet arrives and triggers a hardware interrupt. Which SRQ flag will be used when adding this thread back to the run queue?

Module 2

Meet the 4BSD Scheduler

The 4BSD scheduler is the old guard — the scheduling algorithm that dates back to original BSD Unix. Its approach is elegant in its simplicity: track how much CPU time each thread has used recently, and gradually lower the priority of CPU-hungry threads so interactive programs stay responsive.

The Priority Decay Formula

Every time a clock tick fires, the 4BSD scheduler asks: "How much CPU has this thread used?" and adjusts its priority accordingly:

newpriority = PUSER + (ts_estcpu / INVERSE_ESTCPU_WEIGHT) + NICE_WEIGHT × (p_nice − PRIO_MIN)
  • PUSER — base priority for timeshare threads
  • ts_estcpu — estimated recent CPU usage (higher = used more CPU)
  • INVERSE_ESTCPU_WEIGHT — controls how strongly CPU usage affects priority (typically 8)
  • p_nice — user-settable "niceness" value (−20 to +20)
💡
The Key Insight

The more CPU you use, the higher your priority number becomes — which means lower actual priority. CPU hogs naturally sink to the back of the line.

CPU Usage Decay (schedcpu)

Every second, the schedcpu() function applies exponential decay to each thread's CPU usage estimate:

ts_estcpu = (2 × loadavg × ts_estcpu) / (2 × loadavg + FSCALE)

This means old CPU usage is gradually "forgotten." Under high load, decay happens more slowly — the system remembers CPU-hungry threads longer when resources are scarce.

The Clock Tick Handler

CODE

/* Called on every scheduler clock tick */
ts->ts_cpticks++;
ts->ts_estcpu = ESTCPULIM(ts->ts_estcpu + 1);
if ((ts->ts_estcpu % INVERSE_ESTCPU_WEIGHT) == 0)
    resetpriority(td);
        
PLAIN ENGLISH

Increment the tick counter (ts_cpticks) — this tracks raw clock ticks for the current scheduling window.

Add 1 to the estimated CPU usage (ts_estcpu) and clamp it to a maximum value so it does not overflow.

Every Nth tick (where N = INVERSE_ESTCPU_WEIGHT, typically 8), recalculate the thread's priority using the decay formula above.

Priority does not change on every tick — it is batched for efficiency. The thread's position in the run queue only shifts every 8 ticks.

Per-Thread State: struct td_sched

The 4BSD scheduler's per-thread state has evolved across versions:

FreeBSD 10

struct td_sched {
    fixpt_t     ts_pctcpu;
    int         ts_cpticks;
    int         ts_slptime;
    int         ts_flags;
    struct runq *ts_runq;
};
/* FreeBSD 11 added: */
    int         ts_slice; /* Remaining ticks in quantum */
/* FreeBSD 12+ added: */
    u_int       ts_estcpu; /* Estimated CPU utilization */
        
PLAIN ENGLISH

ts_pctcpu: Percentage of CPU used — shown in ps output as the CPU% column.

ts_cpticks: Raw clock ticks consumed in the current window. Feeds the decay formula.

ts_slptime: How long this thread has been sleeping. After sleeping, the next wakeup gets a priority boost.

ts_slice (FB11+): Remaining ticks before the scheduler forces a switch. Adds explicit time-slicing.

ts_estcpu (FB12+): Moved from the process struct into the per-thread struct for finer-grained accounting.

Run Queue Organization

4BSD Scheduler: Run Queue Organization SMP System Global runq Unbound threads (no CPU affinity) runq_pcpu[N] Per-CPU Queues Pinned / Bound threads CPU0 CPU1 CPUn UP systems use a single struct runq only

On SMP systems, there are two pools of queues: a global queue for threads without CPU preferences, and per-CPU queues for threads pinned to specific CPUs.

4BSD Feature Matrix

Feature10111213141516
ts_slice field
TDF_SLICEEND flag
ts_estcpu in td_sched
sched_clock(td, cnt)
sched_switch(td, flags)
HWT hooks
sched_4bsd_* naming
struct sched_instance

A thread has been using a lot of CPU. What happens to its ts_estcpu value during the schedcpu() decay pass?

Module 3

Meet the ULE Scheduler

ULE (pronounced "you-lee") replaced the 4BSD scheduler as the default in FreeBSD 7.1. Where 4BSD treats SMP as an afterthought, ULE was designed from the ground up for multi-core systems.

Per-CPU Design

The fundamental insight of ULE is that each CPU gets its own scheduler state — a struct tdq structure with separate queues for real-time and timeshare threads:

ULE Scheduler: Per-CPU Thread Dispatch Queues CPU 0 struct tdq tdq_lock (per-CPU) FB 10–15: tdq_realtime tdq_timeshare tdq_idle (three separate queues) FB 16+: tdq_runq (unified) (single unified queue) tdq_cpu = 0 CPU 1 struct tdq tdq_lock (per-CPU) FB 10–15: tdq_realtime tdq_timeshare tdq_idle (three separate queues) FB 16+: tdq_runq (unified) (single unified queue) tdq_cpu = 1 CPU N struct tdq tdq_lock (per-CPU) FB 10–15: tdq_realtime tdq_timeshare tdq_idle (three separate queues) FB 16+: tdq_runq (unified) (single unified queue) tdq_cpu = N Work Stealing Work Stealing ↔ Load Balancing (tdq_idled / sched_balance) ↔
💡
Scalability Win

On a 64-core server, 64 CPUs can schedule threads simultaneously without contending on a single lock — a massive scalability win over 4BSD's global queue approach.

Interactivity Detection

ULE automatically classifies threads as interactive or batch-oriented by measuring their voluntary sleep patterns. The magic happens in sched_interact_score():

CODE

static int
sched_interact_score(struct thread *td) {
    struct td_sched *ts = td->td_sched;
    int div;
    if (ts->ts_runtime >= ts->ts_slptime) {
        div = max(1, ts->ts_runtime / SCHED_INTERACT_HALF);
        return (SCHED_INTERACT_HALF +
            (SCHED_INTERACT_HALF - (ts->ts_slptime / div)));
    }
    div = max(1, ts->ts_slptime / SCHED_INTERACT_HALF);
    return (ts->ts_runtime / div);
}
        
PLAIN ENGLISH

Get the thread's scheduler state, which tracks time spent running vs. sleeping.

If the thread has spent more time running than sleeping, it gets a high score (less interactive — it is a CPU hog).

If the thread sleeps more than it runs, it gets a low score (more interactive — waiting for user input or I/O).

The SCHED_INTERACT_HALF constant creates a threshold: scores above it mark batch threads, below mark interactive ones.

Interactive threads receive a priority boost, keeping them responsive even under heavy load.

Work Stealing & Load Balancing

1
Local check

Is my own queue empty? If no, run the highest-priority thread. No cross-CPU coordination needed.

2
Steal threshold

Check if any other CPU's tdq_load exceeds the load balancing threshold. If so, steal from the busiest.

3
Topology-aware

Prefer stealing from CPUs in the same NUMA domain, then same physical package, then same SMT core. Cross-domain transfers cost cache-line flushes.

ULE Feature Matrix

Feature10111213141516
3 run queues per CPU
1 unified run queue
sched_slice_min
always_steal tunable
Cache-padded lock
tdq_curthread field
Lockless TDQ accessors
kern.sched.ule.* namespace

A thread spends 90% of its time sleeping (waiting for user keystrokes) and only 10% running. What does ULE's sched_interact_score() return for it?

Module 4

The Run Queue

The run queue is the central data structure that holds all threads ready to execute. Its design has evolved significantly between FreeBSD versions, with the most dramatic change arriving in FreeBSD 16.

Data Structure Evolution

FreeBSD 10–15

#define RQ_NQS      64    /* 64 run queues */
#define RQ_PPQ       4    /* 4 priorities per queue */

struct rqbits {
    rqb_word_t rqb_bits[RQB_LEN];
};
struct runq {
    struct rqbits  rq_status;
    struct rqhead  rq_queues[RQ_NQS];
};
        
PLAIN ENGLISH

64 queues, each covering 4 priority values. Priority 0–3 → queue 0, priority 4–7 → queue 1, and so on.

rq_status: a bitmask with one bit per queue — set means "this queue has threads, check it."

Finding the highest-priority runnable thread = finding the lowest-set bit in rq_status. That is a single bsf (bit-scan forward) instruction — O(1).

The downside: 4 different priorities share a queue, so a priority-5 thread might wait behind priority-7 threads at the same queue slot.

FreeBSD 16

#define RQ_NQS      256   /* 256 run queues */
#define RQ_PPQ         1   /* 1 priority per queue */

typedef unsigned long rqsw_t;
#define RQSW_NB     4     /* 4 × 64-bit = 256 bits */

struct rq_status {
    rqsw_t rq_sw[RQSW_NB];
};
struct runq {
    struct rq_status rq_status;
    struct rq_queue  rq_queues[RQ_NQS];
};
        
PLAIN ENGLISH

256 queues — one per priority level. Priority 5 → queue 5, no sharing.

rq_status: now 4 × 64-bit words = 256 bits total. One bit per queue, just more of them.

Finding the highest-priority thread: scan the 4 words with __builtin_ctzl() (count trailing zeros) — still effectively O(1).

✅ Eliminates intra-band priority inversion: no thread with priority 5 waits behind priority 7 in the same slot.

Run Queue Comparison

AspectFreeBSD 10–15FreeBSD 16
Queues64256
Priorities per queue41
Status trackingstruct rqbits (platform-specific)struct rq_status with rqsw_t array
Bit operationsDirect rqb_bits manipulationrunq_sw_*() abstraction functions
Priority mappingpri / RQ_PPQ = queue index1:1 mapping

Core Queue Operations

CODE

/* Conceptual runq_add (simplified) */
void
runq_add(struct runq *rq, struct thread *td, int flags) {
    int pri = td->td_priority;
    int qi = pri / RQ_PPQ; /* FB10-15: pri/4  FB16: pri/1 */
    struct rq_queue *rqh = &rq->rq_queues[qi];
    if (flags & SRQ_YIELDING)
        TAILQ_INSERT_TAIL(rqh, td, td_runq);
    else
        TAILQ_INSERT_HEAD(rqh, td, td_runq);
    runq_setbit(&rq->rq_status, qi);
}
        
PLAIN ENGLISH

Convert the thread's numeric priority to a queue index. In FB10-15, divide by 4 so priorities 0–3 share queue 0. In FB16, it is 1:1.

Get a pointer to the target queue head in the run queue array.

If the thread is yielding (gave up CPU voluntarily), add it to the tail — it waits behind others at the same priority.

Otherwise (preempted or newly runnable), add it to the head — it runs first when its priority comes up.

Set the corresponding bit in the status bitmask to indicate this queue is non-empty. Enables O(1) queue selection via bit-scan instructions.

💡
Key Insight: RQ_PPQ=1

With RQ_PPQ=1 in FB16, every priority gets its own queue. This eliminates "priority inversion within a band" — no thread with priority 5 has to wait behind a thread with priority 7 in the same queue slot.

How does FreeBSD 16 track which of its 256 run queues are non-empty?

Module 5

Context Switching

Every time the scheduler picks a new thread, the CPU needs to save the old thread's register state and load the new thread's state. This is the context switch — the most performance-critical path in the entire kernel.

What Gets Saved?

AMD64 (x86-64)

Register ClassRegistersWhy
Callee-saved GPRsr12, r13, r14, r15, rbp, rsp, rbxABI requires these survive function calls
Instruction pointerrip (via return address on stack)Resume execution at the right place
FPU/SIMD statexsave area (or fxsave on older CPUs)AVX/SSE registers — lazily saved only if used

ARM64 (AArch64)

Register ClassRegistersWhy
Callee-saved GPRsx19–x29, lr (x30)ARM64 calling convention preserves these
Stack & framesp, fp (x29)Each thread has its own kernel stack
VFP/NEON stateFloating-point registersLazily saved when next used
PAC keysptrauth_switch()Pointer authentication keys (security feature)
💡
Lazy FPU Saving

Both architectures use lazy FPU saving: the FPU state is only saved/restored when a thread actually uses floating-point instructions. This avoids saving 512+ bytes of SIMD state on every context switch — most kernel threads never touch the FPU.

The CPU Conversation

What a context switch looks like as a conversation between CPU components:

The mi_switch() Orchestrator

CODE

void
mi_switch(int flags) {
    struct thread *td = curthread;
    /* 1. Account for CPU time used */
    sched_switch(td, flags);
    /* sched_switch picks next thread and calls */
    /* cpu_switch(old_td, new_td, lock) */
    /* 2. We return here when scheduled again */
    td->td_oncpu = PCPU_GET(cpuid);
}
        
PLAIN ENGLISH

Get the currently running thread (curthread is a per-CPU global — fast, no locking needed).

Call sched_switch(), which asks the scheduler to pick the next thread. The scheduler calls cpu_switch() under the hood.

The key insight: cpu_switch() does not return to us — it returns to the new thread's saved state. We only resume when some future scheduling decision picks us again.

When we are re-scheduled (maybe milliseconds or seconds later), we update our CPU ID to reflect which CPU we are now running on — we might have migrated to a different core!

Context Switch Overhead

1
Thread→Thread (same process) — ~1–3 μs

Only register save/restore. No TLB flush. Cheapest possible switch.

2
Process→Process — ~3–10 μs

Registers + page table switch + TLB flush. Some TLB entries may be preserved with PCID/ASID tags.

3
+ FPU state — add ~0.5–2 μs

Added if either thread uses FPU/AVX/NEON. The xsave area can be 512–8192 bytes depending on available extensions.

After cpu_switch() is called, when does the old thread resume executing?

Module 6

The Pluggable Scheduler Framework

Until FreeBSD 15, choosing between 4BSD and ULE was a compile-time decision — you had to rebuild the kernel to switch. FreeBSD 16 changes this with a new pluggable scheduler framework that enables boot-time selection.

The Shim Architecture

FreeBSD 16: Pluggable Scheduler Architecture User / Kernel Code sched_add(), sched_switch(), … sched_shim.c — IFUNC Trampolines active_sched resolves function pointers at boot time OR 4BSD Scheduler sched_4bsd_add() sched_4bsd_switch() sched_4bsd_…() ULE Scheduler sched_ule_add() sched_ule_switch() sched_ule_…() FreeBSD 16: Boot-time selection via kern.sched.name tunable (default: ULE) FreeBSD 10–15: Compile-time selection only (no shim layer)
💡
IFUNC Trampolines

The shim layer uses IFUNC trampolines — the same mechanism used for CPU-optimized string functions like memcpy(). The linker resolves each function pointer once at boot time. After that, each scheduler call goes directly to the implementation — no extra pointer dereference at runtime.

struct sched_instance — The Vtable

Each scheduler registers itself by filling in a structure of function pointers:

CODE

struct sched_instance {
    /* Thread lifecycle */
    void (*sched_fork)(struct thread *, struct thread *);
    void (*sched_exit)(struct proc *, struct thread *);
    /* Core scheduling */
    void (*sched_clock)(struct thread *, int);
    void (*sched_switch)(struct thread *, int);
    void (*sched_add)(struct thread *, int);
    struct thread *(*sched_choose)(void);
    void (*sched_wakeup)(struct thread *, int);
    /* Priority management */
    void (*sched_prio)(struct thread *, u_char);
    void (*sched_user_prio)(struct thread *, u_char);
    /* ~45 total: affinity, preempt, bind, load… */
};
        
PLAIN ENGLISH

This is a vtable — the same design pattern as C++ virtual functions, but in plain C.

Each scheduler (4BSD, ULE, or a future third-party one) fills in this struct with its own function pointers.

The kernel calls sched_add(). The shim resolves this to whichever scheduler's .sched_add pointer was set at boot.

With ~45 function pointers, the interface covers everything from creating threads (sched_fork) to making scheduling decisions (sched_choose).

How DECLARE_SCHEDULER Works

CODE

#define DECLARE_SCHEDULER(si)  \
    DATA_SET(schedulers, si); \
    SCHED_DEFINE_IFUNCS(si)

/* In sched_ule.c: */
static struct sched_instance ule_instance = {
    .sched_add     = sched_ule_add,
    .sched_switch  = sched_ule_switch,
    .sched_choose  = sched_ule_choose,
    /* ... all ~45 functions ... */
};
DECLARE_SCHEDULER(ule_instance);

/* Loader tunable — set in /boot/loader.conf: */
kern.sched.name="ULE"     # or "4BSD"
        
PLAIN ENGLISH

DECLARE_SCHEDULER does two things: registers the scheduler instance in a linker set, and generates IFUNC trampoline stubs.

DATA_SET(schedulers, si) places the instance in a special linker section so the kernel can discover all available schedulers at boot.

SCHED_DEFINE_IFUNCS generates one IFUNC resolver per scheduler function. At boot, the linker resolves each to the active scheduler's function pointer.

Setting kern.sched.name=ULE in /boot/loader.conf selects the scheduler — no recompilation needed.

What This Enables

A/B
A/B Testing

Compare schedulers on the same hardware by rebooting with a different tunable. No recompilation, no separate kernel binary.

🔬
Third-Party Schedulers

Researchers can implement novel scheduling algorithms as kernel modules, loadable without modifying the base kernel.

📦
Unified Kernel

Distributions ship a single kernel binary with both schedulers compiled in, selected at boot time based on workload.

A researcher wants to test a new scheduling algorithm on FreeBSD 16. What do they need to do?

Module 7

The Big Picture

Now that we have explored each subsystem in detail, let us step back and see how the FreeBSD scheduler has evolved across major versions, how the two schedulers compare, and how to observe it all with SDT probes and DTrace.

Timeline of Major Changes

FreeBSD 10 — Baseline

Both schedulers present, 64-queue runq, compile-time selection. The 4BSD and ULE schedulers coexist as separate compile options.

FreeBSD 11

ts_slice added to 4BSD for explicit time-slicing. SDT probes added to both schedulers for DTrace observability.

FreeBSD 12

ts_estcpu becomes explicit in 4BSD's td_sched. ULE gets a minimum time-slice tunable (sched_slice_min).

FreeBSD 14

API cleanup: sched_switch loses the newtd parameter, sched_wakeup gains flags. ULE gets cache-padded locks to avoid false sharing on multi-socket systems.

FreeBSD 15

tdq_curthread added to ULE, sched_ap_entry for application processor startup, AST-based preemption improvements.

FreeBSD 16

Unified run queue (3 queues → 1 per CPU), 256-queue runq (RQ_PPQ=1), pluggable framework (sched_shim.c), per-scheduler sysctl namespaces (kern.sched.ule.*).

4BSD vs ULE Comparison

Feature4BSDULE
DesignSingle global lock, global + per-CPU queuesPer-CPU locks, per-CPU queues only
Priority algorithmExponential CPU-usage decayInteractivity scoring (sleep/run ratio)
Time slicesFixed quantum (~100 ms)Variable, based on classification
Interactive boostNoneAutomatic: score < 30 → boosted
CPU topologyBasic (last-CPU preference)Full: SMT, packages, NUMA
Load balancingShortest-queue + IPIWork-stealing, topology-aware
Run queues (FB10–15)1 global + N per-CPU3 per CPU: realtime, timeshare, idle
Run queues (FB16)1 global + N per-CPU1 unified per CPU
Lockingsched_lock global spinlockPer-CPU tdq_lock spinlocks
Default sinceOriginal BSD (retired from default in 7.1)FreeBSD 7.1 onward

Available SDT Probes

ProbeFires When
sched:::change-priThread priority changed via sched_thread_priority()
sched:::lend-priPriority lent due to priority propagation
sched:::enqueueThread added to run queue (sched_add())
sched:::dequeueThread removed from run queue
sched:::load-changeRun queue load changes
sched:::on-cpuThread begins executing after context switch
sched:::off-cpuThread switched off CPU
sched:::surrenderThread yields due to preemption

DTrace One-Liners

DTRACE

# Watch threads being enqueued in real time
dtrace -n 'sched:::enqueue {
    printf("%s (pid %d) enqueued", execname, pid); }'

# Top 10 threads by scheduling events, every 5 seconds
dtrace -n 'sched:::on-cpu { @[execname, tid] = count(); }' \
  -n 'tick-5s { trunc(@, 10); printa(@); clear(@); }'

# How long are context switches taking?
dtrace -n 'sched:::off-cpu { self->ts = timestamp; }' \
  -n 'sched:::on-cpu /self->ts/ {
    @[execname] = quantize(timestamp - self->ts);
    self->ts = 0; }'

# Watch priority changes in real time
dtrace -n 'sched:::change-pri {
    printf("%s[%d]: %d -> %d",
        execname, tid, arg1, arg2); }'
        
PLAIN ENGLISH

enqueue probe: Fires every time a thread enters the run queue. Great for seeing which processes are most active.

on-cpu aggregation: Counts scheduling events per process, reports every 5 seconds. Shows who the CPU-hungriest processes are in production.

off-cpu/on-cpu latency: Measures how long each thread waits between being switched out and switched back in. A distribution (quantize) shows if outliers exist.

change-pri trace: Shows priority changes as they happen. Useful for diagnosing unexpected priority boosts or inversions.

All probes fire at near-zero cost when not active — SDT probes are nop instructions when DTrace is not running.

Scheduling Data Flow

The complete journey of a scheduling decision, from timer interrupt to new thread executing:

❶ Timer Interrupt

hardclock() fires → calls sched_clock(td, cnt) to update CPU usage statistics and check if the time slice has expired.

❷ Preemption Check

If the time slice is up, the scheduler sets TDF_NEEDRESCHED on the thread. This flag is checked on return from interrupt.

❸ mi_switch()

The kernel calls mi_switch(), which calls sched_switch(). The scheduler picks the highest-priority runnable thread from the run queue.

❹ cpu_switch()

Architecture-specific code saves old registers, loads new registers, switches the kernel stack pointer. On return, we are running as the new thread.

❺ Address Space Switch

If the new thread is in a different process, switch page tables and flush the TLB (unless PCID/ASID is available to avoid the flush).

❻ Resume Execution

The new thread returns from its saved mi_switch() call frame and continues executing. sched:::on-cpu fires.

Speed

The entire flow — from timer interrupt to new thread executing — typically takes 1–10 microseconds. Fast enough to happen thousands of times per second on each CPU without noticeable overhead.

Putting It All Together

  1. Priority system: 256 levels, lower is better, divided into interrupt / realtime / kernel / timeshare / idle classes — and those classes have been shrinking/growing across versions.
  2. 4BSD: Classic decay-based scheduling, simple and predictable. Still useful for workloads where global lock contention is not an issue.
  3. ULE: Modern per-CPU design with interactivity scoring and work stealing. The default since FreeBSD 7.1 and continually refined.
  4. Run queue: Evolved from 64 shared-priority queues (FB10–15) to 256 one-priority-per-queue (FB16), eliminating intra-band inversions.
  5. Context switching: Architecture-specific register save/restore with lazy FPU and PCID/ASID support for cheaper process switches.
  6. Pluggable framework: IFUNC-based dispatching for boot-time scheduler selection — the prerequisite for a future ecosystem of specialized schedulers.
  7. DTrace: Production-safe observability into every scheduling decision with near-zero overhead when inactive.

You notice a thread's priority changing unexpectedly under load. Which DTrace probe would you use to investigate?