Construction of Microkernels

Lecture of Prof. J. Liedtke, summer term 2000

Summary by Felix Hupfeld

1 Outline

Some features of L4:

Threads, IPC
drivers at user level
address space operation: map, unmap, grant
page fault handling via ipc
hierarchical construction of address spaces from an initial address space (sigma0) via map/grant operations.

2 Threads, Thread Switch

A thread has:

internal properties
- register set (IP (instruction pointer), SP (stack pointer), general registers)
- stack
- status (Flags, OS-specific like priority, time, ... )
- address space
external properties
- unique id
- communication status

Conclusions:

Thread state must be saved/restored on thread switch - so that it does not get lost
We need a thread control block (tcb) per thread - to store the state
tcbs must be kernel objects - for protection of the thread state

Operations for tcbs:

find any thread's tcb by its uid
the currently executing tcb (per processor)

Thread switch:

user mode a
switch to kernel code and kernel stack (enter kernel)
user mode b

Enter kernel (by int, exception, interrupt)

single instruction (intxxx or sysenter)
- push user esp on the kernel stack (pointed to by esp0), load kernel esp0
- push user flags, reset flags
- push user eip, load kernel entry eip
push X, which indicates error code or kernel-call type
push register (optional)

Kernel Stack

From hardware perspective there are two stacks in the system: The user-mode stack and the kernel-stack. There are two different modells for execution in the kernel. One uses one global kernel stack for every entry in the kernel mode (system call, interrupt, ...), the other switches to a dedicated stack per thread.

Kernel stack

per processor (event model aka interrupt model)
- continuations
  - on switch (blocking call), store execution context and give a continuation (pointer to a function continuing the ongoing work along with some kind of context) to the kernel which is called when the thread can resume. Kernel stack can be discarded. All context is in the continuations.
- stateless kernel
  - no kernel threads,
  - kernel not interruptible,
  - inefficient system calls,
  - kernel can exchanged on the fly
- low cache footprint
per thread (process model)
- kernel can use threads
- no special means for keeping state on interruption
- no conceptual difference between kernel mode and user mode
- difficult to exchange kernel on the fly (either no persistent tcbs or tcbs with virtual addresses)
- larger cache footprint (minimize cache foot print!)

Per Thread Stacks are feasible with a microkernel because their potential stack size is tightly bounded (no deep nestings of calls in the kernel). Otherwise stack size would not be bounded and a potential big stack per thread would be needed.

See: Draves, Bershad, Rashid et al.: Using Continuations to implement Thread management and Communication in Operation Systems

In L4 implementation, the kernel stack per thread is on bottom of the aligned tcb (find current tcb by getting alignment of current esp0).

Operations on Kernel entry

Kernel entry/exit on x86:

push X
push registers (pushad)
...
popad
add esp,4 (get rid of X)
iretd

Resulting Stack Layout:

Hardware:
- SS (for segment)

ESP
flags
CS (for segment)
eip

Software
- X
- registers (optional)

X permits to differentiate between stack layouts:

ipc (w/o register push)
interrupt, exception, some system calls (w/ register push)
V86 mode (additional information is saved before starting the above mentioned procedure, DS, ES, FS, GS)

Thread Switch (x86)

push X
pushad
get current tcb from alignment of esp
switch stack pointer (move current in tcb, move new tcb esp in esp)
prepare kernel stack pointer esp0(bottom of new current tcb)
popad
add esp,4
iretd

MIPS R4600

no hardware stack
special registers
manual save of context is needed

3 TCBs, AS layout, booting

Basic TCB data structure (no AS)

myself (the thread_id, which consists of id and version bits. Used to check for integrity and validity of a potential tcb)
state
resources (like MMX stuff)
thread_esp
scheduling:
- timeslice
- remaining_timeslice
- priority
- cpu_clock
present_link (all present tcbs)
ready_link (all threads that are ready to execute)
tcb_id (magic word)
tcb_version

additional fields:

send queue links (for threads awaiting to send to this TCB's thread)
wakeup queue links (see coming section)

Size 0.5 - 1kb

ThreadID -> TCB

direct table: tcb_addr[thread_id]
hash table: tcb_addr[thread_id] (aka physical tcb)
- like direct table, but additional compare, if we really found the right tcb (pipeline conflict!, slow)

no MMU
table access per tcb
TLB entry for table
tcb pointer array, requires 1M for 256k threads ( 4 (pointer size) * 256k threads )
Trick: map unused parts of tcb array to 0-tcb page, access will result in "invalid thread id", so size of tcb array depends on the maximum thread id
TLB grows: 4 entries for 4000 threads (4000 threads = 4000 pointers a 4 bytes = 16k = 4 pages a 4k = 4 TLB entries, 0-tcb pages need 4k pagetable entries, version w/o 0-tcb's could work with a 4MB page)

direct address: tcb_addr = thread_id << ... & mask (for getting rid of version), aka virtual tcb
- MMU (because not every unused tcb should use real physical memory)
- no table access
- TLB entry per tcb
- TCB array, requires 128M for 256k threads
- TLB grows: 1 TLB for 8 threads, ie. 4k page for 8 threads, ie 512 Bytes per TCB

Layout of initial AS

user regions
shared system regions
- physical memory
- kernel code
- tcbs
- other kernel tables
per space system regions

Current Limitations:

256k or 512k threads (need virtual memory in kernel part of address space, so available user part is smaller)
64 or 128meg physical memory

Variations:

With physical tcb (table lookup), no need for mapping of tcb area, tcbs are accessible via physical memory mapping
If physical memory grows, full mapping in the physical memory region no longer possible. Not all of the physical memory has to be mapped, only the part where the page tables are.

Booting

Loading of L4, sigma0, L4 init, OS modules in the physical memory. L4 init is run and destroyed, sigma0 constructs the initial AS, the pagers of the OS construct further ASes:.

L4 init

basic memory
- build initial PTAB
- turn address translation on
basic VM
exception handling
- build IDT
- enter dummy handlers
processor (+coprocessor)
- check processor
- enable features like 4M pages, ...
- initialize TSS (Task State Segment): ESP0, IOPBM (io permission bitmap)
- init. coproc.
TCBs
- initialize first tcb
- switch to this tcbs kernel stack (make current "thread" the first thread)
- install pf handler for tcbs
dispatching
- init ready queue (current is the only one)
- init wakeup queue (empty)
- install switch system call handler
hardware interrupts
- init all hardware interrupts ad detached, ie no thread assoc.

create sigma0 and root task
release init code
start sigma0 and root task

4 Clans & Chiefs, IPC API

Clans & Chiefs

A clan consists of a number of tasks of which one is designated as being the chief. Intra-Clan IPC are direct IPC by the microkernel. IPC to other clans (Inter-Clan IPCs) are redirected to the next chief of the clan. He may forward the message to the next chief of the other clan, which forwards the message to the receiver. The chief acts as being the sender of the message and thus deceives the receiver of the message. All redirections are direction-preserving: Chiefs may only pretend that a message comes from outside for messages into the clan and that they come from inside for message ids inside the clan.

This results in some trust relationships. Clan members can here always trust their chief and so on....

Applications of this concept:

remote IPC (chief forwards message via network)
multi level security
debugging
heterogeneity (conversion big in little endian by the chief, protocol conversion)
migration

Drawbacks:

Only one hierarchy
Ineffecient because of the many copies involved

Messages must be copied, otherwise an attack called "ToC ToU" is possible: After checking the message (e.g. by the chief), the message can changed by the sender as it still located in the user space.

The Clans&Chiefs model is outdated, the new model uses a mechanism called IPC redirection.

Thread IDs

Varying number of bits for

Version
thread number
task number
task version
chief
site

IPC-API

Operations

send
receive from (closed)
receive (open)
call (send to and receive from callee, atomic so that reply cannot be send before caller is ready to receive)
send & receive (from anyone. For optimization purposes: servers use it to reply a request and wait for the next one, which involves less kernel enter/exits)

Message Types

register
direct string (up to 2meg)
indirect strings (up to 15 x 4meg, supports scatter/gather, important for example for protocol stack implementations: add header and packet payload)
map pages

IPC-ABI

Parameters are in registers, wherever possible.

3 registers for register messages (EDI, EDX, EBX).
send descriptor (EAX)
- map msg bit
- deceiving bit (for virtual ids (clans & chiefs!))
- 30 bit pointer (0 for mapping and register messages, otherwise message buffer pointer)
receive descriptor (EBP)
- map msg bit
- open wait bit
- 30 bit pointer (0 for register messages, otherwise message buffer pointer)

Message buffer is a memory region which contains some parameters and space for the direct string buffer and pointers and space for additional indirect string buffers.

5 IPC API

Timeouts

L4 Version 2, X.0

snd timeout (given by the sender, time until the message transfer start), rcv timeout (receiver)
snd-pf timeout (how much time per pagefault handling), rcv-pf timeout (against DoS through pager)
1us - 19h

L4 Version X.1, 4

snd timeout, rcv timeout, xfer timeout (time from message transfer start to end, ie. absolute timeout for all pagefaults while transfering the message)
1us - 610h (relative timeout values)
absolute timeout values

6 IPC Implementation: Short IPC, Long IPC

Critical Path

System call pre (disable intr)
identify dest thread and check
same chief/ no ipc redirection
ready to receive
analyze msg and transfer

short: no action
long: copy strings, see extra section

switch to dest thread & address space
system call post (enable intr)

Short IPC

call: sender goes to waiting to receive, receiver becomes running, switch to receiver
send: both become running, lazy send: don't switch to receiver instantaneously, ie. decouple scheduling decision from IPC event
message consists of 3 registers which stay unchanged during the operation

Long IPC

enable interrupt for message transfer (interrupts and preemptions are possible)
lock both partners
mem copy (pagefaults are possible)

copy in/copy out

2 x 2n r/w operations ( 2n r/w copy in, 2n r/w copy out)
3 x n/8 cache lines (both AS, kernel area)

temporary mapping (TM)

2n operations
2 x n/8 cache lines
map dest into source kernel area
copy
on leave (dispatch): invalidate PTE and flush TLB (other thread might want to use the kernel TM area and the mappings might change)
on return (dispatch): page fault: reestablish (probably changed) mapping
optimization: don't flush if same AS, don't flush if AS TLB has just been flushed
pagefault in partner AS handling: tunnel to (receiver), access area, tunnel back
SMP:

TM area per processor
page table per processor

7 Small Address Spaces

Context-switch costs (untagged TLB)

enter/exit kernel	20...200 cycles
switch thread	10 cycles
switch AS
flush TLB	50 cycles
refill TLB	9..96 refills 15...40 cycles/refill 100-4000 cycles

Small AS on x86 - emulating a tagged TLB

Include several L4 AS in one Hardware AS, separation via segmentation. All Hardware AS (HwAS) share the same Small ASes in a dedicated region.

	*method*	*source offset*	*destination offset*
large -> large	temp mapping switch HwAS	0	kernel com area
large -> small	direct (no HwAS switch)	0	dest AS base
large -> same	direct (no HwAS switch)	0	0
small -> same	"	source AS base	source AS base
small -> small	"	source AS base	dest AS base
small -> current large	"	source AS base	0
small -> non-current large	temp mapping, switch HwAS or switch HwAS, direct	source AS base	kernel com area 0

TCB additions:

partner
waddr
HwAS
AS base
AS size

Potential problem during small -> large IPC: IPC is interrupted, HwAS changes, IPC is continued, then IPC is resumed in wrong HwAS. So IPC small AS must be marked to be in the partner space of the IPC so that the kernel switches back.

8 IPC Abort

During long IPC, sender is locked running, receiver is locked waiting, the pagers are waiting. Nested IPC (IPC during IPC) is needed on pagefault, which requires a thread-ipc stack. The stack size is bound, only one short IPC (which is atomic) to pager within normal ipc.

If pagefault IPC fails, the ipc-state is poped, ie. the pagefault IPC is canceled. Then the real IPC is aborted, ie. A and B get a cc of abort.

Abort is for threads currently sending/receiving a message (transfer has started), cancel is for threads waiting to receive/send a message (transfer not yet started).

Pagefault IPC can be cancelled because it's a short (atomic) IPC.

9 Dispatching, Timeouts

Thread Switch

Switch to(), ie. switch to (idle)
Switch to( Thread ) see 1

Dispatcher/Idle thread

no user mode (never returns to usermode)
no own AS
no id (=> you can't send IPCs to it)
almost stateless
- stack is discarded on interruption
- knows highest active priority
is idlethread


 
tcb[A].sp = SP;                 // save stack pointer (SP) of A
SP = dispatcher thread bottom  	// switch to (dispatcher) from A


B = select next ready thread
if B == nil
	idle                        // on next scheduling event, switch
                                // from here to first line of this code


SP := tcb[A].sp;                // discard stack, restore A's SP

if B != A
	switch from A to B 			// see section 1
else
	return

Priorities, Preemption

priority levels: 0...255
hard priorities
round robin per priority
ready list per prio
current tcb per prio (head of ready list)
keep highest active prio (otherwise one would have to check all ready list heads whether they have a ready thread)
remaining timeslice per thread: If it runs to 0, next thread is selected and allocated a new timeslice

Lazy dispatching

waiting threads stay in the ready list
a dispatch trial takes them out of the ready list

Timeouts & Wakeups

Data structure operations:

insert
delete
find next
raise

Unsorted/sorted list, sorted tree are to expensive.

Solution: Separation into wakeup classes (aka. calendar queue). The list of wakeups is divided into three intervals (t = tick intervall)

soon [ 0, s * t ]
late ( s * t, l * t ]
late late ( ll * t, oo )

Insert puts the entry (a TCB) into the corresponding list, depending on the interval the entry's point in time is in. Find next has to search only the soon list for pending timeouts. Every s time steps, parts of the late list (those below s*t milliseconds to raisal) have to be sorted into the soon list ("correction phase"). Every ll time steps, those entries of the late late list, whose point in time is below ll*t from now have to be sorted into the late list.

10 Mapping, Implementation

grant and map induce a tree-like data structure, which contains the physical frames as root node, address spaces as nodes which are connected due to grants/map relations for the address spaces. There is a tree for every physical frame/tile, called tile-map tree.

Needed operations on this datastructure:

map: find corresponding entry for page in mapping tree, add child
unmap: find corresponding entry for page in mapping tree, kill child subtree (incl. delete entry in corresponding page table)
grant: find entry, move entry

Tree Parsing:

avoid recursion (limited stack size in kernel!)
solution: save tree as list with depth number, where child follow directly their parent node

Node structure:

parent
sibling
child
Mapnode pointer

Pointers between PTE and mapnode

Mapnode pointers
- pointer from pagetable to corresponding mapping db node.
- Stored in area per pagetable (x + 4kb) or in a hashtable cache
- Used for access to node during map/grant operation
Backlink pointers
- pointer from tile-map tree to corresponding PTE.
- Saved in node, cache not useful,
- used for unmap

Locking:

Coarse: Lock per physical page, ie. per tree
Fine: Lock per node:

uplock: protects up and backlink pointers
downlock:
- protects down pointer
protocol:
- get downlock, get uplock too or unlock the downlock

you have to obtain up- and downlock to insert/remove link between to nodes as the link information is stored in both nodes and has to be modified atomically

To change a link you have to obtain the up and the down pointers between the linked nodes.

Optimization:
Store parent/child pointer XORed in one node entry. Depending on the direction of traversal, one of the entries is known from the traversal predecessor. The other can be calculated this way.

See: LN Reference Manual V2.2, 1.1 Implementing the Model

A.1 x86 page tables

Page sizes: pages 4k = 2^12, superpages 4meg = 2^22.
Register C3 points to the pagedirectory, which is one page big and contains 1024 * 4 Bytes or 1024 superpages or pages to pagetables, respectively.
A pagetable itself is one page and contains 1024 entries for 4k pages.

2000, 2001 Felix Hupfeld - last change: 31.05.2001 15:05