Construction of Microkernels
Lecture of Prof. J. Liedtke, summer term 2000
Summary by Felix Hupfeld
1 Outline
Some features of L4:
- Threads, IPC
- drivers at user level
- address space operation: map, unmap, grant
- page fault handling via ipc
- hierarchical construction of address spaces from an initial address space
(sigma0) via map/grant operations.
2 Threads, Thread Switch
A thread has:
- internal properties
- register set (IP (instruction pointer), SP (stack pointer), general
registers)
- stack
- status (Flags, OS-specific like priority, time, ... )
- address space
- external properties
- unique id
- communication status
Conclusions:
- Thread state must be saved/restored on thread switch - so that it does not
get lost
- We need a thread control block (tcb) per thread - to store the state
- tcbs must be kernel objects - for protection of the thread state
Operations for tcbs:
- find any thread's tcb by its uid
- the currently executing tcb (per processor)
Thread switch:
- user mode a
- switch to kernel code and kernel stack (enter kernel)
- user mode b
Enter kernel (by int, exception, interrupt)
- single instruction (intxxx or sysenter)
- push user esp on the kernel stack (pointed to by esp0), load kernel
esp0
- push user flags, reset flags
- push user eip, load kernel entry eip
- push X, which indicates error code or kernel-call type
- push register (optional)
Kernel Stack
From hardware perspective there are two stacks in the system: The user-mode
stack and the kernel-stack. There are two different modells for execution in
the kernel. One uses one global kernel stack for every entry in the kernel mode
(system call, interrupt, ...), the other switches to a dedicated stack per thread.
Kernel stack
- per processor (event model aka interrupt model)
- continuations
- on switch (blocking call), store execution context and give a continuation
(pointer to a function continuing the ongoing work along with some
kind of context) to the kernel which is called when the thread can
resume. Kernel stack can be discarded. All context is in the continuations.
- inefficient and inelegant to program
- kernel stack can be discarded on switch
- stateless kernel
- no kernel threads,
- kernel not interruptible,
- inefficient system calls,
- kernel can exchanged on the fly
- low cache footprint
- per thread (process model)
- kernel can use threads
- no special means for keeping state on interruption
- no conceptual difference between kernel mode and user mode
- difficult to exchange kernel on the fly (either no persistent tcbs or
tcbs with virtual addresses)
- larger cache footprint (minimize cache foot print!)
Per Thread Stacks are feasible with a microkernel because their potential stack
size is tightly bounded (no deep nestings of calls in the kernel). Otherwise
stack size would not be bounded and a potential big stack per thread would be
needed.
See: Draves, Bershad, Rashid et al.: Using Continuations to implement Thread
management and Communication in Operation Systems
In L4 implementation, the kernel stack per thread is on bottom of the aligned
tcb (find current tcb by getting alignment of current esp0).
Operations on Kernel entry
Kernel entry/exit on x86:
- push X
- push registers (pushad)
- ...
- popad
- add esp,4 (get rid of X)
- iretd
Resulting Stack Layout:
- Hardware:
- ESP
- flags
- CS (for segment)
- eip
- Software
X permits to differentiate between stack layouts:
- ipc (w/o register push)
- interrupt, exception, some system calls (w/ register push)
- V86 mode (additional information is saved before starting the above mentioned
procedure, DS, ES, FS, GS)
Thread Switch (x86)
- push X
- pushad
- get current tcb from alignment of esp
- switch stack pointer (move current in tcb, move new tcb esp in esp)
- prepare kernel stack pointer esp0(bottom of new current tcb)
- popad
- add esp,4
- iretd
MIPS R4600
- no hardware stack
- special registers
- manual save of context is needed
3 TCBs, AS layout, booting
Basic TCB data structure (no AS)
- myself (the thread_id, which consists of id and version bits. Used to check
for integrity and validity of a potential tcb)
- state
- resources (like MMX stuff)
- thread_esp
- scheduling:
- timeslice
- remaining_timeslice
- priority
- cpu_clock
- present_link (all present tcbs)
- ready_link (all threads that are ready to execute)
- tcb_id (magic word)
- tcb_version
additional fields:
- send queue links (for threads awaiting to send to this TCB's thread)
- wakeup queue links (see coming section)
Size 0.5 - 1kb
ThreadID -> TCB
- direct table: tcb_addr[thread_id]
- hash table: tcb_addr[thread_id] (aka physical tcb)
- like direct table, but additional compare, if we really found the right
tcb (pipeline conflict!, slow)
- no MMU
- table access per tcb
- TLB entry for table
- tcb pointer array, requires 1M for 256k threads ( 4 (pointer size) * 256k
threads )
- Trick: map unused parts of tcb array to 0-tcb page, access will result
in "invalid thread id", so size of tcb array depends on the maximum
thread id
- TLB grows: 4 entries for 4000 threads (4000 threads = 4000 pointers a
4 bytes = 16k = 4 pages a 4k = 4 TLB entries, 0-tcb pages need 4k pagetable
entries, version w/o 0-tcb's could work with a 4MB page)
- direct address: tcb_addr = thread_id << ... & mask (for getting
rid of version), aka virtual tcb
- MMU (because not every unused tcb should use real physical memory)
- no table access
- TLB entry per tcb
- TCB array, requires 128M for 256k threads
- TLB grows: 1 TLB for 8 threads, ie. 4k page for 8 threads, ie 512 Bytes
per TCB
Layout of initial AS
- user regions
- shared system regions
- physical memory
- kernel code
- tcbs
- other kernel tables
- per space system regions
Current Limitations:
- 256k or 512k threads (need virtual memory in kernel part of address space,
so available user part is smaller)
- 64 or 128meg physical memory
Variations:
- With physical tcb (table lookup), no need for mapping of tcb area, tcbs
are accessible via physical memory mapping
- If physical memory grows, full mapping in the physical memory region no
longer possible. Not all of the physical memory has to be mapped, only the
part where the page tables are.
Booting
Loading of L4, sigma0, L4 init, OS modules in the physical memory. L4 init
is run and destroyed, sigma0 constructs the initial AS, the pagers of the OS
construct further ASes:.
- L4 init
- basic memory
- build initial PTAB
- turn address translation on
- basic VM
- exception handling
- build IDT
- enter dummy handlers
- processor (+coprocessor)
- check processor
- enable features like 4M pages, ...
- initialize TSS (Task State Segment): ESP0, IOPBM (io permission bitmap)
- init. coproc.
- TCBs
- initialize first tcb
- switch to this tcbs kernel stack (make current "thread"
the first thread)
- install pf handler for tcbs
- dispatching
- init ready queue (current is the only one)
- init wakeup queue (empty)
- install switch system call handler
- hardware interrupts
- init all hardware interrupts ad detached, ie no thread assoc.
- create sigma0 and root task
- release init code
- start sigma0 and root task
4 Clans & Chiefs, IPC API
Clans & Chiefs
A clan consists of a number of tasks of which one is designated as being the
chief. Intra-Clan IPC are direct IPC by the microkernel. IPC to other clans
(Inter-Clan IPCs) are redirected to the next chief of the clan. He may forward
the message to the next chief of the other clan, which forwards the message
to the receiver. The chief acts as being the sender of the message and thus
deceives the receiver of the message. All redirections are direction-preserving:
Chiefs may only pretend that a message comes from outside for messages into
the clan and that they come from inside for message ids inside the clan.
This results in some trust relationships. Clan members can here always trust
their chief and so on....
Applications of this concept:
- remote IPC (chief forwards message via network)
- multi level security
- debugging
- heterogeneity (conversion big in little endian by the chief, protocol conversion)
- migration
Drawbacks:
- Only one hierarchy
- Ineffecient because of the many copies involved
Messages must be copied, otherwise an attack called "ToC ToU" is
possible: After checking the message (e.g. by the chief), the message can changed
by the sender as it still located in the user space.
The Clans&Chiefs model is outdated, the new model uses a mechanism called
IPC redirection.
Thread IDs
Varying number of bits for
- Version
- thread number
- task number
- task version
- chief
- site
IPC-API
Operations
- send
- receive from (closed)
- receive (open)
- call (send to and receive from callee, atomic so that reply cannot be send
before caller is ready to receive)
- send & receive (from anyone. For optimization purposes: servers use
it to reply a request and wait for the next one, which involves less kernel
enter/exits)
Message Types
- register
- direct string (up to 2meg)
- indirect strings (up to 15 x 4meg, supports scatter/gather, important for
example for protocol stack implementations: add header and packet payload)
- map pages
IPC-ABI
Parameters are in registers, wherever possible.
- 3 registers for register messages (EDI, EDX, EBX).
- send descriptor (EAX)
- map msg bit
- deceiving bit (for virtual ids (clans & chiefs!))
- 30 bit pointer (0 for mapping and register messages, otherwise message
buffer pointer)
- receive descriptor (EBP)
- map msg bit
- open wait bit
- 30 bit pointer (0 for register messages, otherwise message buffer pointer)
Message buffer is a memory region which contains some parameters and space
for the direct string buffer and pointers and space for additional indirect
string buffers.
5 IPC API
Timeouts
L4 Version 2, X.0
- snd timeout (given by the sender, time until the message transfer start),
rcv timeout (receiver)
- snd-pf timeout (how much time per pagefault handling), rcv-pf timeout (against
DoS through pager)
- 1us - 19h
L4 Version X.1, 4
- snd timeout, rcv timeout, xfer timeout (time from message transfer start
to end, ie. absolute timeout for all pagefaults while transfering the message)
- 1us - 610h (relative timeout values)
- absolute timeout values
6 IPC Implementation: Short IPC, Long IPC
Critical Path
- System call pre (disable intr)
- identify dest thread and check
- same chief/ no ipc redirection
- ready to receive
- analyze msg and transfer
- short: no action
- long: copy strings, see extra section
- switch to dest thread & address space
- system call post (enable intr)
Short IPC
- call: sender goes to waiting to receive, receiver becomes running,
switch to receiver
- send: both become running, lazy send: don't switch to receiver instantaneously,
ie. decouple scheduling decision from IPC event
- message consists of 3 registers which stay unchanged during the operation
Long IPC
- enable interrupt for message transfer (interrupts and preemptions are possible)
- lock both partners
- mem copy (pagefaults are possible)
- copy in/copy out
- 2 x 2n r/w operations ( 2n r/w copy in, 2n r/w copy out)
- 3 x n/8 cache lines (both AS, kernel area)
- temporary mapping (TM)
- 2n operations
- 2 x n/8 cache lines
- map dest into source kernel area
- copy
- on leave (dispatch): invalidate PTE and flush TLB (other thread might
want to use the kernel TM area and the mappings might change)
- on return (dispatch): page fault: reestablish (probably changed) mapping
- optimization: don't flush if same AS, don't flush if AS TLB has just
been flushed
- pagefault in partner AS handling: tunnel to (receiver), access area,
tunnel back
- SMP:
- TM area per processor
- page table per processor
7 Small Address Spaces
Context-switch costs (untagged TLB)
enter/exit kernel
|
20...200 cycles
|
switch thread
|
10 cycles
|
switch AS
|
|
flush TLB
|
50 cycles
|
refill TLB
|
9..96 refills
15...40 cycles/refill
100-4000 cycles
|
Small AS on x86 - emulating a tagged TLB
Include several L4 AS in one Hardware AS, separation via segmentation. All
Hardware AS (HwAS) share the same Small ASes in a dedicated region.
|
method
|
source offset
|
destination offset
|
large -> large
|
temp mapping
switch HwAS
|
0
|
kernel com area
|
large -> small
|
direct (no HwAS switch)
|
0
|
dest AS base
|
large -> same
|
direct (no HwAS switch)
|
0
|
0
|
small -> same
|
"
|
source AS base
|
source AS base
|
small -> small
|
"
|
source AS base
|
dest AS base
|
small -> current large
|
"
|
source AS base
|
0
|
small -> non-current large
|
temp mapping, switch HwAS or
switch HwAS, direct
|
source AS base
|
kernel com area
0
|
TCB additions:
- partner
- waddr
- HwAS
- AS base
- AS size
Potential problem during small -> large IPC: IPC is interrupted, HwAS changes,
IPC is continued, then IPC is resumed in wrong HwAS. So IPC small AS must be
marked to be in the partner space of the IPC so that the kernel switches
back.
8 IPC Abort
During long IPC, sender is locked running, receiver is locked waiting, the
pagers are waiting. Nested IPC (IPC during IPC) is needed on pagefault, which
requires a thread-ipc stack. The stack size is bound, only one short IPC (which
is atomic) to pager within normal ipc.
If pagefault IPC fails, the ipc-state is poped, ie. the pagefault IPC is canceled.
Then the real IPC is aborted, ie. A and B get a cc of abort.
Abort is for threads currently sending/receiving a message (transfer has started),
cancel is for threads waiting to receive/send a message (transfer not yet started).
Pagefault IPC can be cancelled because it's a short (atomic) IPC.
9 Dispatching, Timeouts
Thread Switch
- Switch to(), ie. switch to (idle)
- Switch to( Thread ) see 1
Dispatcher/Idle thread
- no user mode (never returns to usermode)
- no own AS
- no id (=> you can't send IPCs to it)
- almost stateless
- stack is discarded on interruption
- knows highest active priority
- is idlethread
tcb[A].sp = SP; // save stack pointer (SP) of A
SP = dispatcher thread bottom // switch to (dispatcher) from A
B = select next ready thread
if B == nil
idle // on next scheduling event, switch
// from here to first line of this code
SP := tcb[A].sp; // discard stack, restore A's SP
if B != A
switch from A to B // see section 1
else
return
Priorities, Preemption
- priority levels: 0...255
- hard priorities
- round robin per priority
- ready list per prio
- current tcb per prio (head of ready list)
- keep highest active prio (otherwise one would have to check all ready list
heads whether they have a ready thread)
- remaining timeslice per thread: If it runs to 0, next thread is selected
and allocated a new timeslice
Lazy dispatching
- waiting threads stay in the ready list
- a dispatch trial takes them out of the ready list
Timeouts & Wakeups
Data structure operations:
- insert
- delete
- find next
- raise
Unsorted/sorted list, sorted tree are to expensive.
Solution: Separation into wakeup classes (aka. calendar queue). The list of
wakeups is divided into three intervals (t = tick intervall)
- soon [ 0, s * t ]
- late ( s * t, l * t ]
- late late ( ll * t, oo )
Insert puts the entry (a TCB) into the corresponding list, depending on the
interval the entry's point in time is in. Find next has to search only the soon
list for pending timeouts. Every s time steps, parts of the late list
(those below s*t milliseconds to raisal) have to be sorted into
the soon list ("correction phase"). Every ll time steps, those
entries of the late late list, whose point in time is below ll*t
from now have to be sorted into the late list.
10 Mapping, Implementation
grant and map induce a tree-like data structure, which contains
the physical frames as root node, address spaces as nodes which are connected
due to grants/map relations for the address spaces. There is a tree for every
physical frame/tile, called tile-map tree.
Needed operations on this datastructure:
- map: find corresponding entry for page in mapping tree, add child
- unmap: find corresponding entry for page in mapping tree, kill child
subtree (incl. delete entry in corresponding page table)
- grant: find entry, move entry
Tree Parsing:
- avoid recursion (limited stack size in kernel!)
- solution: save tree as list with depth number, where child follow directly
their parent node
Node structure:
- parent
- sibling
- child
- Mapnode pointer
Pointers between PTE and mapnode
- Mapnode pointers
- pointer from pagetable to corresponding mapping db node.
- Stored in area per pagetable (x + 4kb) or in a hashtable cache
- Used for access to node during map/grant operation
- Backlink pointers
- pointer from tile-map tree to corresponding PTE.
- Saved in node, cache not useful,
- used for unmap
Locking:
- Coarse: Lock per physical page, ie. per tree
- Fine: Lock per node:
- uplock: protects up and backlink pointers
- downlock:
- protocol:
- get downlock, get uplock too or unlock the downlock
- you have to obtain up- and downlock to insert/remove link between to
nodes as the link information is stored in both nodes and has to be modified
atomically
To change a link you have to obtain the up and the down pointers between the
linked nodes.
Optimization:
Store parent/child pointer XORed in one node entry. Depending on the direction
of traversal, one of the entries is known from the traversal predecessor. The
other can be calculated this way.
See: LN Reference Manual V2.2, 1.1 Implementing the Model
A.1 x86 page tables
- Page sizes: pages 4k = 2^12, superpages 4meg = 2^22.
- Register C3 points to the pagedirectory, which is one page big and contains
1024 * 4 Bytes or 1024 superpages or pages to pagetables, respectively.
- A pagetable itself is one page and contains 1024 entries for 4k pages.
2000, 2001 Felix Hupfeld - last change:
31.05.2001 15:05