# µ-Kernel Construction (4)

## **IPC Functionality & Interface**

**IPC Primitives** 

Send to

 (a specified thread)

 Receive from

(a specified thread)

- Two threads communicate
- No interference from other threads
- Other threads block until it's their turn
- Problem
  - How to communicate with a thread unknown a priori

(e.g., a server's clients)

**IPC Primitives** 

- Send to (a specified thread)
- Receive from (a specified thread)
- Receive (from any thread)

- Scenario
  - A client thread sends a message to a server expecting a response
  - The server replies expecting the client thread to be ready to receive
- Problem
  - The client might be preempted between the send to and receive from

**IPC Primitives** 

- Send to (a specified thread)
- Receive from (a specified thread)
- Receive (from any thread)

Call

(send to, receive from specified thread)

- Send to & Receive (send to, receive from any thread)
- Send to & Receive from (send to, receive from specified different thread)

Are other combinations appropriate?

Atomic operation to ensure that server's (callee's) reply cannot arrive before client (caller) is ready to receive.

Atomic operation for optimization reasons. Typically used by servers to reply and wait for the next request (from anyone).



- Strings (optional)
  - In-memory messages copied from sender to receiver
  - May incur user-level page faults during copy operation
- Mappings (optional)
  - Messages that map pages from sender to receiver
  - Can map other resources too



### Operations

- Send to
- Receive from
- Receive
- Call
- Send to & Receive
- Send to & Receive from

- Message Types
  - Registers
  - Strings
  - Mappings



- How to we deal with threads that are
  - Uncooperative
  - Malfunctioning
  - Malicious?
- How to prevent an IPC operation from never completing?



snd timeout, rcv timeout



- snd timeout, rcv timeout
  - snd-pf timeout
    - specified by sender

 Attack through receiver's pager





- snd timeout, rcv timeout
  - snd-pf / rcv-pf timeout
    - specified by receiver

 Attack through sender's pager





- Worst case IPC transfer time is high
  - Potential worst-case is a page fault per memory
    - IPC time = send timeout +  $n \times$  page fault timeout
  - Worst-case for a careless implementation is unbound
    - Pager might respond with null mapping that does not resolve the fault



snd timeout, rcv timeout, xfer timeout snd, xfer timeout rcv



(specified by the partner thread)



- What timeout values Timeout values are typical or necessary?
- How do we encode timeouts to minimize space needed to specify all four values?

- - $\infty$  (infinite)
    - Client waiting for a (trusted) server
  - 0 (zero)
    - Server responding to a client
    - Polling
  - Specific time
    - 1 us 610 h (log)



Timeout values  $\infty$  (infinite) Client waiting for a (trusted) server 0 (zero) Server responding to a client Polling Specific time 1 us – 610 h (log)



snd timeout, rcv timeout, xfer timeout snd, xfer timeout rcv

- relative timeout values
  - 0
  - infinite
  - 1 us ... 610 h (log)





#### Timeouts (vX.2, v4)

snd timeout, rcv timeout, xfer timeout snd, xfer timeout rcv

#### relative timeout values





- User gives absolute timeouts relative to the current epoch (:= all but the least significant 10+e bits of clock).
- Kernel computes absolute timeout via "(clock' & (~0ull << (10+e))) | (m << e)", i.e., "epoch' | (m << e)".</p>
  - The clock readings of the client and the kernel are different!
- (a) Timeout 09:50, clock 09:45 => epoch 09:00 => delta :=  $m \ll e = 50'$ 
  - Kernel reached at clock' 09:48 => epoch' 09:00 => timeout 09:50 (ok)
- (b) Timeout 10:12, clock 09:55 => epoch 09:00 => delta 1:12 (must be able to specify "next epoch")
  - (1) Kernel reached at clock'  $09:59 \Rightarrow$  epoch  $09:00 \Rightarrow$  timeout 10:12 (ok)
  - (2) Kernel reached at clock' 10:01 => epoch 10:00 => timeout 11:12 (wrong)

Instead of specifying "this vs. next epoch" specify least significant bit (LSB) of target epoch:

- (a) Timeout 09:50, clock 09:45 => epoch 09:00 => c = LSB(09) == 1, delta 50'
  - Kernel reached at clock' 09:48 => epoch' 09:00 => LSB(09) == 1 == c => epoch'' 09:00 => timeout 09:50 (ok)
- (b) Timeout 10:12, clock 09:55 => epoch 09:00 => c = LSB(10) == 0, delta 12'
  - (1) Kernel reached at clock' 09:59 => epoch' 09:00 => LSB(09) == 1 != c => epoch'' 10:00 => timeout 10:12 (ok)
  - (2) Kernel reached at clock' 10:01 = epoch' 10:00 = LSB(10) = 0 = c = c = epoch'' 10:00 = timeout 10:12 (ok)
- Errors occur only if the epoch changes more than once between the client and the kernel reading the clock, i.e., if more than one complete epoch ((1<<(10+e))  $\mu s \approx$  (1<<e) ms) passed in between.
- As can be seen in (b1), using c is different from using more bits for the delta (effectively specifying the LSB of the target epoch): epoch' is 09, having delta include the LSB would decrease this to 08 (LSB forced to 0); considering c != LSB(09) increases epoch' to 10.

#### Do not waste your time understanding this – informational only!

## Timeout Range of Values (seconds) [v4, vX.2]

| е  | <i>m</i> =1 | <i>m</i> =1023 |                                   |
|----|-------------|----------------|-----------------------------------|
| 0  | 0,000001    | 0,001023       |                                   |
| 1  | 0,00002     | 0,002046       | $1_{\rm HS} = 1023_{\rm HS}$ with |
| 3  | 0,00008     | 0,008184       | 1us granularity                   |
| 5  | 0,000032    | 0,032736       |                                   |
| 7  | 0,000128    | 0,130944       |                                   |
| 9  | 0,000512    | 0,523776       |                                   |
| 11 | 0,002048    | 2,095104       |                                   |
| 13 | 0,008192    | 8,380416       |                                   |
| 15 | 0,032768    | 33,521664      |                                   |
| 17 | 0,131072    | 134,086656     |                                   |
| 19 | 0,524288    | 536,346624     |                                   |
| 21 | 2,097152    | 2145,386496    |                                   |
| 23 | 8,388608    | 8581,545984    |                                   |
| 25 | 33,554432   | 34326,18394    |                                   |
| 27 | 134,217728  | 137304,7357    |                                   |
| 29 | 536,870912  | 549218,943     | Up to ~610h with                  |
| 31 | 2147,483648 | 2196875,772    | ~35min granularity                |



- Send to
- Receive from
- Receive
- Call
- Send to & Receive
- Send to & Receive from
- Destination thread ID
- Source thread ID
- Send registers
- Receive registers
- Number of map pages
- Page range for each map page
- Number of send strings
- Send string start for each string
- Send string size for each string

- Receive window for mappings
- Number of receive strings
- Receive string start for each string
- Receive string size for each string
- Send timeout
- Receive timeout
- Send xfer timeout
- Receive xfer timeout
- IPC result code
- Sender thread ID
- Specify deceiting IPC
- Thread ID to deceit as
- Intended receiver of deceited IPC



- Parameters in registers whenever possible
- Make frequent/simple operations simple and fast



Sender Registers

Receiver Registers







## Why use a single call instead of many?

- The implementation of the individual send and receive is very similar to the combined send and receive
  - We can use the same code
    - We reduce cache footprint of the code
    - We make applications more likely to be in cache
- L4 only implements combined "send to A and receive from B" syscall
  - A may but need not be equal to B
  - A or B may be 0 to avoid a send or receive phase
    - A == B == 0 is just a costly no-operation



- Send to
- Receive from
- Receive
- Call
- Send to & Receive
- Send to & Receive from
- Destination thread ID
- Source thread ID
- Send registers
- Receive registers
- Number of map pages
- Page range for each map page
- Number of send strings
- Send string start for each string
- Send string size for each string

- Receive window for mappings
- Number of receive strings
- Receive string start for each string
- Receive string size for each string
- Send timeout

IPC syscall

- Receive timeout
- Send xfer timeout
- Receive xfer timeout
- IPC result code
- Sender thread ID
- Specify deceiting IPC
- Thread ID to deceit as
- Intended receiver of deceited IPC



Assume that 64 extra registers are available

- Name them MR<sub>0</sub> ... MR<sub>63</sub> (message register 0 ... 63)
- All message registers are transferred during IPC

- Send to
- Receive from
- Receive
- Call
- Send to & Receive
- Send to & Receive from
- Destination thread ID
- Source thread ID
- Send registers
- Receive registers
- Number of map pages
- Page range for each map page
- Number of send strings
- Send string start for each string
- Send string size for each string

- Receive window for mappings
- Number of receive strings
- Receive string start for each string
- Receive string size for each string
- Send timeout
- Receive timeout
- Send xfer timeout
- Receive xfer timeout
- IPC result code
- Sender thread ID
- Specify deceiting IPC
- Thread ID to deceit as
- Intended receiver of deceited IPC



## **Message Construction**

- Messages are stored in registers (MR<sub>0</sub> ... MR<sub>63</sub>)
- First register (MR<sub>0</sub>) acts as message tag
- Subsequent registers contain
  - Untyped words (u)
  - Typed words (t)
    (e.g., map item, string item)





## **Message Construction**

- Messages are stored in registers (MR<sub>0</sub> ... MR<sub>63</sub>)
- First register (MR₀) acts MR₀
  as message tag
- Subsequent registers contain
  - Untyped words (u)
  - Typed words (t)
    (e.g., map item, string item)





## **Message Construction**

- Typed items occupy one or more words
- Three currently defined items
  - Map item (2 words)
  - Grant item (2 words)
  - String item (2+ words)
- Typed items can have arbitrary order







- Up to 4 MB (per string)
- Compound strings supported
  - Allows scatter-gather
- Incorporates cacheability hints
  - Reduce cache pollution for long copy operations





- Send to
- Receive from
- Receive
- Call
- Send to & Receive
- Send to & Receive from
- Destination thread ID
- Source thread ID
- Send registers
- Receive registers
- Number of map pages
- Page range for each map page
- Number of send strings
- Send string start for each string
- Send string size for each string

- Receive window for mappings
- Number of receive strings
- Receive string start for each string
- Receive string size for each string
- Send timeout
- Receive timeout
- Send xfer timeout
- Receive xfer timeout
- IPC result code
- Sender thread ID
- Specify deceiting IPC
- Thread ID to deceit as
- Intended receiver of deceited IPC



### Assume that 34 extra registers are available

- Name them BR<sub>0</sub> ... BR<sub>33</sub> (buffer register 0 ... 33)
- Buffer registers specify
  - Receive strings
  - Receive window for mappings







- Send to
- Receive from
- Receive
- Call
- Send to & Receive
- Send to & Receive from
- Destination thread ID
- Source thread ID
- Send registers
- Receive registers
- Number of map pages
- Page range for each map page
- Number of send strings
- Send string start for each string
- Send string size for each string

- Receive window for mappings
- Number of receive strings
- Receive string start for each string
- Receive string size for each string
- Send timeout
- Receive timeout
- Send xfer timeout
- Receive xfer timeout
- IPC result code
- Sender thread ID
- Specify deceiting IPC
- Thread ID to deceit as
- Intended receiver of deceited IPC

Timeouts Send and receive timeouts are the important ones

- Xfer timeouts only needed during string transfer
- Store xfer timeouts in predefined memory location Sender Registers **Receiver Registers**



- Send to
- Receive from
- Receive
- Call
- Send to & Receive
- Send to & Receive from
- Destination thread ID
- Source thread ID
- Send registers
- Receive registers
- Number of map pages
- Page range for each map page
- Number of send strings
- Send string start for each string
- Send string size for each string

- Receive window for mappings
- Number of receive strings
- Receive string start for each string
- Receive string size for each string
- Send timeout
- Receive timeout
- Send xfer timeout
- Receive xfer timeout
- IPC result code
- Sender thread ID
- Specify deceiting IPC
- Thread ID to deceit as
- Intended receiver of deceited IPC

**IPC Result** 

 Error conditions are exceptional



- Not common case
- No need to optimize for error handling
- Bit in received message tag indicates error
  - Fast check
- Exact error code store in predefined memory location



- IPC errors flagged in MR<sub>0</sub>
- Sender's thread ID stored in register

#### Sender Registers



**Receiver Registers** 

- Send to
- Receive from
- Receive
- Call
- Send to & Receive
- Send to & Receive from
- Destination thread ID
- Source thread ID
- Send registers
- Receive registers
- Number of map pages
- Page range for each map page
- Number of send strings
- Send string start for each string
- Send string size for each string

- Receive window for mappings
- Number of receive strings
- Receive string start for each string
- Receive string size for each string
- Send timeout
- Receive timeout
- Send xfer timeout
- Receive xfer timeout
- IPC result code
- Sender thread ID
- Specify deceiting IPC
- Thread ID to deceit as
- Intended receiver of deceited IPC





- When redirection bit set
  - Thread ID to deceit as and intended receiver ID stored in predefined memory locations

- Send to
- Receive from
- Receive
- Call
- Send to & Receive
- Send to & Receive from
- Destination thread ID
- Source thread ID
- Send registers
- Receive registers
- Number of map pages
- Page range for each map page
- Number of send strings
- Send string start for each string
- Send string size for each string

- Receive window for mappings
- Number of receive strings
- Receive string start for each string
- Receive string size for each string
- Send timeout
- Receive timeout
- Send xfer timeout
- Receive xfer timeout
- IPC result code
- Sender thread ID
- Specify deceiting IPC
- Thread ID to deceit as
- Intended receiver of deceited IPC



- What about predefined memory locations?
  - Must be thread local



## What are Virtual Registers?

- Virtual registers are backed by either
  - Physical registers, or
  - Non-pageable memory
- UTCBs hold the memory backed registers
  - UTCBs are thread local
  - UTCBs cannot be paged
    - No page faults
    - Registers always accessible









 Use separate segment for UTCB pointer

movl %gs:0, %edi

 Switch pointer on context switches



## Message Registers and UTCB

- Some MRs are mapped to physical registers
- Kernel will need UTCB pointer anyway pass it



#### Sender Registers



## Free Up Registers for Temporary Values

- Kernel needs registers for temporary values
- MR<sub>1</sub> and MR<sub>2</sub> are the only values that the kernel may not need



Sender Registers



**Receiver Registers** 

## Free Up Registers for Temporary Values

#### Sysexit instruction requires

- ECX = user IP
- EDX = user SP

#### Sender Registers



















### **IPC Register Usage**

## **Register Encoding on IA-64**





























