# Lazy Context Switching Algorithms for Sparc-like Processors

Jochen Liedtke

German National Research Center for Computer Science (GMD) \* jochen.liedtke@gmd.de

GMD Technical Report No. 776 September 1993

<sup>\*</sup>GMD SET-RS, Schlo Birlinghoven, 53757 Sankt Augustin, Germany

#### Abstract

Recent experiences show that inter-process communication (ipc) can be implemented very fast and efficiently. The necessary context switching basically consists of changing the address space and saving/restoring the processor's registers. This may become a performance bottleneck on processors with a large number of registers. For example, ipc would be 5 times slower on a Sparc processor than on a comparable 8-register processor, if all 136 Sparc registers are saved and restored on context switch.

Therefore, we propose to delay saving and restoring most registers until they are accessed (hoping that they are not accessed until the next process switch occurs).

This paper presents lazy context switching algorithms and tuning options on an abstract level. It is shown that on this level they do never perform worse and often better than existing algorithms. There are situations in which they need only about 4 memory references per context switch.

Since real life performance of these algorithms will heavily depend on coding, integration into an OS kernel and RPC profile, this paper can only be a basis for further experiments. CONTENTS 5

# Contents

| 1 | Motivation                               |                                                 |    |  |  |  |
|---|------------------------------------------|-------------------------------------------------|----|--|--|--|
| 2 | Spa                                      | Sparc Register Architecture                     |    |  |  |  |
| 3 |                                          |                                                 |    |  |  |  |
| 4 |                                          |                                                 |    |  |  |  |
|   | 4.1                                      | Over/Underflow and Context Switch               | 13 |  |  |  |
|   | 4.2                                      | Enregistering                                   | 13 |  |  |  |
|   | 4.3                                      | Window Flushing                                 | 15 |  |  |  |
|   | 4.4                                      | Cross-Domain Register Saving                    | 16 |  |  |  |
| 5 | Improving the Algorithms                 |                                                 |    |  |  |  |
|   | 5.1                                      | Introducing Window Masks                        | 19 |  |  |  |
|   |                                          | 5.1.1 Over/Underflow and Context Switch         | 19 |  |  |  |
|   |                                          | 5.1.2 Window Flushing                           | 20 |  |  |  |
|   | 5.2                                      | Using Cpu-Registers cwp and wim                 | 21 |  |  |  |
|   |                                          | 5.2.1 Over/Underflow and Context Switch         | 22 |  |  |  |
|   |                                          | 5.2.2 Window Flushing                           | 24 |  |  |  |
| 6 | Improving the Single Thread Situation 25 |                                                 |    |  |  |  |
|   | 6.1                                      | Introducing Simple Mode                         | 25 |  |  |  |
|   |                                          | 6.1.1 Complex Over/Underflow and Context Switch | 26 |  |  |  |
|   |                                          | 6.1.2 Simple Over/Underflow and Context Switch  | 28 |  |  |  |
|   |                                          | 6.1.3 Window Flushing                           | 29 |  |  |  |
|   | 6.2                                      | Introducing Trivial Mode                        | 31 |  |  |  |
|   |                                          | 6.2.1 Simple Over/Underflow and Context Switch  | 32 |  |  |  |
|   |                                          | 6.2.2 Trivial Over/Underflow and Context Switch | 32 |  |  |  |
| 7 | Ren                                      | narks                                           | 33 |  |  |  |
|   | 7.1                                      | Performance                                     | 33 |  |  |  |
|   | 7.2                                      | Hardware Support                                | 34 |  |  |  |
|   | 7.3                                      | Tuning                                          | 35 |  |  |  |
|   | 7.4                                      | Special RPC Support                             | 35 |  |  |  |
|   | 7.5                                      | Entering and Leaving Kernel Mode                | 35 |  |  |  |

CONTENTS

## 1 Motivation

Inter-process communication (ipc) by message passing is one of the central paradigms of most  $\mu$ -kernel based and other client/server architectures. It helps to increase modularity, flexibility, security and scalability, and it is the key for distributed systems and applications. It has to be fast and effective, otherwise programmers will not use remote procedure calls (RPC), multi-threading and multitasking adequately. Thus ipc performance is vital for modern operating systems.

Recent experiences show that ipc can be implemented very fast and efficiently. As described in [Lie 93], L3 running on an Intel 486 processor needs approximately 250 cycles for a 8-byte cross-domain ipc. Due to the built-in segment system<sup>1</sup>, entering and leaving kernel mode is very expensive (107 cycles) on 486 [i486]. Since most other modern processors need less than 10 cycles for this, you can hope to achieve a performance of about 150 cycles per short ipc.

Compared to this value, context switching (which is used inside ipc and also other routines) is a serious performance problem on processors with a large number of registers. For example, saving and restoring all 136 registers of the Cypress Sparc processor CY7C601 [Ross] costs at least  $136/2 \times 5 + 136/2 \times 4 = 612$  cycles, i.e. ipc would be 5 times slower than expected!

# 2 Sparc Register Architecture

We describe the Sparc's register architecture in so far as needed for discussing context switch algorithms. Details can be found in [Ross].

A Sparc processor has 8 global registers and 8 or more register windows. The active window is identified by the current window pointer, an internal register called cwp. Besides the global registers, only the registers of the active window can be accessed.

For changing the active window there are two user level instructions which increase/decrease the cwp register by one modulo the number of windows:

```
save: cwp := cwp - 1 (push)
restore: cwp := cwp + 1 (pop)
```

**Remark:** For operations on window indices we use + and - to denote addition and subtraction modulo the number of windows. This seems to be better readable than special symbols  $\oplus$  and  $\ominus$  and is not ambiguous.

<sup>&</sup>lt;sup>1</sup>The processor automatically loads and checks segment descriptors when switching between user and kernel mode, even if a flat memory model instead of a segmented one is used.

As shown in figure 1 the register file is intended to be used as a circular stack of overlapping register windows.



Figure 1: Sparc Register Windows

Changing the current window is controlled by the window invalid mask, a further internal register called wim, which associates a valid/invalid bit to each window. If cwp is set to a window marked invalid, the processor raises an exception.

Exceptions, traps and external interrupts decrease cwp but are *not* sensitive to the window invalid mask. Thus one window marked invalid permits safe handling of these events including window overflow and underflow.

Here and in the following we assume that register windows will be saved by pushing them onto some user level memory stack. The operations

push/pop the values of register window i to/from the stack of thread t. Due to overlapping only the registers  $r_{16} \cdots r_{31}$  of each window are saved/restored by push/pop. The registers  $r_8 \cdots r_{15}$  of the top window must be handled differently by

```
push stack top (i,t)
pop stack top (i,t)
```

Window over/underflow exception handlers usually look like

```
overflow: \left\{\begin{array}{l} \mbox{wim}_{\,\mathrm{cwp}-1} = \mbox{invalid} \;\right\} \\ \mbox{push (cwp-1, actual);} \\ \mbox{wim}_{\,\mathrm{cwp-1}} := \mbox{invalid}; \\ \mbox{wim}_{\,\mathrm{cwp}} := \mbox{valid} \;. \end{array} underflow: \left\{\begin{array}{l} \mbox{wim}_{\,\mathrm{cwp+1}} = \mbox{invalid} \;\right\} \\ \mbox{wim}_{\,\mathrm{cwp+2}} := \mbox{invalid}; \\ \mbox{wim}_{\,\mathrm{cwp+1}} := \mbox{valid}; \\ \mbox{pop (cwp+1, actual)}. \end{array}
```

# 3 Frugal Context Switch

The costs for register saving can simply be reduced by only saving the used part of the window stack. Since the current window belongs to the OS kernel which is called by a trap, the top window to be saved is cwp+1; the bottom window is the last valid one:

```
i := cwp + 1;
while wim i+1 = valid do i := i + 1 od;
do
    push (i, actual);
    i := i + 1
until i = cwp od;
push stack top (i, actual)
```

For a complete context switch to a new thread at least one window of this thread has to be restored. Since the kernel does not know how many of its previous windows will be used in the near future, restoring previous windows should be delayed until they are accessed. In this way, the new thread has a well defined current window and a maximum of unused windows available. Previous windows will be restored on demand by window underflow.

For preventing hidden channels, the values contained in the unused register windows must be destroyed, e.g. by filling them with zeroes.

```
switch to (new):
  bottom := cwp+1;
   while wim bottom+1 = valid do bottom := bottom + 1 od;
  i := bottom ;
  do
     push (i, actual);
     i := i + 1
   until i = cwp od ;
  push stack top (i, actual);
   pop (bottom, new);
  i := bottom - 1;
  do
     fill with zero (i);
     i := i - 1
   until i = bottom od :
   cwp := bottom - 1.
```

Let us assume that saving a window costs 4 time units, restoring 5, filling with zeroes 1, and all other operations are for free. Then on an n-window processor, a frugal context switch from a thread actually using k windows would cost

$$4k + 5 + (n - 1)$$

whereas the stupid context switch always saving and restoring all n-1 windows costs (4+5)n. Thus on an 8 window processor 63 units would be needed for stupid and 4k + 12 (16...40) for frugal context switch. But if a frugal context switch is immediately followed by 7 window underflows, the real costs can increase up to 75.

If remote procedure calls (RPC) are implemented adequately, only the bottom window is in use when returning from server to client. Then the two frugal context switches (client  $\rightarrow$  server  $\rightarrow$  client) cost 4k + 2n + 12. If we assume that all k client windows have to be restored after RPC, the costs on an 8-window processor sum up to 9k + 24 (33...87) units.

# 4 Lazy Context Switch

The basic idea of lazy context switch is to associate windows and threads in a flexible way. Not only restoring is delayed (as already in frugal context switch), but also saving windows is delayed as long as possible. Ideally, neither saving nor restoring is necessary on context switch.



Figure 2: Lazily Managed Register Windows

In the situation shown in figure 2, a context switch from thread A to thread B requires only to change wim and cwp. If a thread hits a window belonging to a different thread, this window is first saved and then given to the requesting thread.

Note that due to overlapping there must be always at least one free window between the regions of two different threads. To leave things simple, we insist that the windows associated to one thread must form a contiguous region being the top of the thread's logical window stack.

Associating different windows to different threads requires more than the window invalid mask. The kernel holds the **owner** thread of each window and always ensures that exactly the windows owned by the actual thread are marked valid in the wim register. Free windows have the owner nil. Furthermore, the kernel has a per thread variable called **top** which holds the index of the thread's actual top window, if there is at least one window associated to the thread.

In this section we present the lazy context switch algorithms on an abstract level, e.g. without using the processor registers cwp and wim. Optimizations are also not yet considered.

For reasoning we will use some **predicates**:

```
\begin{array}{lll} \textit{registered } (t): & \exists i: \textit{owner}_i = t \;. \\ \\ \textit{is min } (i,t): & \textit{owner}_i = t \; \land \; \textit{owner}_{i-1} = \textit{nil} \;. \\ \\ \textit{is max } (i,t): & \textit{owner}_i = t \; \land \; \textit{owner}_{i+1} = \textit{nil} \;. \\ \\ \textit{undefined } (i): & \textit{registers of window } i \; \textit{may be changed by OS}. \\ \\ \textit{defined } (i): & \textit{registers of window } i \; \textit{must not be changed by OS}. \\ \\ \textit{left undefined } (i,t): & \textit{owner}_i = t \; \Longrightarrow \; \textit{undefined } (i) \; \land \; \textit{left undefined } (i-1,\,t) \;. \\ \\ \textit{right defined } (i,t): & \textit{owner}_i = t \; \Longrightarrow \; \textit{defined } (i) \; \land \; \textit{right defined } (i+1,\,t) \;. \\ \\ \end{aligned}
```

The windows associated with a registered t are contiguous and always separated by at least one free window:

$$\mathbf{I0} \qquad \qquad is \ min \ (i,t) \, \wedge \, is \ min \ (j,t) \implies i = j \ .$$

$$I1 \hspace{1cm} \textit{owner}_{i} \neq \textit{owner}_{i+1} \implies \textit{owner}_{i} = \textit{nil} \lor \textit{owner}_{i+1} = \textit{nil} \;.$$

For each registered t holds

I2 
$$owner_{top_t} = t$$
,  
I3  $left \ undefined \ (top_t-1,t)$ ,

I4 
$$right \ defined \ (top_t, t)$$
).

The invariants I0...I4 will be valid on entering and leaving the routines, but not necessarily between these two points.

# 4.1 Over/Underflow and Context Switch

```
{ is min (top<sub>actual</sub>, actual) }
overflow:
   flush (top _{actual}-2);
   owner_{top_{actual}-1} := actual;
                                              \{ is min (top_{actual}-1, actual) \}
   fill with zero (top actual-1).
                                              \{ is max (top_{actual}, actual) \}
underflow:
   flush (top _{actual}+2);
   owner top_{actual}+1 := actual;
                                               \{ is \ max \ (top_{actual} + 1, \ actual) \}
   pop (top<sub>actual</sub>+1, actual).
                                               \{ \ \}
switch to (new):
   if ¬ registered (new)
       then enregister (new)
                                               \{ registered (new) \}
   fi.
```

# 4.2 Enregistering

"Enregistering" denotes the action of allocating (at least) one register window to a thread which actually is not registered. In a way this corresponds to a page replacement or cache line replacement algorithm.

The first algorithm, called *outside actual* enregistering, places the new window left to the actual region (see figure 3). If the actual region covers n-3 windows or less, the new region is outside the actual one which is not changed by enregistering. The unused windows of the actual region remain allocated to the actual thread. When the new region grows by window underflow, first these unused windows are used. Growing by overflow sometime grabs windows from the bottom of the actual region.



Figure 3: Outside Actual Enregistering

```
enregister (t):  \left\{ \begin{array}{l} \neg \ \mathit{registered} \ (t) \end{array} \right\}   i := top_{\mathtt{actual}} ;   \mathbf{do} \ i := i-1 \ \mathbf{until} \ \mathrm{owner}_{i} = \mathrm{nil} \ \mathbf{od} ;   \mathrm{flush} \ (i-1) \ ;   \mathrm{flush} \ (i-2) \ ;   \mathrm{owner}_{i-1} := t \ ;   \mathrm{pop} \ (i-1, t) \ ;   top_t := i-1 \ .   \left\{ \begin{array}{l} registered \ (t) \end{array} \right\}
```

The second algorithm presented is called *inline actual* enregistering. It tries to use the unused windows (left of the top) of the actual region (see figure 4). Now there are probably more windows available which can be used without prior saving them and growing by overflow hits the actual region later than in the inside case. But on the other hand, growing by underflow immediately flushes the actual region. In the same way, overflow of the actual region immediately induces saving the new bottom window to memory.



Figure 4: Inside Actual Enregistering

```
\begin{array}{lll} \text{enregister (t):} & \left\{ \begin{array}{l} \neg \ \textit{registered (t)} \end{array} \right\} \\ \text{i := top}_{\texttt{actual}} - 1 \ ; \\ \textbf{while} \ \text{owner}_{i} = \text{actual do} \\ & \text{owner}_{i} := \text{nil} \ ; \\ & \text{i := i - 1} \\ \textbf{od} \ ; \\ \text{flush (top}_{\texttt{actual}} - 2) \ ; \\ \text{flush (top}_{\texttt{actual}} - 3) \ ; \\ & \text{owner}_{\texttt{top}_{\texttt{actual}} - 2} := t \ ; \\ & \text{top}_{t} := \text{top}_{\texttt{actual}} - 2 \ ; \\ & \text{pop (top}_{t}, \ t) \ . \end{array} \quad \left\{ \begin{array}{l} \neg \ \textit{registered (t)} \end{array} \right\} \end{array}
```

# 4.3 Window Flushing

Regions of windows belonging to the same thread must not be split (invariant **IO**). Therefore only windows at the left or right margin of such a region (or nil-owned ones, of course) can be flushed. Furthermore, flushing a (used) top window requires flushing the complete region. Otherwise the stack of windows saved in memory would be inconsistent.

```
 \begin{aligned} & \text{flush (i)}: & \left\{\begin{array}{l} \textit{owner}_{i+1} = \textit{nil} \; \forall \; \textit{owner}_{i+1} = \textit{nil} \; \right\} \\ & \text{if owner}_{i} \neq \; \text{nil} \\ & \text{then if top}_{owner}_{i} = i \\ & \text{then flush all (owner}_{i}) \\ & \text{elif is max (i,owner}_{i}) \\ & \text{then push (i, owner}_{i}); \\ & \text{owner}_{i} := \text{nil} \\ & \text{else } \left\{\textit{is min (i,owner}_{i}), \textit{undefined (i)}\right\} \\ & \text{owner}_{i} := \text{nil} \end{aligned}   fi . \left\{\begin{array}{l} \textit{owner}_{i} = \textit{nil} \; \right\}
```

```
\begin{aligned} &\text{flush all } (t): & \left\{ \begin{array}{l} \textit{registered } (t) \end{array} \right\} \\ &\text{i}:= top_t \ ; \\ &\text{while } owner_{i+1} \neq \ nil \ do \ i:= i+1 \ od \ ; \\ &\text{while } i \neq top_t-1 \ do \\ &\text{push } (i,t) \ ; \\ &\text{owner}_i := nil \ ; \\ &\text{i}:= i-1 \\ &\text{od} \ ; \\ &\text{push stack } top \ (i) \ ; \\ &\text{while } owner_i := nil \ ; \\ &\text{i}:= i-1 \ ; \\ &\text{od} \ . & \left\{ \begin{array}{l} \textit{registered } (t) \end{array} \right\} \end{aligned}
```

## 4.4 Cross-Domain Register Saving

We take the routines for push/pop register windows on/from a thread's stack as already defined. But since lazy context switch delays register saving, a thread's memory may be inaccessible when saving is demanded. Therefore we introduce a stack extension in each thread's control block (tcb):

17

In this way, registers can be pushed as long as the tcb remains accessible. Of course, on closing a tcb the thread must be deregistered:

```
 \begin{cases} tcb \ accessible \ (t) \\ t \neq \ actual \end{cases} 
 \text{if registered (t)} 
 \text{then flush all (t)} 
 \text{fi .} 
 \begin{cases} \neg \ registered \ (t) \\ \end{cases}
```

The tcb stack must be able to hold at least one maximum sized window region, but this is not sufficient. Unfortunately, arbitrary growth of the tcb stack is possible:

| $\mathbf{t}_1$                            | $\mathbf{t}_2$                             | kernel                  |                                                              |
|-------------------------------------------|--------------------------------------------|-------------------------|--------------------------------------------------------------|
| $n \times \text{save}$                    |                                            |                         |                                                              |
| $\operatorname{rpc} (\operatorname{t}_2)$ |                                            | switch $(t_2)$          | $3 \times \text{push user } (t_1)$                           |
|                                           | $n \times \text{save}$                     | overflow                | $(n-3) \times \mathbf{push} \ \mathbf{tcb} \ (\mathbf{t_1})$ |
|                                           | $\operatorname{rpc}\ (\operatorname{t}_1)$ | switch $(t_1)$          | $3 \times \text{push user } (t_2)$                           |
| $n \times \text{save}$                    |                                            | overflow                | $(n-3) \times \mathbf{push} \ \mathbf{tcb} \ (\mathbf{t}_2)$ |
| $\operatorname{rpc} (\operatorname{t}_2)$ |                                            | $\operatorname{switch}$ | $3 \times \text{push user } (t_1)$                           |
| :                                         | <u>:</u>                                   |                         | <u>:</u>                                                     |

For solving this problem we assume that the user stack of the actual thread is always accessible. We use a fixed size tcb stack which is large enough to hold at least 2(n-1) windows, i.e. twice the available processor registers. (The processor supports n windows and at least one must be free.) We define a tcb stack to be critical, if its free space is less than needed for n-1 windows. Then we extend push/pop and the enregister operation as follows:

- 1. If the actual thread's tcb stack is critical and a window is pushed onto it, the complete tcb stack will be copied to the user stack so that the actual tcb stack becomes empty.
- 2. If pushing onto a non actual tcb stack leads to a critical tcb, the corresponding thread will be completely deregistered.
- 3. Enregistering a thread with a critical tcb stack leads to restore all n-1 windows.

**I**5

 $\neg critical(t)$ .

```
\left\{\begin{array}{l} defined\ (i) \\ tcb\ accessible\ (t) \end{array}\right\}
    push (i,t):
        if user stack accessible (t) AND tcb stack empty (t)
             then push onto user stack (i,t)
             else push onto tcb stack (i,t);
                      if critical (t)
                          then if t = actual
                                        then copy tcb stack to user (t)
                                                \{tcb\ stack\ empty\ (t)\}
                                        else flush all (t)
                                                 \{\neg registered(t)\}
                                   fi
                      fi
        fi.
                                                              \left\{ \begin{array}{l} \textit{undefined (i)} \\ \textit{registered (t)} \implies \neg \textit{ critical (t)} \end{array} \right\}
                                                              \{ \neg registered (t) \}
    enregister (t):
        if critical (t)
             then i := top_t;
                      while i \neq top_{t}-1 do
                          i := i + 1;
                          flush (i);
                          owner_i := t;
                          pop from tcb stack (i,t)
                      od
        fi.
                                                              \left\{ \begin{array}{l} registered \ (t) \\ \neg \ critical \ (t) \end{array} \right\}
Now for each registered t holds
```

# 5 Improving the Algorithms

## 5.1 Introducing Window Masks

For better performance we introduce a per thread variable wmask. Its semantics is defined by a new invariant:

For each t holds

I6 
$$owner_i = t \iff wmask_{t,i} = valid$$
.

As a consequence, for each t holds

$$registered (t) \iff wmask_t = invalid^n$$
.

Most loops parsing the owner-array will disappear and deciding whether a thread is registered or not can be done fast, because the wmask array of one thread fits into one machine word.

### 5.1.1 Over/Underflow and Context Switch

```
{ is min (top<sub>actual</sub>, actual) }
overflow:
     flush bottom (top _{actual}-2);
    owner top_{actual}^{-1} := actual;
    \operatorname{wmask}_{\operatorname{actual}, \operatorname{top}_{\operatorname{actual}^{-1}}} := \operatorname{valid};
                                                               \{ is min (top_{actual}-1, actual) \}
     fill with zero (top _{actual}-1).
                                                               { is max (top<sub>actual</sub>, actual) }
underflow:
     flush top (top _{actual}+2);
    \mathrm{owner}_{\mathtt{top}_{\mathtt{actual}}+1} := \mathtt{actual} \ ;
    \operatorname{wmask}_{\operatorname{actual}, \operatorname{top}_{\operatorname{actual}}+1} := \operatorname{valid};
                                                               \{ is \ max \ (top_{actual} + 1, \ actual) \}
     pop (top<sub>actual</sub>+1, actual).
switch to (new):
     if ¬ registered (new)
         then enregister (new)
     fi.
                                                                \{ registered (new) \}
```

```
\begin{array}{ll} enregister \; (t): & \left\{ \begin{array}{l} \neg \; registered \; (t) \end{array} \right\} \\ i:= top_{actual} \; ; \\ \textbf{do} \; i:= i-1 \; \textbf{until} \; owner_i = nil \; \textbf{od} \; ; \\ flush \; bottom \; (i-1) \; ; \\ flush \; bottom \; (i-2) \; ; \\ owner_{\; i-1} := t \; ; \\ wmask_{\; t,\; i-1} := valid \; ; \\ pop \; (i-1,\; t) \; ; \\ top_{\; t} := i-1 \; . & \left\{ \; registered \; (t) \; \right\} \end{array}
```

## 5.1.2 Window Flushing

```
\{ owner_{i+1} = nil \}
flush bottom (i):
   if owner i \neq nil
       then if top_{owner} = i
                  then flush all (owner<sub>i</sub>)
                  else push (i, owner;);
                          owner_i := nil;
                          wmask owner; i := invalid
              fi;
                                                 \{ owner_i = nil \}
   fi.
                                                 \left\{ owner_{i-1} = nil \right\}
flush top (i):
   if owner i \neq nil
       then if top_{owner} = i
                  then flush all (owner;)
                  else \ owner_i := nil;
                          \operatorname{wmask}_{\operatorname{owner}_i, i} := \operatorname{invalid}
              fi;
                                                 \{ owner_i = nil \}
   fi.
```

```
\begin{aligned} &\text{flush all } (t): & \left\{ \begin{array}{l} \textit{registered } (t) \end{array} \right\} \\ &\text{i}:= top_t \ ; \\ &\text{while } owner_{i+1} \neq \ nil \ \textbf{do} \ i:= i+1 \ \textbf{od} \ ; \\ &\text{while } i \neq top_t-1 \ \textbf{do} \\ &\text{push } (i,t) \ ; \\ &\text{owner}_i := nil \ ; \\ &\text{i}:= i-1 \\ &\textbf{od} \ ; \\ &\text{push } stack \ top \ (i) \ ; \\ &\text{while } owner_i = t \ \textbf{do} \\ &\text{owner}_i := nil \ ; \\ &\text{i}:= i-1 \ ; \\ &\textbf{od} \ ; \\ &\text{wmask}_t := invalid^n \ . \end{aligned} \quad \left\{ \begin{array}{l} \textit{registered } (t) \end{array} \right\}
```

# 5.2 Using Cpu-Registers cwp and wim

For further improvement we use the processor's built in registers cwp and wim directly. For this purpose we redefine the invariants **I2...I4** and **I6**:

For each registered t holds

```
 \begin{aligned} \mathbf{I2'} & owner_{\,\mathrm{TOP}_{\,\mathbf{t}}} = t \;, \\ \\ \mathbf{I3'} & left \; undefined \; (TOP_{\,\mathbf{t}}-1,t) \;, \\ \\ \mathbf{I4'} & right \; defined \; (TOP_{\,\mathbf{t}},t) \;) \;. \\ \\ \mathbf{I6'} & owner_{\,\mathbf{i}} = t \iff WMASK_{\,\mathbf{t}_{\,\,\mathbf{i}}} = valid \;. \end{aligned}
```

where

$$\begin{split} TOP_t &= \left\{ \begin{array}{ll} \textit{top}_t & \textit{if} \ t \neq \ \textit{actual} \\ \textit{cwp} & \textit{if} \ t = \ \textit{actual}, \ \neg \ \textit{is} \ \textit{min} \ (\textit{cwp-1}, \textit{actual}) \\ \textit{cwp-1} & \textit{if} \ t = \ \textit{actual}, \ \textit{is} \ \textit{min} \ (\textit{cwp-1}, \textit{actual}) \end{array} \right. \\ WMASK_{t,i} &= \left\{ \begin{array}{ll} \textit{wmask}_{t,i} & \textit{if} \ t \neq \ \textit{actual} \\ \textit{wim}_i & \textit{if} \ t = \ \textit{actual} \end{array} \right. \end{aligned}$$

#### 5.2.1 Over/Underflow and Context Switch

```
\{ is min (cwp+1, actual) \}
overflow:
   flush bottom (cwp-1);
   owner _{cwp} := actual ;
   wim_{cwp} := valid;
   fill with zero (cwp).
                                               \{ is min (cwp, actual) \}
                                               \{ is max (cwp, actual) \}
underflow:
   flush top (cwp+2);
   owner _{cwp+1} := actual;
   wim_{cwp+1} := valid;
                                               \{ is max (cwp+1, actual) \}
   pop (cwp+1, actual).
                                               \{ \}
switch to (new):
   if ¬ registered (new)
       then enregister (new)
   fi;
   \operatorname{wmask}_{\operatorname{actual}} := \operatorname{wim} ;
   top_{actual} := cwp + 1;
   wim := wmask_{new};
   cwp := top_{new}-1;
                                               \{ registered (new), actual = new. \}
   actual := new.
                                               \{ \neg registered (t) \}
enregister (t):
   i := min valid window;
   flush bottom (i-1);
   flush bottom (i-2);
   owner _{i-1} := t;
   \operatorname{wmask}_{t,i-1} := \operatorname{valid};
   pop (i-1, t);
                                               \left\{ \begin{array}{c} registered \ (t) \end{array} \right\}
   top_t := i - 1.
```

The function 'min valid window' can be implemented by means of a loop:

```
\label{eq:min_valid_window} \begin{split} \text{min valid window} : \\ \text{j} := \text{cwp} ; \\ \text{while wim}_{j} = \text{valid do } j := j-1 \text{ od} ; \\ \text{j} . \end{split}
```

Obviously, the cost of this function depends on n, the number of register windows. For avoiding this you can interpret wim 0...n-1 as an integer, rotate it by cwp and use as index into a given array which holds the index of the lowest bit set minus one:

```
min valid window:

j := (2^n \times wim + wim)/2^{cwp};

index_j.

index = [-1, 0, 1, 0, 2, 0, 1, 0, 3, ...].
```

If n is too large, i.e. if an index array of size  $2^n$  would be too expensive, halve or quarter its size by:

```
if j and 0xff = 0
then index_{j/256} + 8
else index_j
fi
```

In this way, for n=32 (the maximum number of windows in the Sparc architecture) the function can be calculated by approximately 6 integer operations and 1 memory reference.

Regardless of the version used, we will assume that the costs of calculating min valid window are defacto independent of n.

#### 5.2.2 Window Flushing

```
\left\{ owner_{i+1} = nil \right\}
flush bottom (i):
     if wim_i = valid
          then push (i, actual);
                    owner_i := nil;
                    \mathrm{wim}\,_i := \mathrm{invalid}
     elif \ \mathrm{owner} \ _i \neq \ \mathrm{nil}
          then if top_{owner_i} = i
                         then flush all (owner<sub>i</sub>)
                         else push (i, owner<sub>i</sub>);
                                   owner_i := nil;
                                   \operatorname{wmask}_{\operatorname{owner}_{\dot{\mathbf{i}}},\dot{\mathbf{i}}} := \operatorname{invalid}
                    fi;
                                                                   \left\{ \ \mathit{owner}_{\,i} \, = \, \mathit{nil} \ \right\}
     fi.
                                                                   \left\{ owner_{i-1} = nil \right\}
flush top (i):
     if wim_i = valid
          then owner i := nil;
                    wim_i := invalid
     \textbf{elif} \ \mathrm{owner} \ _i \neq \ \mathrm{nil}
          then if top_{owner_i} = i
                         then flush all (owner<sub>i</sub>)
                         else \ owner_i := nil;
                                   \operatorname{wmask}_{\operatorname{owner}_i, i} := \operatorname{invalid}
                    fi;
                                                                   \left\{ owner_{i} = nil \right\}
     fi.
```

```
\begin{aligned} &\text{flush all } (t): & \left\{ \begin{array}{l} \textit{registered } (t) \;, \; t \neq \; \textit{actual } \right\} \\ &\text{i} := top_t \;; \\ &\text{while } owner_{i+1} \neq \; \text{nil } \textbf{do } i := i+1 \; \textbf{od } ; \\ &\text{while } i \neq \; top_t-1 \; \textbf{do} \\ &\text{push } (i,\,t) \;; \\ &\text{owner}_i := \text{nil } ; \\ &\text{i} := i-1 \\ &\textbf{od } ; \\ &\text{push } \textit{stack } top \; (i) \;; \\ &\text{while } owner_i := t \; \textbf{do} \\ &\text{owner}_i := \text{nil } ; \\ &\text{i} := i-1 \;; \\ &\textbf{od } ; \\ &\text{wmask }_t := invalid^n \;. & \left\{ \begin{array}{l} \textit{registered } (t) \;, \; t \neq \; \textit{actual } \end{array} \right\} \end{aligned}
```

# 6 Improving the Single Thread Situation

The window over/underflow handlers are still burdened by inspecting the owner variable and the tcb stack state for each push or pop. Although these operations are not expensive, they may count in situations when over/underflow events dominate context switching.

# 6.1 Introducing Simple Mode

To get rid of this overhead we differentiate between *simple mode*, when all windows (but the one need as a barrier) are owned by the actual thread, and *complex mode* (otherwise). In simple mode, it is no longer necessary to inspect or change the owner field. This fact can be used without dynamic mode check on each exception. Since the exception handlers are always invoked indirectly, mode switch can be very efficiently done by establishing new exception handlers.

A further benefit of simple mode is that zeroing a register window on overflow can be omitted, since the values in this window have been generated by the same thread.

The invariants **I2**′...**I4**′ are slightly reformulated to become independent of the owner array in simple mode:

For each registered t holds

$$\begin{array}{ll} \textbf{I2''} & \textit{owner}_{\, \text{TOP}_{\, \text{t}}} \, = \, t \; , \\ \\ \textbf{I3''} & \textit{left undefined } \left( \textit{TOP}_{\, \text{t}} \text{--} 1, t \right) \; , \\ \\ \textbf{I4''} & \textit{right defined } \left( \textit{TOP}_{\, \text{t}}, t \right) \; ) \; . \end{array}$$
 where

 $TOP_{\,t} \; = \left\{ \begin{array}{ll} top_{\,t} & \textit{if} \; t \neq \; \textit{actual} \\ cwp & \textit{if} \; t = \; \textit{actual}, \; wim_{\,cwp} = \; valid \\ cwp-1 & \textit{if} \; t = \; \textit{actual}, \; wim_{\,cwp} = \; \textit{invalid} \end{array} \right.$ 

## 6.1.1 Complex Over/Underflow and Context Switch

```
 \left\{ \begin{array}{l} \textit{is min (cwp+1, actual)} \\ \textit{complex mode} \end{array} \right\}   \textbf{if wim}_{\texttt{cwp-1}} = \textit{valid} \\ \textbf{then enter simple mode by overflow} \\ \textbf{else flush bottom non actual (cwp-1)}; \\ \textit{owner}_{\texttt{cwp}} := \textit{actual}; \\ \textit{wim}_{\texttt{cwp}} := \textit{valid}; \\ \textit{fill with zero (cwp)} \\ \textbf{fi} \; . \\ \left\{ \begin{array}{l} \textit{complex mode} \vee \textit{simple mode} \\ \textit{is min (cwp, actual)} \end{array} \right\}
```

```
\left\{\begin{array}{c} is\ max\ (cwp,\ actual) \\ complex\ mode \end{array}\right\}
complex underflow:
    if wim _{cwp+2} = valid
         then enter simple mode by underflow
         else flush top non actual (cwp+2);
                   owner _{cwp+1} := actual;
                   wim_{cwp+1} := valid;
                   pop (cwp+1, actual)
    fi.
                                                                 \left\{\begin{array}{l} complex \ mode \ \lor \ simple \ mode \\ is \ max \ (cwp+1, \ actual) \end{array}\right\}
                                                                \{ complex mode \}
switch to (new):
     if ¬ registered (new)
         then enregister (new)
     fi;
     \operatorname{wmask}_{\operatorname{actual}} := \operatorname{wim} ;
     top_{actual} := cwp + 1;
     wim := wmask_{new};
     cwp := top_{new}-1;
     actual := new.
                                                                 \left\{ \begin{array}{l} complex \ mode \\ registered \ (new) \ , \ actual = new \end{array} \right\}
                                                                \left\{ \begin{array}{l} \neg \ registered \ (t) \\ complex \ mode \end{array} \right\}
enregister (t):
    i := cwp;
     while wim i = \text{valid } \mathbf{do} \ i := i - 1 \ \mathbf{od} \ ;
     flush bottom (i-1);
     flush bottom (i-2);
     owner_{i-1} := t;
     \operatorname{wmask}_{t,i-1} := \operatorname{valid};
     pop (i-1, t);
     top_t := i - 1.
                                                                 \left\{\begin{array}{c} complex \ mode \\ registered \ (t) \end{array}\right\}
```

### 6.1.2 Simple Over/Underflow and Context Switch

```
 \left\{ \begin{array}{l} \forall \ i \neq \ cwp \colon owner_i = \ actual \\ owner_{cwp} = \ nil \\ is \ min \ (cwp+1, \ actual) \end{array} \right\} 
enter simple mode by overflow:
     owner _{cwp} := actual);
     establish (simple overflow, simple underflow, simple switch to);
     simple overflow.
                                                                         \left\{\begin{array}{l} simple\ mode \\ is\ min\ (cwp,\ actual) \end{array}\right\}
                                                                         \left\{ \begin{array}{l} \forall \: i \neq \: cwp \colon owner_{\:i} \: = \: actual \\ owner_{\:cwp+1} \: = \: nil \\ is \: max \: (cwp, \: actual) \end{array} \right\}
enter simple mode by underflow:
     owner _{cwp+1} := actual;
     establish (simple overflow, simple underflow, simple switch to);
     simple underflow.
                                                                        \left\{\begin{array}{c} simple \ mode \\ is \ max \ (cwp+1, \ actual) \end{array}\right\}
                                                                         \left\{\begin{array}{l} is \ min \ (cwp+1, \ actual) \\ simple \ mode \end{array}\right\}
simple overflow:
     push (cwp-1, actual);
      \{zeroing\ the\ window\ is\ not\ necessary\}
      wim _{cwp-1} := invalid ;
     wim_{cwp} := valid.
                                                                         \left\{\begin{array}{l} simple \ mode \\ is \ min \ (cwp, \ actual) \end{array}\right\}
                                                                          \left\{\begin{array}{c} is\ max\ (cwp,\ actual) \\ simple\ mode \end{array}\right\}
simple underflow:
     pop (cwp+1, actual);
      wim_{cwp+1} := valid;
     wim _{cwp+2} := invalid.
                                                                         \left\{\begin{array}{l} simple\ mode \\ is\ max\ (cwp+1,\ actual) \end{array}\right\}
```

```
\begin{split} & simple \ switch \ to \ (new) : & \left\{ \begin{array}{l} \textit{simple mode} \end{array} \right\} \\ & i := cwp \ ; \\ & \textbf{while} \ wim_{\,i} = valid \ \textbf{do} \ i := i - 1 \ \textbf{od} \ ; \\ & owner_{\,i} := nil \ ; \\ & wim_{\,i} := invalid \ ; \\ & establish \ (complex \ overflow, \ complex \ underflow, \ switch \ to) \ ; \\ & switch \ to \ (new) \ . & \left\{ \begin{array}{l} \textit{complex mode} \\ \textit{registered (new) , actual = new} \end{array} \right\} \end{split}
```

## 6.1.3 Window Flushing

```
\left\{\begin{array}{l} owner_{i+1} = nil \\ complex \ mode \end{array}\right\}
flush bottom (i):
     if wim_i = valid
           then push (i, actual);
                       owner_i := nil;
                       wim_i := invalid
           else flush bottom non actual (i)
     fi.
                                                                              \left\{\begin{array}{c} complex \ mode \\ owner_{i} = nil \end{array}\right\}
                                                                             \left\{ \begin{array}{l} \textit{owner}_{i+1} = \textit{nil} \\ \textit{owner}_{i} \neq \textit{actual} \\ \textit{complex mode} \end{array} \right\}
flush bottom non actual (i):
     if \ \mathrm{owner}_{\,i} \neq \ \mathrm{nil}
           then if top owner; = i
                             then flush all (owner;)
                             else push (i, owner;);
                                         owner_i := nil;
                                         wmask owner; ,i := invalid
                       fi;
                                                                             \left\{ \begin{array}{l} \mathit{complex} \ \mathit{mode} \\ \mathit{owner}_{\,i} = \mathit{nil} \end{array} \right\}
     fi.
```

```
\left\{\begin{array}{l} owner_{i-1} = nil \\ complex \ mode \end{array}\right\}
flush top (i):
      if wim_i = valid
             then owner i := nil;
                           wim_i := invalid
             else flush top non actual (i)
       fi.
                                                                                         \left\{ \begin{array}{l} \mathit{complex} \ \mathit{mode} \\ \mathit{owner}_{i} = \mathit{nil} \end{array} \right\}
                                                                                          \left\{ \begin{array}{l} \mathit{owner}_{i-1} = \mathit{nil} \\ \mathit{owner}_i \neq \mathit{actual} \\ \mathit{complex} \ \mathit{mode} \end{array} \right\}
flush top non actual (i):
      if \ \mathrm{owner} \ _i \neq \ \mathrm{nil}
             then if top_{owner}_i = i
                                  then flush all (owner;)
                                  else \ owner_i := nil;
                                               wmask owner; i := invalid
                          fi;
       fi.
                                                                                         \left\{ \begin{array}{l} \mathit{complex} \ \mathit{mode} \\ \mathit{owner}_i = \mathit{nil} \end{array} \right\}
```

```
\left\{ \begin{array}{ll} registered \ (t) \ , \ t \neq \ actual \\ complex \ mode \end{array} \right\}
flush all (t):
     i := top_t;
     \mathbf{while} \ \mathrm{owner}_{\,i+1} \neq \ \mathrm{nil} \ \mathbf{do} \ i := i + 1 \ \mathbf{od} \ ;
     while i \neq top_t-1 do
           push (i, t);
           owner_i := nil;
           i := i - 1
      od;
      push stack top (i);
      while owner _{i} = t do
           owner_i := nil;
           i := i - 1;
      od;
      \operatorname{wmask}_{t} := \operatorname{invalid}^{n}.
                                                                         \left\{ \begin{array}{l} complex \ mode \\ \neg \ registered \ (t) \end{array} \right\}
```

# 6.2 Introducing Trivial Mode

In simple mode, the tcb stacks have still to be inspected on push/pop operations. So we apply the same technique once more to get rid of this and introduce trivial mode.

```
trivial mode: simple mode, tcb stack empty (actual).
```

Recall that we assume that the user stack of the *actual* thread is always accessible!

#### 6.2.1 Simple Over/Underflow and Context Switch

```
\left\{\begin{array}{l} is \ min \ (cwp+1, \ actual) \\ simple \ mode \end{array}\right\}
simple overflow:
    if tcb stack empty (actual)
         then enter trivial mode by overflow
         else push tcb (cwp-1, actual);
                   wim_{cwp-1} := invalid;
                   wim_{cwp} := valid
    fi.
                                                              \left\{\begin{array}{l} simple\ mode\ \lor\ trivial\ mode\\ is\ min\ (cwp,\ actual) \end{array}\right\}
                                                              \left\{\begin{array}{l} is\ max\ (cwp,\ actual) \\ simple\ mode \end{array}\right\}
simple underflow:
    if tcb stack empty (actual)
         then enter trivial mode by underflow
         else pop (cwp+1, actual);
                   wim _{cwp+1} := valid;
                   wim_{cwp+2} := invalid
    fi.
                                                              \left\{\begin{array}{l} simple \ mode \lor trivial \ mode \\ is \ max \ (cwp+1, \ actual) \end{array}\right\}
```

## 6.2.2 Trivial Over/Underflow and Context Switch

```
 \left\{ \begin{array}{l} is \ min \ (cwp+1, \ actual) \\ simple \ mode \end{array} \right\}  enter trivial mode by overflow :  establish \ (trivial \ overflow, \ trivial \ underflow) \ ; \\ trivial \ overflow \ . \\ \left\{ \begin{array}{l} trivial \ mode \\ is \ min \ (cwp, \ actual) \end{array} \right\}
```

```
\left\{\begin{array}{c} is\ max\ (cwp,\ actual) \\ simple\ mode \end{array}\right\}
enter trivial mode by underflow:
     establish (trivial overflow, trivial underflow);
     trivial underflow.
                                                                     \left\{\begin{array}{l} trivial\ mode \\ is\ max\ (cwp+1,\ actual) \end{array}\right\}
                                                                     \left\{\begin{array}{l} is \ min \ (cwp+1, \ actual) \\ trivial \ mode \end{array}\right\}
trivial overflow:
     push onto user stack (cwp-1, actual);
     wim _{cwp-1} := invalid;
     wim_{cwp} := valid.
                                                                     \left\{\begin{array}{l} trivial\ mode \\ is\ min\ (cwp,\ actual) \end{array}\right\}
                                                                     \left\{\begin{array}{l} is\ max\ (cwp,\ actual) \\ trivial\ mode \end{array}\right\}
trivial underflow:
     pop from user stack (cwp+1, actual);
     wim_{cwp+1} := valid;
     wim _{cwp+2} := invalid.
                                                                     \left\{\begin{array}{l} trivial\ mode \\ is\ max\ (cwp+1,\ actual) \end{array}\right\}
```

# 7 Remarks

#### 7.1 Performance

The resulting algorithms are rather complex and their performance relies heavily on the application dependent interplay of window overflows, underflows and context switches. A precise performance analysis seems impossible, but we can state some interesting highlights:

 Properly implemented RPC saves and restores register windows on a n-window-processor exactly like a normal procedure call on a n - 1window-processor, if inside actual enregistering is used. This means, RPC profits in the same way from multiple windows as local PC. (Classically, PC profits, whereas RPC suffers.) 34 7 REMARKS

• In periods of dominating window under/overflow and less context switches (trivial mode), the algorithms perform exactly like the classical ones.

• Window overflow and context switch costs are independent of the number of the processor's windows, if outside actual enregistering is used.<sup>2</sup> Overflow saves at most 1, context switch saves at most 2 and restores 1 register window.

Unfortunately, underflow is not completely lazily handled, since it flushes a total region then hitting it and not only its top window. We discuss three ideas to overcome this problem:

- You could save only the bottom window to memory and all others of the same region move one window towards the bottom. Unfortunately, copy window i to i + 1 is nearly as expensive as copying it to memory. Although this method is cheaper in some cases, it can also be much more expensive than the original one.
- You could give up the restriction that the windows in the processor's register file are always the stack top of the thread's logical window stack. Then the top window can be saved to memory while the other windows of the region still stay in registers. But as long as there is one window left of the region, a new activation of the thread would induce reloading all register windows at the same place. In most cases this would be very inefficient due to clashing with the same windows which were restored just before.

Additionally you could allow region splitting. Then the stack top of a partially flushed region could be restored elsewhere.

Indeed, the last method is probably the only promising one. But it would complicate the algorithms a lot and it is not sure whether it will lead to increased or decreased efficiency. Furthermore, practice may even show that 'flush all on window underflow' is a non-problem.

# 7.2 Hardware Support

An elegant, effective and cheap method supporting lazy context switch even better would be to replace the window invalid mask register by a simple register management unit (RMU). This should extend the currently used

<sup>&</sup>lt;sup>2</sup>For inside actual enregistering, owner; must be set to nil from the actual top to the leftmost valid window.

7.3 Tuning 35

wim register (1 bit per window) by  $log_2(n)$  bits per window mapping each logical window to a physical one. Then window allocation would no longer be restricted to contiguous regions and 'flush all' becomes obsolete. Instead, window allocation would be free of topological constraints.

## 7.3 Tuning

Due to a Sparc processor's low exception raising time (4 cycles) presaving or prerestoring register windows is not helpful, neither on overflow nor on underflow nor when enregistering a new thread.

The effective tuning point is choosing the bottom window of the new region when enregistering. Statements about the real effects of outside actual, inside actual and further enregistering methods require practical experiments. One should also try combinations of the inside and outside method, for example

```
if is rpc call switch
    then use inside
elif actual region is small
    then use outside
elif most of the actual region is free
    then use half outside
    else use outside
fi
```

# 7.4 Special RPC Support

RPC-related context switch can be supported by

- omitting the barrier nil window and using inside actual enregistering on call,
- deallocating the complete actual region on return,
- on return using the actual region as new region, if the new (return to) thread is deregistered during executing RPC.

# 7.5 Entering and Leaving Kernel Mode

The techniques presented here should also be applied to handle (non ipc) system calls, page faults and other exceptions/interrupts.

36 REFERENCES

# References

[i486] Intel Corporation. i486 Processor Programmer's Reference Manual. Santa Clara, 1986

[Lie 93] J.Liedtke. Improving IPC by Kernel Design. Proceedings 14th ACM Symposium on Operating Principles, Asheville, North Carolina, December 1993.

[Ross] Ross Technology Inc. SPARC RISC User's Guide. Austin 1990.