caritas' space: 2007

Saturday, April 28, 2007

OS as parallel system

There are subsystems as following in Linux kernel
- process management (virtual memory system, kernel entry point)
- physical memory management system
- file system
- device driver, device management
- network system
The subsystems in Linux kernel like the "server" in micro-kernel operating system. They can be seen as multi-thread server too. Although the Linux kernel provides multi-process environment for user space, the kernel itself should be seen as a multi-thread program, because everything is shared.
Most resources are not shared between processes in process management system, i.e., the virtual memory system is not shared between processes, while they are shared between threads.
Process/thread is used as one of the parallel roles, the other parallel roles include IRQ handler, softirq, tasklet, work queue, timer operations. The subsystems in Linux kernel can be seen as multi-thread server system.
If multiple threads access different objects, such as the different file, there will be no conflict.
Copy, cache, transaction technologies in parallel system can be used in Linux kernel too, such as the per CPU cache in slab subsystem.

In operating system, not only the traditional process and thread are schedule-able entities, but also the IRQ handler, software IRQ, tasklet, timer operations and work queue should be the schedule-able entities too, conceptually.

Some threads can have higher priority than some tasklets, so, it can preempt these tasklets. It can even have higher priority than some IRQ handler, so that the IRQ is not serviced during thread executing. I think this is a useful concept in real-time system. I think the thread-style IRQ handing in FreeBSD maybe follow the same concept.

Video pipeline

Video capture

Analog video

Tuner
ADC
Analog video decoder (PAL, NTSC)

Digital video

Digital tuner
Demodulating
Demux
Decoder

Video processing

Video clip
Video scaling
Video scan type/scan rate conversion

Frame/field rate conversion
Interleaved -> progressive, progressive -> interleaved

Color space conversion

RGB -> YUV
YUV -> RGB
Brightness, contrast, Hue/Tint, color temperature conversion

Video enhancement

Histogram equilization
Flesh tone correction
Noise reduction

Video blending

Chroma keying
video mixer

Video display

Gamma correction
Converting to panel signal format (Panel timing control)

Cancelable operations

There are many cancelable operations in software system with user interaction, such as cancel auto tunning in TV. The implementation mechanisms of cancelable operations are as follow:

message
The "cancel" message is sent to thread which do long-time job if necessary. While the function which do long-time job must check whether there are "cancel" message received, if so, the function must stop working and return.

Cons:
- It isn't native programming model in Linux/Unix, so third party software will not use it.
- Can't cancel work in kernel (for example the sleep or block read can not be interrupted)
signal / real time signal
The "cancel" signal is sent to thread which do long-time job if neccesary. Upon the signal is received, the blocked operation (such as sleep, blocked read, etc) will be interrupted and a signal action function will be called which set a flag to indicate the canel signal is received by the thread. The function which do long-time job must check the flag to see whether there are "cancel" signal received, if so, the function must stop working and return.

Pros:
- Native programming model in Linux/Unix, easy to cooperate with third party software.
- Can cancel work in kernel.

Problems:

race conditions !!!

Friday, April 27, 2007

Message mechanism implementation in POSIX system

The message mechanism is widely used for inter-thread (process) communication. Some implementation mechanisms of message are as follow:

pipe (anonymous or named)
- Pros:
  - Can be used for inter-process and inter-thread message
  - Can suspend on get message (desired)
  - Support non-block operation
- Cons:
  - Can not support message priority
Socket (stream or datagram)
Used by X Window system message system. Almost same as pipe, with following additional pros and cons
- Pros:
  - Can be used in network environment (not important now)
- Cons:
  - Need kernel networking support.
POSIX message queue
- Pros:
  - Can be used for inter-process and inter-thread message
  - Can suspend on get message (desired)
  - Support time out
  - Support message priority
  - Support non-block operation
- Cons:
  - Need new kernel (> 2.6.6)
SysV message queue
- Pros:
  - Can be used for inter-process and inter-thread message
  - Can suspend on get message (desired)
  - Support message priority
  - Support non-block operation
- Cons:

Conclusion:

The function of mechanisms above is almost same, the POSIX message queue or SysV message queue is preferred if available because they are more similar to message then need less implementation effort.

Simple component system for embedded system

1. Objective

Separate interface and implementation
- User depends on only interface not implementation, the implementation can evolve without breaking user.
- User can use several implementation of one interface simultaneously.
Modular
- Component is a special module, more modular.
- Manage the dependency relationship of software modules.
Testable
- The component is the test unit.
Position transparent (single-thread/multi-thread transparent)
- local/remote, same thread/different thread transparent

2. Interface

Describe in IDL formally: language neutral
Binary standard: vtable

Implementation in C: function pointer structure

typedef struct
{
    void (*dvFooInit)(dvFooIntf intf);
    void (*dvFooDestroy)(dvFooIntf intf,
        FOO_DESCRIPTION sFooDesc);
    void (*dvFooBarEnable)(dvFooIntf intf, BOOL bEnable);
    ...
} dvFooOps;

typedef struct
{
    dvFooOps *ops;
} dvFoo;

typedef dvFoo *dvFooIntf;

Implementation in C++: class with only pure virtual method

class dvFoo
{
public:
    virtual void dvFooInit(void) = 0;
    virtual int32 dvFooDesctroy(FOO_DESCRIPTION desc) = 0;
    virtual void dvFooBarEnable(BOOL bEnable) = 0;
    ...
};

typedef dvFoo *dvFooIntf;

A tool can be developed to generate C++ and C implementation of interface from IDL automatically.

3. Component (object)

Properties:
- Software module that implement one or several interfaces. The interfaces is the API of component and should be stable.
- The source code, document, configuration information, test code of component is organized in individual folder.
- Every component has its own version number and can evolve individually.
- Can be selected and configured via a configuration tool.

Implementation

Implementation in C: fill function pointer structure

dvFooOps opsfoo = {
    dvLisaFooInit,
    dvLisaFooDesctory,
    dvLisaFooBarEnable,
    ...
};

typedef struct
{
    dvFooOps *ops;
    /* data members for Lisa */
} dvLisaFoo;

dvLisaFoo foo = {
    &opsfoo,
    /* data members for Lisa */
};

For driver implementation, instance number can be used as the data member.

Implementation in C++: inherit implemented interfaces

class dvLisaFoo : public dvFoo
{
    /* data members for Lisa */
public:
    void dvFooInit(void);
    int32 dvFooDestroy(FOO_DESCRIPTION desc);
    void dvFooBarEnable(BOOL bEnable);
    ...
};

For driver implementation, instance number can be used as the data member.

4. Usage

After getting the pointer to the interface of the component, user of component can use the component via direct function call without need to know the implementation of component. (C user can use C++ implementation).

#define dvFooInit(intf) (intf)->ops->dvFooInit((intf))
#define dvFooDestroy(intf, sFooDesc) 
    (intf)->ops->dvFooDesctroy((intf), (sFooDesc))

dvFooIntf intf;
/* Get interface */
dvFooInit(intf);
dvFooDestroy(intf, sFooDesc);

5. Inter-thread (inter-process) component

If the the user and implementer of component is resided in different thread (or process), it is not sufficient to use direct function call simply. To use component across thread (process), the following mechanism is provided.

A proxy is provided in user thread (process), which is the representative of the component which implemented in different thread (process). The user can use the proxy in same way of component in single thread environment. The proxy implement the interface with inter-thread (inter-process) communication (such as message sending and receiving).
A stub is provided in implementation thread (process), which is the representative of the user which is in different thread (process). The stub use the component in the same way of user in single thread environment. The stub simulate the user with inter-thread (inter-process) communication (such as message receiving and sending).
A simple implementation of inter-thread inter-process provides a simple mutual exclusion resolution like crude force serialization between all functions of one component.
In implementation, the implementing thread id can be used in proxy as the data member.
A tool can be developed to generate proxy and stub implementation of interface from IDL automatically.

6. Component manager (component directory/component factory)

In order to get the interface of component, a component manager should be provided. The simplest component manager is as follow:

Create all components before using components. Such as, all device driver components are created using the source code generated by configuration tool. All middleware components are created using the source code generated by configuration tool too.
Component managers have lists of all components of system. Such as the device driver manager has the list of all device driver components.
Component manager provide a interface to get the interfaces of components. Such as all the interfaces of device driver components can be gotten from device driver manager.
For inter-thread (inter-process) components, the proxy and stub is created before using as that of the component. The proxy instead of component itself is in the list of components in component manager. The user of component will get the proxy, it is desirable.

Auto Shanghai 2007

Photos of Auto Shanghai 2007: http://picasaweb.google.com/huang.ying.caritas/AutoShanghai2007

Sunday, April 15, 2007

virtualization

1. Virtualization level

a. Timing-accurate virtualization

The timing logic is simulated by software. It is mainly used for chip
verification. The performance is lowest.

b. Full virtualization

It is instruction-accurate CPU virtualization and necessary peripheral
virtualization. The "unmodified" guest operating system can be run on
it.

Examples: Bochs, Qemu, Skyeye, Xcopilot.

c. Para-virtualization

It is instruction-accurate CPU virtualization and necessary peripheral
virtualization but with some modification to real hardware
architecture, so that the guest operating system must be "modified" to
run on it.

Examples: Xen, UML.

d. Operating system level virtulization

Some instruction (most non-privilege instruction) of CPU and the
operating system interface (mostly system call interface) is
virtualized. The target application can be run on it. Same or
different operating system interface as that of host operating system
may be virtualized.

Examples with same OS interface as that of host OS: Linux vserver,
FreeBSD jail, Qemu user space emulation.

Examples with different OS interface as that of host OS: Linux binary
compatibility support on FreeBSD, Wine.

2. Implementation

a. Hardware based or OS based implementation

Some virtulization systems work as a user space application (perhaps
with some kernel space components) on a host OS, which are called OS
based implementation, while others work on bare hardware directly,
which are called hardware based implementation.

b. Interpretation based implementation

It is mainly used for full virtualization. The core of virtualization
system usually is a loop to fetch an instruction, interpret the
instruction, execute the instruction and advance the PC, perhaps with
some pipeline handling.

Interpretation based virtualization system is usually good on
portability, if it is programmed in portable language.

Example: Bochs, Skyeye, Xcopilot

c. Dynamic translation based implementation

In dynamic translation based virtualization system, the target binary
code is translated into host binary code first, then the translated
code is executed on host CPU directly.

In general, its performance is better than interpretation based
implementation with good portability.

It can be used for full virtualization, in which, both the kernel
space code and user space code is translated before execution. It can
also be used for operating system level virtualization, in which, only
the user space code is translated and the OS interface (system call)
is implemented through calling the host system OS interface directly.

Examples: Qemu, Qemu user space virtualization.

d. Monitor based implementation

In monitor based virtualization system, the target CPU must be same as
host CPU. The target code is executed on host CPU directly but usually
with lower priority. Some privileged instruction or some resource
access will trap into monitor, which usually is executed with highest
priority, then, the monitor do some check and complete the privileged
operation or resource access for target code and return to guest code.

Its performance can reach nearly native, but its main constraint is
that the guest CPU must be same as the host CPU.

Examples: KQemu.

e. Para-virtualization implementation

On para-virtualization system, the "unmodified" guest code can not be
run directly. Instead, usually the guest OS must be "modified" to be
implemented based on para-virtualization system instead of real
hardware. Some privileged operation or resource access are done
through calling the para-virtualization system instead of direct
hardware programming.

Examples: Xen, UML, coLinux.

f. Implement the API of another OS in user space

This can be used for operating system level virtualization.

Examples: wine.

g. Implement the system call of another OS in kernel space

This can be used for operating system level virtualization.

Examples: Linux binary compatibility support on FreeBSD.

h. Partition resources between processes

Some resources such as file system, CPU time, network address and
memory is partitioned between processes. Such that, some processes
works on the restricted resources only.

Examples: Linux vserver, FreeBSD jail.

Tuesday, January 30, 2007

GTK+ Map Viewer

Recently, I developed a simple map viewer for my Zaurus with GTK+.

Because most map image file is very big, from several MB to tens MB, it is very inconvenient to view map with ordinary image viewer, especially for embedded system like PDA and mobile phone.

This program is developed to facilitate map viewing, the basic idea behind it is:

The big map image file is splitted into many small image files, loaded on demand to reduce memory requirement.
The big map image file is pre-scaled too, because scaling a big image file is also a memory/CPU hungry task.

Some other useful feature for a map viewer is also provided. For example, personal marks can be added to a map. A map making tools is also provided to facilitate splitting and scaling the big map image file.

The source code of the program can be download from:
http://linux-mcr700.sourceforge.net/gviewmap-0.1.tar.gz

Monday, January 29, 2007

Stack-less multi-task system (2) -- implementation

The basic idea behind implementation:

Replace implicit C stack with explict link stack.
Use label to record PC.
Store task state in heap instead of stack.
Only functions will sleep directly or indirectly need to deal with stack explictly, other functions aren't affected.

Performance comparison:

Task stack space is saved, only one stack is needed.
Task switching time may be a little longer because of task swithing executing path will go through all functions invoked before sleep. But, if the task is small enough and function invoking depth is short, the task switching time will be affordable.

Sample Code:


#include <stdlib.h>
#include <assert.h>
#include <string.h>

struct stack
{
 short   stk_label;
 void    *stk_data;
 struct stack *stk_prev;
 struct stack *stk_next;
};

#define TASK_STATE_RUN         0
#define TASK_STATE_SLEEP       1

struct task
{
 short    tsk_state;
 void     (*tsk_entry)(void);
 struct stack tsk_stk;
 struct stack *tsk_stk_curr;
};

extern struct task tasks[];
struct task *task_curr;

void sleep(void)
{
 task_curr->tsk_state = TASK_STATE_SLEEP;
}

void wakeup(int task_id)
{
 tasks[task_id].tsk_state = TASK_STATE_RUN;
}

/* return code */
#define RC_SUCCESS    0
#define RC_FAILED     1
#define RC_PENDING    2

void push_stack()
{
 if (!task_curr->tsk_stk_curr->stk_next) {
  struct stack *stk;

  stk = malloc(sizeof(struct stack));
  assert(stk);
  memset(stk, 0, sizeof(*stk));
  stk->stk_prev = task_curr->tsk_stk_curr;
  task_curr->tsk_stk_curr->stk_next = stk;
 }
 task_curr->tsk_stk_curr = task_curr->tsk_stk_curr->stk_next;
}

void pop_stack()
{
 struct stack *stk;

 stk = task_curr->tsk_stk_curr->stk_prev;
 assert(stk->stk_next->stk_next == NULL);
 free(stk->stk_next);
 stk->stk_next = NULL;
 task_curr->tsk_stk_curr = stk;
}

#define TEST_TASK_LABEL1    0
#define TEST_TASK_LABEL2    1
#define TEST_TASK_LABEL3    2

#define TEST_TASK_SUB_LABEL1   0
#define TEST_TASK_SUB_LABEL2   1

void test_task_label1(void)
{
 /* everything as normal, without sleep */
}

int test_task_label2(void)
{
 int rc = RC_SUCCESS;

 push_stack();

 switch (task_curr->tsk_stk_curr->stk_label) {
 case TEST_TASK_SUB_LABEL1:
  task_curr->tsk_stk_curr->stk_label = TEST_TASK_SUB_LABEL2;
  rc = RC_PENDING;
  sleep();
  break;
 case TEST_TASK_SUB_LABEL2:
  break;
 }
 if (rc != RC_PENDING)
  pop_stack();
 return rc;
}

void test_task(void)
{
 switch (task_curr->tsk_stk.stk_label) {
 case TEST_TASK_LABEL1:
  test_task_label1();
  task_curr->tsk_stk.stk_label = TEST_TASK_LABEL2;
  task_curr->tsk_stk.stk_data = malloc(10);
  sleep();
  break;
 case TEST_TASK_LABEL2:
  if (test_task_label2() == RC_PENDING) {
   break;
  }
  task_curr->tsk_stk.stk_label = TEST_TASK_LABEL3;
 case TEST_TASK_LABEL3:
  break;
 default:
  assert(0);
  break;
 }
}

void test_task2(void)
{
 wakeup(0);
}

struct task tasks[2] =
{
 {
  TASK_STATE_RUN,
  test_task,
  { 0, NULL, NULL, NULL },
  NULL,
 },
 {
  TASK_STATE_RUN,
  test_task2,
  { 0, NULL, NULL, NULL},
  NULL,
 },
};

void schedule(void)
{
 int i;

 for (i = 0; i < 2; i++) {
  if (tasks[i].tsk_state == TASK_STATE_RUN) {
   task_curr = &tasks[i];
   task_curr->tsk_stk_curr = &task_curr->tsk_stk;
   tasks[i].tsk_entry();
  }
 }
}

int main(int argc, char *argv[])
{
 for (;;)
  schedule();
 return 0;
}

Sunday, January 14, 2007

Stack-less multi-task system

In traditional multi-task operating system, there is one stack for one task. The stack is mainly used for storing call chain, function parameters and local variables, that is, the context of the task. When task is switched, the stack of task is switched too.

The problem of stack based multi-task system is the size of stack should be the maximum possible size that the stack maybe grow to. In Linux system, the size of kernel task stack is 4kB or 8kB. But, in 99% condition, only small part of whole stack is used. The efficiency of stack memory usage is low.

Another solution is that all tasks share the same stack. When task is executed, the stack is used for storing call chain, function parameters and local variables as that of task based multi-task. When task is about to be switched out of CPU, it saves all state (context) needed to continue the work elsewhere (such as global variable, malloced memory) and hands the stack to the task to be switched into CPU. The new task will ruin all the contents in the stack and use it exclusively until the next switching.

The problem of stack-less multi-task system is that when switching point is deeply nested, it is hard to storing all state needed in a modular way. For example, when a "page fault" occurs in a memory mapped page, the handler will go through memory subsystem and file system and maybe be switched out of CPU for waiting for disk reading. How can the task save the state of memory subsystem and file system in a modular way? So, the stack-less multi-task system is used mainly in the kernel of micro-kernel operating system, because there are only few simple subsystems, the state saving can be done in a modular or manageable way.

A resolution is adaptive stack scheme. If task is working for a system call that have very small stack usage (such as the wait in POSIX), all the state needed is stored elsewhere and the stack is freed, otherwise the stack is preserved. But this resolution is not very effective if most system calls use more stack.

I think another possible resolution is split task into more small tasks. Because every task is so small, its state can be stored without refer to other subsystem, that is in a modular way. And because the overhead of task is much less than before, the total overhead is not too much. Perhaps one small task corresponds one subsystem in traditional system. For example, the memory system, file system can be tasks too. In this system, create a process turns to create several tasks.

caritas' space