OCEOSMP – a new real time operating system for multicore embedded systems

Monday 12 September 2022

A number of significant differences exist between the operating systems used in embedded systems and the more familiar operating systems used in desktop and other computers, writes Professor Michael Ryan.

Perhaps the most significant of these is due to the embedded system code base being fixed once the system has been developed, unlike desktop and similar systems where the application code base can change radically during the lifetime of the system.

Desktop operating systems must allow the addition of new application code at any time and must support this new code being put into use irrespective of what other applications may be active.

The typical approach used involves virtual addressing, where applications use logical addresses that are mapped to actual physical addresses by page tables set up by the operating system together with support from special memory management hardware.

Since embedded system software is fixed, with all tasks and the associated code defined before scheduling begins, it usually is possible to use physical addresses in the code and to avoid the need for logical to physical address translation and the associated overheads. OCEOSMP is designed for systems of this type.

‘Black box’ versus ‘white box’

Another difference might be described as ‘black box’ versus ‘white box’. Desktop or similar operating systems are usually ‘black box’, it is difficult for the application code to determine what is happening, or has happened, as the operating system switches between tasks.

The OCEOSMP ‘white box’ approach involves keeping a record of activities that can be inspected by the application at any time. All designs are based on assumptions, and this approach enables checking how close assumptions are to being wrong while the system is running, and if wrong allows what happened be determined so that corrective action can be taken. Capture of task timing and other information and fault anticipation, detection, isolation and recovery features help ensure that the overall system is robust.

A further difference relates to roles. In desktop or similar operating systems the operating system starts before any application code is run, and is essentially the ‘master’, allowing applications to be set up and put into action.

In embedded systems the application code usually starts first and in due course initialises the operating system, causes it to start scheduling tasks, and if scheduling is abandoned may resume control. The operating system can be thought of as the ‘servant’ of the application, and OCEOSMP is designed to operate in this manner.

These general differences are greatly added to by the need with most embedded systems to provide deterministic behaviour and guarantee that timing deadlines will be met. This is unlike the desktop or similar operating system where reactions to all circumstances, and how long it will be before a result is obtained, are usually not predictable.

Unbounded priority inversion

In the embedded system deterministic behaviour and meeting timing requirements can be critically important. Operating systems problems such as unbounded priority inversion (which has occurred as far away as Mars) are just some of the issues that need to be avoided and that can complicate embedded system software design and development. OCEOSMP is deterministic with problems such as unbounded priority inversion excluded by its design.

OCEOSMP allows task deadlines to be specified and if missed takes application specific action. In addition, it provides time-based services that allow output of data bits and task start requests be set to occur in a specific system time window irrespective of task scheduling.

Overall OCEOSMP has been designed ab initio to meet embedded system development needs. It is intended for use in single or multicore CPU based embedded software that must be deterministic, meet timing constraints, be highly robust, and be efficient in the use of resources. It was developed in co-operation with the European Space Agency to ESA’s flight level qualification Level B and further qualification is in progress.

Debug support for OCEOSMP applications (DMON software)

OCEOSMP and the multi-core CPU

Two main advantages arise from using a multicore CPU, additional computational power and redundancy. OCEOSMP is designed to support both of these advantages. It makes symmetric use of the CPU cores and allows any core be removed from use if found to be faulty.

Additional computational power

In OCEOSMP a core can be restricted to use only by tasks above a certain priority, and a task configuration can specify that the task should only run on a specific CPU core. Usually, however, any task is allowed to run on any core, with greatly increased throughput when many tasks are ready to run.

A task can be configured so that OCEOSMP can distribute different ‘jobs’ (execution instances) of the task across different cores. As each job has its own data pointer different parts of a task can then be processed in parallel (assuming the application code is re-entrant). An example is a large FFT, where different parts of the input array might initially be processed concurrently by several CPU cores.

Redundancy

Software redundancy involves the same software running on different CPU cores, with the results for the different cores compared to check results are reliable. If a task is configured to allow it OCEOSMP will run different ‘jobs’ (execution instances) of the task concurrently, each with its own pointer to its result storage area. Results can then be compared to ensure they agree.

Hardware redundancy requires being able to remove faulty cores from use at run time and may involve bringing cores that have been kept in reserve into use. OCEOSMP supports both. Selected cores can be exempted from use by OCEOSMP in the initial configuration, allowing them to be used by some other aspect of the application.

Symmetric processing

OCEOSMP treats all CPU cores on an equal basis after the initial start-up sequence. Context switch and other OCEOSMP code is re-entrant and can be run on any core, with critical parts protected from use by more than one core at a time. Multiprocessing is symmetric once start-up is complete.

The start-up sequence depends on the CPU family involved. It may involve just one core which then starts other cores, or may involve several cores starting simultaneously. In the latter case the application specifies the core to be used in initialising and starting OCEOSMP. Apart from the start-up core other cores will be in a wait state until start-up is complete.

Fault anticipation, detection, isolation and recovery

Information such as the maximum execution time for a task, the number of times pre-empted, the maximum items on a data queue, the upper and lower bounds of a core’s system stack and many other items is updated as OCEOSMP schedules tasks and is stored at predefined addresses.

This information is readily available to the application and allows it to check if design assumptions are coming under threat, and if so to anticipate a problem and take appropriate action, perhaps temporarily disabling a task.

OCEOSMP itself automatically checks for problems and provides a range of responses.

At the simplest level, OCEOSMP checks all system call parameters, and if invalid as for example in an attempt to start a non-existent task will give an appropriate return code, make a log entry, and update the system status flags perhaps resulting in a call to an application defined problem handling function.

Less simple is for example where an attempt is made to write to a data queue that is already full. This gives rise to a similar response from OCEOSMP, but with a different response code, different log entry, and different status flag update.

Most serious is when OCEOS detects that its own data has been corrupted. Whenever a system call is made OCEOSMP checks that the sentinels protecting the three data areas are intact, and also that the stack pointer of the current core is within its allowed area. If not, a log entry is made, the system state variable is updated, and OCEOSMP exits returning an appropriate status code to the application that started it. OCEOSMP can then be restarted later.

Between these extremes the application itself can carry out any checks deemed appropriate, and if necessary disable a task or take a CPU core out of use. If necessary, the application can terminate OCEOSMP and return to the application code that started it.

A 32-bit word containing several system status flags is automatically reset when OCEOSMP is started. The flags are automatically updated when OCEOSMP detects an anomaly such as task not starting due to an inadequate number of jobs. An application defined mask specifies which flags when set should lead to a user defined problem handling function being called.

OCEOSMP provides a system log and an optional context switch log. The location and sizes of these are defined in the application configuration. They are stored in the ‘log area’, which also holds the indexes used.

Summary

OCEOSMP has been designed specifically to provide a ‘white box’ operating system whose behaviour can be readily checked and understood. It is deterministic, compact and allows timing behaviour be guaranteed. It is symmetric in its use of CPU cores.

It is robust and provides good support for fault anticipation, detection, isolation and recovery. It allows CPU cores to be excluded from use at run time and allows tasks be disable if required.

It is supported by OCE’s DMON debug monitor, which provides several advanced monitoring and debugging features including context switch logging. Further information is available on request.

Author: Professor Michael Ryan, CTO at O.C.E. Technology Ltd, michael.ryan@ocetechnology.com