Feature Story

INSIDE THE IMAGINATION TECHNOLOGIES META PROCESSOR, PART 2

PC unit
The PC unit holds two program counter registers per thread along with a simple functional unit that is used for the majority of the system's program flow manipulations such as branches. During the execution of normal background code PC is the current execution address for this thread, while PCX holds the address from which the next interrupt on this thread will execute. When an interrupt occurs, the contents of PC and PCX swap. The PC holds the execution address for the executing interrupt handler, while PCX holds the address from which background code execution will resume when the interrupt completes. The swapping of the meaning of PC and PCX happens transparently when interrupt level is entered or exited.

Branches, jumps and calls should use the address for the first instruction to be executed as no further processing is applied to this value before it is copied into a thread’s PC register.

Trigger unit
The trigger unit provides a mechanism for detecting and synchronizing with various types of system event. The trigger unit provides independent systems for two levels of event handling – background and interrupt. Background control provides voluntary synchronization with zero overhead. Interrupt control is based on a conventional model with overhead. The META core supports two distinct types of trigger source via the trigger unit. These sources are hardware triggers and kicks generated by software events.

Hardware triggers are sourced both from hardware outside of the META core itself, such as Coprocessors, and internal sources like timers. Each hardware trigger provides a simple on/off flag. A set of registers is used to control the routing of internally generated and externally supplied triggers. Kicks are caused by a software action such as writing to a controlling register somewhere in the system. In addition, kicks differ from hardware triggers in that a count of kicks received is accumulated with time and atomically decremented by one as the thread responds to kick-sourced triggers. This counter can be used to implement simple software or hardware request queues using shared memory or a coprocessor interface FIFO as the storage area for the request data. All software inter-thread events should be communicated using kicks.

Control unit
The control unit is a simple unit that only contains a register file - it has no associated ALU. These registers hold all of the control state that cannot be put in the memory mapped I/O space. This includes all the control registers that are needed within the core itself such as individual thread on/off controls, DSP mode switches, hardware loop controls and repeat counters.

Input/output ports
The META core connects to internal and external data sources/sinks via multiple memory ports. DSP performance is achieved by reducing the direct load on the memory interface using either separate instruction and data caches, or on-chip RAM. For example, when a program is operating in a tight core loop, there may be no instruction fetch activity on the memory bus as all requests are serviced from the contents of the instruction cache or on-chip RAMs.

Coprocessor ports
Up to eight read and/or write coprocessor interfaces may be used in a specific instance of the core to allow threads to operate synchronously with arbitrary hardware. The coprocessor interface module lets data be transferred to and from any application specific hardware modules, for example real-time data feeds such as digital audio. This interface allows for transfers of up to 64-bits to a coprocessor per cycle and supports flow controlling of the I/O feed. Many hardware functions, such as memory-mapped peripherals, often require shared access. Typically, such peripherals are interfaced using SoC interconnect busses and access is governed using interrupts. By interfacing such peripherals to the META coprocessor ports under the control of threads, inter-thread locks and the hardware scheduler may be used to control shared access. The ability to switch threads without any software overhead allows real-time control of I/O – essential for complex multi-function products.

System bus
The system bus can carry a number of simultaneous transactions from each thread, allowing independent operation of the threads from memory-mapped hardware with differing response latencies.

Threads and thread scheduling
The META core supports two to four independent hardware threads that share the processor's core resources such as register execution units and memory bandwidth. A fine-grained instruction scheduler switches between the thread contexts on a cycle-by-cycle basis. The instruction scheduler manages multiple threads by extracting a list of required resources from the next pending instruction for each thread. Resource requests are matched to resource availability via an interlocking process that yields a set of instructions that could be issued. From this set it is then possible to choose one instruction to issue this cycle via a variable priority scheduler. Each thread can use different processor resources at the same time, or one thread can use all of the processor’s resources. To support multiple threads and DSP functionality the META core has internal RAM, register execution units and external interface ports. All major functional units of META, including caches and the MMU are thread-aware.

Pipeline
The META pipeline is composed of three stages (post-decode) and the instruction set has three main types of instructions:

Unit internal (UI) instructions: effectively single cycle operations that complete upon the cycle of operation

Unit-to-unit (U2U) instructions: have a two cycle footprint spread over a total of three cycles and result from using different source and destination units or conditional operators,with the exception of conditional compares

DSP instructions: a hybrid form of the UI and U2U forms

Instruction set
The META instruction set is 32-bit, with most base and some DSP instructions supporting conditional execution. META supports four types of instructions:

General-purpose base

DSP (digital signal processing)

SIMD (single instruction, multiple data)

Template - table-driven VLIW (very long instruction word) These instructions encompass the following classes of operation:

Address unit and data unit ALU

Logical

Bit manipulation and comparison

Program control

Transfer

System

DSP
The DSP capability of META consists of:

Multiple execution (data and address) units

Two banks of DSP RAM per data unit shared between threads as global resources

Saturation, rounding, shifting and scaling hardware support & guard bits to ensure numerical fidelity

Split arithmetic pipelines to support parallel arithmetic, e.g. dual half-word arithmetic

Support for modulo and bit-reverse addressing

16- and 32-bit multiplier options
Four types of DSP instruction are supported:

Simple (MAC): up to four 16-bit MACs or two 32-bit MACs per cycle

Complex: complete complex (16-bit/16-bit packed) FFT butterfly add/sub and multiply instructions. Can overlap add/sub and multiply for single cycle issue

SIMD: an instruction can be executed across multiple data units with different data

VLIW: an instruction set extension that allows data unit operations to be executed in parallel with memory operations. It’s enabled by defining a set of up to sixteen (per thread) template instructions which can then be issued using a template instantiation instruction. This provides maximum concurrency, and when used in conjunction with the complex FFT instructions, reduces the cycle count for a radix 8 butterfly from 43 cycles to 27 cycles The DSP instruction set also includes data unit comparison operations, including MIN/MAX, FFB (find first bit), etc - vital for the add-compare-select operation at the heart of Viterbi decode acceleration.

Caches
The META core interfaces directly to the instruction and data caches that support its full read/write code/data operating bandwidth. Using this interface the core may issue both a data read/write operation and an instruction read operation in a single cycle with the intention of moving data into and out of the corresponding caches as fast as possible in parallel.

The instruction cache is a 4 way set associative cache of up to 64Kbytes size. Each cache line is 8 x 64 bit words and there are up to 1024 lines. The cache is non-blocking on a thread basis. If one thread misses, requests by other threads will still be accepted. Cache line invalidation only occurs when there is available return data. The instruction cache supports pre-fetch, aiding predictability for use in real-time systems.

The data cache has a similar organization and range of sizes as the instruction cache with the addition of data write-through for all linear addresses other than those in the core memory region. Locked cache lines can be placed within the core memory region so that write-through of such cache lines is also nullified. Cache line locking allows developers to lock in critical code to ensure deterministic response time in interrupt handling routines and to lock in critical data to eliminate memory latency. The caches may be partitioned into halves, quarters, eighths, sixteenths or combinations. A thread has exclusive use of a local cache partition and all threads share a global cache partition; the global cache partition remains coherent for interleaved accesses by multiple threads without the use of cache flushing.

Memory Management Unit (MMU)
The MMU is responsible for translating the logical linear address used by the META core threads and co-processor hardware into a physical memory address implemented via the system bus. The MMU allows META to operate as a microprocessor supporting protected operating systems such as embedded Linux.

.