Feature Story

INSIDE THE IMAGINATION TECHNOLOGIES META PROCESSOR, PART 1

The META SoC (system-on-chip) processor platform and the META multi-threaded processor core from Metagence has been promoted for a wide range of applications including:

Digital audio for home and personal devices

Digital radio, including DAB and DRM

Analog TV

Digital consumer electronics, including Digital TV, STB, PVR, DVD, CD, digital still cameras, camcorders, etc.

Mobile multimedia

Wireless communications, e.g., 802.11, WiMAX, UWB, ZigBee

Home gateways

In-car infotainment systems

The Meta system includes comprehensive development tools and a variety of peripheral types.

Like other processor core companies, Metagence combines pre-verified hardware and software IP (intellectual property) elements and development tools, reducing development risk and time to market. Where Meta differs is in the processor structure itself. Similar in concept to the multi-threading proposed by Don Sollars in the now defunct TeraGen product, META supports multiple threads in hardware, with each thread being a virtualized instantiation of the processor, with its own register resources. Using this paradigm enables transparent scaling as hardware capabilities improve with each succeeding semiconductor process generation.

The META Base Architecture Multi-threading allows META to switch contexts in response to real-time events without software overhead.

The heart of Meta is the threading mechanism that provides hardware control over the process.

In the event of a condition with the potential to cause a stall cycle, e.g., a cache miss, META automatically starts executing the next thread. There are times when a specific thread must run, and to support that, META provides a number of features including cache line locking and cache pre-fetching to control memory stalls & data address pre-issue to avoid pipeline delays. Metagence calls their threading variant superthreading. All threads operate in a parallel/overlapped manner with no context switching overheads, increasing utilization of shared ALU and cache/memory resources.

Unlike multi-processor systems, where care must be taken early on to partition tasks between processors, META allows developers to regard the code on each thread as if it is the only code present, i.e., they can develop real-time applications in isolation, and later run them in parallel on separate threads, since the details of multi-threading are automatically handled by the META hardware and software development tools.

Concurrency
The META processor can perform a number of real-time, non real-time, General Purpose Processor and DSP processing tasks concurrently in a single core. This is in contrast to traditional solutions that would require a multi-processor system to achieve similar performance levels. Using a single META processor can save on die area, power consumption and development time when compared to multi-processor solutions.

Configurability
Major functional units that have a bearing on performance, power and die size such as the number of threads, caches and DSP capability are selectable by the systems designer. The optionally included functions aid the development of an optimal solution in silicon.

Heterogeneous GPP/DSP core
META is a heterogeneous processor core, combining general purpose processing capability with DSP functionality. The combination provides advantages in cost, area and power consumption, when compared to using separate DSP and GPP cores. By combining the instruction set models for the processors in a unified design, META simplifies development flow, assures software tool chain compatibility, and reduces development risk through a unified debug strategy. The DSP capability of META exceeds that of many heterogeneous and standalone DSP cores. The processor supports both SIMD (Single Instruction, Multiple Data) and table-driven VLIW (Very Long Instruction Word) instructions.

Resource management for application-level QoS
Many embedded applications in the communications and consumer space are required to perform to a certain level in order to meet end user expectations, such as video frame rate, audio quality and no lost or dropped communications packets. These system-level considerations have now migrated down to the architecture of an SoC, which can be likened to the hub of an advanced communications network, with many different types of data stream – some latency-critical, others bandwidth-critical – and peripherals requiring attention. META’s Automatic MIPS Allocation process provides automatic resource management in hardware, ensuring that each thread of execution gets the needed MIPS and has the required response time.

Process portable core
The META processor core is supplied as a fully synthesizable core delivered as soft IP in Hardware Description Language (HDL) format. This gives customers flexibility in choosing their foundry, and the META™ core has been proven in designs that have been produced in millions of units, in different foundries and various technology geometries.

Instruction Set Architecture (ISA) & organization

32-bit general purpose architecture with 64-bit extensions and multiple execution units (address and data) allowing parallel execution

64-bit internal bandwidth for cache and general memory

Single cycle instructions and zero-overhead loops

Support for conditional execution in many GPP and DSP instructions

Support for up to four independent hardware threads, configurable for GPP or DSP

24 registers per thread plus 24 global registers allocated to threads under software control

AMA™ (Automatic MIPS Allocation) for system load balancing, allowing global control of deadlines and background processing activity, providing QoS for the application

Precise exceptions and user/supervisor mode for virtualisation support

Digital Signal Processing

Up to four 16-bit MACs/cycle, or two 32-bit MACs/cycle

SIMD DSP, with the same instruction executed in multiple data units

VLIW-like instruction template for complex DSP operations, with the functionality of four instructions combined in a single cycle

Additional 24 private registers for each DSP thread

Configurable DSP RAM in each data unit supports extensive temporary data stores

Support for split 16-bit data types (two parallel operations per data unit); 24- and 32-bit data types and 40-bit accumulation

Multiple zero-overhead rounding/shift options for precise management of integer algorithms

Modulo and bit-reversed addressing support

Caches & Memory Management Unit (MMU)

Optional 4-way set-associative data and instruction caches

Privilege model, with 4 KB page sizes, allows support for protected operating systems, e.g., Linux, supporting META™ used as either an MCU or MPU

Interfacing

System bus manages simultaneous transactions from each thread allowing independent operation of threads from coexistent fast and slow memory-mapped hardware

Coprocessor interface module with support for up to eight read and/or write interfaces to allow threads to operate synchronously with other hardware modules

Parallel 32-bit debug/control bus typically driven via JTAG interface

Power management

Low-level clock gating controlled by thread and resource scheduling

Unused resources automatically ‘switched off’ cycle by cycle

Functional Overview
The META processor is a 32-bit unified GPP/DSP, with all instructions, internal registers and data paths 32-bits wide. However, DSP functionality may use different word sizes, such as a dual 16-bit memory configuration. The memory interface may also have a different size, for example the data cache can return a 64-bit word per cycle. Item Features [Table1] The META core supports two to four independent hardware threads. Typically, these threads work in parallel on independent activities. The processor has separate data and instruction caches. All instructions are single-cycle, all ALU operations are conditional and zero overhead loops are supported.

Multiple execution units
To support multiple independent contexts (threads) and to improve the load balancing of those contexts, the META core's logic is built around the concept of execution units. In a META core these units hold localized register state and an execution pipeline. Different classes of unit may have different execution pipeline logic and a different quantity of register state. The different unit types are:

Register execution units

Data unit

Address unit

PC unit

Trigger unit

Control unit

Non-execution units

Input/output ports

Coprocessor ports

Instructions may make use of more than one unit at any one time, which can in turn introduce performance improvements via the parallel execution of operations. The META core incorporates units that are targeted towards specific instructions including modulo addressing, as well as units that are targeted towards the majority of the basic processing requirements of a CPU. The META core also uses units to hold certain core control and uses units to represent I/O ports such as coprocessor ports, where these units have no arithmetic capability.>

Data units
The data units contain a 32-bit register file with a maximum of 16 registers (0-15) per thread and 16 global registers (16-31). These registers are internally linked to an ALU, which can perform signed add/sub, logical-arithmetic-left/right shifts, logical and/or/xor, bit manipulations, address operations, and 16/32-bit multiplies. The data units ALU can generate conditions that can then affect the future execution of the instruction stream. The multipliers in the ALU consist of four 17x17 multipliers in parallel. These multipliers are arranged to perform 16x16, 16x16|16x16 split, 31x31 and 32x32 multiplies via a dedicated post-multiplier add/sub unit. Loosely associated with the multiply is a set of 40-bit accumulators, with every thread having one private accumulator and access to three further accumulators that are shared between all of the threads (global common). These allow multiplier results to be accumulated with as much precision as possible via a 40-bit add/sub ALU component associated with the accumulator registers.

Data will typically be retrieved from the accumulators via the shifter module (in the logical ALU data path), which will therefore allow as much or as little of the accumulator result as is desired to be kept. The add/sub and logical sub-pipelines may also be partitioned to support dual half-word (16-bit) operation (carry between bottom and top partitions can be disabled). This partitioning enables parallel calculations to be performed in many cases – and in addition some instructions allow multiple data units to be used at the same time thereby further enhancing the achievable parallelism.

Each data unit contains two RAMs that may be used by DSP instructions to provide parameter, accumulate or ’twiddle’ (sine/cosine look-up tables for FFTs and inverse FFTs) data. Each RAM can supply up to 32-bits of data per cycle and also support parallel loading/storing of data to the data cache. For dual 16-bit operation all of the 32-bit data word will be used (as two separate 16-bit words – typically real and imaginary data).

Address units
To support the requirements for DSP-style functionality META includes address units that enable the issue of a load and a store in parallel with the operation of the main DSP execution units. The register files in each address unit have two read ports along with two write ports. The two read ports allow two register operands to be read from the register file per cycle, while the two write ports allow writes from two separate pipeline phases to be committed to the register file per cycle. In essence, the address units are specialized units that incorporate a reduced number of registers and a reduced functionality ALU (as compared with the data units), although the address units do have some extra DSP specific functionality. This functionality consists of modulo addressing (which is useful for some DSP functions such as filtering), and bit reversed addressing (which is useful for some DSP functions such as FFTs).

Each address unit holds eight registers that are private to each thread along with eight registers that are common to all of the threads. Including common registers enables compiled code to be used by more than one thread such as C library code.

Return to: 2006 Feature Stories