Feature Story

More feature stories by year:


Return to: 2006 Feature Stories

CLIENT: Imagination Technologies

April 2, 2006: AudioDesignLine


PC unit
The PC unit holds two program counter registers per thread along with a simple functional unit that is used for the majority of the system's program flow manipulations such as branches. During the execution of normal background code PC is the current execution address for this thread, while PCX holds the address from which the next interrupt on this thread will execute. When an interrupt occurs, the contents of PC and PCX swap. The PC holds the execution address for the executing interrupt handler, while PCX holds the address from which background code execution will resume when the interrupt completes. The swapping of the meaning of PC and PCX happens transparently when interrupt level is entered or exited.

Branches, jumps and calls should use the address for the first instruction to be executed as no further processing is applied to this value before it is copied into a thread’s PC register.

Trigger unit
The trigger unit provides a mechanism for detecting and synchronizing with various types of system event. The trigger unit provides independent systems for two levels of event handling – background and interrupt. Background control provides voluntary synchronization with zero overhead. Interrupt control is based on a conventional model with overhead. The META core supports two distinct types of trigger source via the trigger unit. These sources are hardware triggers and kicks generated by software events.

Hardware triggers are sourced both from hardware outside of the META core itself, such as Coprocessors, and internal sources like timers. Each hardware trigger provides a simple on/off flag. A set of registers is used to control the routing of internally generated and externally supplied triggers. Kicks are caused by a software action such as writing to a controlling register somewhere in the system. In addition, kicks differ from hardware triggers in that a count of kicks received is accumulated with time and atomically decremented by one as the thread responds to kick-sourced triggers. This counter can be used to implement simple software or hardware request queues using shared memory or a coprocessor interface FIFO as the storage area for the request data. All software inter-thread events should be communicated using kicks.

Control unit
The control unit is a simple unit that only contains a register file - it has no associated ALU. These registers hold all of the control state that cannot be put in the memory mapped I/O space. This includes all the control registers that are needed within the core itself such as individual thread on/off controls, DSP mode switches, hardware loop controls and repeat counters.

Input/output ports
The META core connects to internal and external data sources/sinks via multiple memory ports. DSP performance is achieved by reducing the direct load on the memory interface using either separate instruction and data caches, or on-chip RAM. For example, when a program is operating in a tight core loop, there may be no instruction fetch activity on the memory bus as all requests are serviced from the contents of the instruction cache or on-chip RAMs.

Coprocessor ports
Up to eight read and/or write coprocessor interfaces may be used in a specific instance of the core to allow threads to operate synchronously with arbitrary hardware. The coprocessor interface module lets data be transferred to and from any application specific hardware modules, for example real-time data feeds such as digital audio. This interface allows for transfers of up to 64-bits to a coprocessor per cycle and supports flow controlling of the I/O feed. Many hardware functions, such as memory-mapped peripherals, often require shared access. Typically, such peripherals are interfaced using SoC interconnect busses and access is governed using interrupts. By interfacing such peripherals to the META coprocessor ports under the control of threads, inter-thread locks and the hardware scheduler may be used to control shared access. The ability to switch threads without any software overhead allows real-time control of I/O – essential for complex multi-function products.

System bus
The system bus can carry a number of simultaneous transactions from each thread, allowing independent operation of the threads from memory-mapped hardware with differing response latencies.

Threads and thread scheduling
The META core supports two to four independent hardware threads that share the processor's core resources such as register execution units and memory bandwidth. A fine-grained instruction scheduler switches between the thread contexts on a cycle-by-cycle basis. The instruction scheduler manages multiple threads by extracting a list of required resources from the next pending instruction for each thread. Resource requests are matched to resource availability via an interlocking process that yields a set of instructions that could be issued. From this set it is then possible to choose one instruction to issue this cycle via a variable priority scheduler. Each thread can use different processor resources at the same time, or one thread can use all of the processor’s resources. To support multiple threads and DSP functionality the META core has internal RAM, register execution units and external interface ports. All major functional units of META, including caches and the MMU are thread-aware.

The META pipeline is composed of three stages (post-decode) and the instruction set has three main types of instructions:
  • Unit internal (UI) instructions: effectively single cycle operations that complete upon the cycle of operation
  • Unit-to-unit (U2U) instructions: have a two cycle footprint spread over a total of three cycles and result from using different source and destination units or conditional operators,with the exception of conditional compares
  • DSP instructions: a hybrid form of the UI and U2U forms

    Instruction set
    The META instruction set is 32-bit, with most base and some DSP instructions supporting conditional execution. META supports four types of instructions:
  • General-purpose base
  • DSP (digital signal processing)
  • SIMD (single instruction, multiple data)
  • Template - table-driven VLIW (very long instruction word) These instructions encompass the following classes of operation:
  • Address unit and data unit ALU
  • Logical
  • Bit manipulation and comparison
  • Program control
  • Transfer
  • System

    The DSP capability of META consists of:
  • Multiple execution (data and address) units
  • Two banks of DSP RAM per data unit shared between threads as global resources
  • Saturation, rounding, shifting and scaling hardware support & guard bits to ensure numerical fidelity
  • Split arithmetic pipelines to support parallel arithmetic, e.g. dual half-word arithmetic
  • Support for modulo and bit-reverse addressing
  • 16- and 32-bit multiplier options
    Four types of DSP instruction are supported:
  • Simple (MAC): up to four 16-bit MACs or two 32-bit MACs per cycle
  • Complex: complete complex (16-bit/16-bit packed) FFT butterfly add/sub and multiply instructions. Can overlap add/sub and multiply for single cycle issue
  • SIMD: an instruction can be executed across multiple data units with different data
  • VLIW: an instruction set extension that allows data unit operations to be executed in parallel with memory operations. It’s enabled by defining a set of up to sixteen (per thread) template instructions which can then be issued using a template instantiation instruction. This provides maximum concurrency, and when used in conjunction with the complex FFT instructions, reduces the cycle count for a radix 8 butterfly from 43 cycles to 27 cycles The DSP instruction set also includes data unit comparison operations, including MIN/MAX, FFB (find first bit), etc - vital for the add-compare-select operation at the heart of Viterbi decode acceleration.

    The META core interfaces directly to the instruction and data caches that support its full read/write code/data operating bandwidth. Using this interface the core may issue both a data read/write operation and an instruction read operation in a single cycle with the intention of moving data into and out of the corresponding caches as fast as possible in parallel.

    The instruction cache is a 4 way set associative cache of up to 64Kbytes size. Each cache line is 8 x 64 bit words and there are up to 1024 lines. The cache is non-blocking on a thread basis. If one thread misses, requests by other threads will still be accepted. Cache line invalidation only occurs when there is available return data. The instruction cache supports pre-fetch, aiding predictability for use in real-time systems.

    The data cache has a similar organization and range of sizes as the instruction cache with the addition of data write-through for all linear addresses other than those in the core memory region. Locked cache lines can be placed within the core memory region so that write-through of such cache lines is also nullified. Cache line locking allows developers to lock in critical code to ensure deterministic response time in interrupt handling routines and to lock in critical data to eliminate memory latency. The caches may be partitioned into halves, quarters, eighths, sixteenths or combinations. A thread has exclusive use of a local cache partition and all threads share a global cache partition; the global cache partition remains coherent for interleaved accesses by multiple threads without the use of cache flushing.

    Memory Management Unit (MMU)
    The MMU is responsible for translating the logical linear address used by the META core threads and co-processor hardware into a physical memory address implemented via the system bus. The MMU allows META to operate as a microprocessor supporting protected operating systems such as embedded Linux.

  • Exceptions and interrupts
    Exceptions cause execution to HALT and are then matrixed to allow the interrupt for one thread to be handled by another thread. This means that an exception may be handled by the thread on which it occurs, or on any other thread, depending upon the set up of the thread on which the exception occurs. The META core incorporates full support for interrupts. Interrupts may be triggered by one of a number of causes including the following:
  • Internal actions, for example exceptions including software TRAPs
  • External co-processors
  • Hardware modules
  • Deliberate actions on other threads that generate HALT triggers

    Advanced trigger processing
    Conventional interrupt handling can have a large overhead, involving a save of the current context, execution of the interrupt service routine (ISR) and a restore. With advanced trigger processing threads can poll or wait for events and respond immediately. Since META is multi-threaded, no context save is required resulting in a true one-cycle response without overhead. Simple code on one thread can respond to multiple events synchronously.

    This interface is a parallel 32-bit data and 4-bit address read/write interface that can be driven via a JTAG interface module or other appropriate hardware placed outside the META core. This slave interface allows an external debug host or controller to indirectly issue reads or writes within the logical address space of any thread. This allows all features of the core to be controlled or monitored. Side-band output signals are also present such that data flow control and system monitoring may be implemented externally without polling via this interface.

    Automatic MIPS Allocation
    Automatic MIPS Allocation (AMA) is the method used to allow thread instruction issue rates and relative thread priorities to be controlled in a dynamic fashion. Rate control is concerned with the number of instructions a thread wishes to run over a given time period and the total load on the system. In general terms instruction rates are controlled via two counters – the delay and pool counters which influence and are influenced by the issue of instructions. Both counters increment at a regular rate (each counter may have a separate rate) minus the rate at which instructions for a thread were issued recently. In addition, both counters have rules for saturation or range limiting. During interrupt level processing a thread's rate will be boosted to the maximum possible level to reduce interrupt latencies. In addition to rate control AMA also manages the relative priorities of the different threads. Rate control is handled primarily through a static priority register setting, a deadline counter and the status of the delay counter part of the rate controller. As with rate control during interrupt level processing a thread's priority may be boosted to the maximum possible level to reduce interrupt latencies.

    Inter-thread communications
    A thread encompasses all the features that a conventional processor provides for the execution of a task, including independent support for interrupt handling, software scheduling and privilege protection. Threads are equivalent to multiple independent processors operating together. Code can be developed as if for a single-threaded processor, with the META processor hardware and CodeScape META development system taking care of the details of multi-threading.

    There will be times when threads are required to interact. META provides several ways of achieving this:
  • Hardware events (triggers) generated by one thread and sent to another
  • Shared memory areas into which threads may read/write via an agreed protocol
  • Thread synchronization; using memory bus interlocks that allow atomic updates to key shared memory locations to be used for synchronization
  • Kick hardware, where software-generated events are stored for later processing by a thread. Separate kick counters are supported for background and interrupt processing by each thread. Background processing allows zero-overhead voluntary processing of kicks whereas interrupt processing allows transparent communication of events between/to supervising software on different threads
  • Signals; a software extension of the hardware kick system. Up to 32 independent events with corresponding handlers may be established for each thread by setting a bit and then sending a kick
  • Cyclic command buffer; a shared memory cyclic buffer exploiting the kick system, in which command descriptions are placed and then corresponding kicks are delivered to the server thread
  • By combining the methods described above, a unified system of multi-level thread interaction can be created, using the hardware scheduler and simple run-time code. Such a system would not require an operating system information system or handheld applications.

    Meta represents one more step in the evolution of combined computational models. Native support for facilities once reserved for operating systems can become powerful tools when integrated into a CPU such as META. More importantly, the META processor eases the hardware/software tradeoff burden, absorbing many of the operating details into the processor hardware.

    Return to: 2006 Feature Stories