Feature Story

HSAIL: Write-Once-Run-Everywhere for Heterogeneous Systems

Power efficiency has emerged as a primary design goal for modern silicon chips. Accelerators such as GPUs have well-known advantages in compute density per-watt and per-mm^2 – note for example that the systems at the top of the latest Green500 (http://www.green500.org/) and Top500 (http://www.top500.org/) lists are now based on heterogeneous designs.

However, these systems have traditionally been difficult to program, due to two challenges. First, many accelerators support only dedicated address spaces that require cumbersome copy operations and prevent the use of pointer-based data structures on both the accelerator and the host processor. Second, accelerator programming has traditionally required a specialized language such as OpenCL™ or CUDA™. Some of these specialized languages are only supported by a single hardware vendor, which further constrains their adoption.

An intermediate language called HSAIL is helping to address some of the challenges. One of the benefits of HSAIL is its portability across multiple vendor products. Compilers that generate HSAIL can be assured that the resulting code will be able to run on a wide variety of target platforms. HSAIL also provides existing programming languages with an efficient parallel intermediate language that runs on a wide variety of hardware. This provides the underlying infrastructure and brings the benefits of heterogeneous computing to existing, popular programming models such as Java™, OpenMP™, C++, and more.

HSAIL was recently introduced by the Heterogeneous System Architecture (HSA) Foundation, which was created to make these inherent benefits of accelerators available to mainstream programmers – in languages they already know, and with the features and tools they expect. The HSA Foundation is currently working on specifications that will standardize the system architecture for heterogeneous devices. For example, the HSA systems architecture specifies that all devices in the system have access to a shared, coherent memory heap – this allows devices to utilize pointers, and addresses the first challenge of burdensome copies.

Version 0.95 of the HSAIL specification (officially the "HSA Programmer's Reference Manual: Virtual ISA and Programming Model, Compiler Writer's Guide, and Object Format (BRIG)") was recently publicly released and can be obtained at this link -http://hsafoundation.com/. This article describes the design goals for HSA, introduces the HSAIL execution model and instruction set (as described in unambiguous detail in the HSAIL specification), and shows an example of HSAIL in action.

What is HSAIL?

HSAIL is the HSA intermediate language for parallel computing. HSAIL is typically generated by a high-level compiler – the input is a programming language such as Java or parallel C++, and the output includes HSAIL for the parallel code regions as well as host code for the CPU processor. HSAIL defines a binary format (called "BRIG") which is intended to be the "shipping container" for HSAIL- application binaries will include embedded BRIG. At runtime, a tool called the "finalizer" translates the embedded BRIG to the target instruction set for the heterogeneous device. Depending on the usage model, the finalizer can also be run at build time or install time.

One of the benefits of HSAIL is its portability across multiple vendor products. Compilers that generate HSAIL can be assured that the resulting code will be able to run on a wide variety of target platforms. Likewise, HSAIL-based tools (debuggers, profilers, etc) also support many target platforms. HSA is an open foundation with broad industry support (founding members include AMD, ARM, Imagination, Mediatek, Qualcomm, Samsung, and Texas Instruments). HSAIL is a stable format which is forward-compatible to future hardware revisions (so applications that contain BRIG will continue to run). We expect HSAIL will evolve with the addition of new operations primarily driven by workload analysis, but at a controlled pace of approximately every couple years (similar to modern CPU architectures).

HSAIL is a low-level intermediate language, just above the machine instruction set. HSAIL is designed for fast and robust compilation – the conversion from HSAIL to machine ISA is more of a translation than a complex compiler optimization. This simple final step reduces the chance for errors to creep into the design, and also reduces performance variations due to changes in the finalizer. Instead, most optimizations are intended to be done in the high-level compiler, which has a larger time budget and more scope to implement complex optimizations. HSAIL provides a fixed-size register file, so the high-level compiler also performs register allocation, which is traditionally one of the more complex and time-consuming parts of the compilation process.

HSAIL Execution Model

As mentioned above, HSAIL is designed for parallel execution. HSAIL itself specifies the instruction flow for a single "work-item" of execution. When the parallel task is dispatched, the dispatch command specifies the number of work-items that should be executed (the "grid" dimension), and each work-item then works on a single point in the grid. For example, the grid might be a 1920x1080 high-definition video frame, and each work-item might be working on one pixel in the frame.

Figure 1: An HSA grid and its work-groups and work-items

Figure 2 shows the different levels of the HSAIL execution model. This model will likely be familiar to experts in graphics or GPU computing. First, note that the grid consists of a number of work-items. Each work-item thus has a unique identifier (specified with x,y,z coordinates). HSAIL contains instructions so that each work-item can determine where it is (its unique coordinates), and thus what part of the data the work-item should operate on. Grids can have 1, 2, or 3 dimensions – the picture here shows a 3D grid but the video frame example from the previous paragraph would use a 2D grid.

Grids are divided into one or more work-groups. Work-Items in the same work-group can efficiently communicate and synchronize with each other through "group" memory. Work-groups can provide opportunities to extracting peak performance from the machine through the use of group memory. The last work-group in each dimension of a grid may be only partially filled, providing developers with some flexibility in the grid size.

The wavefront is a hardware concept indicating the number of work-items that are scheduled together. Different hardware may have different wavefront widths, and thus most programs do not need to be aware of the wavefront width (although HSAIL does support this for the intrepid expert). HSAIL also provides cross-lane operations that combine results from several work-items in the work-group.

When a grid executes, work-groups are distributed to one or more compute units in the target device.

The grid is always scheduled in work-group-sized granularity – work-groups thus encapsulate a piece of parallel work, and performance naturally scales for higher-end devices with more compute units.

The work-items in the HSA execution model provide a familiar target for programmers since each work-item represent a single thread of execution – HSAIL code thus looks like a sequential program. Parallelism is expressed by the grids and work-groups (which specify how many work-items to run) rather than inside the HSAIL code itself. This is a powerful lever to make the model portable across a wide range of parallel hardware with different vector widths and numbers of compute units. Contrast this with CPU models which often require expressing thread parallelism (i.e. between CPU cores) and the SIMD parallelism (within each core) using different mechanisms. Further, SIMD parallelism is often hard-coded into the algorithm and difficult to scale as the SIMD width increases.

HSAIL Instruction Set

Writing in HSAIL is similar to writing in assembly language for a RISC CPU: the language uses a load/store architecture, supports fundamental integer and floating point operations, branches, atomic operations, multi-media operations, and uses a fixed-size pool of registers. HSAIL has the functionality to run existing programming models such as OpenCL™ and C++AMP, but also adds features which are designed to support programming models which have traditionally targeted only CPUs such as Java and C++.

Thus HSAIL includes support for function pointers, exceptions and debugging information. Additionally, HSAIL defines group memory, hierarchical synchronization primitives (for example, workgroup-level and global synchronization), and wavefronts that should look familiar to programmers of GPU computing devices and can be useful for achieving peak performance. The HSAIL specification provides a detailed explanation for the rich set of operations available in HSAIL.

A key design point for HSAIL is the use of a fixed-size register pool. The effectively moves the register allocation to the high-level compiler, and allows the finalizer to run faster and with less complexity. HSAIL provides four 4 classes of registers:

· "C": 1-bit control registers. These are used to store the output of comparison operations.

· "S": 32-bit registers that can store either a 32-bit integer or a single-precision floating point value.

· "D": 64-bit register that can store either a 64-bit integer or a double-precision floating point value.

· "Q": 128-bit registers that store packed values. Several packed formats are supported. Each packed element can range from 8 bits to 16 bits in size.

HSAIL provides up to 8 "C" registers. The "S", "D", and "Q" registers share a single pool of resources which supports up to 128 "S" registers. Each D register requires 2 register slots, and each Q register requires 4 slots. The high-level compile must ensure that the "1*S + 2*D + 4*Q" is less than 128 in the generated HSAIL code.

If the HIGH-LEVEL COMPILER uses all available registers, it will utilize the HSAIL "spill" segment to shuffle live values in and out of the registers. If the target machine has more available registers than 128, the finalizer can convert spill locations into hardware registers.

HSAIL Machine Models and Profiles

HSAIL is intended to support a wide range of devices from a big iron in a computing farm to a small gadget in your hands. To make sure that HSAIL can be implemented efficiently on multiple market segments, the HSA Foundation introduced the concepts of machine models and profiles. Machine model is about the sizes of the data pointers; profiles focus on features and precision requirements.

Too many machine models and profiles would fragment the ecosystem, and make it difficult for both the infrastructure and the community to evolve and grow. For this reason, there are currently only two machine models: Small for 32-bit address and Large for 64-bit address space. Likewise, there are only two profiles, Base and Full.

A process executing with a 32-bit address space size requires the HSAIL code to use the Small machine model. A process executing with a 64-bit address space requires the HSAIL code to use the Large machine model. The Small model is appropriate for mobile applications which are predominately 32-bit today, or a legacy PC application with some portion rewritten as data-parallel kernels. The Large model is appropriate for modern PC applications running on predominately 64-bit PC environment. As mobile application processors evolve into 64-bit, the Large model might be adopted in the mobile space.

HSAIL profiles are provided to guarantee that implementations support a required feature set and meet a given set of program limits. The strictly defined set of HSAIL profile requirements provides portability assurance to users that a certain level of support is present. The Base profile indicates that an implementation targets smaller systems that provide better power efficiency without sacrificing performance. Precision is possibly reduced in this profile to improve power efficiency. The Full profile indicates that an implementation targeting larger systems will have hardware that can guarantee higher-precision results without sacrificing performance.

The full-profile follows the IEEE-754 rules for floating point operations. Notably, this requires mathematically accurate results for addition, subtraction, multiplication, division, and square root operations. Also the full profile supports all IEEE-754 rounding modes. The Base profile relaxes the accuracy requirements for division and square root, and only requires the round-to-nearest.

The following rules apply to profiles:

• A finalizer can choose to support either or both profiles.

• A single profile applies to the entire HSAIL program.

• An application is not allowed to mix profiles.

• Both the large and small machine models are supported in each profile.

HSAIL Example

Figure 3 shows a simple code example written in Java, and Figure 4 shows the HSAIL code that is generated by the input Java. The code loops through all players and computes the percentage of the team's scores that were achieved by each player. A key point in this example is that input language here is standard Java8 (rather than a compute-specific language like OpenCL™). The code is written in the standard Lambda/Stream API (part of the upcoming Java 8 standard), and uses a "Sumatra"-enabled JVM compiler to generate HSAIL. For more information on the JVM compiler used for this example see (http://openjdk.java.net/projects/sumatra/) and the Graal JIT compiler (http://openjdk.java.net/projects/graal/).

class Player {

private Team team;

private int scores;

private float pctOfTeamScores;

public Team getTeam() {return team;}

public int getScores() {return scores;}

public void setPctOfTeamScores(int pct) { pctOfTeamScores = pct; }

};

// "Team" class not shown

// Assume "allPlayers' is an initialized array of Players..

Stream<Player> s;

s = Arrays.stream(allPlayers).parallel();

s.forEach(p -> {

int teamScores = p.getTeam().getScores();

float pctOfTeamScores = (float)p.getScores()/(float) teamScores;

p.setPctOfTeamScores(pctOfTeamScores);

});

Figure 3: Java code example for a parallel loop that uses object references

Looking closer at the Java code, note that the "Team" field in the Player data structure is a pointer to another class. The HSAIL code will directly dereference the Team pointer to retrieve the number of total scores by the team. While this is a very simple example, it demonstrates a fundamental shift in the ease-of-programming advantages that HSA will provide – the accelerator now has direct access to host data structures, including those that contain object references (pointers). This is vastly superior to previous approaches where the Player and Team data would have to be packaged into arrays, all pointers converted to indexes into the new arrays, and the arrays copied to the device – all before the GPU could even begin executing. The HSA model is both easier to program and removes the power-inefficient copy operations.

01: version 0:95: $full : $large;

02: // static method HotSpotMethod<Main.lambda$2(Player)>

03: kernel &run (

04: kernarg_u64 %_arg0 // Kernel signature for lambda method

05: ) {

06: ld_kernarg_u64 $d6, [%_arg0]; // Move arg to an HSAIL register

07: workitemabsid_u32 $s2, 0; // Read the work-item global "X" coordin

08:

09: cvt_u64_s32 $d2, $s2; // Convert X gid to long

10: mul_u64 $d2, $d2, 8; // Adjust index for sizeof ref

11: add_u64 $d2, $d2, 24; // Adjust for actual elements data start

12: add_u64 $d2, $d2, $d6; // Add to array ref ptr

13: ld_global_u64 $d6, [$d2]; // Load from array element into reg

14: @L0:

15: ld_global_u64 $d0, [$d6 + 120]; // p.getTeam()

16: mov_b64 $d3, $d0;

17: ld_global_s32 $s3, [$d6 + 40]; // p.getScores ()

18: cvt_f32_s32 $s16, $s3;

19: ld_global_s32 $s0, [$d0 + 24]; // Team getScores()

20: cvt_f32_s32 $s17, $s0;

21: div_f32 $s16, $s16, $s17; // p.getScores()/teamScores

22: st_global_f32 $s16, [$d6 + 100]; // p.setPctOfTeamScores()

23: ret;

24: };

Figure 4: HSAIL code generated from Java "forEach" kernel

The HSAIL code implements the lambda function (the code inside the "forEach" block). The HSAIL code starts with the kernel signature (lines 4-5) for the single-argument lambda function inside the Java "forEach" block. Note that the bulk of the HSAIL code looks similar to a modern assembly language, with explicit load/store instructions and ALU operation that operate exclusively on registers. The first operand is typically the destination, so for example the instruction at line 10 stores into double (64-bit) register "$d2" the product of register "$d2" and immediate value "8. Note the memory instructions specify the segment – for example the "kernarg" segment in line 6, and "global" segment in line 13. Line 19 dereferences the "Team" variable, showing how HSAIL supports pointer-containing data structures. The comments above explain each code line, and which source lines they correspond to.

Conclusion

It is widely agreed that the future of computing will be heterogeneous. The elephant in the room is that there needs to be an open standard established in a cooperative way first, before people can differentiate and compete with their heterogeneous solutions. HSA is an important step where the elephant was identified, and the action was taken.

HSA will fundamentally change the way that people program heterogeneous devices. We can already see the potential in the Java example - a compiler for an existing, popular programming model now generates HSAIL. Programmers will continue to program in the languages they are already using, can use pointers and data structures as they expect, and the resulting HSAIL code is portable and will run on many different parallel targets. HSAIL on HSA will enable programmers to achieve massive performance gains without the traditional pain of accelerator programming.

And it will make programming heterogeneous systems fun!

Return to: 2013 Feature Stories