Renderscript Part 2

[This post is by R. Jason Sams, an Android engineer who specializes in graphics, performance tuning, and software architecture. —Tim Bray]

In Introducing Renderscript I gave a brief overview of this technology. In this post I’ll look at “compute” in more detail. In Renderscript we use “compute” to mean offloading of data processing from Dalvik code to Renderscript code which may run on the same or different processor(s).

Renderscript’s Design Goals

Renderscript has three primary goals, given here from most to least important.

Portability: Application code needs to be able to run across all devices, even those with radically different hardware. ARM currently comes in several variants — with and without VFP, with and without NEON, and with various register counts. Beyond ARM, there are other CPU architectures like x86, several GPU architectures, and even more DSP architectures.

Performance: The second objective is to get as much performance as possible within the constraints of Portability. For Renderscript to make sense we need to achieve much greater performance than established solutions.

Usability: The third goal is to simplify development as much as possible. Where possible we automate steps to avoid glue code and other developer busy work.

Those three goals lead to several design trade-offs. It’s these trade-offs that separate Renderscript from the existing approaches on the device, such as Dalvik or the NDK. They should be thought of as different tools intended to solve different problems.

Core Design Choices

The first choice that needed to be made is what language should be used. When it comes to languages there are almost unlimited options. Shading style languages, C, and C++ were considered. In the end the shading style languages were ruled out due to the need to be able to manipulate data structures for graphical applications such as scene graphs. The lack of pointers and recursion were considered crippling limitations for usability. C++ on the other hand was very desirable but ran into issues with portability. The advanced C++ features are very difficult to run on non-cpu hardware. In the end we chose to base Renderscript on C99 because it offers equal performance to the other choices, is very well understood by developers, and poses no issues running on a wide range of hardware.

The next design trade-off was work flow. Specifically we focused on how to convert source code to machine code. We explored several options and actually implemented two different solutions during the development of Renderscript. The older versions (Eclair through Gingerbread) compiled the C source code all the way to machine code on the device. While this had some nice properties such as the ability for applications to generate source on the fly it turned out to be a usability problem. Having to compile your app, install it, run it, then find your syntax error was painful. Also the weaker CPU in devices limited the static analysis and scope of optimizations that could be done.

Then we switched to LLVM, moving to a model where scripts are compiled and analyzed on the host using a modified version of clang. We perform high level optimizations at this stage, then emit LLVM bitcode. The translation of the intermediate bitcode to machine code still happens on the device (along with additional device-specific optimizations).

Our last big trade-off for compute was thread launching. The trade-off here is between performance and portability. Given sufficient knowledge, existing compute solutions allow a developer to tune an application for a specific hardware platform to the detriment of others. Given unlimited time and resources developers could tune for every hardware combination. While testing and tuning a variety of devices is never bad, no amount of work allows them to tune for unreleased hardware they don’t yet have. A more portable solution places the tuning burden on the runtime, providing greater average performance at the cost of peak performance. Given that the number one goal was portability we chose to place this burden on the runtime.

A secondary effect of choosing runtime thread-launch management is that dynamic decisions can be made about where to run a script. For example, some compute hardware can support pointers and recursion while others cannot. We could have chosen to disallow these things and give developers a lowest common denominator API, but we chose to instead let the runtime analyze the scripts. This allows developers to get full use of hardware that supports these features, since there is always a fully featured CPU to fall back upon. In the end, developers can focus on writing good apps and the hardware manufacturers can compete on making the most fully featured and efficient hardware. As new features appear, applications will benefit without application code changes.

Usability was a major driver in Renderscript’s design. Most existing compute and graphics platforms require elaborate glue logic to tie the high performance code back to the core application code. This code is very bug prone and usually painful to write. The static analysis we do in the host Renderscript compiler is helpful in solving this issue. Each user script generates a Dalvik “glue” class. Names for the glue class and its accessors are derived from the contents of the script. This greatly simplifies the use of the scripts from Dalvik.

Example: The Application Level

Given these trade-offs, what does a simple compute application look like? In this very basic example we will take a normal android.graphics.Bitmap object and run a script that copies it to a second bitmap, converting it to monochrome along the way. Let’s look at the application code which invokes the script before we look at the script itself; this comes from the HelloCompute SDK sample:

    private Bitmap mBitmapIn;
private Bitmap mBitmapOut;
private RenderScript mRS;
private Allocation mInAllocation;
private Allocation mOutAllocation;
private ScriptC_mono mScript;

private void createScript() {
mRS = RenderScript.create(this);

mInAllocation = Allocation.createFromBitmap(mRS, mBitmapIn,
Allocation.MipmapControl.MIPMAP_NONE,
Allocation.USAGE_SCRIPT);
mOutAllocation = Allocation.createTyped(mRS, mInAllocation.getType());

mScript = new ScriptC_mono(mRS, getResources(), R.raw.mono);

mScript.set_gIn(mInAllocation);
mScript.set_gOut(mOutAllocation);
mScript.set_gScript(mScript);
mScript.invoke_filter();
mOutAllocation.copyTo(mBitmapOut);
}

This function assumes that the two bitmaps have already been created and are of the same size and format.

The first thing all Renderscript applications need is a context object. This is the core object used to create and manage all other Renderscript objects. This first line of the function creates the object, mRS. This object must be kept alive for the duration the application intends to use it or any objects created with it.

The next two function calls create compute allocations from the Bitmaps. Renderscript has its own memory allocator, because the memory may potentially be shared by multiple processors and possibly exist in more than one memory space. When an allocation is created its potential uses need to be enumerated so the system may choose the correct type of memory for its intended uses.

The first function createFromBitmap() creates a matching Renderscript allocation object and copies the contents of the bitmap into the allocation. Allocations are the basic units of memory used in renderscript. The second Allocation created with createTyped() generates an Allocation identical in structure to the first. The definition of that structure is retrieved from the first with the getType() query. Renderscript types define the structure of an Allocation. In this case the type was generated from the height, width, and format of the incoming bitmap.

The next line loads the script, which is named “mono.rs”. R.raw.mono identifies it; scripts are stored as raw resources in an application’s APK. Note the name of the auto-generated “glue” class, ScriptC_mono.

The next three lines set properties of the script, using generated methods in the “glue” class.

Now we have everything set up. The function invoke_filter() actually does some work for us. This causes the function filter() in the script to be called. If the function had parameters they could be passed here. Return values are not allowed as invocations are asynchronous.

The last line of the function copies the result of our compute script back to the managed bitmap; it has the necessary synchronization code built-in to ensure the script has completed running.

Example: The Script

Here’s the Renderscript stored in mono.rs which the application code above invokes:

#pragma version(1)
#pragma rs java_package_name(com.android.example.hellocompute)

rs_allocation gIn;
rs_allocation gOut;
rs_script gScript;

const static float3 gMonoMult = {0.299f, 0.587f, 0.114f};

void root(const uchar4 *v_in, uchar4 *v_out, const void *usrData, uint32_t x, uint32_t y) {
float4 f4 = rsUnpackColor8888(*v_in);

float3 mono = dot(f4.rgb, gMonoMult);
*v_out = rsPackColorTo8888(mono);
}

void filter() {
rsForEach(gScript, gIn, gOut, 0);
}

The first line is simply an indication to the compiler which revision of the native Renderscript API the script is written against. The second line controls the package association of the generated reflected code.

The three globals listed correspond to the globals which were set up in our managed code. The fourth global is not reflected because it is marked as static. Non-static, const, globals are also allowed but only generate a get reflected method. This can be useful for keeping constants in sync between scripts and managed code.

The function root() is special to renderscript. Conceptually it’s similar to main() in C. When a script is invoked by the runtime, this is the function that will be called. In this case the parameters are the incoming and outgoing pixels from our allocation. A generic user pointer is also provided along with the address within the allocation that this invocation is processing. This example ignores these parameters.

The three lines of the root function unpack the pixel from RGBA_8888 to a vector of four floats. The second line uses a built-in math function to compute the dot product of the monochrome constants with the incoming pixel data to get our grey level. Note that while dot returns a single float it can be assigned to a float3 which simply copies the value to each of the x, y, and z components of the float3. In the end we use another builtin to repack the floats back to a 32 bit pixel. This is also an example of an overloaded function as there are separate versions of rsPackColorTo8888 which take RGB (float3) or RGBA (float4) data. If A is not provided the overloaded functions assume a value of 1.0f.

The filter() function is called from managed code to do the conversion. It simply does a compute launch on each element of the allocation. The first parameter is the script to be launched - the root function of this script will be invoked for each element in the allocation. The second and third parameters are the input and output data allocations. The last parameter is the pointer for the user data if we desired to pass additional information to the root function.

The forEach will launch across multiple threads if the device has multiple processors. In the future forEach can provide a transition point where control may pass from one processor to another. In this example it is reasonable to expect that in the future filter() would get executed on the CPU and root() would occur on a GPU or DSP.

I hope this gives glimpse into the design behind Renderscript and a simple example of how it can be used.