Throughout this series, we’ve been converting sample data ByteArrays to and from Vector.<Number>s and any audio processing that we’ve done, we did on these Vectors. While it’s clear that working on Vectors is more comfortable than directly accessing ByteArrays (what with having to explicitly set a ByteArray’s position in order to access a random element in it), you may have been wondering whether all that copying data around between these different structures comes with a huge performance penalty.

In this installment, I want to show you that just the opposite is true. http://www.philippseifried.com/blog/files/misc/AS3_Performance_Tests.zip contains a class with various performance tests. You can download the zip and run them yourself, or just read on and let me take you through their output.

ByteArrays vs. Vector.<Number>s

I’ve performed these tests on a Macbook Pro running Flash Player 10.2. Your results may vary depending on the player version, hardware and operating system, but the proportions should stay roughly the same.

— 1.) Testing 1000000 random number writes to ByteArray…
Complete. Test took 299ms.

 — 2.) Testing 1000000 random number writes to Vector…
Complete. Test took 138ms.

 (1000000 calls to Math.random take 135ms.)

The first two tests populate a ByteArray and a Vector.<Number> with a million random numbers. This demonstrates the cost of sequential write access to both data structures. Whereas sequentially writing to a Vector is almost free, the same operation on a ByteArray takes roughly 160 milliseconds.

— 3.) Testing 1000000 sequential reads from ByteArray…
Complete. Test took 141ms. 

— 4.) Testing 1000000 sequential reads from Vector…
Complete. Test took 17ms.

Sequential reading shows the same contrast – the ByteArray is an order of magnitude slower.
Random read access is a different story:

— 5.) Testing 1000000 random reads from ByteArray…
Complete. Test took 551ms.

 — 6.) Testing 1000000 random reads from Vector…
Complete. Test took 514ms.

These tests read single values from arbitrary locations in the data. The Vector is still faster, but only by roughly 10%.

— 7.) Testing 1000 instantiations of a Vector.<Number> of size 4096…
Complete. Test took 157ms.

 

Test 7 measures the cost of instantiating a Vector of a typical audio buffer size. Assuming you’d need about 20 of these per second (44100 samples for each stereo channel), you could instantiate them on the fly at a cost of about 3ms per second. However, if you’re going to build an effect pipeline of any sort, you probably want to plan ahead and reuse those Vectors.

— 8.) Testing copying 88200 values from ByteArray into two Vectors and back.
Complete. Test took 28ms.

 — 9.) Testing reading 88200 values from ByteArray and writing them back.
Complete. Test took 37ms.

 

Tests 8 and 9 are interesting: 8 measures the overhead of copying a one second audio ByteArray into two Vector.<Number>s (one per stereo channel), and then writing them back into the ByteArray. The resulting 28ms are the base cost for any kind of Vector-based audio processing setup.

You might think this is pretty high, but compare that with test 9, which does the same thing without Vectors, just reading the ByteArray Number by Number and immediately writing the values back into the ByteArray – the cost is actually higher. The culprit here is the fact that in order to write a float back into the ByteArray, we need to set its position back by one value (two in this case, because we’re alternately reading left and right channel data), which comes at a cost, whereas the Vector-based solution leverages the speed of sequential access, both for the Vector.<Numbers> and for the ByteArray. It turns out that the setup we’ve been working with – copying sample data into Vectors, manipulating it there and then writing it back – basically comes for free.

 

A few more tests to pave the road for part 8

In part 8, we’ll extend the SoundManager from last week’s post with a flexible audio effects architecture. The goal is to let you build a graph of audio devices, connected through their inputs and outputs. You might think that a good way of implementing this is to traverse the graph for each audio sample, letting each connected effect device process the sample in some way. Tests 10 and 11 show why that’s not such a good idea:

— 10.) Testing overhead of 88200 method calls (one per each stereo sample in one second):
(time for calculation without method call overhead: 2.6ms)
Method call overhead: 6.7ms.

 — 11.) Testing overhead of 88200 method calls on another class’ instance:
time for calculation without method call overhead: 2.3ms)
Method call overhead: 7.6ms.

 

Remember that when working on audio data, you’re dealing with 88200 samples per second. Operations that are normally negligible can become a performance drain at that amount of data. Suppose that you had a setup of effect devices with some sort of process() method that you would call for each sample, the overhead of just calling the method (without the method actually doing anything) would be around 7ms per device. At 14 devices, you’d be spending a tenth of your time just with method calls.

On a side note, tests 10 and 11 (which is supposed to measure the extra overhead of calling a method on another instance, such as a connected device) consistently return about the same measurements on my machine, which surprised me. Sometimes 10 is a little faster, sometimes 11. It could well be that the compiler did some optimizations that it couldn’t do in a real-world scenario (where the called method would access member variables, for example), but as we’re not going to take the method-call-per-sample route anyway, I didn’t investigate further.

— 12.) Mixer-Setup (100 iterations over 10 sources with 4096 samples):
Testing mixing on a per sample basis:
Complete. Mixing took 404ms
Testing mixing on a per source basis:
Complete. Mixing took 136ms

 

Finally, test 12 is supposed to test Flash’s cache behavior by creating something like an audio mixer. It creates 10 Vectors of size 4096 and fills them with random data. Then it writes the sum of the data in these Vectors into an “output” Vector.

The first test (“per sample basis”) performs the addition in the following way:

for (j=0; j<4096; j++)
{
     result[j] = 0;
     for each (source in sources)
     {
          result[j] += source[j];
     }
}

For each sample, we go over all source Vectors, adding their value at the sample index to the output.
This takes 404ms for 100 iterations.

The second test (“per source basis”) performs the addition differently:

for (j=0; j<4096; j++)
{
     result[j] = 0;
}
for each (source in sources)
{
     for (j=0; j<4096; j++)
     {
          result[j] += source[j];
     }
}

First, the complete output Vector is cleared. Then we go over each source Vector, and add all of its samples to the output. This only takes 136ms.

The difference between the two methods (as I understand it – please comment if I got that wrong!) is that the second version leverages the cache by reading only two Vectors (output and one source) at a time, while the first version takes a look at every source for each sample, effectively incurring the cost of random read access.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*


2 × = sixteen

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>