## 2012-06-26

### 2D Interpolation, Part 2: Minimizing Array Accesses

Last time I massaged the interpolator to avoid computing every time the linear transformation from destination space to source space, using only integer variables. With that rewriting, it is time to avoid referencing source values more than is needed. First thing we do, let's name all elements:

```void interp0() {
int offset = 0;
int yn = 0;
int yi = 1;
for (int i = 0; i < height; i++) {
final float yr = (float) yn / (float) height;
int xn = 0;
int xi = 1;
for (int j = 0; j < width; j++) {
final float xr = (float) xn / (float) width;
float c00 = data[yi-1][xi-1], c01 = data[yi-1][xi+0], c02 = data[yi-1][xi+1], c03 = data[yi-1][xi+2];
float c10 = data[yi+0][xi-1], c11 = data[yi+0][xi+0], c12 = data[yi+0][xi+1], c13 = data[yi+0][xi+2];
float c20 = data[yi+1][xi-1], c21 = data[yi+1][xi+0], c22 = data[yi+1][xi+1], c23 = data[yi+1][xi+2];
float c30 = data[yi+2][xi-1], c31 = data[yi+2][xi+0], c32 = data[yi+2][xi+1], c33 = data[yi+2][xi+2];
float c = 0;
switch (method) {
case NEAREST:
c = yr < 0.5 ? xr < 0.5 ? c11 : c12 : xr < 0.5 ? c21 : c22;
break;
case BILINEAR:
c = linear(yr,
linear(xr, c11, c12),
linear(xr, c21, c22) );
break;
case BICUBIC:
c = cubic(yr,
cubic(xr, c00, c01, c02, c03),
cubic(xr, c10, c11, c12, c13),
cubic(xr, c20, c21, c22, c23),
cubic(xr, c30, c31, c32, c33) );
break;
}
pixels[offset++] = hue(c);
xn += nx - 3;
if (xn >= width) {
xn -= width;
xi += 1;
}
}
yn += ny - 3;
if (yn >= height) {
yn -= height;
yi += 1;
}
}
}
```

Two things are of note: first, I need all 16 elements for bicubic interpolation, even if nearest-neighbor or bilinear were requested. Second, in the nearest-neighbor case I had to transform an index computation involving a float-to-integer rounding into explicit conditionals. Such is the price of progress.

Since yi, xi don't change at each step, it is possible to just reference new elements when they do, and reuse values in the mean time. First I will reuse the columns. In order to do that I need to hoist the assignments outside the inner loop, in effect converting them to pure initializations, and arrange for the variables to update when xi is incremented:

```void interp0() {
int offset = 0;
int yn = 0;
int yi = 1;
for (int i = 0; i < height; i++) {
final float yr = (float) yn / (float) height;
int xn = 0;
int xi = 1;
float c00 = data[yi-1][0], c01 = data[yi-1][1], c02 = data[yi-1][2], c03 = data[yi-1][3];
float c10 = data[yi+0][0], c11 = data[yi+0][1], c12 = data[yi+0][2], c13 = data[yi+0][3];
float c20 = data[yi+1][0], c21 = data[yi+1][1], c22 = data[yi+1][2], c23 = data[yi+1][3];
float c30 = data[yi+2][0], c31 = data[yi+2][1], c32 = data[yi+2][2], c33 = data[yi+2][3];
for (int j = 0; j < width; j++) {
final float xr = (float) xn / (float) width;
float c = 0;
switch (method) {
case NEAREST:
c = yr < 0.5 ? xr < 0.5 ? c11 : c12 : xr < 0.5 ? c21 : c22;
break;
case BILINEAR:
c = linear(yr,
linear(xr, c11, c12),
linear(xr, c21, c22) );
break;
case BICUBIC:
c = cubic(yr,
cubic(xr, c00, c01, c02, c03),
cubic(xr, c10, c11, c12, c13),
cubic(xr, c20, c21, c22, c23),
cubic(xr, c30, c31, c32, c33) );
break;
}
pixels[offset++] = hue(c);
xn += nx - 3;
if (xn >= width) {
xn -= width;
xi += 1;
if (xi < nx - 2) {
c00 = c01; c01 = c02; c02 = c03; c03 = data[yi-1][xi+2];
c10 = c11; c11 = c12; c12 = c13; c13 = data[yi+0][xi+2];
c20 = c21; c21 = c22; c22 = c23; c23 = data[yi+1][xi+2];
c30 = c31; c31 = c32; c32 = c33; c33 = data[yi+2][xi+2];
}
}
}
yn += ny - 3;
if (yn >= height) {
yn -= height;
yi += 1;
}
}
}
```

The initialization inlines the constant value xi = 1 in the second index. The update shifts the values around in classic bucket brigade fashion, and fetches fresh values for the new column to process. Note that there is a subtlety involving the range check, in that the last iteration for j will have xi = nx−2, but at that point it is too early to notice: logically the loop termination test should come before updating xn, xi and the array elements. I opt here to conditionalize the update to keep the `for` loop structured; feel free to `break` out of the loop instead.

Now it's time to reuse the rows. This is made simpler by the fact that the source matrix is an array of arrays. First I introduce row variables pointing to each active row, and rename the element accesses accordingly:

```void interp0() {
int offset = 0;
int yn = 0;
int yi = 1;
for (int i = 0; i < height; i++) {
final float yr = (float) yn / (float) height;
int xn = 0;
int xi = 1;
float[] p0 = data[yi-1], p1 = data[yi+0], p2 = data[yi+1], p3 = data[yi+2];
float c00 = p0[0], c01 = p0[1], c02 = p0[2], c03 = p0[3];
float c10 = p1[0], c11 = p1[1], c12 = p1[2], c13 = p1[3];
float c20 = p2[0], c21 = p2[1], c22 = p2[2], c23 = p2[3];
float c30 = p3[0], c31 = p3[1], c32 = p3[2], c33 = p3[3];
for (int j = 0; j < width; j++) {
final float xr = (float) xn / (float) width;
float c = 0;
switch (method) {
case NEAREST:
c = yr < 0.5 ? xr < 0.5 ? c11 : c12 : xr < 0.5 ? c21 : c22;
break;
case BILINEAR:
c = linear(yr,
linear(xr, c11, c12),
linear(xr, c21, c22) );
break;
case BICUBIC:
c = cubic(yr,
cubic(xr, c00, c01, c02, c03),
cubic(xr, c10, c11, c12, c13),
cubic(xr, c20, c21, c22, c23),
cubic(xr, c30, c31, c32, c33) );
break;
}
pixels[offset++] = hue(c);
xn += nx - 3;
if (xn >= width) {
xn -= width;
xi += 1;
if (xi < nx - 2) {
c00 = c01; c01 = c02; c02 = c03; c03 = p0[xi+2];
c10 = c11; c11 = c12; c12 = c13; c13 = p1[xi+2];
c20 = c21; c21 = c22; c22 = c23; c23 = p2[xi+2];
c30 = c31; c31 = c32; c32 = c33; c33 = p3[xi+2];
}
}
}
yn += ny - 3;
if (yn >= height) {
yn -= height;
yi += 1;
}
}
}
```

Next I hoist the row assignments out of the outer loop, as I did before:

```void interp0() {
float[] p0 = data[0], p1 = data[1], p2 = data[2], p3 = data[3];
int yn = 0;
int yi = 1;
int offset = 0;
for (int i = 0; i < height; i++) {
final float yr = (float) yn / (float) height;
float c00 = p0[0], c01 = p0[1], c02 = p0[2], c03 = p0[3];
float c10 = p1[0], c11 = p1[1], c12 = p1[2], c13 = p1[3];
float c20 = p2[0], c21 = p2[1], c22 = p2[2], c23 = p2[3];
float c30 = p3[0], c31 = p3[1], c32 = p3[2], c33 = p3[3];
int xn = 0;
int xi = 1;
for (int j = 0; j < width; j++) {
final float xr = (float) xn / (float) width;
float c = 0;
switch (method) {
case NEAREST:
c = yr < 0.5 ? xr < 0.5 ? c11 : c12 : xr < 0.5 ? c21 : c22;
break;
case BILINEAR:
c = linear(yr,
linear(xr, c11, c12),
linear(xr, c21, c22) );
break;
case BICUBIC:
c = cubic(yr,
cubic(xr, c00, c01, c02, c03),
cubic(xr, c10, c11, c12, c13),
cubic(xr, c20, c21, c22, c23),
cubic(xr, c30, c31, c32, c33) );
break;
}
pixels[offset++] = hue(c);
xn += nx - 3;
if (xn >= width) {
xn -= width;
xi += 1;
if (xi < nx - 2) {
c00 = c01; c01 = c02; c02 = c03; c03 = p0[xi+2];
c10 = c11; c11 = c12; c12 = c13; c13 = p1[xi+2];
c20 = c21; c21 = c22; c22 = c23; c23 = p2[xi+2];
c30 = c31; c31 = c32; c32 = c33; c33 = p3[xi+2];
}
}
}
yn += ny - 3;
if (yn >= height) {
yn -= height;
yi += 1;
if (yi < ny - 2) {
p0 = p1; p1 = p2; p2 = p3; p3 = data[yi+2];
}
}
}
}
```

Here I'm presented with exactly the same situation regarding the last iteration as with the columns, and faced with the same choice. At this point I could call it quits and be satisfied with a reasonable implementation; I'll press further. Next time I'll convert the interpolator to operate on a linear array, opening the door to manipulating graphics directly.