## 2012-06-27

### 2D Interpolation, Part 3: Linear Array Accesses

Last time I've hoisted all accesses to the source array. This opens the door to being able to process linear arrays representing a matrix in row-major order by stenciling. Not only that, but I was able to completely eliminate the need to have a bordered array on input by explicitly replicating elements as needed. The first step is to use a row buffer into which to copy the elements to be processed. Instead of stenciling by indexing into the source array, I'll index into this row buffer. The code is identical to the one presented last time, except for the initialization, done by copying, and the indexing. Instead of four arrays p0, p1, p2 and p3 I use just one containing four rows back to front; this requires computing the indices by hand. To bring in the next row to interpolate, a pair of `arrayCopy`s make room in the row buffer and assign the new row in the opened space. Now both yi and xi cease to represent the index of the stencil over the source array, and I'm free to simplify the index arithmetic by offsetting their starting values:

```void interp0() {
final float[] p = new float[4 * nx];
for (int i = 0; i < 4; i++)
arrayCopy(data[i], 0, p, i*nx, nx);
int yr = 0;
int yi = 3;
int offset = 0;
for (int i = 0; i < height; i++) {
final float yf = (float) yr / (float) height;
float c00 = p[0*nx+0], c01 = p[0*nx+1], c02 = p[0*nx+2], c03 = p[0*nx+3];
float c10 = p[1*nx+0], c11 = p[1*nx+1], c12 = p[1*nx+2], c13 = p[1*nx+3];
float c20 = p[2*nx+0], c21 = p[2*nx+1], c22 = p[2*nx+2], c23 = p[2*nx+3];
float c30 = p[3*nx+0], c31 = p[3*nx+1], c32 = p[3*nx+2], c33 = p[3*nx+3];
int xr = 0;
int xi = 3;
for (int j = 0; j < width; j++) {
final float xf = (float) xr / (float) width;
float c = 0;
switch (method) {
case NEAREST:
c = yf < 0.5 ? xf < 0.5 ? c11 : c12 : xf < 0.5 ? c21 : c22;
break;
case BILINEAR:
c = linear(yf,
linear(xf, c11, c12),
linear(xf, c21, c22) );
break;
case BICUBIC:
c = cubic(yf,
cubic(xf, c00, c01, c02, c03),
cubic(xf, c10, c11, c12, c13),
cubic(xf, c20, c21, c22, c23),
cubic(xf, c30, c31, c32, c33) );
break;
}
pixels[offset++] = hue(c);
xr += nx - 3;
if (xr >= width) {
xr -= width;
xi += 1;
if (xi < nx) {
c00 = c01; c01 = c02; c02 = c03; c03 = p[0*nx+xi];
c10 = c11; c11 = c12; c12 = c13; c13 = p[1*nx+xi];
c20 = c21; c21 = c22; c22 = c23; c23 = p[2*nx+xi];
c30 = c31; c31 = c32; c32 = c33; c33 = p[3*nx+xi];
}
}
}
yr += ny - 3;
if (yr >= height) {
yr -= height;
yi += 1;
if (yi < ny) {
arrayCopy(p, nx, p, 0, 3*nx);
arrayCopy(data[yi], 0, p, 3*nx, nx);
}
}
}
}
```

Processing a matrix as a linear array in row-major order is a matter of adjusting the sources to the `arrayCopy`s. I've renamed the source array as srcpix and its dimensions as nrows and ncols so that later I can modify the code to avoid the need for an explicit border without breaking the existing `interp1()`:

```final float[] srcpix = {
0.60, 0.60, 0.48, 0.24, 0.60, 0.60,
0.60, 0.60, 0.48, 0.24, 0.60, 0.60,
0.00, 0.00, 0.36, 0.12, 0.48, 0.48,
0.24, 0.24, 0.48, 0.60, 0.12, 0.12,
0.12, 0.12, 0.24, 0.48, 0.36, 0.36,
0.12, 0.12, 0.24, 0.48, 0.36, 0.36
};
final int nrows = 6;
final int ncols = 6;

void interp0() {
final float[] p = new float[4 * ncols];
arrayCopy(srcpix, 0, p, 0, 4 * ncols);
int yr = 0;
int yi = 3;
int offset = 0;
for (int i = 0; i < height; i++) {
final float yf = (float) yr / (float) height;
float c00 = p[0*ncols+0], c01 = p[0*ncols+1], c02 = p[0*ncols+2], c03 = p[0*ncols+3];
float c10 = p[1*ncols+0], c11 = p[1*ncols+1], c12 = p[1*ncols+2], c13 = p[1*ncols+3];
float c20 = p[2*ncols+0], c21 = p[2*ncols+1], c22 = p[2*ncols+2], c23 = p[2*ncols+3];
float c30 = p[3*ncols+0], c31 = p[3*ncols+1], c32 = p[3*ncols+2], c33 = p[3*ncols+3];
int xr = 0;
int xi = 3;
for (int j = 0; j < width; j++) {
final float xf = (float) xr / (float) width;
float c = 0;
switch (method) {
case NEAREST:
c = yf < 0.5 ? xf < 0.5 ? c11 : c12 : xf < 0.5 ? c21 : c22;
break;
case BILINEAR:
c = linear(yf,
linear(xf, c11, c12),
linear(xf, c21, c22) );
break;
case BICUBIC:
c = cubic(yf,
cubic(xf, c00, c01, c02, c03),
cubic(xf, c10, c11, c12, c13),
cubic(xf, c20, c21, c22, c23),
cubic(xf, c30, c31, c32, c33) );
break;
}
pixels[offset++] = hue(c);
xr += ncols - 3;
if (xr >= width) {
xr -= width;
xi += 1;
if (xi < ncols) {
c00 = c01; c01 = c02; c02 = c03; c03 = p[0*ncols+xi];
c10 = c11; c11 = c12; c12 = c13; c13 = p[1*ncols+xi];
c20 = c21; c21 = c22; c22 = c23; c23 = p[2*ncols+xi];
c30 = c31; c31 = c32; c32 = c33; c33 = p[3*ncols+xi];
}
}
}
yr += nrows - 3;
if (yr >= height) {
yr -= height;
yi += 1;
if (yi < nrows) {
arrayCopy(p, ncols, p, 0, 3*ncols);
arrayCopy(srcpix, yi*ncols, p, 3*ncols, ncols);
}
}
}
}
```

Now in a slow-motion approach towards eliminating the repeated border, I'll first reduce the array dimensions to their true value. In the preceding code I substitute nrows := nrows−2 and ncols := ncols−2, and simplify all indices:

```final int nrows = 4;
final int ncols = 4;

void interp0() {
final float[] p = new float[4*(ncols+2)];
arrayCopy(srcpix, 0, p, 0, 4*(ncols+2));
int yr = 0;
int yi = 3;
int offset = 0;
for (int i = 0; i < height; i++) {
final float yf = (float) yr / (float) height;
float c00 = p[0*ncols+0], c01 = p[0*ncols+1], c02 = p[0*ncols+2], c03 = p[0*ncols+3];
float c10 = p[1*ncols+2], c11 = p[1*ncols+3], c12 = p[1*ncols+4], c13 = p[1*ncols+5];
float c20 = p[2*ncols+4], c21 = p[2*ncols+5], c22 = p[2*ncols+6], c23 = p[2*ncols+7];
float c30 = p[3*ncols+6], c31 = p[3*ncols+7], c32 = p[3*ncols+8], c33 = p[3*ncols+9];
int xr = 0;
int xi = 3;
for (int j = 0; j < width; j++) {
final float xf = (float) xr / (float) width;
float c = 0;
switch (method) {
case NEAREST:
c = yf < 0.5 ? xf < 0.5 ? c11 : c12 : xf < 0.5 ? c21 : c22;
break;
case BILINEAR:
c = linear(yf,
linear(xf, c11, c12),
linear(xf, c21, c22) );
break;
case BICUBIC:
c = cubic(yf,
cubic(xf, c00, c01, c02, c03),
cubic(xf, c10, c11, c12, c13),
cubic(xf, c20, c21, c22, c23),
cubic(xf, c30, c31, c32, c33) );
break;
}
pixels[offset++] = hue(c);
xr += ncols - 1;
if (xr >= width) {
xr -= width;
xi += 1;
if (xi < ncols+2) {
c00 = c01; c01 = c02; c02 = c03; c03 = p[0*ncols+xi+0];
c10 = c11; c11 = c12; c12 = c13; c13 = p[1*ncols+xi+2];
c20 = c21; c21 = c22; c22 = c23; c23 = p[2*ncols+xi+4];
c30 = c31; c31 = c32; c32 = c33; c33 = p[3*ncols+xi+6];
}
}
}
yr += nrows - 1;
if (yr >= height) {
yr -= height;
yi += 1;
if (yi < nrows+2) {
arrayCopy(p, (ncols+2), p, 0, 3*(ncols+2));
arrayCopy(srcpix, yi*(ncols+2), p, 3*ncols+6, ncols+2);
}
}
}
}
```

In the second part of the slo-mo maneuver, I'll eliminate the repeated values from the source matrix. Every row repeats the first and last element, and both the first and the last rows are repeated in turn. I achieve this by careful stuffing into the row buffer:

```final float[] srcpix = {
0.60, 0.48, 0.24, 0.60,
0.00, 0.36, 0.12, 0.48,
0.24, 0.48, 0.60, 0.12,
0.12, 0.24, 0.48, 0.36,
};
final int nrows = 4;
final int ncols = 4;

void interp0() {
final float[] p = new float[4*ncols+8];
p[1*ncols+2] = srcpix[0*ncols]; arrayCopy(srcpix, 0*ncols, p, 1*ncols+3, ncols); p[2*ncols+3] = srcpix[1*ncols-1];
p[2*ncols+4] = srcpix[1*ncols]; arrayCopy(srcpix, 1*ncols, p, 2*ncols+5, ncols); p[3*ncols+5] = srcpix[2*ncols-1];
p[3*ncols+6] = srcpix[2*ncols]; arrayCopy(srcpix, 2*ncols, p, 3*ncols+7, ncols); p[4*ncols+7] = srcpix[3*ncols-1];
arrayCopy(p, 1*ncols+2, p, 0, ncols+2);
int yr = 0;
int yi = 2;
int offset = 0;
for (int i = 0; i < height; i++) {
final float yf = (float) yr / (float) height;
float c00 = p[0*ncols+0], c01 = p[0*ncols+1], c02 = p[0*ncols+2], c03 = p[0*ncols+3];
float c10 = p[1*ncols+2], c11 = p[1*ncols+3], c12 = p[1*ncols+4], c13 = p[1*ncols+5];
float c20 = p[2*ncols+4], c21 = p[2*ncols+5], c22 = p[2*ncols+6], c23 = p[2*ncols+7];
float c30 = p[3*ncols+6], c31 = p[3*ncols+7], c32 = p[3*ncols+8], c33 = p[3*ncols+9];
int xr = 0;
int xi = 3;
for (int j = 0; j < width; j++) {
final float xf = (float) xr / (float) width;
float c = 0;
switch (method) {
case NEAREST:
c = yf < 0.5 ? xf < 0.5 ? c11 : c12 : xf < 0.5 ? c21 : c22;
break;
case BILINEAR:
c = linear(yf,
linear(xf, c11, c12),
linear(xf, c21, c22) );
break;
case BICUBIC:
c = cubic(yf,
cubic(xf, c00, c01, c02, c03),
cubic(xf, c10, c11, c12, c13),
cubic(xf, c20, c21, c22, c23),
cubic(xf, c30, c31, c32, c33) );
break;
}
pixels[offset++] = hue(c);
xr += ncols - 1;
if (xr >= width) {
xr -= width;
xi += 1;
if (xi < ncols+2) {
c00 = c01; c01 = c02; c02 = c03; c03 = p[0*ncols+xi+0];
c10 = c11; c11 = c12; c12 = c13; c13 = p[1*ncols+xi+2];
c20 = c21; c21 = c22; c22 = c23; c23 = p[2*ncols+xi+4];
c30 = c31; c31 = c32; c32 = c33; c33 = p[3*ncols+xi+6];
}
}
}
yr += nrows - 1;
if (yr >= height) {
yr -= height;
yi += 1;
if (yi <= nrows) {
arrayCopy(p, (ncols+2), p, 0, 3*(ncols+2));
if (yi < nrows) {
p[3*ncols+6] = srcpix[yi*ncols];
arrayCopy(srcpix, yi*ncols, p, 3*ncols+7, ncols);
p[4*ncols+7] = srcpix[(yi+1)*ncols-1];
}
}
}
}
}
```

Note that during initialization the first row is duplicated by an explicit copy. Note also that the inner loop doesn't change at all. The outer loop does; since the first duplicated row was eliminated, yi initially is 2, not 3; since the last duplicated row was eliminated, there new bounds for yi is nrows, not nrows + 2, and the conditions simplify accordingly. Since the last row is repeated, it must be processed even when yi is exactly nrows. Finally, since the new row must repeat its first and last elements, I must use the same schema as for the initialization. In the case of the repeated last row, the first `arrayCopy` moves the three last rows to the front; by doing nothing the very last row appears in the last two consecutive strides of the row buffer.

Believe it or not, I consider this code rather compact. Normally, the special case of repeated elements is handled by unrolling the loops and placing a loop header and a loop footer specialized to the appropriate values. By using the row buffer, I can avoid duplicating code, for a cost in O(N) storage. As it is, the only transformation required would be to extract the `switch` from the inner loop into separate the interpolators and simplify the nearest-neighbor and bilinear cases by eliminating the unneeded elements. Before doing that, it would be best to tackle the problem of processing ARGB sources, by unpacking the channels in the source pixels, interpolating each and repacking them into a destination pixel. That can be done neatly with the row buffers for maximal inner-loop speed. Until the next time!