Linear Algebra
Architecture
1.0 Derivation
The Faddeev algorithm computes the expression CX+D subject to the constraint AX=B, where C,D,A, and B are matrices and A is an rank NxN matrix (not necessarily full rank). It is convenient to describe the algorithm as an extended matrix a using the representation
(1)
The algorithm begins by adding a linear combination of the rows A|B to the rows C|D or
If W is chosen so that WA-C=0, then W=CA-1 and this substitution in the lower right hand corner provides the desired result there or
(2)
It can be seen that matrix operations are "programmable" based on the entries in (1), e.g.,
,
,
.
It is not necessary to actually compute the value of W; it is only necessary to annul the elements of C, a process that can be done using an LU decomposition algorithm in which a is decomposed into the product of a lower triangular matrix (l) and an upper triangular (u) matrix or

where the "-" above indicates these elements lij are not computed and each matrix is taken as 4x4. Here the matrix u is the desired final result and corresponds to the transformation from a to the form in (2). This has been achieved by stopping the factorization process at the point where the elements of C have been annulled. Consequently, the desired result from the lower right hand corner of (2) is
.
An explicit mathematical formula for the l's and u's above can be obtained by induction. That is,
From the mathematical expressions in (3) above it is possible to go directly to the coded form by replacing the summation by "add()":
for j to 2*N do
for
i to 2*N do
if i=1 and j>=1 and j<=2*N then u[i,j]:=a[i,j] fi;
if j>=i and i>1 and j<=2*N and i<=N then
u[i,j]:=a[i,j]-add(l[i,k]*u[k,j],k=1..i-1)
fi;
if j>N and i>N and j<=2*N and i<=2*N then
(4)
u[i,j]:=a[i,j]-add(l[i,k]*u[k,j],k=1..N)
fi;
if j=1 and i>=1 and i<=2*N then l[i,j]:=a[i,j]/u[j,j] fi;
if i>=j and j>1 and i<=2*N and j<=N then
l[i,j]:=(a[i,j]-add(l[i,k]*u[k,j],k=1..j-1))/u[j,j]
fi;
od
od;
Here, the loop limits have been replaced by the problem size parameter, N, where it has been assumed that each matrix is NxN in size. Solutions obtained are independent of the size parameter N, which is handled symbolically during the solution search.
While the code in (4) is suitable for input to SPADE, there are many different ways to code the Faddeev algorithm. Each of the different codings can result in different array implementations. In general it is best to assign unique variables to the quantities that you want to calculate. In (4) the result d is not called out separately, which limits the number of array implementations that SPADE can explore. The reason for this is that for (4) SPADE must find a single polygon in space-time that contains all the values of u, only some of which contain d. If instead the input code called out d separately, then SPADE would have an easier task because it would have more freedom to separately place the polygons that represent d and u. Therefore, (4) has been modified as follows:
for j to 2*N do
for i to 2*N do
if
i=1 and j>=1 and j<=N then u[i,j] := a[i,j] fi;
if i=1 and j>=1 and j<=N then b[i,j] := a[i,j+N] fi;
if j>=i and i>1 and j<=N then
u[i,j]:=a[i,j]-add(l[i,k]*u[k,j],k=1..i-1)
fi;
if j>=1 and i>=1 and j<=N and i<=N then
b[i,j]:=a[i,j+N]-add(l[i,k]*b[k,j],k=1..i-1)
(5)
fi;
if j>=1 and i>=1 and j<=N and i<=N then
d[i,j] := a[i+N,j+N]-add(l[i+N,k]*b[k,j],k=1..N)
fi;
if j=1 and i>=1 and i<=2*N then l[i,j]:=a[i,j]/u[j,j] fi;
if i>=j and j>1 and i<=2*N and j<=N then
l[i,j]:=(a[i,j]-add(l[i,k]*u[k,j],k=1..j-1))/u[j,j]
fi;
od
od;
where we have also made the substitution
![]()
2.0 Design Examples
2.1 Minimum Array Design
In this first mapping example the architectural constraints were set to find array designs with the least PE array area among the minimum time latency solutions. In addition, since computation of values of l in (5) require hardware-demanding division operations, it would be desirable to minimize them. One way of doing this is to require that the variable l be mapped such that it is projected (time aligned) onto the PE‑array in a line. In this case division operations would only be necessary in a linear array of PEs as opposed to a 2-D array of PEs. Finally, it is a common characteristic of systolic arrays, which are physically tied to sensors, to have sensor data stream into the array from one or more array-edge boundaries. This is also consistent with FPGA based virtual computers that contain a lot of buffer memory at the board inputs. This input data configuration can be forced by setting a constraint that eliminates exploration of designs that don't map (time align) a to a linear array of PEs on the edge of the complete array.
With these constraints set, a SPADE search discovered 2 unique solutions with time latency 5N‑2, each with ½ N(3N+1) PEs, one of which is shown in Fig. 1. The PE array in Fig. 1 is uniform in terms of the interconnection pattern and there are twelve different flows of data associated with the various variable dependencies, each of which moves along an orthogonal path defined by the arrows in Fig. 1. There are five additional dependencies in which a variable does not move spatially, but rather is updated and reused in the same PE. This corresponds to data "movement" along the time axis in space-time. (The picture in Fig. 1 shows a superposition of data flow at all times. The actual time variation of data flow is more complex, with the size of uniform sub-regions of data movement growing and shrinking with time.)

Figure 1. One of two minimum area designs for Faddeev algorithm with variables a and l constrained to appear on the array boundary (N=6).
2.2 Single Divider Implementation
It is known that the Faddeev algorithm can be mapped to a design that uses only one divider, so a useful question would be what code modifications would be necessary to achieve this. The code (5) shows only divisions by u[j,j]; consequently, a 1-D variable can replace a 2-D variable limiting the number of divisions to N with "u_inv[j]:=1/u[j,j]". Adding this to (5) and changing the divisions by u[j,j] to multiplications by u_inv[j] results in the code (11), where "...." refers to the other statements in (5) that don't involve division, yields
for j to 2*N do
for i to 2*N do
.....
if j<=N and j>=1 then
u_inv[j]:=1/u[j,j] fi;
if j=1 and i>=1 and i<=2*N then
l[i,j]:=a[i,j]*u_inv[j] fi;
if i>=j and j>1 and i<=2*N and j<=N
then
l[i,j]:=(a[i,j]-add(l[i,k]*u[k,j],k=1..j-1))*u_inv[j]
fi;
od
od;
The only way all divisions can occur in the same PE happens when u_inv[j] (a line in space-time) aligns with the time axis. Thus, an architectural constraint is set that only looks for these solutions. With this constraint set, SPADE finds one area-optimal design, that shown in Fig. 3, which has the same 5N-2 time latency as that in Fig. 1, but (2N-1)2 area. (Note that in Fig. 3 the single PE with the divider is in the upper right hand corner of the figure and isn't labeled.)

Figure 3. Single optimal design with one divider (N=6).
