CSC400 Latex Paper Example

From dftwiki3
Jump to: navigation, search

A Paper in Latex

--D. Thiebaut 21:32, 12 October 2009 (UTC) Here is a paper published at a conference. It is formatted using the ACM and IEEE accepted format.

Note that this paper does not use a separate bibliography file.

The Main Document

\documentclass[10pt]{article}
\usepackage{latexcad}
\usepackage{setspace}

\setlength{\textheight}{22cm} \setlength{\textwidth}{17cm}
\topmargin 283mm \advance \topmargin -\textheight \divide
\topmargin by 4 \advance \topmargin -1.0in \leftmargin 216mm
\advance \leftmargin -\textwidth \divide \leftmargin by 2 \advance
\leftmargin -1.0in \oddsidemargin \leftmargin \evensidemargin
\leftmargin


\newtheorem{teorema}{Theorem}
\newtheorem{exemplu}{Example}
\newtheorem{definitie}{Definition}

\sloppy

\twocolumn


%\doublespacing

\begin{document}

%\title{{\bf An Integral Parallel Architecture for Embedded Computation}}
\title{{\bf {\em Not Multi-, but Many-Core}:\\Designing Integral Parallel Architectures for Embedded Computation}}
\author{Mihaela Mali\c ta\footnote{St. Anselm College, {\tt mmalita@anselm.edu}},
Dominique Thi\' ebaut\footnote{Smith College, {\tt
thiebaut@cs.smith.edu}}, Gheorghe \c Stefan\footnote{BrightScale
Inc., {\tt gstefan@brightscale.com}}}

\date{Paper submission deadline: 4-th of May, 2007}
\maketitle

% http://www.arch.cs.titech.ac.jp/ALPS/


\begin{abstract}
The recent switch to fully programmable parallel architectures in
the embedded system domain must be accompanied by appropriate
solutions allowing to implement efficiently all the corner cases
that usually occur in real applications.  We introduce the {\em
integral parallel architecture } (IPA) as the best way to
currently support  intensive data computation in System-on-a-chip
(Soc) implementations, with {\em small area} \& {\em low power}.
An IPA supports all the three possible styles of parallelism: {\em
data}, {\em  time}, and {\em speculative}.

As an illustrative example, we present the {\bf {\em BA1024}}
chip, a fully programmable SoC designed by {\em BrightScale, Inc.}
for HDTV codecs.  Its  main performance figures include $60 \,
GOPS/Watt$ and $2 \, GOPS/mm^{2}$, illustrating an efficient IPA
approach in embedded computation.
\end{abstract}

%\twocolumn

\section{Introduction}

Technology is currently going through some important evolutions
that several researchers, in particular Borkar\cite{Borkar05},
and Asanovic\cite{Asanovic06} have identified:
\begin{itemize}
   \item the slow-down of the increasing rate of the clock speed,
   \item the switch from pure standard functionality to more specific
   functionality in video, graphics, and performance hungry
   applications requiring full
   programmability, and
   \item the replacement of Application Specific Integrated Circuits (ASIC)  by programmable Systems on a Chip (Soc) due to
   development costs and increasing technological difficulties.
\end{itemize}

Borkar and Asanovic propose several new approaches to computer
architectures to respond to these changes and curb their effects:
\begin{itemize}
   \item application domain oriented architectures in two versions: {\bf many}- or {\bf
   multi}-processors, or
   \item computation type oriented architectures, such as the  Dwarf
   approach in Asanovic's paper\cite{Asanovic06}.
\end{itemize}
Intel's Recognition, Mining and Synthesis (RMS) white paper is a
good example of the first type of architecture\cite{Borkar05}.

In this paper we propose two solutions to address some of the
limitations imposed by the current technology shifts:
\begin{itemize}
\item
an optimized approach for {\em low-power \& small-area} embedded
computation in SoC, and
\item a way
to remove some limitations emphasized by Asanovic
\cite{Asanovic06} as the 13th Dwarf, and referring to
``automaton-style computation."
\end{itemize}

The validity of our solutions rests  on two hypotheses.  One is
that programmable SoCs can compete with ASICs only if a fully
programmable parallel architecture is used, because a circuit is
an intrinsically parallel system.  The other hypothesis holds that
the computational model of partial recursive functions
\cite{Kleene36} must be able to treat equally well both circuits
and parallel programmable systems.

Our approach naturally leads to two main results:
\begin{itemize}
   \item the definition of an IPA\footnote{ The reader is invited to see our approach as
being different form the {\em heterogeneous computing systems}
which are those with a range of diverse computing resources that
can be local to one another or geographically distributed.} for
intensive computations in embedded systems, and
   \item the proposal of a more nuanced taxonomy of parallel computation as opposed to the
   more structural and functional approach first introduced by Flynn
   \cite{Flynn72}.
\end{itemize}
Both are exemplified by the {\em BA1024}, which we believe is the
first embodiment of a IPA for the HDTV market, and which we use as
an illustrative example in this paper.

Parallel computation is becoming ever more ubiquitous, and
manifests itself in two extreme forms, one in {\em complex}
computation and the other in {\em intense} data-parallel
computation. Our paper deals with the second form.

The remainder of this paper is structured as follows.  In Section
2 we introduce different views and trends of the parallel
computing landscape as seen by researchers at Intel and at
Berkeley.  In Section 3 we present partial recursiveness and show
how it can be used to identify different parallel constructs.
This leads us in Section 4 to the definition of  Integral Parallel
Architectures, and the two types of forms in which they appear.
Section 5 presents a state of the art circuit for video decoding
and show how it embodies the different types of parallelism we
have presented.  Section 6 concludes with some pertinent remarks
about current designs what we feel trends current design should
follow in order to match the performance levels required by
tomorrow's applications.

\section{Comments on Intel's RMS \& Berkley's Dwarf Approaches}

Intel's approach makes the distinction between {\em multi}-core
era and {\bf many}-core era, between scalar and parallel
applications, and massively parallel applications. For embedded
systems a many-core approach seems to be the solution because we
must substitute SoC implemented in ASIC technologies with SoC
implemented as fully programmable solutions. In this case, the
functional granularity in ASICs must have a counterpart in the
granularity of the parallel programmable system used in
high-performance embedded applications.

Many-core approaches also present low-power solutions where simple
processing elements (PEs) are fully utilized in each clock cycle
to perform either simple functions, one per clock cycle, or
complex functions composed of simpler ones.

In his paper describing the landscape of parallel computing
research as seen from Berkeley, Asanovic \cite{Asanovic06} makes
the following statement:
\begin{quote}
{\em We argue that these two ends of the computing spectrum {\em
[embedded and high performance computing]} have more in common
looking forward than they did in the past. First, both are
concerned with power, $\dots$. Second, both are concerned with
hardware utilization, $\dots$. Third, $\dots$ the importance of
software reuse must increase.}
\end{quote}

The architecture we propose here is mainly oriented to solve the
first two issues -- {\em low-power \& hardware reuse} -- using
many simple programmable, though not reconfigurable, elements. To
define this architecture, we start from the computational model
which has a very direct circuit or computing network
implementation: the {\em partial recursive function} model, which
we present in the next section.




\section{From Partial Recursiveness to Parallel Computation}

We claim that the most suggestive classic computational model for
defining parallel architectures is the model of partial recursive
functions \cite{Brainerd74}, because the rules defining it have a
direct correspondence with circuits, the intrinsic parallel
support for computation.

\subsubsection{Composition \& basic parallel structures} The first
rule of composition provides the basic parallel structures to be
used in defining all the forms of parallelism.

Assume we have $m$ $n$-ary functions $h_{i}(x_{0}, \ldots
x_{n-1})$, for $i= 0, 1, \ldots m-1$, along with an $m$-ary
function $g(y_{0}, \ldots y_{m-1})$.

We can  define the composition rule by combining them as follows:
$$f(x_{0}, \ldots x_{n-1}) = g(h_{0}(x_{0}, \ldots x_{n-1}),  \ldots h_{m-1}(x_{0}, \ldots x_{n-1}))$$

The physical structure associated with this concept, and
containing simple circuits or simple programmable machines is
illustrated in Figure \ref{composition}.

\placedrawing[h]{draw1_1.lp} {{\bf The physical structure
associated to the composition rule.} {\small The composition of
the function $g$ with the functions $h_{0}, \ldots, h_{m-1}$
implies a two-level system. The first level, performing in {\bf
parallel} $m$ computations, is {\bf serially} connected with the
second level which perform a reduction function.}}{composition}

This suggests the following four separate and meaningful forms of
parallelism, as illustrated in Figure \ref{comp}:
\begin{enumerate}
   \item {\bf data parallel composition}: when $n=m$, each function
   $h_{i} = h$ depends on a single input
   variable $x_{i}$, for $i=0, 1, \ldots n-1$, and $g$ performs the identity
   function (see Figure \ref{comp}a). Given an input  {\bf vector} containing
$n$ scalars:
$$\{{\bf X}\} = \{x_{0}, x_{1},  \ldots, x_{n-1}\}$$
the result is another vector:
$$\{h(x_{0}), h(x_{1}),  \ldots, h(x_{n-1})\}.$$

\placedrawing[h]{draw1_2.lp} {{\bf The four simple forms of
composition.} {\small   {\bf a.} Data parallel composition. {\bf
b.} Speculative composition. {\bf c.} Serial composition. {\bf d.}
Reduction composition.}}{comp}

   \item {\bf speculative composition}: when $n=1$, i.e. $x_{0} = x$,
   (see Figure \ref{comp}b),  $g$ performs the identity
   function. In other words, it computes a vector of functions:
   $${\bf H} = [h_{0}(x), \ldots h_{n-1}(x)].$$
    on the same {\bf scalar} input $x$, generating a vector of results:
$${\bf H}(x) = \{h_{0}(x), h_{1}(x),  \ldots, h_{n-1}(x)\}.$$

   \item {\bf serial composition}: This corresponds to the case $n = m = 1$, as shown in Figure
   \ref{comp}c.  Here, a {\em pipe} of different machines receives
   a {\bf stream} of $n$ scalars as input:
   $$<{\bf X}> = <x_{0}, x_{1},  \ldots, x_{n-1}>$$
   and provides another stream of scalars
   $$<f(x_{0}), f(x_{1}),  \ldots, f(x_{n-1})>.$$
   In the general case, the function $f(x)$ is a composition of more
   than two functions $h$ and $g$. Thus, the function $f$ can be
   expressed as a vector of functions ${\bf F}$ receiving as input a
   data stream ${\bf <X>}$:
   $${\bf F(<X>)} = [f_{0}(x), \ldots f_{p-1}(x)].$$
   In Figure \ref{comp}c, ${\bf F}(<X>) = [h(x), g(x)]$.

   \item {\bf reduction composition}: In this last case,  each $h_{i}$ performs the identity
   function, as is illustrated in  Figure \ref{comp}d.  The input is
   the vector
$\{x_{0},   \ldots, x_{n-1}\}$ to the block in which the function
$g$ transforms the stream of vectors into a stream of scalars
$g(x_{0}, \ldots x_{n-1})$.
\end{enumerate}

The composition rule provides the context for defining computation
using the following basic concepts:
\begin{description}
   \item[scalar]: $x$
   \item[vector]: $\{{\bf X}\} = \{x_{0}, x_{1},  \ldots, x_{n-1}\}$
   \item[stream]: $<{\bf X}> = <x_{0}, x_{1},  \ldots, x_{n-1}>$
   \item[function]: $f(x)$
   \item[vector of functions]:
       \begin{itemize}
           \item ${\bf F(<X>)} = [f_{0}(x), \ldots f_{p-1}(x)]$
           applied on streams
           \item ${\bf F(x)} = [f_{0}(x), \ldots f_{p-1}(x)]$
           applied on scalars.
       \end{itemize}
\end{description}

We now have all that is required to define the concepts of
primitive recursive and of minimalization in our context.

\subsubsection{Primitive recursive rule}

The primitive recursive rule computes the function $f(x,y)$ as
follows:

$$f(x,y) = h(x, f(x, y-1))$$

where: $f(x,0) = g(x)$. This rule can be translated in the
following serial composition:
$$f(x,y) = h(x, h(x, h(x, \ldots h(x, g(x)) \ldots ))).$$

There are two ways to implement in parallel the primitive
recursive rule. In both cases we assume that a large amount  of
data is available for the computation in the form of vector or
streams of data ready for input into a primitive recursive
function.


If the function $f(x,y)$ is to be computed for the vector of
scalars $\{{\bf X}\} = \{y_{0}, y_{1},  \ldots, y_{n-1}\}$, then a
data parallel structure is used. In this case each machine
computes  the function $f(x,y_{i})$  using a local {\em data
loop},.  The resulting computation takes $max(y_{0}, y_{1},
\ldots, y_{n-1})$ ``cycles"

If, on the other hand, the function $f(x,y)$ is to be computed for
a stream of scalars, then a time parallel structure is used. A
``pipe" of $n$ machine is selected.  The pipe receives in each
``cycle" a new scalar from the stream of scalars. In cases where
$y>n$, then a {\em data loop} can be added, connecting the output
of the pipe back to its input.


\subsubsection{Minimalization}

The minimalization rule assumes a function $f(x)$ defined as
follows
$$f(x) = min(y)[m(x,y)=0]$$ The value of $f(x)$ is the minimum $y$ for which $m(x,y)=0$.

Here as well, two different parallel solutions are possible for
the minimalization problem: one that uses data parallel structures
while the other one uses time parallel structures.


The first implementation of the minimilization function is the
brute-force approach, and uses the speculative structure
represented in Figure \ref{comp}b, where each block computes a
function which returns a pair containing a predicate and a scalar:
$$h_{i} = \{(m(x,i)=0), i\}$$
after which a reduction step (using a structure belonging to the
class represented in Figure \ref{comp}d) selects the smallest $i$
from all pairs having the form $\{1, i\}$, if any,  that were
generated on the previous speculative composition level.  All
pairs of the form $\{0, i\}$ are ignored.

The second implementation of the minimalization operation shows up
in time-parallel environments where speculation can be used to
speed-up pipeline processing. {\bf Reconfigurable pipes} can be
conceived and implemented using special reduction features
distributed along a pipe.  We formalize this concept now.

We define a function pipe as the function vector:
$${\bf P} = [f_{0}(x), \ldots f_{p-1}(x)]$$
where $y_{i} = f_{i}(x)$, for $i = 0, \ldots p-1$. The associated
reconfigurable pipe transforms the original pipe characterized by:
$${\bf P} = [\ldots f_{i}(y_{i-1}), \ldots]$$
into a pipe characterized  by:
$${\bf P} = [\ldots f_{i}(y_{i-1}, \ldots y_{i-s}), \ldots]$$
where: $f_{i}(y_{i-1}, \ldots y_{i-s})$ is a function or a program
which decides in each step, or ``cycle", which variable to use in
the current computation, selecting\footnote{selection is one of
the simplest reduction function} one of the $\{y_{i-1}, \ldots
y_{i-s}\}$ variables with the help of an $s$-input selector. The
maximum {\em degree of speculation} in this case is the index $s$.

\subsubsection{Examples}

We now present two typical examples of minimalization, and how
speculative composition can be used to solve them and accelerate
the computation.

\begin{exemplu}
Let  $f(x,y)$ be a function  $f: {\bf N}^{2} \rightarrow {\bf N}$,
where ${\bf N}$ denotes the set of positive integers. The function
$f$ computes for $x=a$ all the successive positive values $i$ and
$i+1$ where $sign(f(a,i)) \neq sign(f(a,i+1))$ (if any).  In other
words, the function finds indices associated with the different
values of $x$ where $f(x)$ has different signs for consecutive
indices.
\end{exemplu}
One way of solving this problem is with a pool of machines
performing speculative composition, as illustrated in Figure
\ref{comp}b, where $h_{i} = f(a,i)$, followed by an appropriate
reduction composition, illustrated in Figure \ref{comp}d.
$\diamond$

\begin{exemplu}
Consider a pipeline in which we must compute   the following
conditional expressions in one of the middle stages:
\begin{center}
{\tt z = z[n-1] ? z + (x-c) : z + (x+c);}
\end{center}
where {\tt c} is a constant value.
\end{exemplu}
To maximize speed, the pipeline must evaluate $x-c$ and $x+c$ in
parallel, but only one of the resulting values is fed to the next
stage.

The physical structure of the pipe must be dynamically
reconfigured in order to accommodate the use of a 2-PE speculative
composition array, as illustrated in Figure \ref{comp}b.

If the variable $x$ is computed in Stage $i$:
$$x=y_{i}=f_{i}(\ldots)$$ then  Stage $i+1$ computes
$$x-c=y_{i+1}=f(y_{i})$$  Stage $i+2$ computes
$$x+c=y_{i+2}=f(y_{i})$$ and Stage $i+3$ computes
$$z=y_{i+3}=f(y_{i+1}, y_{i+2}).$$
The pipe is configured as a ``cross" using the stages $i+1$ and
$i+2$ as a speculative ``bar" across the pipe. The degree of
speculation in this case is $s=2$. $\diamond$


In order to implement speculative execution there is no need for
any special features in a data-parallel environment. But for the
time-parallel environment simple reduction features, or selection
functions\footnote{
             This recursive function is called {\em projection} in the theory
             of partial recursive functions.},
must be added in each pipeline stage. This results in a {\em
2-dimension $n \times s$ pipe}, i.e. an $n$-stage pipe where each
stage is able to select its input from the output of the one of
the previous $s$ stages. This 2-dimension pipe has a maximum depth
of $n$ and the maximum degree of speculation $s$.

In conclusion, we see that the minimalization rule can be
implemented using all the simple parallel resources, each one the
embodiment of simple forms of composition.

Next we use the different forms of composition introduced to
propose a new taxonomy of parallel architectures.

\subsection{Functional Taxonomy of Parallel Computing}

The previously identified simple forms of compositions, all
summarized in Figure \ref{comp}, yield a functional taxonomy of
parallel computation which intersects with Flynn's original
taxonomy, and which we categorize as follows:
\begin{description}
   \item[data-parallel computing]: this form uses operators that take vectors as arguments
   and returns vectors,  scalars (by reduction operations) or
   streams (input values for time-parallel computations).  This form is very similar to a
   Flynn's SIMD machine \cite{Flynn72}.
   \item[time-parallel computing]: this form uses operators that take streams as arguments
   and return streams, scalars, or
   vectors (which can be used sa input values for data-parallel computations).
   this form is akin to MIMD machines.
   \item[speculative-parallel computing]: in this form, operators take scalars
   as arguments and return vectors reduced
   to scalars using selection.  This form is used mainly
   to speed up time-parallel computations, and it contains
    a true MISD-like structure.
 \end{description}

Thus, by positioning ourselves in the context of the partial
recursive functions model, we argue that we can devise a
functional taxonomy of parallel computing which better encompasses
today's architectures because it uses {\em functions and
variables} rather then instructions and data.

Because we now are in a new theoretical framework, we must define
what an architecture is in this context, and also the machines
that implement this architecture.

In the next section we present the concept of {\em Integral
Parallel Architecture}, or IPA, as a parallel architecture
featuring a multitude of the above mentioned parallel forms.


\section{Integral Parallel Architectures \& Low-Power Embedded Computation}

It today's technology ring, hi-performance \& low-power embedded
systems compete with ASICs.  We believe that only way to compete
with the ASIC approach is to reduce the granularity of the
programmable computing elements to the level of the circuits
performing the application oriented functions. To this end, one
must rely on small multi-processors, rather than large
many-processors.


\subsection{Integral Parallel Architecture: A Definition}

Two opposite, extreme versions of IPAs exist today:
\begin{itemize}
   \item {\bf complex IPA}s, with all types of parallel mechanisms
   tightly {\em interleaved}.
   \item {\bf intensive IPA}s, with parallel
   mechanisms that are highly {\em segregated}.
\end{itemize}

Current high-performance processors have a complex IPA, because
they implement all types of parallelism in a highly integrated
fashion. They are examples of the first type of IPA. Indeed, a
pipelined super-scalar machine with speculative execution performs
{\em on the same structure} -- a super-speculative pipe --
time-parallel computations, data-parallel computations, and
speculative-parallel computations.

An intensive IPA, on the other hand, is implemented using distinct
hardware support for the simplified, special cases of composition
previously emphasized. Because demanding applications cannot be
implemented using a single type of parallelism, intensive IPAs are
the only solution for building fully programmable, low-power,
hi-performance systems-on-a-chip.

We continue our exploration of the concept of Intensive IPAs by
looking at their implementations


\subsection{Intensive IPAs}

Actual embedded applications demand intensive data-parallel
computation or/and intensive time-parallel computation, often
requiring the support of speculative resources. Therefore, the
machine implementation of an IPA must provide resources for both.
We coin such a machine an {\bf integral parallel machine}, or IPM,
and observe that there are two ways of designing them:
\begin{itemize}
   \item with a unique physical support for both data- and
   time-parallel computation, which we refer to it as a {\bf monolithic IPM},
   or
   \item as two distinct structures for the two types of
   parallelism, which we refer to as a {\bf segregated IPM}.
\end{itemize}

We now briefly present each type of machine.

\subsubsection{Monolithic integral parallel machine}

The main structure of this IPM type is a linear array of
processing elements (PE) that operate in two modes: as a
data-parallel machine receiving instructions from a controller, or
as a pipeline of PEs, each executing its own locally stored
program \cite{Mitu05}, \cite{Thiebaut07}.

Note that the performance requirements of today's applications
will dictate the addition of speculative parallelism to this array
of PEs.

However, the simplest form of this type of  IPM is a linear
data-parallel array of $n$ bidirectionally connected  PEs, adorned
with mechanisms allowing constant time selection ($s$-selectors)
to allow $s$-degree speculative executions in pipeline mode.

\subsubsection{Segregated integral parallel machine}

Segregated integral parallel machines, on the other hand, are
chosen when the cost of adding the speculative resources becomes
too expensive for a very long pipe. If the number of stages in the
pipe $n$ is large (hundreds or thousands) and the degree of
speculation $s$ larger than 2, the additional hardware is
justified only if these very long pipes are mandated by the
overall design. In the other cases, a short pipe with few stages
and a reasonably large $s$ can be added as a separate element of
the overall design.  In cases where the amount of data parallelism
is superior to the amount of time parallelism present, a
segregated IPM would be preferable.


\section{Case Study: {\em The BA1024 Chip}}

In this section we present a chip currently manufactured by {\em
BrightScale, Inc.}, the {\bf {\em BA1024}}, with at its core a
segregated IPM of parameters  $n=1024$, $m=8$ and $s=4$. The chip
is dimensioned to fully support the computation required for dual
HDTV codecs.  The chip is a combination of many- with
multi-processor architectures. It contains an array of 1024 PEs,
alongside a 4-MIPS array working as a MIMD machine.

The chip is implemented in $130nm$ standard process, runs at
200MHz, and features:
\begin{itemize}
   \item audio and video interfaces for two HD channels,
   \item four MIPS processors used as: host processor, video
       processor, demultiplexing processor, and audio processor,
   \item a DDR interface sustaining data transfer rates of $3.2GB/sec$
   \item and interconnection ``fabric" allowing transfers of 128-bit words per
       clock cycle, and
   \item an accelerator with intensive IPA containing:
   \begin{itemize}
       \item 1024 16-bit PEs, each having an integer unit, a boolean
       unit, and 256 16-bit words of data memory,
       \item a global loop structure, used mainly to identify the first PU having the
       selected predicate set to true,
       \item a reduction tree, used to extract data from the array,
       or to calculate  the number of PEs with a  predicate set to
       true,
       \item a bidirectional I/O tree, which transfers data  to and from
       the array. The transfer is strictly parallel and concurrent with
       the main computational process
       \item 8 16-bit PEs, each with its own program memory.
   \end{itemize}
\end{itemize}
Because codec implementation is data intensive, the data-parallel
section is very heavily used. The time-parallel section is needed
to accelerate purely sequential codec modules, mainly for the
H.264 standard.



\subsection{Performance}

The performance measures of the BA1024 are summarized in Tables 1
and 2.



The dot product operation in Table 1  refers to vectors having
1024 16-bit integer components. Note also that the 200 GOPS figure
does not include floating point operations, which are emulated in
software. In this case 2GFLOPS + 100GOPS can be sustained.

The DCT operations in Table 2 are performed on $8 \times 8$
arrays.

\begin{table}[h!p]
\label{perftable}
\begin{center}
\begin{tabular}{|l|c|}
   \hline
   \bf{Operations} & \bf{Performance} \\
   \hline
   16-bit operations
   & 200 GOPS \\
   \hline
   Dot product & 28 clock cycles\\
               & $>$ 7GDotProduct/sec \\
   \hline
   External bandwidth & 3.2GB/sec \\
   \hline
   Internal bandwidth & 400GB/sec \\
   \hline
   OPS/power          & $>$ 60GOPS/Watt \\
   \hline
   OPS/area           & $>$ 2GOPS/$mm^{2}$ \\
   \hline
\end{tabular}
\caption{Overall performance data for the BA1024.}
\end{center}
\end{table}

\begin{table}[h!p]
\label{videotable}
\begin{center}
\begin{tabular}{|l|c|}
   \hline
   \bf{Operations} & \bf{Performance} \\
   \hline
   DCT & 0.15 cycles/pixel \\
   \hline
   SAD & 0.0025 cycles/pixel \\
   \hline
   decoding of H.265 & 85\% utilization \\
   dual HD stream    & of accelerator \\
   \hline
\end{tabular}
\caption{Video performance data for the BA1024.}
\end{center}
\end{table}

\section{Concluding Remarks}

We conclude by making the following remarks.

\paragraph{Ubiquitous parallelism.} Parallelism  is an integral part of today's medium and high
performance processors. It is exploited more and more in the two
IPA variants that we have presented: complex IPAs, where
parallelism appears in general purpose processors, and in
intensive IPAs, such as specialized accelerators. Chances are that
the term ``parallel" may become obsolete in the near future, as
technological limitations are removed, and all computations end up
implementing parallelism in a ``natural" way.

\paragraph{Complex versus intensive IPA.}
There is a significant difference between the performance measures
of standard processors using complex IPAs and those of machines
making use of intensive IPAs, of which the Brightscale BA1024 is
an example. Typical values for today's complex IPAs are
\begin{itemize}
   \item 4GIPS + 4GFLOPS
   \item (0.04GIPS + 0.04GFLOPS)/Watt
   \item (0.016GIPS + 0.016GFLOPS)/$mm^{2}$
\end{itemize}
where an instruction is a 32-bit operation.  For intensive IPAs,
however, the measures are:
\begin{itemize}
   \item 200GOPS {\bf or} 2GFLOPS+100GOPS
   \item 60GOPS/Watt
   \item 2GOPS/$mm^{2}$).
\end{itemize}

Note that our goal in presenting the BA1024 is solely to emphasize
the significant differences between complex and intensive IPAs,
and by no means do we claim any superiority of the BrightScale
design over that of other companies.

\paragraph{Segregation of the complex IPA from the intensive IPA/}
Segregation of IPAs is the best solution for optimizing both {\em
price(area) versus performance} and {\em power versus
performance}.
\begin{center}
{\bf High Performance Architecture=complex IPA + intensive IPA.}
\end{center}
Almost all demanding applications would benefit from the use of a
new kind of computing architecture. We claim that in the case of
such an architecture, {\bf the complex part must be strongly
segregated from the intensive part} in order to reach the target
performance at a competitive price, and with a minimum amount of
dissipated energy.

\paragraph{Acknowledgments}
We would like to thank Emanuele Altieri, Lazar Bivolarski, Frank
Ho, Bogdan M\^{\i}\c tu, Tom Thomson, and Dan Tomescu for
insightful discussions on the material presented here.


\begin{thebibliography}{nnn} %abcdres

% this reference not used
%\bibitem[Andonie '07]{} R\v azvan Andonie, Mihaela Mali\c ta, ``The Connex Array as a Neural
%Network Accelerator", accepted at {\em Third IASTED International
%Conference on Computational Intelligence, 2007}, Bannf, Alberta,
%Canada, July2-4, 2007.

\bibitem{Asanovic06} Krste Asanovic, et. all.: {\em The Landscape
of Parallel Computing Research: A View from Berkeley}, {\em
Technical Report No. UCB/EECS-2006-183}, December 18, 2006.

\bibitem{Borkar05} Shekar Y. Borkar, et. all.: {\em Platform
2015: Intel Processor and Platform Evolution for the Next decade},
Intel Corporation, 2005.

\bibitem{Dubey05} Pradeep Dubey: {\em A Platform 2015 Workload
Model: Recognition, Mining and Synthesis Moves Computers to the
Era of Tera}, Intel Corporation, 2005.

\bibitem{Flynn72} Flynn, M.J.: ``Some computer organization and their
affectiveness", {\em IEEE Trans. Comp.} C21:9 (Sept. 1972), pp.
948-960.

\bibitem{Kleene36} Stephen C. Kleene: ``General Recursive Functions of
Natural Numbers", in {\em Math. Ann}., 112, 1936.

\bibitem{Malita06} Mihaela Malivta, Gheorghe \c Stefan, Marius Stoian:  "Complex vs. Intensive
in Parallel Computation", in {\em International Multi-Conference
on Computing in the Global Information Technology - Challenges for
the Next Generation of IT\&C - ICCGI 2006}, Bucharest, Romania,
August 1-3, 2006.

\bibitem{Mitu05} Bogdan M\^{\i}\c tu: private communication.

\bibitem{Stefan06} Gheorghe \c Stefan: "The CA1024: A Massively Parallel Processor for
Cost-Effective HDTV", in {\em SPRING PROCESSOR FORUM:
Power-Efficient Design}, May 15-17, 2006, Doubletree Hotel, San
Jose, CA. \& in {\em SPRING PROCESSOR FORUM JAPAN}, June 8-9,
2006, Tokyo.


\bibitem{Stefan06b} Gheorghe \c Stefan, Anand Sheel, Bogdan M\^{\i}\c tu, Tom Thomson, Dan Tomescu:
"The CA1024: A Fully Programable System-On-Chip for Cost-Effective
HDTV Media Processing", in {\em Hot Chips: A Symposium on High
Performance Chips}, Memorial Auditorium, Stanford University,
August 20 to 22, 2006.

\bibitem{Stefan06c} Gheorghe \c Stefan: "The CA1024: SoC with Integral Parallel Architecture
for HDTV Processing", in {\em 4th International System-on-Chip
(SoC) Conference \& Exhibit}, November 1 \& 2, 2006 - Radisson
Hotel Newport Beach, CA.

\bibitem{Thiebaut06} Dominique Thi\' ebaut, Gheorghe \c Stefan, Mihaela Mali\c ta: "DNA search and
the Connex technology" in {\em International Multi-Conference on
Computing in the Global Information Technology - Challenges for
the Next Generation of IT\&C - ICCGI 2006}, Bucharest, Romania,
August 1-3, 2006.

\bibitem{Thiebaut07} Dominique Thi\' ebaut, Mihaela Mali\c ta:
``Pipelining the Connex array", {\em BARC07}, Boston, Jan. 2007.

\bibitem{Xavier98} C. Xavier, S.S. Iyengar: {\em
Introduction to Parallel Algorithms}, John Wiley \& Sons, Inc,
1998.

\bibitem{Brainerd74} W. S. Brainerd, \& L. H.  Landweber, {\em Theory of Computation.}, Wiley, 1974.

\end{thebibliography}
\end{document}




The Draw1_1.lp File

This file was generated by LatexCad, a package for drawing figures in Latex. Other options are available for generating graphs.

% Drawing generated by LaTeX-CAD 1.9 - requires latexcad.sty 
% (c) 1998 John Leis leis@usq.edu.au 
\unitlength 1.8pt
{\tiny
\begin{picture}(112,68)
\Thicklines
\drawframebox{24.0}{42.0}{16.0}{12.0}{$h_{0}$}
\drawframebox{46.0}{42.0}{16.0}{12.0}{$h_{1}$}
\drawframebox{78.0}{42.0}{16.0}{12.0}{$h_{m-1}$}
\thicklines
\drawdot{58.0}{42.0}
\thinlines
\drawdot{62.0}{42.0}
\drawdot{66.0}{42.0}
\Thicklines
\drawframebox{52.0}{18.0}{32.0}{12.0}{$g$}
\thicklines
\drawvector{40.0}{28.0}{4.0}{0}{-1}
\drawvector{46.0}{36.0}{12.0}{0}{-1}
\drawvector{64.0}{28.0}{4.0}{0}{-1}
\drawpath{64.0}{28.0}{78.0}{28.0}
\drawpath{78.0}{28.0}{78.0}{36.0}
\drawpath{40.0}{28.0}{24.0}{28.0}
\drawpath{24.0}{28.0}{24.0}{36.0}
\drawdot{50.0}{28.0}
\drawvector{52.0}{12.0}{6.0}{0}{-1}
\drawpath{24.0}{54.0}{78.0}{54.0}
\drawpath{52.0}{60.0}{52.0}{54.0}
\drawvector{24.0}{54.0}{6.0}{0}{-1}
\drawvector{46.0}{54.0}{6.0}{0}{-1}
\drawvector{78.0}{54.0}{6.0}{0}{-1}
\thinlines
\drawdot{46.0}{54.0}
\drawdot{52.0}{54.0}
\drawcenteredtext{54.0}{62.0}{$x_{0}, x_{1}, \ldots x_{n-1}$}
\drawcenteredtext{64.0}{3.0}{{\tt $out = f(x_{0}, x_{1}, \ldots x_{n-1})$}}
\drawdot{54.0}{28.0}
\drawdot{58.0}{28.0}
\end{picture}
}
\vspace{-0.5cm}