This paper has evolved from an article by Clive “Max” Maxfield that was first published in EE Times and also on the Programmable Logic DesignLine website. Any portions of the original article that appear in this paper are reproduced here with the kind permission of CMP/EE Times.

View Topics



 
Before We Start
Before we start, we should note that the following discussions relate to the illustration shown below (of which we are inordinately proud, because capturing the diverse computing options graphically proved to be a non-trivial task).

Also, this paper is intended to be a "living-breathing" document that will be updated to reflect any new players and architectures in the multi-processor and reconfigurable computing arena. So if you think we got anything wrong or we missed something out, don't hesitate to contact us and we'll leap into action with gusto and abandon.




 
Defining Some Terms
OK, let's kick things off by defining a few concepts, because this will make things easier as we wend our way through the rest of this paper. The term central processing unit (CPU) refers to the "brain" of a general-purpose digital computer – this is where all of the decision making and number crunching operations are performed. By comparison, a digital signal processor (DSP) is a special-purpose CPU that has been created to process certain forms of digital data more efficiently than can be achieved with a general-purpose CPU.

Both CPUs and DSPs may be referred to as "processors" for short. The term microprocessor refers to a processor that is implemented on a single integrated circuit (often called a "silicon chip," or "chip") or a small number of chips. The term microcontroller refers to the combination of a general-purpose processor along with all of the memory, peripherals, and input/output (I/O) interfaces required to control a target electronic system (all of these functions are implemented on the same chip to cut down on size, cost, and power consumption).

The heart of a processor is its arithmetic-logic unit (ALU) – this is where arithmetic and logical operations are actually performed on the data. Also, in the case of DSP algorithms, it is often required to perform multiply-accumulate (MAC) operations in which two values are multiplied together and the result is added to an accumulator (that is, a register in which intermediate results are stored). Thus, DSP chips often contain special hardware MAC units.

Last but not least, the term core is understood to refer to a microprocessor (CPU or DSP) or microcontroller that is implemented as a function on a larger device such as a field-programmable gate array (FPGA) or a System-on-Chip (SoC). Depending on the context, the term processor may be used to refer to a chip or a core. [The underlying concepts behind devices such as FPGAs and SoCs – and also ASICs and ASSPs as mentioned later in this paper – are explained in excruciatingly interesting detail in our book Bebop to the Boolean Boogie (An Unconventional Guide to Computers), ISBN: 0750675438.]




 
Introduction
The first commercial microprocessor was the Intel 4004, which was introduced in 1971. This device had a 4-bit CPU with a 4-bit data bus and a 12-bit address bus (the data and address buses were multiplexed through the same set of four pins because the package was pin-limited). Comprising only 2,300 transistors and with a system clock of only 108 KHz, the 4004 could execute only 60,000 operations per second.

For the majority of the three and a half decades since the 4004's introduction, increases in computational performance and throughput have been largely achieved by means of relatively obvious techniques as follows:

a)  Increasing the width of the data bus from 4 to 8 to 16 to 32 to the current 64 bits used in high-end processors.
b)  Adding (and then increasing the size of) local high-speed "cache" memory.
c)  Shrinking the size – and increasing the number – of transistors; today's high-end processors can contain hundreds of millions of transistors.
d)  Increasing the sophistication of processor architectures, including pipelining and adding specialized execution blocks, such as dedicated floating-point units.
e)  Increasing the sophistication of such things as branch prediction and speculative execution.
f)  Increasing the frequency of the system clock; today's high-end processors have core clock frequencies of 3 GHz (that's three billion clock cycles a second) and higher.

The problem is that these approaches can only go so far, with the result that traditional techniques for increasing computational performance and throughput are starting to run out of steam. When a conventional processor cannot meet the needs of a target application, it becomes necessary to evaluate alternative solutions such as multiple processors (in the form of chips or cores) and/or configurable processors (in the form of chips or cores).




 
The Computing Universe
For the purposes of this paper, we will consider the term computing in its most general sense; that is, we will understand "computing" to refer to the act of performing computations. There are many different types of computational tasks we might wish to perform, including – but not limited to – general-purpose office-automation applications (word-processing, spreadsheet manipulation, etc.); extremely large database manipulations such as performing a Google search; one-dimensional digital-signal processing (DSP) applications such as an audio codec; and two-dimensional DSP applications such as edge-detection in robotic vision systems.

In many cases, these different computational tasks are best addressed by a specific processing solution. For example, an FPGA may be configured (programmed) to perform certain DSP tasks very efficiently, but one typically wouldn't consider using one of these devices as the main processing element in a desktop computer. Similarly, off-the-shelf Intel and AMD processor chips are applicable to a wide variety of computing applications, but you wouldn't expect to find one powering a cell phone (apart from anything else, the battery life of the phone would be measured in seconds).

Fundamentally, there are three main approaches when it comes to performing computations. At one end of the spectrum we have a single, humongously large processor; at the other end of the spectrum we have a massively-parallel conglomeration of extremely fine-grained functions (which some may call "a great big pile of logic gates"); and in the middle we have a gray area involving multiple medium- and coarse-grained processing elements. (Note that this paper focuses on the microprocessor/CPU/DSP arenas; mainframe computers and supercomputers are outside the scope of these discussions.)




 
Single Processors
The classical processing solution for many applications is to use a single, humongously large "off-the-shelf" processor, such as a general-purpose CPU chip from Intel (www.intel.com) or AMD (www.amd.com) or a special-purpose DSP chip from Texas Instruments (www.ti.com). Similarly, in the case of embedded applications, one might choose to use a single general-purpose processor core from ARM or ARC or a DSP core from TI.

At some stage, a single processor simply cannot meet the needs of a target application, in which case it becomes necessary to evaluate alternative solutions as discussed in the following topics.




 
Multiple Processors (Homogeneous)
Perhaps the most famous early example of using multiple processors was the INMOS transputer chip, which surfaced in the mid 1980s (the all lowercase "transputer" was the official written form). As a point of interest, the native programming language for the transputer was occam (again, the all lowercase "occam" was the official written form), which was named in honor of the 14th century English philosopher and Franciscan friar William of Ockham, also spelled Occam (1286–1348 give or take a few years).

Each transputer chip contained a single processor that was designed to communicate with – and work in parallel with – other transputers. The idea was that users could hook as many transputer chips together on a circuit board as was necessary to satisfy the computational requirements of the target application. Many believed that the transputer was going to be the next great leap in computing, but creating programs that ran efficiently on this parallel architecture was non-trivial, and the transputer eventually faded away.

Although most non-engineers don't realize it, it is actually very common for systems to use multiple processors. Consider a home computer, for example; in addition to the main CPU, the keyboard will also have its own processor; each hard disk and optical (CD/DVD) drive will typically contain two or more processors, and so forth. Even a simple "USB Memory Stick" contains its own processor, which is used to make the contents of the stick appear to be a hard disk drive as far as the host computer's operating system is concerned.

However, the above examples are characterized by the fact that these multiple processors all have very focused well-partitioned tasks that can be largely performed in isolation. It is much more complicated to have tightly-coupled homogeneous processors, such as the dual-core chips that are now available from AMD and Intel (the term "homogeneous" means that these processing elements are of the same kind). Another term that is applicable to this type of configuration is symmetric multiprocessing (SMP), which means that the view of the rest of the system – memory, input/output, operating system, etc. – is exactly the same (i.e. "symmetrical") for each processor.

When moving from a single processor/core to a dual-processor/core configuration, the system becomes noticeably more responsive, and users don't experience those annoying "hang-ups" and "stalls" that are the hallmark of a single-processor environment. And two processors are only the start; for example, Intel is already talking about a four-core microprocessor called "Clovertown," which is expected to appear on the market in early 2007.

Meanwhile, Sun Microsystems (www.sun.com) is already fielding an eight-core processor called the Ultrasparc TI. Formally known as Niagara, this extreme-performance device is well-suited to highly-threaded commercial environments, such as thread-aware web servers, applications servers, and database servers. Of particular interest is that fact that Sun is open sourcing this chip; the register transfer level (RTL) representation of this device was made available to the engineering community when the www.opensparc.net website went live on January 24th 2006.

And if you think an eight-core processor is impressive, you should check out the Vega processor chip from Azul Systems (www.azulsystems.com). The current implementation of this device boasts an array of twenty-four 64-bit CPU cores, and Azul have announced that a forty-eight core version will be made available in 2007.

Before we move on, we should also make mention of the Multicore Association (www.multicore-association.org), which is a new industry group focused on companies involved with multi-processor hardware, software, and system implementations.




 
Multiple Processors (Heterogeneous)
As opposed to using multiple identical cores, it may be preferable to use a mixture of dissimilar cores. For example, the main digital chip in even the most rudimentary cell phone will typically contain at least one CPU core (to manage the human-machine interface) coupled with at least one DSP core (to perform the baseband signal processing functions). Such solutions are referred to as being "heterogeneous," meaning "consisting of dissimilar elements or parts."

One example of this type of scenario is the Cell processor from IBM (www.ibm.com), which is a single chip containing a general-purpose CPU core tightly coupled with eight DSP cores [IBM actually call these DSP cores Synergistic Processor Elements (SPEs); these little scamps contain floating-point engines and other units; they are predominantly used for graphics calculations.]. Another example is a high-end cell phone, which may include two or more CPU cores and two or more DSP cores combined with large numbers of hardware accelerator blocks and peripheral functions.

Things are further complicated by the fact that the processing cores and other functional units may have their own individual memories along with shared memory structures; and everything may be connected together using multi-level buses and cross-point switches (some of the larger chips actually feature a Network-on-Chip (NoC), which the various processors and peripherals use to communicate with each other). One term which is commonly associated with this type of environment is asymmetric multiprocessing (AMP or ASMP), in which computational tasks (or threads) are strictly divided by type between processors.




 
CPU Chips Linked to FPGA-based Coprocessors and Accelerators
As a starting point for this topic, we should note that several companies make computer motherboards that support two general-purpose processors linked by a high-speed bus. For example, there are several motherboards that boast two of AMDs Opteron processor chips linked by the high-speed, low-latency HyperTransport (HTX) bus, where each of these processors may contain two, four, or more CPU cores as new chips come onto the market.

The idea is to remove one of these general-purpose processor chips and replace it with a small pin-compatible card containing one or more high-end FPGAs. In this case, the general-purpose AMD processor is be used to execute control-type tasks, while the FPGA module will be configured to perform algorithmically-intensive data-processing and number-crunching tasks with extreme speed. Meanwhile, the HyperTransport bus is used to move massive amounts of data around the system with extreme speed.

A good example of this type of approach is offered by the folks at XtremeData (www.xtremedatainc.com) who combine an AMD processor with an FPGA-based module using high-capacity, high-performance FPGAs from Altera (www.altera.com). Similar examples are provided by Cray (www.cray.com) and DRC Computer Corporation (www.drccomputer.com), who do much the same thing but with FPGAs from Xilinx (www.xilinx.com).

Note #1: Many servers now include a Hypertransport (HTX) expansion slot along with their PCI and PCI Express slot(s). The HTX slot attaches directly to the same bus that both processors use; this means that the FPGA-based accelerator card can communicate to the processors in the same way as the 'socket' solutions discussed above, but now you get to keep both processor chips on the motherboard. (There is no technical reason why there couldn't be multiple HTX expansion slots, but today's systems typically offer only one such slot.)

Note #2: AMD's initiative to promote the openness of the HyperTransport Bus is known as Torrenza; in response, Intel announced a proposal – codenamed Geneseo – to open up their front side bus (FSB) to facilitate the same type of implementation.

There are many other vendors with interesting solutions, such as Celoxica (www.celoxica.com), who plug into the HTX slot discussed above; SRC Computers (www.srccomputers.com), whose FPGA-based accelerator card plugs directly into one or more memory slots; and Nallatech (www.nallatech.com), who boast a wide variety of products and tools.




 
On-chip Coprocessors and Accelerators
If you are in the process of creating a new chip from the ground up, one technique is to augment a pre-defined processor core with one or more dedicated coprocessors and/or hardware accelerators. For example, CriticalBlue (www.criticalblue.com) has a tool called Cascade that accepts as input compiled applications (which may be referred to as binaries) in the form of executable ARM machine code. By means of a simple interface, the user selects which functions are to be accelerated, and Cascade then generates the register transfer level (RTL) description for a dedicated coprocessor (and the microcode to run on that coprocessor) to implement the selected functions.

A somewhat similar approach is that taken by Binachip www.binachip.com), whose tools also take compiled (binary) programs. However, these tools first read the binary code into a neutral format, then they allow you to select which functions will be implemented in hardware and which functions are to be realized in software. Finally, they re-generate the binary code for the software portions of the system and generate register transfer level (RTL) representations for the accelerators used to implement the hardware portions of the system.

An alternative technique is that adopted by Poseidon Systems (www.poseidon-systems.com), whose Triton tool suite allows users to analyze ANSI standard C source code, to identify areas of the code to be accelerated, and to generate accelerators/coprocessors that can be used in conjunction with ARM, PowerPC, Nios, or MicroBlaze hard and soft processor cores implemented in SoCs and/or FPGAs.

And then there are the tools from Synfora (www.synfora.com) can also analyze ANSI standard C source code and generate register transfer level (RTL) representations for corresponding hardware accelerators.

In reality, there are quite a few other players in this arena; these include (but are not limited to) Altera (www.altera.com) with its C2H (ANSI C to hardware accelerator) technology, Celoxica (www.celoxica.com) with its Agility Compiler (SystemC to hardware accelerator) and DK Suite (Handel-C to hardware accelerator) approaches, Forte Design Systems (www.forteds.com) with its Cynthesizer (SystemC/C++ to hardware accelerator) suite, and Mentor Graphics (www.mentor.com) with its Catapult BL and SL (C to hardware accelerator) technology.




 
Large Arrays of "Things"
One way to think of the hardware used to perform computations is in terms of its granularity. The finest level of granularity is provided by an application-specific integrated circuit (ASIC) or application-specific standard part (ASSP), in which algorithms can be hand-crafted in silicon at the level of individual logic gates. (An ASIC is a device that is custom-created for a particular application and is intended for use by only one – or very few – companies. By compassion, ASSPs are devices that are created using ASIC technologies, but that are intended to be sold as standard parts to anybody who wants to use them.)

Next, we have FPGAs with their four-input lookup tables (LUTs). These are off-the-shelf chips that can contain the equivalent of tens of thousands to tens of millions of logic gates. FPGAs are designed in such a way that they can be configured (programmed) to perform some desired function or functions; the SRAM-based versions of these devices have the advantage that they can be reconfigured as required. [Structured ASICs may be considered to occupy a space somewhere between ASICs and FPGAs, especially in the case of devices from eASIC (www.easic.com), which combine custom routing with FPGA-like SRAM-based LUTs.]

Note that we might decide to include one or more hard processor cores on an ASIC or ASSP, in which case we would refer to this device as a System-on-Chip (SoC). Similarly, we might decide to include one or more hard and/or soft processor cores on an FPGA (which may also be viewed as an SoC by some folks). All of these cases would then be considered to be a hybrid solution involving a mixture of traditional processor core(s) and algorithms implemented in gates/LUTs/etc.

In recent years, a number of companies have started to offer more exotic architectures, each of which is applicable to a focused set of computational applications. If we consider these offerings in terms of granularity, then the first step above traditional FPGAs would be an architecture such as that provided by Elixent (www.elixent.com). This reconfigurable algorithm processing (RAP) architecture – which is targeted toward the efficient implementation of arithmetic/DSP functions – is based on an array of 4-bit arithmetic-logic units (ALUs) in a "sea" of programmable interconnect. These ALUs can be linked using fast carry chains so as to implement wider functions. In addition to forming part of a datapath, the output of one ALU may be used to select the instruction of another ALU. The programming model for these devices is to take the same register transfer level (RTL) representation used to create an ASIC or to configure (program) an FPGA, and to use an appropriate synthesis engine to generate a corresponding configuration file.

Next, we have the field programmable object array (FPOA) architecture from MathStar (www.mathstar.com). An example FPOA device may contain around 400 silicon "objects" in the form of 16-bit ALUs (each with its own instruction cache and scratchpad memory), register files, and multiply accumulators (MACs) – along with internal RAM banks and external high-speed memory interfaces – all of which can communicate with each other through programmable interconnect fabric. Each object can be programmed individually and acts autonomously. All of the objects and the interconnect run at 1 GHz. In addition to general-purpose I/O (GPIO) pins, the FPOA boasts high-speed I/O that can transmit and receive 2 32 GB/s. The main programming model for these devices is to use a graphical interface that generates SystemC, and the target application area is for compute-intensive DSP tasks such as edge detection and pattern recognition for robotic vision systems with high-frame-rates and high resolutions.

Another group of architectures may be classed as comprising one (or a small number) of general-purpose CPU cores coupled with an array of processing elements (PEs). Depending on the implementation, each of these PEs can contain multipliers, adders, ALUs, MACs, counters, synchronizers, memory, etc. Three good examples of this concept are IPFlex (www.ipflex.com) with an off-the-shelf device comprising two CPUs and hundreds of 32-bit PEs; ClearSpeed (www.clearspeed.com) with an off-the shelf device comprising a general-purpose CPU coupled with an array of 32/64-bit PEs containing floating-point multipliers and suchlike targeted toward scientific and engineering calculations; and IMEC (www.imec.be) with a configurable core comprising a single very long instruction word (VLIW) CPU coupled with an array of 32/64 PEs each containing an ALU/MAC combo.

A good example of the next higher level of granularity is provided by picoChip (www.picochip.com), whose picoArray features several hundred 16-bit CPU and DSP cores connected by a sea of programmable interconnect that can move 5 terabits of data per-second around the device. Each core, has its own local memory (ranging from 1K to 64K depending on the core type). The programming model for a picoArray is an interesting mixture of styles. A VHDL block-level netlist is used to define the connectivity between each of the CPU and DSP cores (each block in the netlist maps onto a specific type of core); meanwhile, the actual function of each block is defined in C and/or assembly code.

Another example of this level of granularity is provided by the multiprocessor DSP (MDSP) architecture from Cradle Technologies (www.cradle.com). Current incarnations of the MDSP offer up to 8 CPU cores and 16 DSP cores. Each of these 32-bit cores has its own local instruction and data memory. The latest programming model for these devices is to create a C program that is divided into multiple threads, and to tag each thread as being either a control thread (to be executed on a CPU) or a signal processing thread (to be executed on a DSP). A run-time dynamic scheduler is then used to assign threads to available resources on the device.

And yet another example is provided by Ambric (www.ambric.com). Right from the beginning, the folks at Ambric resolved that massive parallelism is only practical if you first start with the programming model, and then build the chip accordingly, rather than the other way around. Thus, they started by defining a structural object programming model, which is a hierarchical structure of self-contained objects linked through asynchronous self-synchronizing channels. The objects are written in Java or assembly code, and the structure is programmed graphically or textually in an Eclipse-based Integrated Design Environment (IDE). It was only after defining this programming model that the little rapscallions built a silicon chip upon which to implement this programming model – and what a chip it is! This little scamp comprises an array of 360 32-bit CPU/DSP cores and 360 1-KByte RAMs all linked by a configurable interconnect of channels. The result is a programmable chip capable of performing one tera-operations/second (which makes your eyes water) that can be easily programmed by system architects and software engineers.




 
Configurable Processors
As for so many things in computing, the term "configurable" is something of a slippery customer, because it means different things to different people. In the case of cores from ARC (www.arc.com), for example, you have the ability to customize the instruction set – and therefore the architecture of the core. By analyzing your source code application(s) using tools from ARC, you can determine which instructions aren’t used and remove them from the instruction set and the processor core. Also, you have the ability to add new instructions to the core (this is a tad more complicated).

Another technique is the concept fielded by Tensilica (www.tensilica.com). In this case, you start with a predefined 32-bit post-RISC processing engine called Xtensa that comprises around 25K gates. Next, Tensilica's tools analyze your C/C++ application and evaluate millions of possible processor extensions based on techniques like single-instruction-multiple-data (SIMD) and vector operations, operator fusion, and parallel execution. Once you select the configuration that's best for your particular application, a processor generator outputs the register transfer level (RTL) description for your custom processor along with a custom compiler, assembler, and source-level debugger. A typical customer may end up with 5 or 6 heterogeneous Tensilica cores on their SoC, and some devices (for networking applications) have several hundred such cores.

As an aside, in February 2006, Tensilica started offering a suite of off-the-shelf cores called the Diamond Standard family. These are cores that Tensilica have pre-configured to perform a range of CPU and DSP functions extremely efficiently (these cores feature extremely high performance coupled with low power consumption).

And then we have the guys and gals at CoWare (www.coware.com) with their Processor Designer technology, which allows you to create a custom core from the ground up. As opposed to ARC and Tensilica whom we might regard as providing configurable IP, the tools from CoWare should be regarded as being more of an Electronic Design Automation (EDA) approach. In this case, the folks at CoWare have developed a high-level language that is designed to allow you to specify the required functionality of a processor core, including things like the instructions forming the instruction set, register files, execution units, the memory subsystem, and so forth. Using this language, you can define CPU and/or DSP cores with a wide variety of characteristics, such as single instruction multiple data (SIMD) capabilities, very long instruction word (VLIW) superscalar architectures, and so on. Then, once you are ready, you press the "Go" button and Processor Designer generates the register transfer level (RTL) representation used to create your core, along with a custom assembler, C compiler, linker, debugger, and instruction set simulator (ISS).

Another group of folks worth mentioning are the guys and gals at Target Compiler Technologies (www.retarget.com) whose Chess/Checkers technology also allows you to create a custom core from the ground up. Once again, this is more of an EDA approach. Also known as TCT, Target is an interesting company in that they are reputed to have more design wins in this space than any of their competitors, but not many people know about them (apart from the folks who are in this arena). An industry expert told the author that this is largely because everyone who works at Target is an engineer with at least five different jobs, and nobody has the time (or inclination) to do any marketing.

Last but not least, we should also note that ARM (www.arm.com) has a product called OptimoDE that can be used to generate specially configured cores. However, these cores are designed to act as slaves (coprocessors); that is; they require a host processor to load their local memories and start them running. Also, someone who shall remain namless told the authors of this paper that: ”OptimoDE is so difficult to work with that only a few guys in Belgium actually know how to use it!.”




 
Reconfigurable Processors

The term "reconfigurable computing" means different things to different folks. The best comparison the authors have heard thus far is that of the transporter systems on Star Trek. By this we mean that we all know how these devices are supposed to work and what they do, but we don't have a clue how to build one with the technologies available today.

Similarly, engineers have a vision of the ideal reconfigurable computing scenario, which involves a silicon chip whose function can be reconfigured at the level of individual logic gates (that is, changing an AND gate into an OR gate, for example) and whose connections between gates can be reconfigured on-the-fly without any negative impact with regard to performance or power consumption. In this dream world, it would also be possible to be reconfiguring certain portions of the device while other portions continued to function, thereby allowing new design variations to dynamically evolve in real-time. The problem is that, at this time, we don't have a clue how to build such a device and – even if we did – we don’t have the tools required to program one of these little scamps.

OK, back to the real world. One incarnation of reconfigurable computing that can be achieved with today's technologies is known as static reconfiguration. In this case, a programmable device such as an FPGA is first configured to perform a certain task, and is later reconfigured to perform a different task. By comparison, dynamic reconfiguration refers to configuring different portions of a device "on-the-fly" while other portions of the device continue to perform their tasks.

One interesting scenario involves an FPGA containing a number of soft microprocessor and DSP cores, each executing its own local microcode. A special controller block can be used to supply the various processor cores with new microcode as required (this new microcode could be stored in an external memory).

Perhaps the best example of reconfigurable computing to date is provided by Stretch Inc. (www.stretchinc.com), which provides a family of off-the-shelf software-configurable processors. Each of these chips contains two main units: Tensilica's Xtensa core coupled with Stretch's reconfigurable instruction set extension fabric (ISEF), which contains wide register files and lots of computational units (multipliers, adders, and so forth) in a sea of programmable interconnect. Stretch's tools analyze your C/C++ application and generate a corresponding configuration file to program the ISEF to perform specific tasks. The point here is that the ISEF can be reconfigured thousands of times a second so as to tailor it to better serve different portions of the algorithm.




 
Summary
This paper has really only touched the surface of the state of play in modern computing. In addition to yet more hardware solutions, it is also necessary to consider such things as operating system issues along with the problems of programming, debugging, verifying, and profiling applications.

The point is that there are now a lot of options available to the designers of today's state-of-the-art systems. As usual, system architects have to perform the traditional tradeoff between power, performance, and cost. Ultimately designers have to ask the questions: How much performance do we want? How much do we need? And how much can we afford?