Table of contents
Main characteristics and design philosophy
F-TTA0 is a completely new architecture, based loosely on the work
previously done by groups researching the general TTA concept and previous
work on other F-CPU architecture proposals, but otherwise completely
independent of any other architecture. The basic idea of the proposed
TTA is to simplify the hardware as much as possible while still providing
maximum generality and parallelism. This is facilitated by the use of a
TTA, which effectively exposes more of the pipeline to the software, and
allows many of the instruction scheduling and control operations to be
performed by the compiler instead of the hardware. Unfortunately, this
also creates certain difficulties in terms of requiring that the compiler
know many details of the internal architecture, and eliminating these
dependencies often forces a less optimistic approach to execution
(instruction reordering at run time is impossible with current techniques,
for example). However, the use of a TTA does allow greater thread-level
parallelism when combined with SMT techniques, which should help
compensate for the reduced ILP. Consequently, the F-TTA0 will be an SMT
architecture.
In order to support SMT, the F-TTA0 will be capable of storing
information for multiple contexts on-chip. These contexts will have
separate external and internal representations; that is, contexts may not
be identified in hardware by the OS-defined PID, TID, etc. Whenever an
instruction generates a bus transaction, control signals on the bus will
indicate which context is issuing the instruction, so that functional
units may act accordingly.
The F-TTA0 is composed of several different functional units, grouped into
two categories: control units and execution units. Execution units are the
obvious case: adders, FPUs, and the like; the things that actually DO the
computation. Control units include the hardware that does the more obvious
control activities such as instruction fetch, but also things like the
register file. These units will be covered in more detail in later
chapters. All functional units, control or execution, will be connected
by one or more busses (one per move instruction in the instruction word).
Constants will be implemented as words in the instruction stream. One
read-only address will be used as a constant generator; any reads from
this address will cause the instruction fetch unit to place the data
in the next instruction word in the stream onto the bus as a constant.
Instructions and encoding
The F-TTA0 will not use traditional instructions. As a TTA, it will
have only one instruction: move. "Instructions" will consist of four
overhead bits and two six-bit addresses: source and destination. This
will result in 16-bit move instructions, which will be grouped to allow
multiple move executions in a single cycle, making the F-TTA0 pseudo-VLIW.
The four overhead bits will be grouped into two conditional cancellation
bits and two operand size bits, covered below.
Instruction words can be any length, depending on the capabilities
of the control units. An F-TTA0 implementation requiring minimal die
space could even issue a single move per cycle, at the expense (obviously)
of very low performance.
The four high-order bits of every instruction word will be the
two conditional cancellation bits, immediately followed by the two
operand size bits. The F-TTA0 will have two conditional cancellation
units. The output of each unit will be ANDed with the respective
conditional cancellation bit in the instruction word. If the result
is 1, the instruction will be squashed. If it is 0, the instruction
will be executed. The operand size bits will reference one of four
special-purpose registers, which will determine the size of the operands
to be computed, up the the maximum capability of the implementation.
The control bits will be followed by the six-bit source operand, then
the six-bit destination operand. These operands will specify logical
address which may (depending on the implementation) be translated by
the appropriate control unit into a physical bus address. These
addresses will specify specific ports on functional units; some ports
may be read-only or write-only. Read-only and write-only ports may
share addresses on the bus (e.g. a write-only integer operand register
may share an address with a read-only integer result register).
Logical addresses implemented in F-TTA0 will be reserved for
functional units as follows:
000000 | constant as source, NOP as dest |
000001 | Invalid address |
000010-000111 | BC/reserved |
001000-001111 | RF |
010000-010011 | IB |
010100-010111 | RN |
011000-011011 | LS |
011100-100011 | Reserved for future use |
100100-100111 | R0 |
101000-101011 | R1 |
101100-101111 | U0 |
110000-110011 | U1 |
111000-111011 | C0 |
111100-111111 | C1 |
Abbreviations are defined below.
Branching and conditional execution
The F-TTA0 will have two conditional cancellation units. Each will
have a single-bit output not present as a logical address on the bus,
and each will have three writable addresses: two data operands and one
control operand specifying the comparison to be made by the unit (a<b,
a<=0, etc). Whenever the output of a cancellation unit and the
corresponding cancellation bit in an instruction are both 1, the
instruction will be squashed. The program counter will also be
readable and writable on the bus, allowing branches to be implemented
as conditional or unconditional writes to the PC.
A traditional branch predictor will not be implemented; instead, a
squash predictor may be implemented, using any algorithm. The
instruction fetch unit will monitor for writes to the PC and, if a
squash predictor is available, will use its prediction to
speculatively fetch instructions. The fetch unit will monitor the
conditional cancellation units when issuing a branch to the bus, and
will squash speculatively fetched instructions based on the branch
instruction word and the outputs of the conditional cancellation units.
The control unit may also take other optimistic measures based on
squash predictions of other conditional instructions, such as
preemptive actions on constants.
Control units
The F-TTA0 will have the following functional control units:
- BC: Bus Control unit
- RF: Special- and general-purpose Register File
- C0/C1: Conditional Cancellation units 0 and 1.
BC: Bus Control
The bus control unit is similar to the traditional "control unit" of
an OTA. It is the arbiter of all internal busses, but also performs
instruction fetch (and decode, to the limited extent that a TTA does
instruction decode), squash prediction, exception handling, power-on
BIST, etc. Note that the BC will NOT actually transfer data on the
bus, except as requested through its bus ports. It will simply send
the necessary control signals to other units, which will read from or
write to bus data lines as required.
The BC will have three ports: control, status, and immediate. The
control port will be write-only and will provide for any miscellaneous
control actions not controlled by special-purpose registers, as well
as requesting machine configuration and status information from the BC
unit. Control will be write-only. Status will provide machine status
and configuration information, will be read-only, and will be aliased
to control. Immediate will provide immediate data read from the
instruction stream when requested by an instruction. It will be
read-write, but writes to this address will be ignored (this will
act as a redundant NOP).
RF: Register File
The register file will contain all registers, whether special-purpose
or general-purpose. The number of general-purpose registers will be
determined by the implementation, but will normally be 64. The
RF will have multiple input ports to specify internal registers; the
data written to these ports will specify special- or general-purpose
registers, and the specific register to read or write. Each
register specifier port will specify the register to be connected to
one bus port, both read and write. To accommodate larger
implementations, with as many as eight internal busses, the RF may
need to have a very large number of read and write ports. Since
implementing this tends to create a rather distended RF, the RF
may be split into segments, with each segment containing a subset
of the registers and read/write ports.
The RF will have eight ports assigned to it: four address/status
ports and four data ports. The address ports will be used to specify
the specific register to connect to a data port and will be write-only.
The data ports will allow for reading from and writing to specific
registers, and will be read-write. The status ports will be read-only
and will be aliased with address ports.
C0/1: Conditional Cancellation Units
Although these could also be considered execution units, since
they perform decision-making tasks, they also are heavily involved
in instruction flow and bus control, so they are considered to be
functional control units instead. Each will have five ports
allocated: operand, specifying the comparison to make, status,
giving software access to information about the unit, and three
data inputs. The data and operand ports will be write-only. The
status port will be read-only and will be aliased with the operand
port.
Execution units
The F-TTA0 will have the following functional execution units:
- IB: Integer/bitwise unit
- RN: Real-number unit
- LS: Load/Store unit
- U0: User-defined unit 0
- U1: User-defined unit 1
- R0: Re-programmable unit 0
- R1: Re-programmable unit 1
Integer/bitwise unit
This unit will perform all generic integer and bitwise operations:
add, shift, etc. It will have eight ports: opcode, three operand
ports, status, and three result ports. The opcode port will
specify the operation to perform on the operands and will be write-only.
The status port will provide information on the current status of the
IB, will be read-only, and will be aliased to the opcode port. The
three operand ports will be write-only and will provide data on which
to operate. The three result ports will provide the results of the
opcode, will be read-only, and will be aliased to the operand ports.
Real-number unit
This unit will perform computations on more complex data formats
such as floating-point representations. It will have the same ports
as the IB. It will be capable of standard IEEE representations of
floating-point data. It may also be capable of alternative
representations such as logarithmic number systems. When other number
systems are supported, the RN will support conversion between the
supported number systems.
Load/Store unit
The LS will perform all interactions with caches and memory. It will
have five ports: An opcode port, a status port, an address port,
a data port, and a result port. The opcode port will specify the
operation to perform: load, store, prefetch, cache block invalidate, etc,
and will be write-only. The status port will provide information about
the status of the LS unit, will be read-only, and will be aliased with
the opcode port. The address port will be write-only, and will
specify the address of the operation to perform. The data port will
be write-only and will specify additional data (such as data to write
to the specified address for a store instruction). The result will
specify the result of the operation performed (such as the data
fetched from memory for a load operation), will be read-only, and will
be aliased to the data port.
User-defined units
These units will be defined by a particular F-TTA0 implementation.
No guarantees of functionality or the ability to uniquely identify
one particular user-defined unit will be made. These will for
implementations designed for a specific purpose, and any software
which uses one of these units should be designed in conjunction with
the implementation itself. The user-defined units will have four to
eight ports. Four addresses will be reserved for each unit for
reading, and four for writing. They may be read-only, write-only,
read-write, aliased or not, at the discretion of the designer. In
short, use these at your own risk and for your own benefit.
Re-programmable units
These units are general special-purpose units. They will consist
of re-programmable logic, to be used by any software. Each will have
five to eight ports: control and status, and three definable
addresses. The control port will allow for control and reprogramming
of the unit, and will be write-only. Status will provide information
about the status of the unit, will be read-only, and will be aliased
with the control port. The remaining three addresses assigned to
the re-programmable units will be usable at the discretion of the
software programming the unit.
Issues, comments, and analysis
- FEUs may support separate modes of operation: blocking and
streaming. In blocking mode, the FEU will keep track of whether or not
a particular result is ready to be read, and whether or not it has
been read. Operations will be dispatched based on writes to the
input registers, but execution will block when data becomes ready, and
will not unblock until the data has been read. In streaming mode, the
FEU will continue to execute based on its current input data, possibly
clobbering output data if it is not read in time. Blocking or
streaming mode will be selected based on the data in the opcode
port. Blocking mode will minimize the hardware knowledge required
by the software; if you want to do an add, you just move your data
to the appropriate ports, then move your result off. If the result
is not ready by the time you issue your read, your thread will block
until it becomes ready. Streaming mode will minimize writes to
operand and control registers for repetitive operations such as MACs
and vector operations; once the first piece of data is ready, you
just move your new data into the operand ports and move your old
data off of the result ports.
- Many of these techniques are very conservative, so IPC
will probably be fairly low. However, with SMT, overall throughput
should be quite high. Any time there is a free bus cycle (a context
blocks due to an FEU not having data ready; there is a NOP or
squashed transaction; generally, any time there is a free bus cycle)
the BC can simply fill the gap with operations from a different context.
- Busses will need the following lines, in addition to data:
- Source address (six bits)
- Destination address (six bits)
- Context (lg <number of contexts>, probably no more than
three or four for most initial implementations)
The first general-purpose implementation of F-TTA0 will probably
have a 64-bit (max) data path, and maybe eight or sixteen contexts (max)
on-chip, and possibly eight busses (max). This would mean that we
would need 8*(6+6+4+64)=640 lines for the busses, not including ground
wires and other control information. Yes, this is an awful lot. We'd
probably need to use a BiCMOS process, too, because these busses will
probably be awfully long. But, by doing this, we should be able to
achieve very good parallelism on multithreaded applications with
relatively few transistors. And if we put the busses on our top metal
layer (maybe even use copper, if we have the money or IBM or someone
else decides to be generous), we shouldn't have to interfere too much
with our other interconnects.
- FUs (control and execution both) are free to run at any
clock rate they want, independent of the bus. This should allow
for any superpipelining we want to do, and might allow many operations
to be completed in a single bus cycle. This may also simplify
design of the FUs (and the chip in general) by allowing more
distributed clock signals and decoupling FU clocking from bus clocking.
It is also possible that it could make more power optimizations
possible, by allowing FUs to use different voltage planes from the
bus signals or each other.
- We should provide two bits for control of instruction word
size. These could be stored in a special-purpose register, but could
use the usual four operand-size registers. By doing this, we
eliminate possible ugliness when it comes to compatibility, especially
since we're putting immediates in the instruction stream. An
application designed for an F-TTA implementation that has a word
length of eight moves wouldn't have to worry about a
single-instruction implementation grabbing an instruction as
immediate data. Software would also be able to change word length
as another possible optimization; if one piece of code is particularly
serial, the application could set the processor to use shorter word
lengths to avoid having to insert many NOPs, thus reducing code bloat.
- Optimizing compilers for TTAs are still in their infancy, so
we may not get performance as good as more traditional architectures
right away. However, software written in assembly language should
be very fast, and, by exposing more of the internal workings to
software, we should be able to get pretty darned good performance once
the compilers get good. We are going back to basic RISC concepts
(the whole "A smart compiler and a dumb machine" concept), so this
is probably a good path to follow.
- Assembly language gets pretty trivial with this architecture.
Everything is just <src>,<dest>;<src>,<dest>.
Conditional cancellation arguments can be prepended (e.g.
0<src>,<dest>;3<src>,<dest> for no
cancellation or cancellation on either unit, respectively) and
operand-size data can be appended (e.g.
<src>,<dest>-1;<src>,<dest>-2 for operand
size registers 1 and 2, respectively). We'd probably want to have
defaults for these so they wouldn't always be required; probably 0
for conditional compilation and 3 for operand size.
- R0 is a usable register. Since immediates are placed in the
instruction stream and are full-length, there is no need to do a
read-modify-write operation to zero a register. Simply move a constant
0 from the instruction stream to the register.
- For the same reasons as in the original proposed architecture,
F-TTA0 will have no condition codes.
- In order to speed up context switch, F-TTA0 will implement
all features previously discussed, including smooth register backup,
variable cache reservation, and priority loads and stores. High-
and low-priority loads and stores will also be available to software
as separate opcodes, in order to facilitate use of F-TTA0 in
embedded and real-time applications.
- A separate control bus, not visible to software, will be
implemented to handle exceptions and context switches (note that,
since F-TTA0 is SMT, a "context switch" is only required when
a context needs to be transferred to or from memory due to having
more contexts than can be stored on-chip; otherwise, all contexts
are always executing). This bus will make provision for any FU
to signal exception information (including the exception generated
and the context generating the exception) to all other FUs, so
that they may cancel execution, save state information, or call
an exception handler (in the case of the BC) as required. Note
that, by having exception information broadcast to all FUs, since
F-TTA0 is a purely in-order processor, all exceptions are precise.
- In order to keep the BC instruction queue full even when
immediates are generated, the icache will need to be dual-ported.
Whenever an instruction queue starts to empty, the BC will simply
read two instruction words at a time from the icache until it is
again full. This will also help with SMT, since instructions may
be issued from more than one queue in a given bus cycle.
Furthermore, if the icache is already dual-ported, the making the
dcache dual-ported should not be difficult from a design point of
view, and would speed up memory accesses, so the dcache will be
dual-ported as well.