General
HyperTransport provides a point-to-point interconnect that can be extended to support a wide range of devices. Figure 2-1 on page 21 illustrates a sample HT system with four internal links. HyperTransport provides a high-speed, high-performance, point-to-point dual simplex link for interconnecting IC components on a PCB. Data is transmitted from one device to another across the link.
Figure 2-1. Example HyperTransport System
The width of the link along with the clock frequency at which data is transferred are
scalable:
Link width ranges from 2 bits to 32-bits
Clock Frequency ranges from 200MHz to 800MHz (and 1GHz in the future)
This scalability allows for a wide range of link performance and potential applications with bandwidths ranging from 200MB/s to 12.8GB/s.
At the current revision of the spec, 1.04, there is no support for connectors implying that all HyperTransport (HT) devices are soldered onto the motherboard. HyperTransport is technically an "inside-the-box" bus. In reality, connectors have been designed for systems that require board to board connections, and where analyzer interfaces are desired for debug.
Once again referring to Figure 2-1, the HT bus has been extended in the sample system via a series of devices known as tunnels. A tunnel is merely an HT device that performs some function, but in addition it contains a second HT interface that permits the connection of another HT device. In Figure 2-1, the tunnel devices provide connections to other I/O buses:
Infiniband
PCI-X
Ethernet
The end device is termed a cave, which always represents the termination of a chain of devices that all reside on the same HT bus. Cave devices include a function, but no additional HT connection. The series of devices that comprise an HT bus is sometimes simply referred to as an HT chain.
Additional HT buses (i.e. chains) may be implemented in a given system by using a HT-to-HT bridge. In this way, a fabric of HT devices may be implemented. Refer to section entitled, "Extending the Topology" on page 33 for additional detail.
Transfer Types Supported
HT supports two types of addressing semantics:
legacy PC, address-based semantics
messaging semantics common to networking environments
The first part of this book discusses the address-based semantics common to compatible PC implementations. Message-passing semantics are discussed in Chapter 19, entitled "Networking Extensions Overview," on page 443.
Address-Based Semantics
The HT bus was initially implemented as a PC compatible solution that by definition uses Address-based semantics. This includes a 40-bit, or 1 Terabye (TB) address space. Transactions specify locations within this address space that are to be read from or written to. The address space is divided into blocks that are allocated for particular functions, listed in Figure 2-2 on page 23.
HT Address Map
HyperTransport does not contain dedicated I/O address space. Instead, CPU I/O space is mapped to high memory address range (FD_FC00_0000h—FD_FDFF_FFFFh). Each HyperTransport device is configured at initialization time by the boot ROM configuration software to respond to a range of memory address spaces. The devices are assigned addresses via the base address registers contained in the configuration register header. Note that these registers are based on the PCI Configuration registers, and are also mapped to memory space (FD_FE00_0000h—FD_FFFF_FFFFh. Unlike the PCI bus, there is no dedicated configuration address space.
Read and write request command packets contain a 40-bit address Addr[39:2]. Additional memory address ranges are used for interrupt signaling and system management messages. Details regarding the use of each range of address space is discussed in subsequent chapters that cover the related topic. For example, a detailed discussion of the configuration address space can be found in Chapter 13, entitled "Device Configuration," on page 305.
Data Transfer Type and Transaction Flow
The HT architecture supports several methods of data transfer between devices, including:
Programmed I/O
DMA
Peer-to-peer
Each method is illustrated and described below. An overview of packet types and transactions is discussed later in this chapter.
Programmed I/O Transfers
Transfers that originate as a result of executing code on the host CPU are called programmed I/O transfers. For example, a device driver for a given HT device might execute a read transaction to check its device status. Transactions initiated by the CPU are forwarded to the HT bus via the Host HT Bridge as illustrated in Figure 2-3. The example transaction is a write that is posted by the host bridge; thus no response is returned to from the target device. Non-posted operations of course require a response.
Transaction Flow During Programmed I/O Operation
DMA Transfers
HT devices may wish to perform a direct memory access (DMA) by simply initiating a read or write transfer. Figure 2-4 illustrates a master performing a DMA read operation from main DRAM. In this example, a response is required to return data back to the source HT device.
Transaction Flow During DMA Operation
Peer-to-Peer Transfers
the initial request to read data from the target device residing on the same bus. Note that even though the target device resides on the same bus, it ignores the request moving in the upstream direction (toward the host processor). When the request reaches the upstream bridge, it is turned around and sent in the downstream direction toward the target device. This time the target device detects the request and returns the requested data in a response packet.
Peer-to-Peer Transaction Flow
The peer-to-peer transfer does not occur directly between the requesting and responding devices as might be expected. Rather, the upstream bridge is involved in handling both the request and response to ensure that the transaction ordering requirements are managed correctly. This requirement exist to support PCI-compliant ordering. True, or direct, peer-to-peer transfers are supported when PCI ordering is not required as defined by the networking extensions. See Chapter 19, entitled "Networking Extensions Overview," on page 443 for details.
HT Signals
The HT signals can be grouped into two broad categories:
The link signal group — used to transfer packets in both directions (High-Speed Signals).
The support signal group — that provides required resources such as power and reset, as well as other signals to support optional features such power management (Low-Speed Signals).
Primary HT Signal Groups
Link Packet Transfer Signals
The high-speed signals used for packet transfer in both directions across an HT link include:
CAD (command, address, data). Multiplexed signals that carry control packets (request, response, information) and data packets. Note that the width of the CAD bus is scalable from 2-bits to 32-bits. (See "Scalable Performance" on page 30.)
CLK (clock). Source-synchronous clock for CAD and CTL signals. A separate clock signal is required for each byte lane supported by the link. Thus, the number of CLK signals required is directly proportional to the number of bytes that can be transferred across the link at one time.
CTL (control). Indicates whether a control packet or data packet is currently being delivered via the CAD signals.
these signals and defines various widths of data bus supported. The variables "n" and "m" define the scaling option implemented. Refer to "Link Initialization" on page 282 for details regarding HT data width and clock speed scaling.
Link Signals Used to Transfer Packets
Link Support Signals
The low-speed link support signals consist of power- and initialization-related signals and power management signals. Power- and initialization-related signals include:
VLDT & Ground — The 1.2 volt supply that powers HT drivers and receivers
PWROK — Indicates to devices residing in the HT fabric that power and clock are stable.
RESET# — Used to reset and initialize the HT interface within devices and perhaps their internal logic (device specific).
Power management signals
LDTREQ# — Requests re-enabling links for normal operation.
LDTSTOP# — Enables and disables links during system state transitions.
Link Support Signals
Scalable Performance
The width of the transmit and receive portion of the link (CAD signals) may be different. For example, devices that typically send most of their data to main memory (upstream) and receive limited data from the host can implement a wide path in the high performance direction and narrow path for traffic in the lesser used direction, thereby reducing cost.
The HyperTransport link combines the advantages of both serial and parallel bus architectures. HT provides options for the number of data paths implemented and for the clock rate at which data is transferred (see "Scalable Link Width and Speeds" on page 30); thus, providing scalable link performance ranging from 0.2GB/s to 12.8GB/s. This scalability is helpful to system designers. For example:
An implementation that needs all the available bandwidth (e.g. system chipsets), can use wide links (up to 32 bits), running at the highest clock frequencies (up to 800MHz now and 1GHz in the future).
Implementations that don't require high bandwidth but do require low power may use narrow links (as few as 2 bits) and lower frequencies (down to 200MHz).
Scalable Link Width and Speeds
HyperTransport lends itself to scaling well because:
The high frequency bus translates to fewer pins required to transfer a specific amount of data. The same protocol is used regardless of link width.
Differential signaling results in a very low current path to ground, thereby reducing the number of power and ground pins required for devices.
Each additional byte lane added has its own source synchronous clock.
HT's implementation of ACPI compliant power management and interrupt signaling is message based, reducing pin count. Note that only two additional signals, LDTSTOP# and LDTREQ#, are required for managing power.
Data Widths
HT provides scalable data paths with link widths of 2-, 4-, 8-, 16-, or 32-bits wide in each direction, as pictured in Figure 2-10 on page 31. The link width used immediately following reset is restricted to no wider than 8 bits. Later during software initialization, configuration software determines the maximum link width that can be supported in each direction and configures both devices to use the maximum width supported for each direction. See "Tuning the Link Width (Firmware Initialization)" on page 295 for details.
Link Widths Supported
Table 2-1. Signals Used for Different Link Widths Link Widths
2
4
8
16
32
Pin Names
Number of Pins
Data Pins (CAD)
8
16
32
64
128
Clock Pins (CLK)
4
4
4
8
16
Control Pins (CTL)
4
4
4
4
4
LDTSTOP#/LDTREQ#
2
2
2
2
2
RESET#
1
1
1
1
1
PWROK
1
1
1
1
1
VHT
2
2
3
6
10
GND
4
6
10
19
37
Total Pins
26
36
57
105
199
As mentioned earlier, asymmetrical link widths are allowed in HyperTransport. For example, devices that typically send the bulk of their data in one direction and receive limited data in the other direction can save on cost by implementing a wide path in the high bandwidth direction and a narrow path for traffic in the low bandwidth direction. Note that the HyperTransport protocol doesn't change with link width. Packet formats remain the same, although it will obviously require more bit times to shift out a 32 bit value on a 2-bit link vs. a 32-bit link (16 bit times vs. 1 bit time).
Clock Speeds
HyperTransport clock speeds currently supported are 200MHz, 300MHz, 400MHz, 500MHz, 600MHz, and 800MHz. Note that 700MHz is not supported. Both rising edge and falling edges of the clock are used to clock signals. The clocking mechanism is referred to as double data rate (DDR) clocking. DDR clocking translates to an effective clock frequency that is double the actual clock frequency. In addition, because each link is dual simplex, the actual link bandwidth is quadrupled when compared to the clock rate.
Table 2-2 shows the bandwidth numbers based on symmetrical links for selected combinations of clock frequency and link width. For example, consider the bandwidth in GigaBytes/second for a 32-bit link operating at 800MHz:
800MHz clock with DDR = effective clock of 1,600MHz/s (1.6GTransfers/s)
1.6GTransfers/s x 4 bytes = 6.4GB/s
6.4GB/s in both directions = 12.8GB/s.
Table 2-2. Maximum Bandwidth Based on Various Speeds and Link Widths Link Width (bits)
Bandwidth per Link(in Gbytes/sec)
800MHz
400MHz
200MHz
2
0.8
0.4
0.2
4
1.6
0.8
0.4
8
3.2
1.6
0.8
16
6.4
3.2
1.6
32
12.8
6.4
3.2
Extending the Topology
Based on point-to-point links, a HyperTransport chain may be extended into a fabric, using single and multi-link devices together. Devices defined for HT include:
Single HT link "cave" devices used to implement a peripheral function
Single or multi-link Bridges; (HT-to-HT, or HT to one or more other protocols such as PCI, PCI-X, AGP or Infiniband)
Multi-link Tunnel devices used to implement a function and extend a link to a neighboring device downstream, thus creating a chain
These devices are the basic building blocks for the HT fabric
Basic HT Device Types
exemplifies a HyperTransport topology that includes all three device types previously discussed. The basic difference between an HT-to-HT bridge and a tunnel device is:
A bridge creates a new link (with its own bus number), and acts as a HyperTransport host bridge for each secondary link.
A tunnel buffers signals, passes packets, but merely extends an existing link to another device. It is not a host, and the bus number is the same on both sides of the tunnel. It also implements an internal function of its own, which a bridge typically would not.
HyperTransport Topology Supporting All Three Major Device Types
Packetized Transfers
Transactions are constructed out of combinations of various packet types and carry the commands, address, and data associated with each transaction. Packets are organized in multiples of 4-byte blocks. If the link uses data paths that are narrower than 32 bits, successive bit-times are added to complete the packet transfer on an aligned 4-byte boundary. The primary packet types include:
Control Packets — used to manage various HT features, initiate transactions, and respond to transactions
Data packets — that carry the payload associated with a control packet (maximum payload is 64 bytes).
As illustrated in Figure 2-13, the control (CTL) signal differentiates control packets from data packets on the bus.
Distinguishing Control from Data Packets
For every group of 8 bits (or less) within the CAD path, there is a CLK signal. These groups of signals are transmitted source synchronously with the associated CLK signal. Source synchronous clocking requires that CLK and its associated group of CAD signals must all be routed with equal length traces in order to minimize skew between the signals.
Control Packets
Control packets manage various HT features, initiate transactions, and respond to transactions as listed below:
Information packets
Request packets
Response packets
Information packet (4 bytes)
Information packets are exchanged between the two devices on a link. They are used by the two devices to synchronize the link, convey a serious error condition using the Sync Flood mechanism, and to update flow control buffer availability dynamically (using tags in NOP packets). The information packets are:
NOP
Sync/Error
Request packet (4 or 8 bytes)
Request packets initiate HT transactions and special functions. The request packets include:
Sized Write (Posted)
Broadcast Message
Sized Write (non-posted)
Sized Read
Flush
Fence
Atomic Read-Modify-Write
Response packet (4 bytes)
Response packets are used in HT split-transactions to reply to a previous request. The response may be a Read Response with data, or simply a Target Done Response confirming a non-posted write has reached its destination.
Data Packets
Some Request/Response command packets have data associated with them. Data packet structure varies with the command which caused it:
Sized Dword Read Response or Write data packets are 1-16 dwords (4-64 bytes)
Sized Byte Read Response data packets are 1 dword (any byte combination valid)
Sized Byte Write data packets are 0-32 bytes (any byte combination valid)
Read-Modify-Write
[ Team LiB ]
HyperTransport Protocol Concepts
Channels and Streams
In HyperTransport, as in other protocols, ordering rules are needed for read, posted/non-posted write transactions, and responses returning from earlier requests. In a point-point fabric, all of these occur over the same link. In addition, transactions from different devices are also merging over the same links. HyperTransport implements Virtual Channels and I/O Streams to differentiate a device's posted requests, non-posted requests, and responses from each other and from those originating from different sources.
Virtual Channels
HyperTransport defines a set of three required virtual channels that dictate transaction management and ordering:
Posted Requests — Posted write transactions belong to this channel.
Non-Posted Requests — Reads, non-posted writes, and flushes belong to this channel.
Responses — Read responses and target done packets belong to this channel.
An additional set of Posted, Non-Posted and Response virtual channels is required for isochronous transactions, if supported. This dedicated set of virtual channels assist in guaranteeing the bandwidth required of isochronous transactions.
When packets are sent over a link, they are sent in one of the virtual channels. Attribute bits in the packets tag them as to which channel they should travel. Each device is responsible for maintaining queues and buffers for managing the virtual channels and enforcing ordering rules.
Each device implements separate command/data buffers for each of the 3 required virtual channels as pictured in Figure 2-14 on page 38. Doing so ensures that transactions moving in one virtual channel do not block transactions moving in another virtual channel. There are I/O ordering rules covering interactions between the three virtual channels of the same I/O stream. Transactions in different I/O streams have no ordering rules (with exception of ordering rules associated with Fence requests). Enforcing ordering rules between transactions in the same I/O stream prevents deadlocks from occurring and guarantees data is transferred correctly. Based on ordering requirements, nodes may not:
Make accepting a request dependent on the ability of that node to issue an outgoing request.
Make accepting a request dependent on the receipt of a response due to a request previously issued by that node.
Make issuing a response dependent on the ability to issue a request.
Make issuing a response dependent upon receipt of a response due to a previous request.
I/O Streams
In addition to virtual channels, HyperTransport also defines I/O streams. An I/O stream consists of the requests, responses, and data associated with a particular UnitID and HyperTransport link. Ordering rules require that I/O streams be treated independently from each other. When a request/response packet is sent, it is tagged with sender attributes (UnitID, Source Tag, and Sequence ID) that are used by other devices to identify the transaction stream in use, and the required ordering within it. Entries within the virtual channel buffers include the transaction stream identifiers (attributes).
used properly, the independent I/O streams create the effect of separate connections between devices and the host bridge above them — much as a shared bus connection appears.
Transactions (Requests, Responses, and Data)
Transfers initiated by HT devices require one or more transactions to complete. These devices may need to perform a variety of operations that include:
sending or forwarding data (write)
requesting that a target return data to it (read)
performing an atomic read/modify/write operation
wanting additional control over ordering of its posted transactions (using Flush and Fence commands)
wanting to broadcast a message to all downstream agents (done by bridges only)
The format of these transactions also vary depending on the type of operation (request) specified as listed below:
Requests that behave like reads and that require a read response and data (i.e., Sized Read, Atomic RMW)
Requests that behave like writes, and require a target done response to confirm completion (i.e. Non-posted Sized Writes)
Posted Requests that behave like writes but don't require any target response or data. (i.e. Posted Sized Writes, Broadcast Message, or Fence)
Transaction Requests
Every transaction begins with the transmission of a Request Packet. Note that the actual format of a request packet varies depending on the particular request, but in general each request contains the following information:
Target address within HyperTransport memory space
The request type (command)
Sender's transaction stream ID (UnitID, SeqID)
The amount of data to be transferred (if any)
Other attributes: virtual channel to use, etc.
HT defines seven basic request types. The characteristics of each request type is discussed in the following sections.
Transaction Responses
Responses are generated by the target device in cases where data is to be returned from the target device, or when confirmation of transaction completion is required. Specifically, in HyperTransport, a response follows all non-posted requests. A target responds to:
Return data to satisfy an earlier read or Atomic Read-Modify Write (RMW) request
Confirm the arrival of non-posted write data
Confirm the completion of a Flush operation
Report errors
The information in a response varies both with the Request that causes it, and with the direction the response is traveling in the HyperTransport fabric. However, content of an HT response generally includes:
Response type (command)
Response direction (upstream or downstream)
Transaction stream (UnitID, Source Tag)
Misc. info: virtual channel to use, error, etc.
Transaction Types
As discussed earlier, HT defines seven basic transaction types. This section introduces the characteristics of each type and defines any sub-types that exist.
Sized Read Transactions
Sized Read transactions permit remote access to a device memory or memory-mapped I/O (MMIO) address space. The operation may be initiated on HT from the host bridge (PIO operation), or an HT device may wish to read data from memory (DMA operation) or from another HT device (peer-to-peer operation). Two types of Sized Read transactions define the different quantities of data to be read.
Sized (Byte) Read — this request defines an aligned 4 byte block of address space from which 0 to 4 bytes can be read. Any single byte location or any group of bytes within the 4 byte block can be accessed. The typical use of this transaction is for reading MMIO registers.
Sized (DW) Read — this request identifies an aligned 64 byte block of address space from which 4-64 bytes can be read. Any continuous group of aligned 4 byte groups (DWs) can be accessed.
The protocol associated with Sized Read transactions is illustrated in Figure 2-15 on page 41. These transactions begin with the delivery of a Sized Read Request packet and completes when the target device returns a corresponding response packet followed by data.
Figure 2-15. Example Protocol — Receiving Data from Target
The basic rules for maintaining high performance of HT reads include:
For reads, the requester won't issue the request until it has buffers available to receive all requested data without wait states.
The requester won't issue the request until it knows the target has room in its transaction queue to accept it (Flow Control)
Upon receiving the read request, the target won't issue the read response until it has all requested data and status available to send. Once it starts the response, there will be no wait states until the read response packet and all data (up to 16 dwords) have been sent.
Upon receiving the response, the requester will check the error bits to make certain the data is valid.
The target and any bridges in the path de-allocate buffers and queue entries as soon as the response has been sent.
Sized Write Transactions
Sized Write transactions permit the host bridge (PIO operation) to send data to a HyperTransport device, or permits a HyperTransport device to send data to memory (DMA operation) or to another device (Peer-to-peer operation). Two types of Sized Write requests permit different sizes of memory or MMIO space to be accessed.
Sized (Byte) Write — this request identifies an aligned block of 32 bytes of address space into which data is to be written. The amount of data to be written can be from 0 to 32 bytes. Note that the maximum transfer size of 32 bytes only occurs if the start address is 32 byte aligned. If the start address is not on a 32-byte boundary, the transfer will be less than 32 bytes. Furthermore, no Byte Write transaction crosses a 32 byte address boundary. Any combination of bytes (need not be contiguous) can be written from the start address to the next aligned 32 byte block of address space.
Sized (DW) Write — this request identifies an aligned block of 64 bytes of address space into which data can be written. The start address must be aligned on 4-byte boundaries, and data to be written is always aligned in 4- byte contiguous groups (DWs). The amount of data written can be from 1 to 16 DW increments.
Non-Posted Sized Writes.
The packet protocol associated with Sized Write transactions depends on whether the Sized Write is posted or not. Figure 2-16 on page 43 illustrates the case of a non-posted Sized Write. This diagram illustrates the basic HT split-transaction request-target done response sequence.
The basic rules for maintaining high performance in HT writes include:
The requester won't issue the non-posted write request until it knows the target can accommodate the request and all of the data to be sent. Refer to the section on Flow Control to see how this is managed for writes.
Upon receiving the write request and data, the target won't issue the target done response until it has properly delivered all data. Once it starts the response, there will be no wait states until the four bytes of the target done response packet have been sent.
Upon receiving the response, the requester will check the error bits to make certain delivery is complete.
The target and any bridges in the path de-allocate request queue entries as soon as the target done response has been sent.
Posted Sized Writes
In both case the transaction begins with the Sized Write request followed by the data. Non-posted operations include a response packet that is delivered back to the requester as verification that the operation has completed, whereas posted writes end once the data is sent.
Flush
Flush is useful in cases where a device must be certain that its posted writes are "visible" in host memory before it takes subsequent action. Flush is an upstream, non-posted "dummy" read command that pushes all posted requests ahead of it to memory. Note that only previously posted writes within the same transaction stream as Flush transaction need be flushed to memory. When an intermediate bridge receives a Flush transaction, it generates one or more Sized Write transactions necessary to forward all data in its upstream posted-write buffer toward the host bridge. Ultimately, the host bridge receives the command and flushes the previously-posted writes to memory. Receipt of the read response from the host bridge is confirmation that the flush operation has completed.
The protocol used when performing a Flush transaction is depicted. When the Flush request reaches the host bridge it completes previously-posted writes to memory. In this example two previously-posted writes are flushed to memory, after which the Target Done (TgtDone) response is returned to the requester.
Example Protocol — Flush Transaction
Fence
Fence is designed to provide a barrier between posted writes, which applies across all UnitIDs and therefore across all I/O streams and all virtual channels. Thus, the fence command is global because it applies to all I/O streams. The Fence command goes in the posted request virtual channel and has no response. The behavior of a Fence is as follows:
The PassPW bit must be clear so that the Fence pushes all requests in the posted channel ahead of it.
Packets with their PassPW bit clear will not pass a Fence regardless of UnitID.
Packets with their PassPW bit set may pass a Fence.
A nonposted request with PassPW clear will not pass a Fence as it is forwarded through the chain, but it may do so after it reaches a host bridge.
Fence requests are never issued as part of an ordered sequence, so their SeqID will always be 0. Fence requests with PassPW set, or with a nonzero SeqID, are legal, but may have an unpredictable effect. Fence is only issued from a device to a host bridge or from one host bridge to another. Devices are never the target of a fence so they do not need to perform the intended function. If a device at the end of the chain receives a fence, it must decode it properly to maintain proper operation of the flow control buffers. The device should then drop it.
Example Protocol — Fence Transaction
Atomic
Atomic Read-Modify-Write (ARMW) is used so that a memory location may be read, (evaluated and) modified, then conditionally written back — all without the race-condition of another device trying to do it at the same time. HT defines two types of Atomic operation:
Fetch & Add
Compare & Swap
The protocol associated with an Atomic Transaction is shown in Figure 2-20 on page 46. The request is followed by a data packet that contains the argument of the atomic operation. The target device performs the request operation and returns the original data read from the target location.
Example Protocol — Atomic Operation
Broadcast
Broadcast Message requests are sent downstream by host bridges, and are used to send messages to all devices. They are accepted and forwarded by all agents onto all links.
the operation of a Broadcast transaction. This example shows a broadcast request working its way down the HT fabric. All devices recognize the Broadcast Message request type and the reserved address, accept the message, and pass it along. Examples of Broadcast Messages include Halt, Shutdown, and the End-Of-Interrupt (EOI) message.
Managing the Links
This section introduces a collection of miscellaneous topics that we have labeled Link Management. They include:
Flow Control
Initialization and Reset
Configuration
Error Detection and Handling
Each of these topics is discussed in the following sections.
Flow Control
Other than information packets, all packets are transmitted from a transmitter to a buffer in the receiver. The receiver buffer will overflow if the transmitter sends too many packets. Flow control ensures that the transmitter only sends as many packets to the receiver device as buffer space allows.
Information packets are not subject to flow control. They are not transmitted to buffers within a device. Devices are always ready to accept information packets (e.g. NOP packets). Only request packets, response packets and data packets are subject to flow control.
Flow control occurs across each link between the source and the ultimate target device. HyperTransport devices must implement the six types of buffers listed above as part of its receiver state-machine. A designer implements buffers of appropriate size to meet bandwidth/performance requirements. The size of each buffer is conveyed to the transmitter during initialization, and available space is updated dynamically through NOP transmission.
HyperTransport requires transmitters on each link to accept NOP packets from receivers at reset indicating virtual channel buffering capacity, then establish a packet coupon scheme that:
Guarantees no transmitter will send a packet that the receiver can't accept
Eliminates the need for inefficient disconnects and retries on the link.
Requires each receiver to dynamically inform the transmitter (via NOP packets) as buffer space becomes available.
With three virtual channels, there are three pairs of buffers in each receiver to handle request/responses and the data:
Posted Request Buffer
Posted Request Data Buffer
Non-Posted Request Buffer
Non-Posted Request Data Buffer
Response Buffer
Response Data Buffer
Buffer entries are sized according to what will be contained in them.
If A Device Supports the optional Isochronous Channel, it must implement additional flow control buffers to support them. An "ISOC" bit is set in request and response packets indicating routing. If the "ISOC" bit is set, all link devices that support it will use these channels; others will pass Isochronous pacekts along in regular channels.
ISOC traffic is exempt from the fairness algorithm implemented for non-ISOC traffic, resulting in higher performance. Isochronous transactions are serviced by devices before non-isochronous traffic. Theoretically, isochronous traffic may result in starving non-isochronous traffic. Applications must guarantee that isochronous bandwidth does not exceed overall available bandwidth.
Initialization and Reset
HyperTransport defines two classes of reset events:
Cold Reset. This occurs on boot and starts when the PWROK and RESET# signals are both seen low. When this happens:
All devices and links return to default inactive state
Previously assigned UnitID numbers are "forgotten" and all return to default UnitID of 0.
All Configuration Space registers return to default state
All error bits and dynamic status bits are cleared
Warm Reset. This occurs when PWROK is high and RESET is seen low.
All devices and links return to default inactive state
Previously assigned UnitID numbers are "forgotten", and all return to default UnitID of 0.
All Configuration Space registers defined as persistent retain previous values. The same is true for Status and error bits defined as persistent.
All other error bits and dynamic status bits are cleared
Because HyperTransport supports scalable link width and clock speed, a set of default minimum link capabilities are in effect following cold reset.
Initial link width is conveyed when both devices sample CAD signal inputs from the other at the end of reset. Initial link clock speed is 200MHz.
Later, Configuration of devices allows optimizing CAD width and clock speeds for each link.
Refer to the core topic section on Reset and Initialization for details on this process.
It is a motherboard's responsibility to tie upper CAD inputs to 0 if a device receiver is attached to a narrower transmitter CAD interface.
Configuration
At boot time, PCI configuration is used to set-up HyperTransport devices:
Read in configuration information about device requirements and capabilities.
Program the device with address range, error handling policy, etc.
Basic configuration of a device is similar to that of PCI devices; however, specific HyperTransport-specific features are handled via the advanced capability registers.
Error Detection and Handling
HyperTransport defines required and optional error detection and handling. Key areas of error handling:
Cycle Redundancy Check (CRC) generation and checking on each link.
Protocol (violation) errors
Receive buffer overflow errors
End-Of-Chain errors
Chain Down errors
Response errors
Thursday, December 27, 2007
Introduction to HyperTransport
Background: I/O Subsystem Bottlenecks
New I/O buses are typically developed in response to changing system requirements and to promote lower cost implementations. Current-generation I/O buses such as PCI are rapidly falling behind the capabilities of other system components such as processors and memory. Some of the reasons why the I/O bottlenecks are becoming more apparent are described below.
Server Or Desktop Computer: Three Subsystems
A server or desktop computer system is comprised of three major subsystems:
Processor (in servers, there may be more than one)
Main DRAM Memory. There are a number of different synchronous DRAM types, including SDRAM, DDR, and Rambus.
I/O (Input/Output devices). Generally, all components which are not processors or DRAM are lumped together in this subsystem group. This would include such things as graphics, mass storage, legacy hardware, and the buses required to support them: PCI, PCI-X, AGP, USB, IDE, etc.
CPU Speed Makes Other Subsystems Appear Slow
Because of improvements in CPU internal execution speed, processors are more demanding than ever when they access external resources such as memory and I/O. Each external read or write by the processor represents a huge performance hit compared to internal execution.
Multiple CPUs Aggravate The Problem
In systems with multiple CPUs, such as servers, the problem of accessing external devices becomes worse because of competition for access to system DRAM and the single set of I/O resources.
DRAM Memory Keeps Up Fairly Well
Although it is external to the processor(s), system DRAM memory keeps up fairly well with the increasing demands of CPUs for a couple of reasons. First, the performance penalty for accessing external memory is mitigated by the use of internal processor caches. Modern processors generally implement multiple levels of internal caches that run at the full CPU clock rate and are tuned for high "hit rates". Each fetch from an internal cache eliminates the need for an external bus cycle to memory.
In addition, in cases where an external memory fetch is required, DRAM technology and the use of synchronous bus interfaces to it (e.g. DDR, RAMBUS, etc.) have allowed it to maintain bandwidths comparable with the processor external bus rates.
I/O Bandwidth Has Not Kept Pace
While the processor internal speed has raced forward, and memory access speed has managed to follow along reasonably well with the help of caches, I/O subsystem evolution has not kept up.
This Slows Down The Processor
Although external DRAM accesses by processors can be minimized through the use of internal caches, there is no way to avoid external bus operations when accessing I/O devices. The processor must perform small, inefficient external transactions which then must find their way through the I/O subsystem to the bus hosting the device.
It Also Hurts Fast Peripherals
Similarly, bus master I/O devices using PCI or other subsystem buses to reach main memory are also hindered by the lack of bandwidth. Some modern peripheral devices (e.g. SCSI and IDE hard drives) are capable of running much faster than the busses they live on. This represents another system bottleneck. This is a particular problem in cases where applications are running that emphasize time-critical movement of data through the I/O subsystem over CPU processing.
Reducing I/O Bottlenecks
Two important schemes have been used to connect I/O devices to main memory. The first is the shared bus approach, as used in PCI and PCI-X. The second involves point-to-point component interconnects, and includes some proprietary busses as well as open architectures such as HyperTransport. These are described here, along with the advantages and disadvantages of each.
The Shared Bus Approach
on page 12 depicts the common "North-South" bridge PCI implementation. Note that the PCI bus acts as both an "add-in" bus for user peripheral cards and as an interconnect bus to memory for all devices residing on or below it. Even traffic to and from the USB and IDE controllers integrated in the South Bridge must cross the PCI bus to reach main memory.
Until recently, the topology on page 12 has been very popular in desktop systems for a number of reasons, including:
A shared bus reduces the number of traces on the motherboard to a single set.
All of the devices located on the PCI bus are only one bridge interface away from the principal target of their transactions — main DRAM memory.
A single, very popular protocol (PCI) can be used for all embedded devices, add-in cards, and chipset components attached to the bus.
Unfortunately, some of the things that made this topology so popular also have made it difficult to fix the I/O bandwidth problems which have become more obvious as processors and memory have become faster.
A Shared Bus Runs At Limited Clock Speeds
The fact that multiple devices (including PCB connectors) attach to a shared bus means that trace lengths and electrical complexity will limit the maximum usable clock speed. For example, a generic PCI bus has a maximum clock speed of 33MHz; the PCI Specification permits increasing the clock speed to 66MHz, but the number of devices/connectors on the bus is very limited.
A Shared Bus May Be Host To Many Device Types
The requirements of devices on a shared bus may vary widely in terms of bandwidth needed, tolerance for bus access latency, typical data transfer size, etc. All of this complicates arbitration on the bus when multiple masters wish to initiate transactions.
Backward Compatibility Prevents Upgrading Performance
If a critical shared bus is based on an open architecture, especially one that defines user "add-in" connectors, then another problem in upgrading bus bandwidth is the need to maintain backward compatibility with all of the devices and cards already in existence. If the bus protocol is enhanced and a user installs an "older generation card", then the bus must either revert back to the earlier protocol or lose its compatibility.
Special Problems If The Shared Bus Is PCI
As popular as it has been, PCI presents additional problems that contribute to performance limits:
PCI doesn't support split transactions, resulting in inefficient retries.
Transaction size (there is no limit) isn't known, which makes it difficult to size buffers and causes frequent disconnects by targets. Devices are also allowed to insert numerous wait states during each data phase.
All PCI transactions by I/O devices targeting main memory generally require a "snoop" cycle by CPUs to assure coherency with internal caches. This impacts both CPU and PCI performance.
Its data bus scalability is very limited (32/64 bit data)
Because of the PCI electrical specification (low-power, reflected wave signals), each PCI bus is physically limited in the number of ICs and connectors vs. PCI clock speed
PCI bus arbitration is vaguely specified. Access latencies can be long and difficult to quantify. If a second PCI bus is added (using a PCI-PCI bridge), arbitration for the secondary bus typically resides in the new bridge. This further complicates PCI arbitration for traffic moving vertically to memory.
A Note About PCI-X
Other than scalability and the number of devices possible on each bus, the PCI-X protocol has resolved many of the problems just described with PCI. For third-party manufacturers of high performance add-in cards and embedded devices, the shared bus PCI-X is a straightforward extension of PCI which yields huge bandwidth improvements (up to about 2GB/s with PCI-X 2.0).
ThePoint-to-Point Interconnect Approach
An alternative to the shared I/O bus approach of PCI or PCI-X is having point-to-point links connecting devices. This method is being used in a number of new bus implementations, including HyperTransport technology. A common feature of point-to-point connections is much higher bandwidth capability; to achieve this, point-to-point protocols adopt some or all of the following characteristics:
only two devices per connection.
low voltage, differential signaling on the high speed data paths
source-synchronous clocks, sometimes using double data rate (DDR)
very tight control over PCB trace lengths and routing
integrated termination and/or compensation circuits embedded in the two devices which maintain signal integrity and account for voltage and temperature effects on timing.
dual simplex interfaces between the devices rather than one bi-directional bus; this enables duplex operations and eliminates "turn around" cycles.
sophisticated protocols that eliminate retries, disconnects, wait-states, etc.
A Note About Connectors
While connectors may or may not be defined in a point-to-point link specification, they may be designed into some implementations to connect from board-board or for the attachment of diagnostic equipment. There is no definition of a peripheral add-in card connector for HyperTransport as there is in PCI or PCI-X.
What HT Brings
HyperTransport is a point-to-point, high-performance, "inside-the-box" motherboard interconnect bus. It targets IT, Telecom, and other applications requiring high bandwidth, scalability, and low latency access. Figure 1-2 on page 15 illustrates a single HT bus implementation with a variety of functional devices attached.
Sample HT-based System
Key Features Of HyperTransport Protocol
The key characteristics of the HT technology include:
Open architecture, non-proprietary bus
One or more fast, point-to-point links
Scaling of individual link width and clock speed to suit cost/performance targets
Split-transaction protocol eliminates retries, disconnects, and wait-states.
Standard and optional isochronous traffic support
PCI compatible; designed for minimal impact on OS and driver software
CRC error generation and checking
Programmable error handling strategy for CRC, protocol, and other errors
Message signalled interrupts
System Management features
Support for bridges to legacy busses
x86 compatibility features
Device types including tunnels, bridges, and end devices permit construction of a system fabric comprised of independent, customized links.
Formerly known as AMD's Lightning Data Transport (LDT), HyperTransport is backed by a consortium of developers. See www.hypertransport.org.
The Cost Factor
In addition to technology-related issues, there is always pressure on the platform designer to increase performance and other capabilities with each new generation, but to do so at a lower cost than the previous one. One popular method of measuring the success of this effort is to compare the bandwidth of one I/O bus to another, and the number of signals required to achieve it. This bandwidth-per-pin comparison works fairly well because I/O bus bandwidth is a critical factor in determining if system data bottlenecks exist, and a lower pin count translates directly into cost savings due to smaller IC packages, lower power, simplified motherboard routing, etc.
An example:
The bandwidth-per-pin for a generic 32-bit PCI bus during a burst transfer is approximately 3.5 MB/s (132 MB/s [33MHz x 4 bytes]/38 pins [32 data signals + 5 control lines + 1 clock]). By comparison, a 32 bit HyperTransport interface running at the lowest clock speed of 200MHz yields a per-pin burst bandwidth of approximately 22 MB/s (1600 MB/s [200Mhz x 2 DDR x 4 bytes]/74 pins [32 CAD signal pairs + 4 clock pairs + 1 CTL pair]).
Networking Support
Finally, at the time of the writing of this book, the HyperTransport I/O Link Specification is at revision 1.04. This specification revision mainly targets I/O subsystem improvements in conventional desktop and server platforms.
A growing number of applications require architectures that integrate well with networking environments. In many of these systems, unlike desktops and servers, processing may be decentralized and features such as message streaming, peer-peer transfers, and assigned isochronous bandwidth become important. In addition, device types such as switches help in building topologies suited to communications networking. To accommodate networking applications, work is well underway on the 1.05 and 1.1 revisions of the HyperTransport I/O Link Specification. The 1.05 specification includes the HyperTransport switch specification and the 1.1 specification incorporates the networking extensions specification. See Chapter 19, entitled "Networking Extensions Overview," on page 443 for a summary of the major features expected to be included in the 1.05 and 1.1 specification revisions.
Visit www.hypertransport.org for up-to-date information on all on-going specification revisions.
Also, visit MindShare's website at www.mindshare.com for updates to this book relating to this and other HyperTransport topics. Information will be available for free download when the new specification revisions are released and details become publicly available.
New I/O buses are typically developed in response to changing system requirements and to promote lower cost implementations. Current-generation I/O buses such as PCI are rapidly falling behind the capabilities of other system components such as processors and memory. Some of the reasons why the I/O bottlenecks are becoming more apparent are described below.
Server Or Desktop Computer: Three Subsystems
A server or desktop computer system is comprised of three major subsystems:
Processor (in servers, there may be more than one)
Main DRAM Memory. There are a number of different synchronous DRAM types, including SDRAM, DDR, and Rambus.
I/O (Input/Output devices). Generally, all components which are not processors or DRAM are lumped together in this subsystem group. This would include such things as graphics, mass storage, legacy hardware, and the buses required to support them: PCI, PCI-X, AGP, USB, IDE, etc.
CPU Speed Makes Other Subsystems Appear Slow
Because of improvements in CPU internal execution speed, processors are more demanding than ever when they access external resources such as memory and I/O. Each external read or write by the processor represents a huge performance hit compared to internal execution.
Multiple CPUs Aggravate The Problem
In systems with multiple CPUs, such as servers, the problem of accessing external devices becomes worse because of competition for access to system DRAM and the single set of I/O resources.
DRAM Memory Keeps Up Fairly Well
Although it is external to the processor(s), system DRAM memory keeps up fairly well with the increasing demands of CPUs for a couple of reasons. First, the performance penalty for accessing external memory is mitigated by the use of internal processor caches. Modern processors generally implement multiple levels of internal caches that run at the full CPU clock rate and are tuned for high "hit rates". Each fetch from an internal cache eliminates the need for an external bus cycle to memory.
In addition, in cases where an external memory fetch is required, DRAM technology and the use of synchronous bus interfaces to it (e.g. DDR, RAMBUS, etc.) have allowed it to maintain bandwidths comparable with the processor external bus rates.
I/O Bandwidth Has Not Kept Pace
While the processor internal speed has raced forward, and memory access speed has managed to follow along reasonably well with the help of caches, I/O subsystem evolution has not kept up.
This Slows Down The Processor
Although external DRAM accesses by processors can be minimized through the use of internal caches, there is no way to avoid external bus operations when accessing I/O devices. The processor must perform small, inefficient external transactions which then must find their way through the I/O subsystem to the bus hosting the device.
It Also Hurts Fast Peripherals
Similarly, bus master I/O devices using PCI or other subsystem buses to reach main memory are also hindered by the lack of bandwidth. Some modern peripheral devices (e.g. SCSI and IDE hard drives) are capable of running much faster than the busses they live on. This represents another system bottleneck. This is a particular problem in cases where applications are running that emphasize time-critical movement of data through the I/O subsystem over CPU processing.
Reducing I/O Bottlenecks
Two important schemes have been used to connect I/O devices to main memory. The first is the shared bus approach, as used in PCI and PCI-X. The second involves point-to-point component interconnects, and includes some proprietary busses as well as open architectures such as HyperTransport. These are described here, along with the advantages and disadvantages of each.
The Shared Bus Approach
on page 12 depicts the common "North-South" bridge PCI implementation. Note that the PCI bus acts as both an "add-in" bus for user peripheral cards and as an interconnect bus to memory for all devices residing on or below it. Even traffic to and from the USB and IDE controllers integrated in the South Bridge must cross the PCI bus to reach main memory.
Until recently, the topology on page 12 has been very popular in desktop systems for a number of reasons, including:
A shared bus reduces the number of traces on the motherboard to a single set.
All of the devices located on the PCI bus are only one bridge interface away from the principal target of their transactions — main DRAM memory.
A single, very popular protocol (PCI) can be used for all embedded devices, add-in cards, and chipset components attached to the bus.
Unfortunately, some of the things that made this topology so popular also have made it difficult to fix the I/O bandwidth problems which have become more obvious as processors and memory have become faster.
A Shared Bus Runs At Limited Clock Speeds
The fact that multiple devices (including PCB connectors) attach to a shared bus means that trace lengths and electrical complexity will limit the maximum usable clock speed. For example, a generic PCI bus has a maximum clock speed of 33MHz; the PCI Specification permits increasing the clock speed to 66MHz, but the number of devices/connectors on the bus is very limited.
A Shared Bus May Be Host To Many Device Types
The requirements of devices on a shared bus may vary widely in terms of bandwidth needed, tolerance for bus access latency, typical data transfer size, etc. All of this complicates arbitration on the bus when multiple masters wish to initiate transactions.
Backward Compatibility Prevents Upgrading Performance
If a critical shared bus is based on an open architecture, especially one that defines user "add-in" connectors, then another problem in upgrading bus bandwidth is the need to maintain backward compatibility with all of the devices and cards already in existence. If the bus protocol is enhanced and a user installs an "older generation card", then the bus must either revert back to the earlier protocol or lose its compatibility.
Special Problems If The Shared Bus Is PCI
As popular as it has been, PCI presents additional problems that contribute to performance limits:
PCI doesn't support split transactions, resulting in inefficient retries.
Transaction size (there is no limit) isn't known, which makes it difficult to size buffers and causes frequent disconnects by targets. Devices are also allowed to insert numerous wait states during each data phase.
All PCI transactions by I/O devices targeting main memory generally require a "snoop" cycle by CPUs to assure coherency with internal caches. This impacts both CPU and PCI performance.
Its data bus scalability is very limited (32/64 bit data)
Because of the PCI electrical specification (low-power, reflected wave signals), each PCI bus is physically limited in the number of ICs and connectors vs. PCI clock speed
PCI bus arbitration is vaguely specified. Access latencies can be long and difficult to quantify. If a second PCI bus is added (using a PCI-PCI bridge), arbitration for the secondary bus typically resides in the new bridge. This further complicates PCI arbitration for traffic moving vertically to memory.
A Note About PCI-X
Other than scalability and the number of devices possible on each bus, the PCI-X protocol has resolved many of the problems just described with PCI. For third-party manufacturers of high performance add-in cards and embedded devices, the shared bus PCI-X is a straightforward extension of PCI which yields huge bandwidth improvements (up to about 2GB/s with PCI-X 2.0).
ThePoint-to-Point Interconnect Approach
An alternative to the shared I/O bus approach of PCI or PCI-X is having point-to-point links connecting devices. This method is being used in a number of new bus implementations, including HyperTransport technology. A common feature of point-to-point connections is much higher bandwidth capability; to achieve this, point-to-point protocols adopt some or all of the following characteristics:
only two devices per connection.
low voltage, differential signaling on the high speed data paths
source-synchronous clocks, sometimes using double data rate (DDR)
very tight control over PCB trace lengths and routing
integrated termination and/or compensation circuits embedded in the two devices which maintain signal integrity and account for voltage and temperature effects on timing.
dual simplex interfaces between the devices rather than one bi-directional bus; this enables duplex operations and eliminates "turn around" cycles.
sophisticated protocols that eliminate retries, disconnects, wait-states, etc.
A Note About Connectors
While connectors may or may not be defined in a point-to-point link specification, they may be designed into some implementations to connect from board-board or for the attachment of diagnostic equipment. There is no definition of a peripheral add-in card connector for HyperTransport as there is in PCI or PCI-X.
What HT Brings
HyperTransport is a point-to-point, high-performance, "inside-the-box" motherboard interconnect bus. It targets IT, Telecom, and other applications requiring high bandwidth, scalability, and low latency access. Figure 1-2 on page 15 illustrates a single HT bus implementation with a variety of functional devices attached.
Sample HT-based System
Key Features Of HyperTransport Protocol
The key characteristics of the HT technology include:
Open architecture, non-proprietary bus
One or more fast, point-to-point links
Scaling of individual link width and clock speed to suit cost/performance targets
Split-transaction protocol eliminates retries, disconnects, and wait-states.
Standard and optional isochronous traffic support
PCI compatible; designed for minimal impact on OS and driver software
CRC error generation and checking
Programmable error handling strategy for CRC, protocol, and other errors
Message signalled interrupts
System Management features
Support for bridges to legacy busses
x86 compatibility features
Device types including tunnels, bridges, and end devices permit construction of a system fabric comprised of independent, customized links.
Formerly known as AMD's Lightning Data Transport (LDT), HyperTransport is backed by a consortium of developers. See www.hypertransport.org.
The Cost Factor
In addition to technology-related issues, there is always pressure on the platform designer to increase performance and other capabilities with each new generation, but to do so at a lower cost than the previous one. One popular method of measuring the success of this effort is to compare the bandwidth of one I/O bus to another, and the number of signals required to achieve it. This bandwidth-per-pin comparison works fairly well because I/O bus bandwidth is a critical factor in determining if system data bottlenecks exist, and a lower pin count translates directly into cost savings due to smaller IC packages, lower power, simplified motherboard routing, etc.
An example:
The bandwidth-per-pin for a generic 32-bit PCI bus during a burst transfer is approximately 3.5 MB/s (132 MB/s [33MHz x 4 bytes]/38 pins [32 data signals + 5 control lines + 1 clock]). By comparison, a 32 bit HyperTransport interface running at the lowest clock speed of 200MHz yields a per-pin burst bandwidth of approximately 22 MB/s (1600 MB/s [200Mhz x 2 DDR x 4 bytes]/74 pins [32 CAD signal pairs + 4 clock pairs + 1 CTL pair]).
Networking Support
Finally, at the time of the writing of this book, the HyperTransport I/O Link Specification is at revision 1.04. This specification revision mainly targets I/O subsystem improvements in conventional desktop and server platforms.
A growing number of applications require architectures that integrate well with networking environments. In many of these systems, unlike desktops and servers, processing may be decentralized and features such as message streaming, peer-peer transfers, and assigned isochronous bandwidth become important. In addition, device types such as switches help in building topologies suited to communications networking. To accommodate networking applications, work is well underway on the 1.05 and 1.1 revisions of the HyperTransport I/O Link Specification. The 1.05 specification includes the HyperTransport switch specification and the 1.1 specification incorporates the networking extensions specification. See Chapter 19, entitled "Networking Extensions Overview," on page 443 for a summary of the major features expected to be included in the 1.05 and 1.1 specification revisions.
Visit www.hypertransport.org for up-to-date information on all on-going specification revisions.
Also, visit MindShare's website at www.mindshare.com for updates to this book relating to this and other HyperTransport topics. Information will be available for free download when the new specification revisions are released and details become publicly available.
Subscribe to:
Posts (Atom)