Background: I/O Subsystem Bottlenecks
New I/O buses are typically developed in response to changing system requirements and to promote lower cost implementations. Current-generation I/O buses such as PCI are rapidly falling behind the capabilities of other system components such as processors and memory. Some of the reasons why the I/O bottlenecks are becoming more apparent are described below.
Server Or Desktop Computer: Three Subsystems
A server or desktop computer system is comprised of three major subsystems:
Processor (in servers, there may be more than one)
Main DRAM Memory. There are a number of different synchronous DRAM types, including SDRAM, DDR, and Rambus.
I/O (Input/Output devices). Generally, all components which are not processors or DRAM are lumped together in this subsystem group. This would include such things as graphics, mass storage, legacy hardware, and the buses required to support them: PCI, PCI-X, AGP, USB, IDE, etc.
CPU Speed Makes Other Subsystems Appear Slow
Because of improvements in CPU internal execution speed, processors are more demanding than ever when they access external resources such as memory and I/O. Each external read or write by the processor represents a huge performance hit compared to internal execution.
Multiple CPUs Aggravate The Problem
In systems with multiple CPUs, such as servers, the problem of accessing external devices becomes worse because of competition for access to system DRAM and the single set of I/O resources.
DRAM Memory Keeps Up Fairly Well
Although it is external to the processor(s), system DRAM memory keeps up fairly well with the increasing demands of CPUs for a couple of reasons. First, the performance penalty for accessing external memory is mitigated by the use of internal processor caches. Modern processors generally implement multiple levels of internal caches that run at the full CPU clock rate and are tuned for high "hit rates". Each fetch from an internal cache eliminates the need for an external bus cycle to memory.
In addition, in cases where an external memory fetch is required, DRAM technology and the use of synchronous bus interfaces to it (e.g. DDR, RAMBUS, etc.) have allowed it to maintain bandwidths comparable with the processor external bus rates.
I/O Bandwidth Has Not Kept Pace
While the processor internal speed has raced forward, and memory access speed has managed to follow along reasonably well with the help of caches, I/O subsystem evolution has not kept up.
This Slows Down The Processor
Although external DRAM accesses by processors can be minimized through the use of internal caches, there is no way to avoid external bus operations when accessing I/O devices. The processor must perform small, inefficient external transactions which then must find their way through the I/O subsystem to the bus hosting the device.
It Also Hurts Fast Peripherals
Similarly, bus master I/O devices using PCI or other subsystem buses to reach main memory are also hindered by the lack of bandwidth. Some modern peripheral devices (e.g. SCSI and IDE hard drives) are capable of running much faster than the busses they live on. This represents another system bottleneck. This is a particular problem in cases where applications are running that emphasize time-critical movement of data through the I/O subsystem over CPU processing.
Reducing I/O Bottlenecks
Two important schemes have been used to connect I/O devices to main memory. The first is the shared bus approach, as used in PCI and PCI-X. The second involves point-to-point component interconnects, and includes some proprietary busses as well as open architectures such as HyperTransport. These are described here, along with the advantages and disadvantages of each.
The Shared Bus Approach
on page 12 depicts the common "North-South" bridge PCI implementation. Note that the PCI bus acts as both an "add-in" bus for user peripheral cards and as an interconnect bus to memory for all devices residing on or below it. Even traffic to and from the USB and IDE controllers integrated in the South Bridge must cross the PCI bus to reach main memory.
Until recently, the topology on page 12 has been very popular in desktop systems for a number of reasons, including:
A shared bus reduces the number of traces on the motherboard to a single set.
All of the devices located on the PCI bus are only one bridge interface away from the principal target of their transactions — main DRAM memory.
A single, very popular protocol (PCI) can be used for all embedded devices, add-in cards, and chipset components attached to the bus.
Unfortunately, some of the things that made this topology so popular also have made it difficult to fix the I/O bandwidth problems which have become more obvious as processors and memory have become faster.
A Shared Bus Runs At Limited Clock Speeds
The fact that multiple devices (including PCB connectors) attach to a shared bus means that trace lengths and electrical complexity will limit the maximum usable clock speed. For example, a generic PCI bus has a maximum clock speed of 33MHz; the PCI Specification permits increasing the clock speed to 66MHz, but the number of devices/connectors on the bus is very limited.
A Shared Bus May Be Host To Many Device Types
The requirements of devices on a shared bus may vary widely in terms of bandwidth needed, tolerance for bus access latency, typical data transfer size, etc. All of this complicates arbitration on the bus when multiple masters wish to initiate transactions.
Backward Compatibility Prevents Upgrading Performance
If a critical shared bus is based on an open architecture, especially one that defines user "add-in" connectors, then another problem in upgrading bus bandwidth is the need to maintain backward compatibility with all of the devices and cards already in existence. If the bus protocol is enhanced and a user installs an "older generation card", then the bus must either revert back to the earlier protocol or lose its compatibility.
Special Problems If The Shared Bus Is PCI
As popular as it has been, PCI presents additional problems that contribute to performance limits:
PCI doesn't support split transactions, resulting in inefficient retries.
Transaction size (there is no limit) isn't known, which makes it difficult to size buffers and causes frequent disconnects by targets. Devices are also allowed to insert numerous wait states during each data phase.
All PCI transactions by I/O devices targeting main memory generally require a "snoop" cycle by CPUs to assure coherency with internal caches. This impacts both CPU and PCI performance.
Its data bus scalability is very limited (32/64 bit data)
Because of the PCI electrical specification (low-power, reflected wave signals), each PCI bus is physically limited in the number of ICs and connectors vs. PCI clock speed
PCI bus arbitration is vaguely specified. Access latencies can be long and difficult to quantify. If a second PCI bus is added (using a PCI-PCI bridge), arbitration for the secondary bus typically resides in the new bridge. This further complicates PCI arbitration for traffic moving vertically to memory.
A Note About PCI-X
Other than scalability and the number of devices possible on each bus, the PCI-X protocol has resolved many of the problems just described with PCI. For third-party manufacturers of high performance add-in cards and embedded devices, the shared bus PCI-X is a straightforward extension of PCI which yields huge bandwidth improvements (up to about 2GB/s with PCI-X 2.0).
ThePoint-to-Point Interconnect Approach
An alternative to the shared I/O bus approach of PCI or PCI-X is having point-to-point links connecting devices. This method is being used in a number of new bus implementations, including HyperTransport technology. A common feature of point-to-point connections is much higher bandwidth capability; to achieve this, point-to-point protocols adopt some or all of the following characteristics:
only two devices per connection.
low voltage, differential signaling on the high speed data paths
source-synchronous clocks, sometimes using double data rate (DDR)
very tight control over PCB trace lengths and routing
integrated termination and/or compensation circuits embedded in the two devices which maintain signal integrity and account for voltage and temperature effects on timing.
dual simplex interfaces between the devices rather than one bi-directional bus; this enables duplex operations and eliminates "turn around" cycles.
sophisticated protocols that eliminate retries, disconnects, wait-states, etc.
A Note About Connectors
While connectors may or may not be defined in a point-to-point link specification, they may be designed into some implementations to connect from board-board or for the attachment of diagnostic equipment. There is no definition of a peripheral add-in card connector for HyperTransport as there is in PCI or PCI-X.
What HT Brings
HyperTransport is a point-to-point, high-performance, "inside-the-box" motherboard interconnect bus. It targets IT, Telecom, and other applications requiring high bandwidth, scalability, and low latency access. Figure 1-2 on page 15 illustrates a single HT bus implementation with a variety of functional devices attached.
Sample HT-based System
Key Features Of HyperTransport Protocol
The key characteristics of the HT technology include:
Open architecture, non-proprietary bus
One or more fast, point-to-point links
Scaling of individual link width and clock speed to suit cost/performance targets
Split-transaction protocol eliminates retries, disconnects, and wait-states.
Standard and optional isochronous traffic support
PCI compatible; designed for minimal impact on OS and driver software
CRC error generation and checking
Programmable error handling strategy for CRC, protocol, and other errors
Message signalled interrupts
System Management features
Support for bridges to legacy busses
x86 compatibility features
Device types including tunnels, bridges, and end devices permit construction of a system fabric comprised of independent, customized links.
Formerly known as AMD's Lightning Data Transport (LDT), HyperTransport is backed by a consortium of developers. See www.hypertransport.org.
The Cost Factor
In addition to technology-related issues, there is always pressure on the platform designer to increase performance and other capabilities with each new generation, but to do so at a lower cost than the previous one. One popular method of measuring the success of this effort is to compare the bandwidth of one I/O bus to another, and the number of signals required to achieve it. This bandwidth-per-pin comparison works fairly well because I/O bus bandwidth is a critical factor in determining if system data bottlenecks exist, and a lower pin count translates directly into cost savings due to smaller IC packages, lower power, simplified motherboard routing, etc.
An example:
The bandwidth-per-pin for a generic 32-bit PCI bus during a burst transfer is approximately 3.5 MB/s (132 MB/s [33MHz x 4 bytes]/38 pins [32 data signals + 5 control lines + 1 clock]). By comparison, a 32 bit HyperTransport interface running at the lowest clock speed of 200MHz yields a per-pin burst bandwidth of approximately 22 MB/s (1600 MB/s [200Mhz x 2 DDR x 4 bytes]/74 pins [32 CAD signal pairs + 4 clock pairs + 1 CTL pair]).
Networking Support
Finally, at the time of the writing of this book, the HyperTransport I/O Link Specification is at revision 1.04. This specification revision mainly targets I/O subsystem improvements in conventional desktop and server platforms.
A growing number of applications require architectures that integrate well with networking environments. In many of these systems, unlike desktops and servers, processing may be decentralized and features such as message streaming, peer-peer transfers, and assigned isochronous bandwidth become important. In addition, device types such as switches help in building topologies suited to communications networking. To accommodate networking applications, work is well underway on the 1.05 and 1.1 revisions of the HyperTransport I/O Link Specification. The 1.05 specification includes the HyperTransport switch specification and the 1.1 specification incorporates the networking extensions specification. See Chapter 19, entitled "Networking Extensions Overview," on page 443 for a summary of the major features expected to be included in the 1.05 and 1.1 specification revisions.
Visit www.hypertransport.org for up-to-date information on all on-going specification revisions.
Also, visit MindShare's website at www.mindshare.com for updates to this book relating to this and other HyperTransport topics. Information will be available for free download when the new specification revisions are released and details become publicly available.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment