CoreGen Adder Parameters: Add-only, no Carry/Borrow, Non Registered
In the test schematic for these CoreGen adders, only one bit of the adder output bus is connected to an output pin. This is done so that 32 and 64 bit adders may be analyzed (as, our SpartanII only has 92 I/O pins, and connecting full input and output busses on the 32 and 64 bit adders would require more pins than 92). Only one of the input busses is connected to input pins, and only one output pin is used (the most significant bit of the bus, to maximize delay for worst-case timing analysis). Therefore, the maximum typical bit width our chip can test in this way is 64 (64 input pins, and 1 output; 128 bits for input is already too many, so the next power-of-two bit width is out). In actuality, 88 bit wide adders could be tested (88 input pins + CLK + RESET + Output pin = 92).
Timing and resource analysis:
Delay times are taken from Xilinx Foundation Timing Analyzer, Default
period analysis for net CLK, and represent worst-case, maximum period delay.
Adder Bus width | Total equivalent gate count for design | Speed grade -5 | Speed grade -6 |
4 | 117 | 5.998ns | 5.083ns |
8 | 229 | 6.427ns | 5.436ns |
16 | 453 | 7.205ns | 6.099ns |
16 (no RPM) | 453 | 6.714ns | not tested |
24 (predicted) | 677 | 8ns | 6.5ns |
32 (no RPM) | 901 | 9.159ns | not tested |
32 | 901 | 8.408ns | 7.120ns |
48 (no RPM) | 1,349 | 14.753ns | 12.401ns |
64 (no RPM) | 1,797 | 14.532ns | 12.314ns |
All adders are constructed from LogiCore Adder/Subtractor 4.0; the 48 and 64 bit adders are designed with RPM disabled. Foundation seemed to run out of space with RPMs enabled for these bit widths. 16 and 32 bit adders without RPM are included for comparison.
The relationship between bus width and total gate count appears to be very linear. This result suggests that the CoreGen architecture is based around a large bank of similarly functioning, basic elements (with a fixed number of gates each). In this case, the core element is probably a full adder; the appropriate number of full adders are simply strung together by CoreGen to produce the adder of desired width.
Oddly, the maximum delay for the 64 bit adder is less than that of the 48 bit adder. This may be due to a fundamentally different layout of the 64 bit adder on the chip (notice the two columns of CoreGen elements in the layout of the 64 bit adder--it appears that the 48 bit only uses one column, and has more routing delay)
add48 (-5) (7.142ns logic, 7.610ns route) (48.4% logic,
51.6% route)
add64 (-5) (6.735ns logic, 5.579ns route) (54.7% logic,
45.3% route)
As expected, the 48 bit adder suffers from more routing delay, but also has more logic delay. This result is particularly confounding, as the 64 bit adder should be performing more logical operations. This implies that the 64 bit adder works in a fundamentally different way from the 48 bit. However, the gate count relationship among these adders is strongly linear with the bit width; in this way, the 64 bit adder appears to be constructed along the same lines as the smaller adders.
The data for the 48 and 64 bit adders does not fit with the other adder data when considering maximum period delay. With RPM turned off for these two adders, period delay is longer, relatively, due to more routing delay. No linear regression was performed due to this discontinuity. Had the XC2S100TQ144 supported RPM'ed 48 and 64 bit adders, the relationship between bit width and period delay would most likely be very linear across all bit widths. If CoreGen adders are simply made of sequences of fundamental adder-blocks, it makes sense that the period delay of an adder be linearly proportional to the bit width.
Without performing a linear regression on the timing data, we can
estimate (visually) that the -5 grade delay on a 24-bit adder would be
about 8ns. Likewise, with a -6 grade chip, the delay would drop to
approximately 6.5ns.
Across all of the data points, the Spartan2 "Higher Performance"
-6 grade represents a 15% speed improvement over the -5 grade. The
speed gain of the chip did not seem to depend on the complexity of this
design (including the non-RPM designs); the speed change then reflects
a higher clock rate. Pricing for the XC2S100TQ144 -6 grade chip is
$22.27, exactly 15% more than the $19.36 for the -5 grade chip. Prices
c/o AVNET.
ERROR:OldMap:160 - The RLOC value of R17C0 on component U1/BU0 in
RPM U1/hset
creates a macro that is too large for the device.
Use a bigger device.
With a similar CoreGen 1.0 8 bit adder, the chip was able to implement the design successfully. With the same delay timing analysis parameters as above, this chip produced 5.691ns of period delay (with the low speed grade), which is actually almost 1ns faster than the Spartan2 at -5. However, 368 gates were used, as compared to the Spartan2's 229. 15% of the CLB's in this older chip are used up with an 8 bit adder, whereas are left in the Spartan2. A quick glance at the FPGA Layout Editor reveals that this 8 bit adder takes up significantly more area on the old chip than on the new (approx 25% as compared to less than 10%). Without more speed tests, it is hard to determine if the older chip is slower in general; it is clear, however, that the new Spartan2 is a much larger, more flexible chip.
The XS40-010XLPC84 advertises 5000 system gates, whereas the Spartan2
boasts 100,000 gates. At speed grade -4, the XS40 is available for
$7.314, and $8.80 at -5. In terms of gates per dollar, the Spartan2
is orders of magnitude more cost-effective. This cost difference
seems to be significant enough to make up for a possible speed reduction
relative to the XS40 series.
FPGA Layout:
add16 |
add32 |
add64 (Relationally Placed Macros disabled) |
RPM vs noRPM (maximum period delay)
add32 (RPM)
8.408ns (5.499ns logic, 2.909ns route) (65.4% logic, 34.6%
route)
add32 (noRPM)
9.159ns (5.403ns logic, 3.756ns route) (59.0% logic, 41.0%
route)
It is apparent from the above data that Relationally Placed Macros
increase (and perhaps maximize) routing efficiency. Over a 32 bit
adder, disabling RPM does not seem to greatly affect delay due to logic.
Routing delay increases by 7%, however.
ADD32 w/ RPM off The design spans the entire chip, and much of the routing seems to be unnecessarily long. |
ADD32 with RPM on Notice that the same design is implemented in far less space, and that routing is more dense. |