Xilinx FPGA CoreGen Adder analysis
Chris McGraw

CoreGen Adder Parameters: Add-only, no Carry/Borrow, Non Registered

In the test schematic for these CoreGen adders, only one bit of the adder output bus is connected to an output pin.  This is done so that 32 and 64 bit adders may be analyzed (as, our SpartanII only has 92 I/O pins, and connecting full input and output busses on the 32 and 64 bit adders would require more pins than 92).  Only one of the input busses is connected to input pins, and only one output pin is used (the most significant bit of the bus, to maximize delay for worst-case timing analysis).  Therefore, the maximum typical bit width our chip can test in this way is 64 (64 input pins, and 1 output; 128 bits for input is already too many, so the next power-of-two bit width is out).  In actuality, 88 bit wide adders could be tested (88 input pins + CLK + RESET + Output pin = 92).

Timing and resource analysis:

Delay times are taken from Xilinx Foundation Timing Analyzer, Default period analysis for net CLK, and represent worst-case, maximum period delay.

Adder Bus width Total equivalent gate count for design Speed grade -5 Speed grade -6
4 117 5.998ns 5.083ns
8 229 6.427ns 5.436ns
16 453 7.205ns 6.099ns
16 (no RPM) 453 6.714ns not tested
24 (predicted) 677 8ns 6.5ns
32 (no RPM) 901 9.159ns not tested
32 901 8.408ns 7.120ns
48 (no RPM) 1,349 14.753ns 12.401ns
64 (no RPM) 1,797 14.532ns 12.314ns 

All adders are constructed from LogiCore Adder/Subtractor 4.0; the 48 and 64 bit adders are designed with RPM disabled.  Foundation seemed to run out of space with RPMs enabled for these bit widths.   16 and 32 bit adders without RPM are included for comparison.

The relationship between bus width and total gate count appears to be very linear.  This result suggests that the CoreGen architecture is based around a large bank of similarly functioning, basic elements (with a fixed number of gates each).  In this case, the core element is probably a full adder; the appropriate number of full adders are simply strung together by CoreGen to produce the adder of desired width.

Oddly, the maximum delay for the 64 bit adder is less than that of the 48 bit adder.  This may be due to a fundamentally different layout of the 64 bit adder on the chip (notice the two columns of CoreGen elements in the layout of the 64 bit adder--it appears that the 48 bit only uses one column, and has more routing delay)

 add48 (-5)  (7.142ns logic, 7.610ns route) (48.4% logic, 51.6% route)
 add64 (-5)  (6.735ns logic, 5.579ns route) (54.7% logic, 45.3% route)

As expected, the 48 bit adder suffers from more routing delay, but also has more logic delay.  This result is particularly confounding, as the 64 bit adder should be performing more logical operations.  This implies that the 64 bit adder works in a fundamentally different way from the 48 bit.  However, the gate count relationship among these adders is strongly linear with the bit width; in this way, the 64 bit adder appears to be constructed along the same lines as the smaller adders.

The data for the 48 and 64 bit adders does not fit with the other adder data when considering maximum period delay.  With RPM turned off for these two adders, period delay is longer, relatively, due to more routing delay.  No linear regression was performed due to this discontinuity.  Had the XC2S100TQ144 supported RPM'ed 48 and 64 bit adders, the relationship between bit width and period delay would most likely be very linear across all bit widths.  If CoreGen adders are simply made of sequences of fundamental adder-blocks, it makes sense that the period delay of an adder be linearly proportional to the bit width.

Without performing a linear regression on the timing data, we can estimate (visually) that the -5 grade delay on a 24-bit adder would be about 8ns.  Likewise, with a -6 grade chip, the delay would drop to approximately 6.5ns.
Across all of the data points, the Spartan2 "Higher Performance" -6 grade represents a 15% speed improvement over the -5 grade.  The speed gain of the chip did not seem to depend on the complexity of this design (including the non-RPM designs); the speed change then reflects a higher clock rate.  Pricing for the XC2S100TQ144 -6 grade chip is $22.27, exactly 15% more than the $19.36 for the -5 grade chip. Prices c/o AVNET.

XS40-010XLPC84 FPGA Comparison:
This chip was not able to implement the a 32 bit adder structure.  Using the CoreGen 1.0 Registered Adder (32 bit unsigned), the implemention stage reported the helpful message:

ERROR:OldMap:160 - The RLOC value of R17C0 on component U1/BU0 in RPM U1/hset
   creates a macro that is too large for the device.  Use a bigger device.

With a similar CoreGen 1.0 8 bit adder, the chip was able to implement the design successfully.  With the same delay timing analysis parameters as above, this chip produced 5.691ns of period delay (with the low speed grade), which is actually almost 1ns faster than the Spartan2 at -5.  However, 368 gates were used, as compared to the Spartan2's 229.  15% of the CLB's in this older chip are used up with an 8 bit adder, whereas are left in the Spartan2.  A quick glance at the FPGA Layout Editor reveals that this 8 bit adder takes up significantly more area on the old chip than on the new (approx 25% as compared to less than 10%).  Without more speed tests, it is hard to determine if the older chip is slower in general; it is clear, however, that the new Spartan2 is a much larger, more flexible chip.

The XS40-010XLPC84 advertises 5000 system gates, whereas the Spartan2 boasts 100,000 gates.  At speed grade -4, the XS40 is available for $7.314, and $8.80 at -5.  In terms of gates per dollar, the Spartan2 is orders of magnitude more cost-effective.  This cost difference seems to be significant enough to make up for a possible speed reduction relative to the XS40 series.

FPGA Layout:
add64 (Relationally Placed Macros disabled)
The "size" of the add32 design is roughly twice that of the add16, which is to be expected (as, presumably, the 32 bit adder is made of the same components as the 16 bit, only twice as many).  It is slighly difficult to verify this linear expansion with the add64, as the routing on the chip is a bit distracting.  However, if only the dense regions of the layout are considered, the design is rounghly twice the size of the 32 bit.  The CoreGen architecture seems to occupy a column on the left of the chip; when more CoreGen adder components are needed, more of this column is utilized.  This reflects a concern for routing efficiency, as the CoreGen adders appear to be made up of repeated, fundamental CoreGen full adders.  Use of multiple-bit adders is expected, so the CoreGen full adders are all arrayed close to eachother, to minimize routing delay between them.  The 64 bit adder reveals that there are multiple columns of CoreGen elements throughout the chip.

RPM vs noRPM (maximum period delay)
 add32 (RPM)
  8.408ns (5.499ns logic, 2.909ns route) (65.4% logic, 34.6% route)
 add32 (noRPM)
  9.159ns (5.403ns logic, 3.756ns route) (59.0% logic, 41.0% route)

It is apparent from the above data that Relationally Placed Macros increase (and perhaps maximize) routing efficiency.  Over a 32 bit adder, disabling RPM does not seem to greatly affect delay due to logic.  Routing delay increases by 7%, however.
ADD32 w/ RPM off  

The design spans the entire chip, and much of the routing seems to be unnecessarily long.

ADD32 with RPM on  

Notice that the same design is implemented in far less space, and that routing is more dense.