Sampled-data techniques such as Class IV Partial Response with Maximum Likelihood detection (PR-IV) are being applied to magnetic disk drive read channels in order to increase transfer rates and recording densities [1-2]. In order to provide robust implementation of the functions required in these channels, implementation of key blocks such as timing recovery, adaptive equalization, and sequence detection are often in the digital domain. The power consumed by these blocks can be appreciable due to the high speeds of operation required in these channels with data rates on the order of 50-100 Mbits/sec and beyond. Power consumption is critical due to the demand for battery operated portable systems and the proximity of the electronics to the magnetic media as form factors continually decrease. The latter is particularly important due to heat dissipation of the electronics and its effect on the media. The CMOS logic alone in a BiCMOS 65 Mbits/sec read channel IC implementing among other functions those of digital equalization and sequence detection dissipates 1 W [3].
This paper describes an architecture for digital implementation of an adaptive equalizer suitable for use in PRML disk drive channels. The new architecture was used in a prototype IC containing an 8-tap 50 MHz sampling rate adaptive equalizer and a Viterbi detector and consumes a total of 70 mW operating off of a 3.3 V power supply in a 1.2
m CMOS technology. There are two key ideas used in the implementation of the adaptive equalizer. First, parallelism and pipelining are employed in the architecture to achieve a high throughput rate at the reduced power supply voltage of 3.3 V. Second, the sign-LMS algorithm is used for coefficient adaptation and tracking and results in improved performance as compared with the full LMS algorithm.
A block diagram of a PR-IV read channel is shown in Figure 2. The output of the magnetic disk is first amplified by the read amplifier before being passed on to the analog front-end which includes a variable gain amplifier (controlled in an automatic gain control loop not shown), lowpass filter, sampler, and analog-to-digital converter followed by the functions of adaptive equalization, sequence detection, and timing recovery in the digital domain. The adaptive equalizer operates on the 6-bit samples from the A/D converter, equalizing these samples for subsequent detection by the sequence detector and use by the timing recovery block. The prototype IC contains the adaptive equalizer and sequence detector blocks.
The multiplier is the building block required in the adaptive equalizer that is most costly in terms of both speed and power. Since the power consumed by a CMOS digital circuit is CV
f, reducing the power supply and employing one or a combination of parallelism and pipelining can result in a significant power savings [4]. Applications using a power supply of 3.3 V are becoming wide-spread in order to reap this reduction in power. Implementation of PRML channels are typically using 6 bits into the adaptive equalizer dictated by the off-channel signal-to-noise ratio. Extensive simulations were performed to explore various permutations of parallelism and pipelining in the implementation of a conventional multiplier [5] for filter sampling rates above 50 MHz and a power supply of 3.3 V. It was found that for implementation of a 6-bit by 6-bit multiplier, the use of 4 multipliers operating in parallel and staggered in phase by the output period T as shown in Figure 2 resulted in the solution dissipating the lowest overall power. This advantage of low power comes at the cost of increased silicon area which, however, will scale with technology. For the required resolution of 6 bits, the overhead associated with pipeline latches in a pipelined implementation of the multiplier increases the power while decreasing the attainable speed (due to latch set-up times).
3.1 Parallel Filter Architecture
A block diagram of a filter stage used to implement the FIR filter is shown in Figure 3a comprised of a delay line, a set of multipliers, and an accumulator. The filter stage resembles the block diagram of an FIR filter. In order to use multipliers which take 4 output periods to perform multiplication, 4 filter stages are used in parallel. The multiplier and accumulator sections are each clocked at one-fourth the output rate of the filter and operate staggered in phase by one output period. The parallel architecture and timing diagrams for this approach are shown in Figures 3b and 3c, respectively. The use of input latches in both the multipliers and accumulator enable the functions of multiplication and accumulation to be pipelined as shown for one filter stage in Figure 3d. During each filter stage clock cycle, the current 8 inputs to a filter stage are multiplied by their respective tap weights. The resulting products are summed by the accumulator during the next filter stage clock cycle.
The Least Mean Square (LMS) algorithm is often used to update the tap weights in the adaptive equalizer. In the LMS algorithm, each coefficient C is updated using the equation
where
is the step size, e
the input error to the slicer at time k, and x
the input to the particular tap weight at time k. Implementation of this algorithm requires two multiplies on top of generation of the slicer error in order to obtain the correction term which is added to the current value of the coefficient as shown in Figure 4a. The time required for these multiplies adds to the latency in the coefficient update which in simulation resulted in stability problems in the adaptation, requiring very small values of
.
In order to reduce this feedback latency in the coefficient update, the sign-LMS algorithm was employed where the coefficients are updated using the equation
where sgn(.) is the signum function which is +1 or -1 for a positive or negative argument, respectively. The coefficient update then reduces to
depending on the product of sgn(e
)*sgn(x
). In the implementation, both c
+
and c
-
are computed in parallel while sgn(e
)*sgn(x
) is computed using an exclusive-OR operation on the sign bits of both the error and the input to the respective tap. Depending on the outcome of the exclusive-OR operation, c
+
or c
-
is chosen. This is shown schematically in Figure 4b. This approach reduces the feedback latency from 16 periods down to 8 as listed in Figure 4c and performs robustly. There is the added advantage that the multiplications required in the full LMS algorithm reduce to a single exclusive-OR which results in a significant power and area savings.
Since the disk drive channel is slowly varying, the output of only one of the four parallel filter stages is used in the coefficient update and the same set of coefficients is used by all four filter stages. In order to reduce timing requirements of the availability of the coefficient set to each of the four filter stages, the approach shown in Figure 5 is used. Using this approach, the coefficients output from the coefficient update block need only be available during the latching instant of Filter Stage 1. The coefficients are passed through this filter stage and are sampled one output period later by Filter Stage 2. This process is repeated until the coefficients are latched by Filter Stage 4. This approach greatly reduces the timing requirements of the coefficient availability from the update circuitry to each of the four blocks. Furthermore, it reduces the amount of parasitic capacitance which would otherwise have to be driven by one set of buffers if all four filter stages were to receive the coefficients directly from the update block.
A 3-tap raised-cosine equalizer response with tap weights -K, 1.0, -K was assumed for the initial response before convergence. An externally controlled initialization signal allows the fourth through the sixth coefficients to be initialized to tap weight values provided off-chip. While these three taps are set to their respective values (fourth and sixth tap weights are the same), the other 5 tap weights are forced to zero. The ability to externally set each of the tap weights individually could easily be implemented but was avoided for simplicity. The value of the step size
is also provided externally and can be programmed to implement gear shifting algorithms for fast convergence.
The chip receives two six-bit words at a one-half rate relaxing the timing requirements for latching of the input data while having no effect on performance of the filter. Since seven of the inputs used by Filter Stage 1 are used by Filter Stage 2, the inputs, after being latched by a filter stage are routed through the stage to be sampled by the subsequent filter stage one output period later as shown in Figure 6. This relaxes the timing requirements on movement of the input data throughout the chip.
The sequence (Viterbi) detector is realized by two half-rate Viterbi detectors [6] operating on the outputs of Filter Stages 1 and 3, and 2 and 4 as shown in Figure 7. The inputs into the Viterbi detector blocks are 6-bits wide. The output of both detectors contribute to a two-wide one-half rate bit stream which is output off-chip (commutator shown in the figure).
Due to the regularity of the flow of both input data and filter taps through the chip shown in Figures 5 and 6, a close mapping of the block diagram to the layout is possible. A detailed schematic of a filter stage is shown in Figure 8 indicating the flow of the input and coefficient into each multiplier cell. The two words enter their respective latches where they are held for the duration of the multiply as well as passed through the cell to the next filter stage. A high level view illustrating the flow of the input data and coefficients through the filter is shown in Figure 9.
A die photo of the chip is shown in Figure 10. The layout closely resembles the block diagram shown in Figure 9.
A prototype IC including the adaptive equalizer and sequence detector was fabricated in 1.2
m MOSIS CMOS.
The power consumption for a power supply (Vdd)equal to 3.3 V and 5.0V is plotted as a function of output rate in Figure 11. The prototype circuit was originally designed to operate at 100 MHz with a power supply of 3.3 V. Due to an error in the extraction procedure during the design process, the capacitance in the critical path through the accumulator and coefficient update circuitry was greatly underestimated resulting in 100 MHz operation only with a power supply near 5 V. Extrapolating the power consumption at low frequencies suggests that at 3.3 V, it is feasible that 100 MHz operation could be achieved in a re-design with a power consumption below 200 mW. A summary of the key performance characteristics is given in Table 1.
The authors would like to thank A. Chandrakasan, S. Sheng, members of the IC Group, and Professors R. Brodersen and J. Rabaey and their research groups at UC Berkeley for their guidance, support, and encouragement.
This work was supported by NSF MIP 9109525, California Micro Program, and National Semiconductor.
[1] H. Kobayashi and D. T. Tang, "Application of partial-response channel coding to magnetic recording systems," IBM Journal of Research and Development, pp. 368-375, July 1970.
[2] R. D. Cideciyan, F. Dolivo, R. Hermann, W. Hirt, and W. Schott, "A prml system for digital magnetic recording," IEEE J. on Selected Areas in Com., vol 10, no. 1, pp. 38-56, January 1992.
[3] R. Philpott, R. Kertis, R. Richetta, T. Schmerbeck, and D. Schulte, "A 7 MB/sec (65 MHz) mixed-signal magnetic recording channel dsp using partial response signaling with maximum likelihood detection," CICC 1993, pp. 10.4.1-10.4.4, May 1993.
[4] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, "Low power cmos digital design," IEEE Journal of Solid State Circuits, Vol. 27, No. 4, pp. 473-484, April 1992.
[5] C. R. Baugh and B. A. Wooley, "A two's complement parallel array multiplication algorithm," IEEE Trans. on Computers, vol. C-22, no. 12, Dec. 1973.
[6] R. W. Wood and D. A. Peterson, "Viterbi detection of class IV partial response on a magnetic recording channel," IEEE Trans. on Communications, vol. com-34, no. 5, pp. 454-461, May 1986.