A new architecture for digital implementation of the adaptive equalizer in Class IV Partial Response Maximum Likelihood (PRML) channels employing parallelism and pipelining is described. The architecture was used in a prototype integrated circuit in a 1.2
m CMOS technology to implement an 8-tap adaptive equalizer and Viterbi sequence detector which consumes a total of 70 mW from a 3.3 V supply operating at an input sampling rate of 50 MHz.
Sampled-data techniques such as Class IV Partial Response with Maximum Likelihood detection (PRML) are being applied to magnetic disk storage channels in order to increase transfer rates and recording densities [1-2]. In order to provide a robust implementation of the functions required in the read path of these channels, implementation of key blocks such as timing recovery, adaptive equalization, and sequence detection is often in the digital domain. The power consumed by these blocks can be appreciable due to the high speeds of operation required with data rates currently on the order of 50-100 Mbits/sec and exceeding this in the future. For example, the power dissipation of the CMOS logic alone in a recently reported BiCMOS 65 Mbits/sec read channel IC implementing digital equalization, sequence detection, and other associated functions was 1 W [3]. Powerful battery operated portable personal computers are becoming increasingly prevalent and the performance requirements of the storage devices in these systems continually increases. Power consumption is a critical parameter in such systems for increased battery life and to minimize heat dissipation effects due to the proximity of the electronics to the magnetic media in drives with decreasing form factors. The net result is the need for increased speed performance of the electronics at a reduced overall power consumption.
A block diagram of a typical PRML read channel is shown in Figure 1. The output of the magnetic disk is first amplified by the read amplifier before being passed on to the analog front-end which includes a variable gain amplifier (controlled in an automatic gain control loop not shown), lowpass filter, sampler, and analog-to-digital converter followed by the functions of adaptive equalization, sequence detection, and timing recovery in the digital domain. The adaptive equalizer operates on the 6-bit samples coming from the A/D converter, equalizing these samples for subsequent detection by the sequence detector and use by the timing recovery block.
This paper describes a low-power architecture for digital implementation of an adaptive equalizer suitable for use in PRML magnetic disk drive channels. The new architecture was used in a prototype IC containing an 8-tap adaptive equalizer and Viterbi detector which at a 50 MHz input sampling rate consumes a total power of 70 mW operating from a 3.3 V power supply in a 1.2
m CMOS technology [14]. The reduction in operating power relative to the current practice was made possible through the use of an optimum mixture of parallelism and pipelining, which in turn allowed the use of a 3.3 V supply while meeting the throughput requirements.
2.0 Low Power Adaptive Equalizer Architecture and Design
As has been recently demonstrated in other work [4], major reductions in power dissipation in such digital functions are possible through the use of lower power supply voltages, since the power consumed by a CMOS digital circuit is proportional to CV
f. The corresponding increase in gate delay is only linear with supply voltage, all else being equal, so that the effective power-delay product is improved at lower voltages. The increased gate delay can be accommodated through the use of a combination of parallelism and pipelining, requiring more gates for the implementation but giving major reductions in overall power dissipation. The optimum combination of parallelism and pipelining is highly dependent on the particular function being performed, and the exploration of these trade-offs for the implementation of an adaptive equalizer suitable for use in PRML magnetic disk drive read channels is the main topic of this paper.
Magnetic disk drive read channels are unique both because of the high sampling and bit rates involved, and because of the relatively modest SNR of the signals detected off the disk. Most implementations of PRML channels use a 6 bit signal representation into the adaptive equalizer dictated by the off-channel signal-to-noise ratio. This small word width has major implications for the minimization of power through the use of parallelism, since multipliers for this width are relatively compact. Extensive simulations were performed to explore various permutations of parallelism and pipelining in the implementation of a conventional multiplier [5] for filter sampling rates above 50 MHz and a power supply of 3.3 V. It was found that for implementation of a 6-bit by 6-bit multiplier, the use of 4 multipliers operating in parallel and staggered in phase by the output period T as shown in Figure 2 resulted in the solution dissipating the lowest overall power. This advantage of low power comes at the cost of increased silicon area which, however, will scale with technology. For the required resolution of 6 bits, the overhead associated with pipeline latches in a pipelined implementation as opposed to a parallel implementation of the multiplier increases the power while decreasing the attainable speed (due to latch set-up times).
While the parallel architecture helps to increase overall throughput for a given supply voltage, system latency is also increased. This latency will have implications from a system standpoint, particularly in applications in which the block employing parallelism is nested in one or more feedback loops. The resulting delay will affect the stability of the feedback loops and must be considered in the system design.
2.1 Parallel Filter Architecture
A block diagram of a filter stage used to implement the equalizer is shown in Figure 3a and is comprised of a delay line, a set of multipliers, and an accumulator. The morphology of the filter stage resembles the block diagram of an FIR filter. In order to use multipliers which take 4 output periods to perform multiplication, 4 filter stages are used in parallel. The multiplier and accumulator sections are each clocked at one-fourth the output rate of the equalizer and operate staggered in phase by one output period. The parallel architecture and timing diagrams for this approach are shown in Figures 3b and 3c, respectively. The use of input latches in both the multipliers and accumulator enable the functions of multiplication and accumulation to be pipelined as shown for one filter stage in Figure 3d. During each filter stage clock cycle, the current 8 inputs into a filter stage are multiplied by their respective tap weights. The resulting products are summed by the accumulator during the next clock cycle of that filter stage.
Much of the discussion thus far has focused on block level considerations to reduce power consumption of the multiplier function. Further optimization can occur with appropriate design choices at the circuit level. The multipliers were constructed using the Baugh-Wooley two's compliment parallel array algorithm [5]. The benefits of two logic styles, complimentary pass logic and static CMOS, were combined to implement the full adders as shown in Figure 4. To reduce the number of transistors and subsequently the power consumption associated with extra parasitic drain and source capacitances, complimentary pass logic was used in the sum generation portion of the full adder. While in order to increase the speed and drive capability, static CMOS was used at both the output of the sum generator and to implement the carry generation signal.
The Least Mean Square (LMS) algorithm is often used to update the tap weights in the adaptive equalizer. In the LMS algorithm, each coefficient C is updated using the equation
where
is the step size, e
the input error to the slicer at time k, and x
the input to the particular tap weight at time k. Implementation of this algorithm requires two multiplies following generation of the slicer error in order to obtain the correction term which is added to the current value of the coefficient as shown in Figure 5a. The time required for these multiplies adds directly to the latency in the coefficient update. Simulation results indicated that the latency associated with the parallel multipliers degraded stability in the adaptation loop, in turn requiring very small values of
and long adaptation times to eliminate stability problems.
In order to reduce this feedback latency in the update of the coefficients, the sign-sign LMS algorithm was employed where the coefficients are updated using the equation
where sgn(.) is the signum function which is +1 or -1 for a positive or negative argument, respectively. The coefficient update then reduces to
depending on the product of sgn(e
)*sgn(x
). In the implementation, both c
+
and c
-
are computed in parallel while sgn(e
)*sgn(x
) is computed using an exclusive-OR operation on the sign bits of both the error and the input to the respective tap. Depending on the outcome of the exclusive-OR operation, c
+
or c
-
is selected. This is shown schematically in Figure 5b. This approach reduces the feedback latency from 16 periods down to 8 as listed in Figure 5c and performs robustly. Another advantage of the sign-sign LMS algorithm is that the multiplications required in the full LMS algorithm reduce to a single exclusive-OR which results in significant power and area savings. Information on the stability of the sign-sign LMS algorithm can be found in [7-10].
Because the disk drive channel is slowly time varying, the output of only one of the four parallel filter stages is used in the update of the coefficients. This same set of coefficients is used by all four filter stages. In order to reduce timing requirements on the availability of the coefficient set to each of the four filter stages, the approach shown in Figure 6 is used. New coefficients from the coefficient update block need be available only during the latching instant of Filter Stage 1. The coefficients are held by the Filter Stage 1 latches for the multipliers there as well as applied to the inputs of the latches in Filter Stage 2 where they are sampled one output period later. This process is repeated until the coefficients are latched into Filter Stage 4. Since the outputs from the coefficient update block need only be available during the latching instant of Filter Stage 1, their availability requirements are much less than if the coefficient update block were to feed each of the filter stages directly. Furthermore, the amount of parasitic capacitance which would otherwise have to be driven by one set of buffers if all four filter stages were to receive the coefficients directly is greatly reduced.
A 3-tap raised-cosine equalizer response with tap weights -K, 1.0, -K was assumed for the initial equalizer impulse response before convergence. An externally controlled initialization signal allows the fourth through the sixth coefficients to be initialized to tap weight values provided off-chip. While these three taps are set to their respective values (fourth and sixth tap weights are the same), the other 5 tap weights are forced to zero. The ability to externally set each of the tap weights individually could easily have been implemented but was avoided in the prototype for simplicity. The value of the step size
is also provided externally and can be programmed to implement gear shifting algorithms for faster convergence.
The chip receives two six-bit words at a one-half rate relaxing the timing requirements for latching of the input data while having no effect on the equalizer performance. Input samples are supplied via a tapped delay line shown in Figure 7. Samples are first latched into Filter Stage 1 on the rising edge of Clock 1. Seven of the samples residing in the Filter Stage 1 latches, while being held for the multipliers, are passed through to Filter Stage 2 to be latched along with one new sample on the rising edge of Clock 2. This process is again repeated for Filter Stages 3 and 4 on the rising edge of Clocks 3 and 4, respectively. This approach reduces the amount of data needing to be passed around the chip at high speed and relaxes timing requirements on the movement of most of the input data throughout the chip. Note that the desired convolution occurs as a vector of input samples moves diagonally through the parallel stages.
The sequence (Viterbi) detector is designed for a PR-IV target and as such is realized by two interleaved independent half-rate Viterbi detectors [6] operating on the outputs of Filter Stages 1 and 3, and 2 and 4 as shown in Figure 8. The input words into the Viterbi detector blocks are 6-bits wide. The outputs of both detectors contribute to a two-wide one-half rate bit stream which is output off-chip (commutator shown in the figure). Descriptions of Viterbi detector implementations can be found in the literature [16-17].
Due to the regularity in the flow of both the input data and the filter taps throughout the chip (shown in Figures 6 and 7), a close mapping of the block diagram to the layout is possible. A detailed schematic of a filter stage is shown in Figure 9 indicating the flow of the input data and coefficient into each multiplier cell. The two words enter their respective latches where they are held for the duration of the multiply as well as passed through the cell to the next filter stage. A high level view illustrating the flow of the input data and coefficients through the filter is shown in Figure 10. A die photo of the chip is shown in Figure 11. Note that the layout closely resembles the block diagram shown in Figure 10.
A prototype IC including the adaptive equalizer and sequence detector was fabricated in a 1.2
m MOSIS CMOS process.
To verify performance, the chip was supplied via a logic analyzer with pseudo-random channel samples generated using Ptolemy, a system simulation tool [11]. A linear channel was assumed and modelled using a Lorentz Pulse [18] to represent transitions on the disk. PW![]()
/T is a parameter used in the lorentz model as a relative measure of the intersymbol interference (ISI) in the channel, where higher numbers for PW![]()
/T correspond to more ISI. Typical disk drive channels today can be characterized with values of PW![]()
/T between 1.5 to 2.0 and will be approaching 2.5 in the future. Since it requires the most boost from the equalizer, a PW![]()
/T value of 2.5 was used to generate channel samples for all functional testing of the prototype. Experiments were performed by initializing the equalizer taps to intentionally introduce mis-equalization and feeding the generated channel samples via the logic analyzer to the prototype and observing the equalizer outputs and the tap weight values over time. The equalizer outputs approached the desired target response while the tap weights converged to their desired values for pseudo random data. Functionality was further verified by comparing the equalizer outputs to previous system simulations.
The power consumption for a supply voltage (Vdd) of 3.3 V and 5.0 V is plotted as a function of output rate in Figure 12. The prototype circuit was originally designed to operate at 100 MHz with a power supply of 3.3 V. Due to an error in the extraction procedure during the design process, the capacitance in the critical path through the accumulator and coefficient update circuitry was underestimated. As a result, in order to achieve proper operation at 100MHz, the supply voltage had to be increased to 5.0 V. Extrapolating the power consumption at low frequencies suggests that at 3.3 V, it is feasible that 100 MHz operation could be achieved in a re-design with a power consumption below 200 mW. A summary of the key performance characteristics is given in Table 1. A comprehensive summary of test results can be found in [12] and [13].
In this paper, an architecture for implementation of adaptive equalizers as required in applications such as the PRML magnetic disk read channel was described. The architecture utilizes parallelism and pipelining which enables high speed operation at a reduced power supply, resulting in a speed-power ratio of 1.4 mW/MHz (for 8 taps) which compares favorably to conventional approaches. The design of a prototype digital detector including the functions of adaptive equalization and sequence detection was described. The prototype demonstrates the potential of the proposed architecture to implement these key functions at high speed in a relatively conservative technology at a low power consumption.
The speed-power ratio advantage of the described approach comes at the cost of increased hardware, system complexity, and latency. The optimum combination of parallelism and pipelining depends on the requirements of a particular system and the available power supply values. In most applications, the power supply values are set by those already available in the particular system although efficient DC-DC conversion techniques are specifically being developed to optimize the combination of system performance and power consumption [15]. Latency affects adaptive algorithms such as those used to control the coefficients of the equalizer (in the case of the adaptive equalizer) as well as those which operate on the equalizer outputs. The allowable latency depends upon the convergence requirements of the different algorithms within a particular system and simulations must be performed to explore the specific performance/power savings trade-offs particular to that system.
The authors would like to thank A. Chandrakasan, S. Sheng, C. Conroy, members of the IC Group, and Professors R. Brodersen and J. Rabaey and their research groups at UC Berkeley for their guidance, support, and encouragement.
This work was supported by NSF MIP 9109525, California Micro Program, and National Semiconductor.
[1] H. Kobayashi and D. T. Tang, "Application of partial-response channel coding to magnetic recording systems," IBM Journal of Research and Development, pp. 368-375, July 1970.
[2] R. D. Cideciyan, F. Dolivo, R. Hermann, W. Hirt, and W. Schott, "A PRML system for digital magnetic recording," IEEE J. on Selected Areas in Com., vol 10, no. 1, pp. 38-56, January 1992.
[3] R. Philpott, R. Kertis, R. Richetta, T. Schmerbeck, and D. Schulte, "A 7 MB/sec (65 MHz) mixed-signal magnetic recording channel DSP using partial response signaling with maximum likelihood detection," 1993 CICC, pp. 10.4.1-10.4.4, May 1993.
[4] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, "Low power CMOS digital design," "IEEE Journal of Solid State Circuits, Vol. 27, No. 4, pp. 473-484, April 1992.
[5] C. R. Baugh and B. A. Wooley, "A two's complement parallel array multiplication algorithm," IEEE Trans. on Computers, vol. C-22, no. 12, Dec. 1973.
[6] R. W. Wood and D. A. Peterson, "Viterbi detection of class IV partial response on a magnetic recording channel," IEEE Trans. on Communications, vol. com-34, no. 5, pp. 454-461, May 1986.
[7] P. M. Clarkson, "Optimal and Adaptive Signal Processing", pp. 281-287, CRC Press, Boca Raton 1993.
[8] S. Dasgupta and C. R. Johnson, Jr., "Some comments on the behavior of sign-sign adaptive identifiers," Systems & Control Letters 7, Elsevier Science Publishers B. V. (North-Holland), pp. 75-82, April 1986.
[9] S. Dasgupta, C. R. Johnson, Jr., and A. M. Baksho, "Sign-sign LMS convergence with independent stochastic inputs," IEEE Trans. on Information Theory, Vol. 36, No. 1, pp. 197-201, January 1990.
[10] E. Eweda, "Convergence of the sign algorithm for adaptive filtering with correlated data," "IEEE Trans. on Information Theory, Vol. 37, No. 5, pp. 1450-1457, September 1991.
[11] E. A. Lee et al. "Ptolemy Manual" University of California at Berkeley, Copyright 1991.
[12] C. S. H. Wong, "Low-power high-speed DSP architecture for magnetic disk PRML read channel," "Masters Thesis, University of California at Berkeley, Memorandum No. UCB/ERL M93/72, October 1993.
[13] J. C. Rudell, "A low-power high-speed digital adaptive equalizer for magnetic disk drive channels utilizing class IV partial response signalling," Masters Thesis, University of California at Berkeley, Memorandum No. UCB/ERL M94/14, March 1994.
[14] G. T. Uehara, C. S. H. Wong, J. C. Rudell, and P. R. Gray, "A 50 MHz 70 mW 8-tap adaptive equalizer/Viterbi sequence detector in 1.2
m CMOS," 1994 CICC, San Diego, California, pp. 4.2.1 - 4.2.4, May 1994.
[15] A. Stratakos, S. Sanders, and R. Broderson. "A Low-Voltage CMOS DC-DC Converter for a Portable Battery-Operated System." Proc. IEEE Power Electronics Specialists Conf., pages 619-626, 1994.
[16] P. J. Black and T. H. Meng, "A 140 Mb/s 32-state radix-4 Viterbi decoder," "IEEE Journal of Solid State Circuits, Vol. 27, No. 12, pp. 1877-1885, December 1992.
[17] P. A. Ziperovich and J. K. Wolf, "CMOS implementation of a Viterbi detector for hard disk drives," "1993 CICC, San Diego, California, pp. 10.3.1 - 10.3.4, May 1993.
[18] J. M. Cioffi, W. L. Abbott, H. K. Thapar, C. M. Melas, and K. D. Fisher, "Adaptive Equalization in Magnetic-Disk Storage Channels." IEEE Communications Magazine, pp 14-29, February 1990.
