Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Zero-Aware Low-Precision RNS Scaling Scheme

Zero-Aware Low-Precision RNS Scaling Scheme axioms Article Amir Sabbagh Molahosseini School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast BT7 1NN, UK; A.SabbaghMolahosseini@qub.ac.uk Abstract: Scaling is one of the complex operations in the Residue Number System (RNS). This operation is necessary for RNS-based implementations of deep neural networks (DNNs) to prevent overflow. However, the state-of-the-art RNS scalers for special moduli sets consider the 2 modulo as the scaling factor, which results in a high-precision output with a high area and delay. Therefore, low-precision scaling based on multi-moduli scaling factors should be used to improve performance. However, low-precision scaling for numbers less than the scale factor results in zero output, which makes the subsequent operation result faulty. This paper first presents the formulation and hardware architecture of low-precision RNS scaling for four-moduli sets using new Chinese remainder theorem 2 (New CRT-II) based on a two-moduli scaling factor. Next, the low-precision scaler circuits are reused to achieve a high-precision scaler with the minimum overhead. Therefore, the proposed scaler can detect the zero output after low-precision scaling and then transform low-precision scaled residues to high precision to prevent zero output when the input number is not zero. Keywords: residue number system (RNS); scaling; Chinese remainder theorem (CRT) 1. Introduction Residue Number Systems (RNSs) have been used in different applications such as digital signal processing (DSP) [1] and deep learning systems [2] to provide low-power, high-speed and fault-tolerant computations [3]. The main feature of an RNS is fast and par- Citation: Sabbagh Molahosseini, A. allel implementation of addition and multiplication based on separate modular arithmetic Zero-Aware Low-Precision RNS circuits. However, detection of multiplication overflow is one of the difficult RNS problems, Scaling Scheme. Axioms 2022, 11, 5. since the multiplication of any two operands larger than half of the dynamic range results https://doi.org/10.3390/ in overflow. Therefore, the high probability of overflow occurrence in multiplication has axioms11010005 motivated researchers to develop overflow prevention mechanisms for an RNS. Scaling Academic Editor: Chong Wang (i.e., division of the RNS number by a constant number) is one of the ways to reduce the size of the operands to prevent overflow in RNS operations. However, scaling is a difficult Received: 21 November 2021 process, since the division operation in an RNS cannot be performed in parallel modular Accepted: 20 December 2021 channels like multiplication and addition [4]. Therefore, usually one of the modulo of the Published: 23 December 2021 moduli set is selected as the scaling factor to reduce the complexity [5]. Publisher’s Note: MDPI stays neutral The scaling for general moduli sets is usually realized using look-up tables (LUTs) [6], with regard to jurisdictional claims in while the adder-based implementations can be achieved based on special moduli sets with published maps and institutional affil- higher performance. Due to this, there is a variety of works focusing on designing scalers iations. n n n for the well-known RNS three-moduli set {2 1, 2 , 2 + 1} [7–9]. The authors of [7,8] n n considered the modulo 2 as the scaling factor. Using 2 as the scaling factor resulted in simplified scalers with high-precision output. However, using only one modulo as the scaling factor is mostly applicable for addition operations, since it cannot drastically reduce Copyright: © 2021 by the author. the size of the numbers to prevent multiplication overflow. Due to this, the authors of [9] Licensee MDPI, Basel, Switzerland. n n proposed two-moduli scaling based on 2 (2 + 1) as the scaling factor, which led to a This article is an open access article low-precision output. Although this scaling factor can significantly reduce the size of the distributed under the terms and n n n operands, the limited 3n-bit dynamic range of the three-moduli set {2 1, 2 , 2 + 1} is not conditions of the Creative Commons suitable for two-moduli scaling factors because in this three-moduli RNS system, the values Attribution (CC BY) license (https:// n n creativecommons.org/licenses/by/ of most numbers are less than the scaling factor (i.e., 2 (2 + 1)), which results in a zero 4.0/). output for the scaler, consequently making the next operation faulty. This is a significant Axioms 2022, 11, 5. https://doi.org/10.3390/axioms11010005 https://www.mdpi.com/journal/axioms Axioms 2022, 11, 5 2 of 13 problem which indicates the importance of a zero-aware scaling mechanism, which is not covered by previous research. Therefore, two-moduli scaling factors together with the large dynamic range four- or k n n n + 1 n n n 2n + 1 five-moduli sets, such as {2 , 2 1, 2 + 1, 2 1} [10], {2 1, 2 , 2 + 1, 2 n n 2n 2n n n 2n 2n + 1 2n + p 1} [11] and {2 1, 2 + 1, 2 , 2 + 1} [12], {2 1, 2 + 1, 2 , 2 1} [13] and {2 , n n n (n + 1)/2 n (n + 1)/2 2 1, 2 + 1, 2 2 +1, 2 + 2 +1} [14], should be used. However, there is a limited number of works that consider the scaling for four-moduli sets. The authors of [15] designed a scaler based on a two-level architecture with the single-modulo scaling n + k factor 2 . The first level of this scaler performs scaling based on the three-moduli set n n + x n {2 1, 2 , 2 + 1}, where 0  x  n, and then the second level computes the final n + k 2n four-moduli scaling using the composite set {2 (2 1), m } [15]. This two-level architecture requires high hardware requirements due to the multiple uses of modular adders. Furthermore, scaling by the 2 modulo is not sufficient for large dynamic range four-moduli sets to avoid overflow. In other words, the regular modulo 2 scaling of the n n n n numbers based on the three-moduli set {2 1, 2 , 2 + 1} is not equivalent to modulo 2 n n n n + 1 scaling in the four-moduli set {2 1, 2 , 2 + 1, 2 1}, since the dynamic ranges of these moduli sets are 3n- and (4n + 1)-bit, respectively. Therefore, two-moduli scaling must be used to prevent multiplication overflow for large dynamic range RNS systems. On the other hand, to have a zero-aware scaler, the operands should be compared with the scaling constant before operation to prevent scaling with zero output. However, magnitude comparison is a difficult RNS operation, and its realization increases hardware complexity [4]. Here, we address this problem without using an RNS magnitude compara- tor based on a method for deriving two different scaling outputs from the same circuit. In the proposed work, first, a low-precision scaler based on two moduli is proposed for RNS four-moduli sets. Then, we reuse the hardware architecture of a low-precision scaler for producing high-precision scaled output that can be used when the low-precision scaler generates zero for non-zero operands. It is shown how new Chinese remainder theorem 2 (New CRT-II) can be used to achieve simplified two-moduli scaling for four- moduli sets. The proposed approach (i.e., a double-output scaler with both low- and high-precision outputs) has two main advantages. First, the high-precision output can be applied to addition operands, while the low-precision output can be used for multiplication operations to prevent overflow by considerably reducing the operands’ size. Second, in the case of using a low-precision output for multiplication, if the low-precision output becomes zero, the high-precision output can be used to prevent overflow, resulting in a zero-aware scaling approach. Moreover, derivation of the proposed general approach for two special n n 2n 2n n n 2n large dynamic range four-moduli sets {2 1, 2 + 1, 2 , 2 + 1} and {2 1, 2 + 1, 2 , 2n + 1 2 1} is presented, and its performance is compared with the conventional method. In the rest of the paper, the mathematical formulation and proof of the proposed scaling approach for both general and special four-moduli sets are described in Section 2. Next, Section 3 presents the fully adder-based hardware design of the proposed scalers. Moreover, a performance comparison is presented in Section 4. Finally, Section 5 concludes the paper. 2. Low-Precision Scaling with Two-Moduli Scaling Factor: Mathematical Formulation This section presents the proposed approach to design scalers for RNS using the New CRT-II [11]. In the rest of this section, a brief introduction about the scaling concept and new CRT-II is first described. Then, mathematical formulations of the proposed approach in general form and for two sample special forms will be presented. 2.1. Scaling Concept and CRT-II The scaling of the weighted number X by the constant factor K according to the scaling operator defined in [16] and is as follows: X = SK +jXj (1) K Axioms 2022, 11, 5 3 of 13 This formula shows that any weighted number can be formed as a summation of its remainder with scaling factor K and the multiplication of the scaling result (i.e., S) and K. In other words, S is the integer quotient of dividing X by K, and it can be expressed as follows [5,7]: S = (2) Note that Equations (1) and (2) are based on weighted numbers. However, they should be implemented inside RNS using residues. Therefore, consider the following residue representations for X and S based on the four-moduli set {m , m , m , m }: 1 2 3 4 RNS X ! (x , x , x , x ) (3) 1 2 3 4 RNS S = ! s , s , s , s (4) ( ) 1 2 3 4 where the scale factor m is one of the moduli. Aside from that, also consider s = jSj f or i = 1 . . . 4 (5) Second, consider the RNS number (x , x , x , x ), which can be converted into its 1 2 3 4 corresponding weighted number X using the New CRT-II conversion formulas for the generic four-moduli set {m , m , m , m } as follows [11]: 1 2 3 4 X = Z + m m k Y Z (6) j ( )j 1 2 1 m m 3 4 Z = x + m jk (x x )j (7) 1 1 2 2 1 Y = x + m jk (x x )j (8) 3 3 3 4 3 where the required multiplicative inverses can be achieved by considering the following relations: jk m m j = 1 (9) 1 1 2 m m 3 4 jk m j = 1 (10) 2 1 jk m j = 1 (11) 3 3 Equations (6)–(8) can be rewritten as follows: X = Z + m m T (12) 1 2 Z = x + m H (13) 1 1 Y = x + m P (14) 3 3 where T = jk (Y Z)j (15) m m 3 4 H = jk (x x )j (16) 2 2 1 P = jk (x x )j (17) 3 3 2.2. General Formulations Now, we choose the scaling factor as the product of the first and second modulo (i.e., m m ). Therefore, scaling of X by m m can be performed by considering k = m m and 1 2 1 2 1 2 substituting Equation (6) into Equation (4) as follows: X Z + m m T Z 1 2 S = = = + T (18) K m m m m 1 2 1 2 Axioms 2022, 11, 5 4 of 13 where x is a residue in modulo m and the maximum value of H in Equation (16) is m 1 1 2 1. Therefore, the maximum value of Z in Equation (13) can be computed as follows: Z = m 1 + m (m 1) = m m 1 (19) Max 1 1 2 1 2 It is clear that the floor of the division of Z by m m is zero. Therefore, by consid- Max 1 2 ering this point and taking into account that T is an integer number, Equation (18) can be simplified as follows: Z Z S = + T = + T = T (20) m m m m 2 2 1 1 Now, according to Equation (5), the residues of T based on the moduli should be computed to achieve the residues of the scaled number as follows: s = jS j = jTj f or i = 1 . . . 4 (21) Li L m m i i Therefore, scaling of Z by m m is reduced to T, and the full reverse conversion (i.e., 1 2 full computing of Equation (6)) is not needed. Now, we are going to achieve a single-modulo scaler for the same moduli set, (i.e., {m , m , m , m }) but with the aim of reusing the two-moduli scaler formulas to reduce 1 2 3 4 the overhead. Hence, considering k = m and the main CRT-II formula of Equation (6) in Equation (4) results in X Z + m m T Z 1 2 S = = = + m T (22) H 2 K m m 1 1 Insertion of Equation (13) into Equation (22) leads to x + m H x 1 1 1 S = + m T = + H + m T (23) H 2 2 m m 1 1 Therefore, since x is less than m , and H and T are integer numbers, Equation (23) can 1 1 be simplified as follows: x + m H x 1 1 1 S = + m T = + H + m T = H + m T (24) H 2 2 2 m m 1 1 Now, according to Equation (5), we have s = S = jH + m Tj f or i = 1 . . . 4 (25) Hi Single 2 According to the residue arithmetic properties [6], Equation (25) can be rewritten as s = jHj +jm j jTj f or i = 1 . . . 4 (26) Hi 2 m m m i i i However, from Equation (21), we know the remainders of T in moduli m are the two-moduli scaling residues. Therefore, we have s = jHj +jm j s f or i = 1 . . . 4 (27) Hi 2 Li m m i i Therefore, by using Equation (21), the single-modulo scaling residues can be achieved from the previously computed two-moduli scaling residues with the minimum overhead. It should be mentioned that more simplifications of Equation (21) can be performed using the exact value of the moduli as shown in the next subsections. Axioms 2022, 11, 5 5 of 13 2n 2n n n 2.3. Case Study: Moduli Set {2 + 1, 2 ,2 + 1, 2 1} The moduli set has a 6n-bit dynamic range, and its reverse converters are all de- signed based on the New CRT-I [11]. However, in contrast to the reverse converters of this moduli set, here, we use the New CRT-II to derive efficient two-moduli scaling for- 2n 2n n n mulas. First, consider the moduli set {m , m , m , m } = {2 + 1, 2 , 2 + 1, 2 1}. 1 2 3 4 According to Equation (20), we must compute T in Equation (15), and then its residues are the low-precision scaling residues. First, the following lemma computes the required multiplicative inverses. 2n 1 Lemma 1. The multiplicative inverses required in Equations (15)–(17) are k = 2 , k = 1 and 1 2 n 1 k = 2 . Proof of Lemma 1. Verification can be performed by substituting the values of the multi- plicative increases and moduli in Equations (9)–(11) as follows: 2n1 2n 2n 2n1 j2  (2 + 1)2 j = j2  2 1j = 1 (28) 2n 2n 2 1 2 1 2n j1 (2 + 1)j 2n = 1 (29) n1 n j2  2 + 1 j n = 1 (30) ( ) 2 1 Now, inserting the values of the moduli and multiplicative inverses in Equations (13)–(17) leads to 2n Z = x + (2 + 1)H (31) Y = x + (2 + 1)P (32) H = jx x j (33) 2n 2 1 n1 P = j2 (x x )j n (34) 4 3 2 1 2n1 T = j2 (Y Z)j 2n (35) 2 1 Equation (31) can be further simplified by substituting Equations (31) and (32) into it as follows: 2n1 n 2n T = j2 (x + (2 + 1)P x (2 + 1)H)j 2n (36) 2 1 The following well-known residue arithmetic properties can be used to further simplify Equations (33)–(36). Property 1. 2 v is equal to the P-bit circular left shifting of v if v is represented as a k-bit i i i 2 1 binary number [11]. Property 2. jv j is equal to one’s complement of v (i.e., v ) if v is represented as a k-bit i i i i 2 1 binary number [11]. Property 3. jv j is equal to v + 1 if v is represented as a k-bit binary number [17]. i i i Property 4. jv j is equal to v + 2 if v is represented as a k-bit binary number [17]. i i i 2 +1 2n 2n n n First, according to the moduli set {2 + 1, 2 , 2 + 1, 2 1}, x and x are (2n + 1)- 1 2 and 2n-bit numbers, respectively. Therefore, Equation (33) can be simplified using Property 3 as follows: 2n H = jx x j = jx (x  2 + x )j 2n 2n 2 1 1,2n 2 2,(2n1)...0 1,(2n1)...0 2 (37) = jx + x + 1j 2n 2,(2n1)...0 1,(2n1)...0 2 Axioms 2022, 11, 5 6 of 13 where x means the j-th bit of the residue x and x and x are (n + 1)- and n-bit numbers, i,j i 4 3 respectively. Therefore, Equation (34) can be rewritten as n1 n P = 2 (x x  2 x ) (38) 3,n 4 3,(n1)...0 2 1 where x is a residue in modulo 2 + 1. Therefore, when x is equal to one, the other bits 3 3,n will be surely be zero, and if the n low significant bits (LSBs) of x are not equal to zero, then the most significant bit (MSB) of x (i.e., x ) should be zero [12]. Therefore, by considering 3 3,n this point and Properties 1 and 2, Equation (38) can be simplified as follows: P = jP + P j (39) 2 n 2 1 where n1 P = j2 x j n = x x x (40) 1 4 2 1 4,0 4,n1... 4,1 | {z } n bits 01 . . . 11 i f x = 1 > 3,n | {z } n bits P = (41) x x . . . x i f x = 0 > 3,0 3,n1 3,n2 3,n | {z } n bits Finally, Equation (35) can be simplified using Properties 1 and 2 as follows: T = jT + T + T + T + T j (42) 2n 1 2 3 4 5 2 1 where 2n1 T = j2 x j 2n = x 0 . . . 00x . . . x (43) 1 3 3,0 3,n 3,1 2 1 | {z } | {z } n1 bits n bits 2n1 n 2n1 T = 2 (2 + 1)P = 2 (p . . . p p . . . p ) 2n 2 n1 0 n1 0 2 1 | {z }| {z } (44) 2n n bits n bits 2 1 = p p . . . p p . . . p 0 n1 0 n1 1 | {z }| {z } n bits n1 bits 2n1 T = 2 x = x x . . . x (45) 3 1,0 1,2n1 1,1 1,(2n1)...0 2n 2 1 | {z } 2n bits 2n1 2n T = 2  2  x = x 1 . . . 11 (46) 4 1,2n 1,2n 2n 2 1 | {z } 2n bits 2n1 2n T = j2 (2 + 1)Hj 2n = jHj = H . . . H H (47) 5 2n 0 2n1 1 2 1 2 1 | {z } 2n bits Note that P is an n-bit number, and due to this computation of Equation (44), it became a simple concatenation. Aside from that, the constant coefficient of H in Equation (47) was substituted with 1 since 2n1 2n 2n1 jj2 j j2 + 1j j = j2  2j = 1 (48) 2n 2n 2n 2n 2 1 2 1 2 1 2 1 Next, after calculation of T using Equation (42), we must compute the residues of T according to Equation (21) to achieve the two-moduli scaled residues. However, the largest 2n value of T is 2 2, and therefore, it is always less than the first and second moduli. Hence, we have s = T = T (49) j j 2n L1 2 +1 s = jTj = T (50) 2n L2 2 Axioms 2022, 11, 5 7 of 13 The third and fourth two-moduli scaled residues can be achieved as follows: s = jTj = jT  2 + T j n L3 (2n1)...n (n1)...0 2 +1 2 +1 = jT T j n (51) 2 +1 (n1)...0 (2n1)...(n1) = jT + T + 2j n 2 +1 (n1)...0 (2n1)...(n) s = jTj = jT + T j n (52) L4 2 1 2 1 (2n1)...n (n1)...0 Now, based on Equation (27), we can also achieve single-modulo scaling formulas from the two-moduli scaling residues as follows: 2n s = jjHj + 2 jTj j 2n 2n 2n 2n H1 2 +1 2 +1 2 +1 2 +1 (53) 2n = jjHj + 2 s j 2n 2n 2n L1 2 +1 2 +1 2 +1 We can simplify Equation (27) by substituting Equation (49) into it and considering 2n that the maximum value of H in (37) is 2 1. Therefore, we have s = jH Tj = H + T + 2 (54) 2n 2n H1 2 +1 2 +1 Similarly, for other residues, we have 2n s = jjHj 2n +j2 j 2njTj 2nj 2n = H (55) H2 2 2 2 2 2n s = jjHj n +j2 j n jTj n j n H3 2 +1 2 +1 2 +1 2 +1 = H  2 + H + s L3 (2n1)...n (n1)...0 (56) 2 +1 2 +1 = H + H + 2 + s L3 (2n1)...n (n1)...0 2 +1 2n s = jjHj n + 2 jTj n j H4 n 2 1 2 1 2 1 2 1 (57) = jH + H + s j n (2n1)...n (n1)...0 L4 2 1 Therefore, we can compute H and T from Equations (37) and (42) just one time and then using them several times to compute both the single- and two-moduli scaling. n n 2n 2n + 1 2.4. Case Study: Moduli Set {2 1, 2 + 1, 2 , 2 1} n n 2n 2n 2n This moduli set has the same moduli as {2 1, 2 + 1, 2 , 2 + 1} except for 2 + 1 2n + 1 which is substituted with 2 1. Due to this, it can lead to a faster RNS arithmetic unit. However, its reverse converter will be more complex. The overall process of designing the n n 2n 2n scaler for this moduli set is relatively the same as for the moduli set {2 1, 2 + 1, 2 , 2 + 1} described in the previous subsection. 2n + 1 2n n n First, consider the moduli order {m , m , m , m } = {2 1, 2 , 2 + 1, 2 1}. 1 2 3 4 Then, according to Equations (15)–(17), the multiplicative inverses can be computed as k n 1 = k = 1, and k = 2 (the proof is straightforward and similar to Lemma 1). Therefore, 2 3 Equation (15) is a key formula in the scaling that can be calculated as follows: T = jY Zj 2n 2 1 n 2n+1 = x + (2 + 1)P x 2 1 H (58) 3 2n 2 1 = jT + T + T + T + T j 2n 1 2 3 4 5 2 1 where T = jx j = 0 . . . 00 x . . . x (59) 2n 1 3 3,n1 3,0 2 1 | {z } | {z } n bits n bits T = 2 + 1 P = p . . . p p . . . p (60) j( ) j 2n 2 n1 0 n1 0 2 1 | {z }| {z } n bits n bits Axioms 2022, 11, 5 8 of 13 Axioms 2022, 10, x FOR PEER REVIEW 8 of 13 T = x = x . . . x (61) 3 1,(2n1)...0 1,2n1 1,0 2n 2 1 | {z } 𝑇 = − 𝑥 = 𝑥 … 𝑥 ,( )… ,2 , (61) 2n bits 2n T = 2  x = 1 . . . 11x (62) 4 1,2n 1,2n 2n 𝑇 = − 2 × 𝑥 = 1 … 11𝑥 2 1 | {z } , ,2 (62) 2n bits 2n+1 T = j(2 1)Hj = jHj = H . . . H H (63) 2n 2n 5 2n1 1 0 2 1 2 1 𝑇 = | − (2 − 1)𝐻 | = | − 𝐻 | = 𝐻 … 𝐻 𝐻 | {z } (63) 2n bits 2n + 1 2n n n 2n + 1 2n n n The The two-moduli two-moduli scaled scaled r residue esidue formulas formulas for for the the moduli moduli set set {2 {2 − 1, 1, 2 2, 2 , + 21,+ 21, −2 1} 2n 2n n n 2n 2n n n are 1} th ar ee sthe amesame as tho as sethose for th for e m the odu moduli li set {2set +{2 1, 2 + , 2 1, + 2 1,, 2 2 −+ 11, } (i2 .e., E1} qu(i.e., ationEquations s (49–52)), (49)–(52)), since all of them are based on T. That aside, the single-modulo scaled residues since all of them are based on T. That aside, the single-modulo scaled residues are the are the same as in Equations (55)–(57) except for the first scaled residue, which is as follows: same as in Equations (55–57) except for the first scaled residue, which is as follows: 2n 𝑠 = ||𝐻 | + |2 | 𝑠 | s = jjHj + 2 s j 2n+1 2n+1 H1 2n+1 L1 2 1 2 1 2 1 (64) = 𝐻 + 𝑠 … 𝑠 𝑠 , , , (64) = H + s . . . s s L1,2n1 L1,0 L1,2n | {z } 2n+1 bits 2n+1 2 1 3. Low-Precision Scaling with Two-Moduli Scaling Factor: Hardware Design 3. Low-Precision Scaling with Two-Moduli Scaling Factor: Hardware Design This section presents the full adder-based and memory-free hardware design of the This section presents the full adder-based and memory-free hardware design of the proposed RNS scaler. The overview of the proposed approach for a generic four-moduli proposed RNS scaler. The overview of the proposed approach for a generic four-moduli set is depicted in Figure 1. First, P, H and T are computed using Equations (15–17), and set is depicted in Figure 1. First, P, H and T are computed using Equations (15)–(17), and then, the two-moduli scaled residues are computed using Equation (21). Afterward, the then, the two-moduli scaled residues are computed using Equation (21). Afterward, the single-modulo scaled residues are obtained using the precomputed two-moduli scaled single-modulo scaled residues are obtained using the precomputed two-moduli scaled residues based on Equation (27). The important part of the scaler is the calculation of T residues based on Equation (27). The important part of the scaler is the calculation of T that that is shared between both kinds of scaling, resulting in a significant hardware reduction. is shared between both kinds of scaling, resulting in a significant hardware reduction. Figure Figure 1. 1. The The block block diagram diagram of of the the pr proposed oposed zer zero-awar o-aware e low-pr low-precision ecision scaler scaler for for the the ge generic neric RNS RNS four-moduli set {m1, m2, m3, m4}. four-moduli set {m , m , m , m }. 1 2 3 4 The The scaler scaler of of Figur Figure e 1 1 is is designed designed based based on on a a general general value value of of the the moduli. moduli. H However owever,, 2n 2n n n 2n 2n n n for special RNS moduli sets with power-of-two moduli such as {2 + 1, 2 , 2 + 1, 2 1}, for special RNS moduli sets with power-of-two moduli such as {2 + 1, 2 , 2 + 1, 2 − 1}, the design can be considerably simplified, as presented in Figure 2. First, the H in Equation the design can be considerably simplified, as presented in Figure 2. First, the H in Equation (39) is implemented using a 2n-bit regular carry-propagate adder (CPA) where its carry-in (39) is implemented using a 2n-bit regular carry-propagate adder (CPA) where its carry- Double-Modulo Axioms 2022, 10, x FOR PEER REVIEW 9 of 13 Axioms 2022, 11, 5 9 of 13 in is connected to one. Aside from that, P also requires a modulo 2 − 1 CPA, which can be is connected to one. Aside from that, P also requires a modulo 2 1 CPA, which can be implemented using an n-bit CPA with EAC [18] based on Equation (40). implemented using an n-bit CPA with EAC [18] based on Equation (40). x1 x2 x3 x4 Operand Preparation 2n-bit CPA 1 n-bit CPA-EAC Operand Preparation 2n-bit CSA-EAC 2n-bit CSA-EAC 2n-bit CSA-EAC 2n-bit CPA-EAC Operand Preparation n-bit CPA- n-bit CSA- EAC CEAC n-bit CPA- CEAC Operand Preparation n-bit n-bit 2n-bit CSA- CSA- CSA- EAC CEAC CEAC n-bit n-bit 2n-bit CPA- CPA- CPA- EAC CEAC CEAC Singl e-Modulo Scal ed Resi dues 2n 2n n n 2n 2n n n 2n Figure 2. The proposed scaler for the moduli set {2 + 1, 2 , 2 + 1, 2 1} with scale coefficients Figure 2. The proposed scaler for the moduli set {2 + 1, 2 , 2 + 1, 2 − 1} with scale coefficients (2 2n 2n 2n 2n 2n (2 + 1) 2 and 2 . + 1) 2 and 2 . The operand preparation unit performs the required inversions, shifting and multi- The operand preparation unit performs the required inversions, shifting and multi- plexing needed in Equation (41). Then, the important variable T in Equation (42) can be plexing needed in Equation (41). Then, the important variable T in Equation (42) can be 2n realized using three carry-save adders (CSAs) with EAC followed by a modulo 2n 2 1 realized using three carry-save adders (CSAs) with EAC followed by a modulo 2 − 1 CPA CPA [18]. Then, according to Equations (49)–(52), the first and second two-moduli scaled [18]. Then, according to Equations (49–52), the first and second two-moduli scaled resi- residues are equal to T, and the third and fourth are only the reduction of T in moduli dues are equal to T, and the third and fourth are only the reduction of T in moduli 2 − 1 n n 2 1 and 2 + 1, which can be realized using an n-bit CPA with EAC and n-bit CPA and 2 + 1, which can be realized using an n-bit CPA with EAC and n-bit CPA with com- with complement EAC (CEAC), respectively. Note that CPA-CEAC is a representation of plement EAC (CEAC), respectively. Note that CPA-CEAC is a representation of the mod- the modulo 2 + 1 adder which can be realized using different methods [19]. Finally, the ulo 2 + 1 adder which can be realized using different methods [19]. Finally, the single- single-modulo scaled residues can be achieved using Equations (54)–(57). The CSAs are modulo scaled residues can be achieved using Equations (54–57). The CSAs are used to used to compress the three operands into two, and then a modulo adder produces the compress the three operands into two, and then a modulo adder produces the scaled res- scaled residue. It can be seen that in the customized version of the scaler for the moduli set 2n idue. It can be seen that in the customized version of the scaler for the moduli set {2 + 1, 2n 2n n n {2 + 1, 2 , 2 + 1, 2 1}, the units for m and m reduction in the two-moduli scaling 1 2 Axioms 2022, 11, 5 10 of 13 part are removed, since the scaled residues are equal to T. Aside from that, the second single-modulo scaled residue is H, and hence, the required m reduction unit is removed. Finally, Algorithm 1 shows how the proposed hardware architecture can be used to provide zero-aware RNS scaling. If the low-precision scaled residues become zero, then the high-precision scaled residue should be used as the output, except in the case that they also become zero. In this case (i.e., both scaler outputs become zero), the number is very small and is less than both of the scaling constants. In this case, its original value can be used in the computations. Note that here we do not use any magnitude comparator which is a complex unit in RNS, and only by checking the scaled residues against zero we could evaluate the relative magnitude of the number (less or greater than the scaling coefficients). Algorithm 1: Zero-Aware RNS Scaling. Input: Non-Zero RNS Number (x , x , x , x ) 1 2 3 4 Output: Non-Zero Scaled RNS Number (s , s , s , s ) 1 2 3 4 1: Calculate the low-precision scaled residues (sl , sl , sl , sl ) 1 2 3 4 2: If (sl , sl , sl , sl ) 6= (0, 0, 0, 0) Then return (sl , sl , sl , sl ) 1 2 3 4 1 2 3 4 3: Calculate the high-precision scaled residues (sh , sh , sh , sh ) 1 2 3 4 4: If (sh , sh , sh , sh ) 6= (0, 0, 0, 0) Then return (sh , sh , sh , sh ) 1 2 3 4 1 2 3 4 5: Return original residues (x , x , x , x ) 1 2 3 4 4. Performance Evaluation The majority of the available state-of-the-art RNS scalers are dedicatedly designed for three-moduli sets, and only [15] presents the first RNS scaler design for four-moduli 2n + 1 2n n n sets. The RNS scaler for the moduli set {2 1, 2 , 2 + 1, 2 1} is fully designed 2n in [15] based on the scaling factor 2 as shown in Figure 3. To perform a technology- independent performance comparison, the unit-gate (U-G) model is used according to [15] for comparative assessment of the works. All the assumptions considered in [15] for estimation of the area and delay of modular adders are also considered here for a fair comparison, as shown in Table 1. Table 1. The area and delay formulas for different n-bit modulo adders based on the U-G model reported in [15]. Modulo Adder Area Delay CPA-EAC 3ndlog n 1e + 12n 2dlog n 1e + 3 2 2 2 1 CSA-EAC 7n 4 2 CPA 1.5ndlog ne + 5n 2dlog ne + 3 2 2 CPA-CEAC 4.5ndlog ne + 0.5n + 6 2dlog ne + 3 2 2 2 + 1 CSA-CEAC 7n 4 Note that in the U-G model, each XOR or XNOR gate counts as two unit gates in the area and delay, and an AND or OR gate is considered one unit gate for both the area and delay. Therefore, the combinatorial circuits such as FAs, half adders (HAs) and one-bit 21 multiplexers count as 7, 3 and 3 unit gates in area and 4, 2 and 2 gates in the delay, respectively. Aside from that, the U-G area and delay estimations for each component of the proposed scaler is described in Table 2. Note that the gray lines in Table 2 are not on the critical delay path. Axioms 2022, 10, x FOR PEER REVIEW 11 of 13 Axioms 2022, 11, 5 11 of 13 x1 x2 x3 x4 Operand Preparation n-bit CPA- n-bit CSA- 2n-bit CSA-EAC EAC CEAC 2n-bit simplified CSA- n-bit CPA- EAC CEAC 2n-bit CPA-EAC Operand Preparation (2n+1)-bit CSA-EAC (2n+1)-bit CPA-EAC Operand Preparation 2n-bit (2n+1)-bit CPA CSA-EAC (2n+1)-bit CPA-EAC Singl e-Modulo Scal ed Resi du es 2n + 1 2n n n Figure 3. The single-modulo scaler for the special moduli set {2 1, 2 , 2 + 1, 2 1} with 2n + 1 2n n n Figure 3. The single-modulo scaler for the special moduli set {2 − 1, 2 , 2 + 1, 2 − 1} with scaling 2n scaling factor 2 proposed in [15]. 2n factor 2 proposed in [15]. Table 2. The area and delay formulas based on the U-G model for different components of the Table 1. The area and delay formulas for different n-bit modulo adders based on the U-G model proposed double-modulo scaler. reported in [15]. Component Area Delay Modulo Adder Area Delay CPA-EAC 3𝑛 ⌈log 𝑛 − 1⌉ + 12𝑛 2⌈log 𝑛 − 1⌉ + 3 2n-bit CPA 3ndlog ne + 13n 2dlog ne + 5 2 2 2 − 1 7𝑛 CSA-EAC 4 n-bit 2  1 MUX 3n 2 2 CPA 1.5𝑛 ⌈log 𝑛 ⌉ + 5𝑛 2⌈log 𝑛 ⌉ + 3 n-bit CPA-EAC 3ndlog n 1e + 12n 2dlog n 1e + 3 2 2 CPA-CEAC 4.5𝑛 ⌈log 𝑛 ⌉ + 0.5𝑛 + 6 2⌈log 𝑛 ⌉ + 3 2n-bit 2Simplified + 1 CSA-EAC1 10n + 4 4 7𝑛 CSA-CEAC 4 2n-bit Simplified CSA-EAC2 6n + 4 4 2n-bit CSA-EAC 14n 4 Note that in the U-G model, each XOR or XNOR gate counts as two unit gates in the area and delay, and an AND or OR gate is considered one unit gate for both the area and 2n-bit CPA-EAC 6ndlog ne + 24n 2dlog ne + 3 2 2 delay. Therefore, the combinatorial circuits such as FAs, half adders (HAs) and one-bit n-bit CSA-CEAC 7n 4 2×1 multiplexers count as 7, 3 and 3 unit gates in area and 4, 2 and 2 gates in the delay, n-bit CPA-CEAC 4.5ndlog ne + 0.5n + 6 2dlog ne + 3 2 2 respectively. Aside from that, the U-G area and delay estimations for each component of the proposed scaler is described in Table 2. Note that the gray lines in Table 2 are not on Finally, the overall area and delay estimations for scalers are presented in Table 3 for the critical delay path. a general value of n. It can be seen that while the proposed low-precision scaler is based on Tab altwo-moduli e 2. The area ascaling nd delayfactor formu , lthe as bhar ased dwar on th eer U equir -G m ement odel for is dless iffere than nt coth mp eosingle-modulo nents of the pro- posed double-modulo scaler. scaler for the same moduli set, while the delay is almost the same. That aside, the proposed scaler outperforms the design of [15] in terms of hardware requirements. Furthermore, as is Component Area Delay expected, the high-precision single-modulo version requires a higher area and delay since ⌈ ⌉ ⌈ ⌉ 2n-bit CPA 3𝑛 log 𝑛 + 13𝑛 2 log 𝑛 + 5 it is computed based on the output of the two-moduli scaler. n-bit 2 × 1 MUX 3𝑛 2 ⌈ ⌉ ⌈ ⌉ n-bit CPA-EAC 3𝑛 log 𝑛 − 1 + 12𝑛 2 log 𝑛 − 1 + 3 Axioms 2022, 11, 5 12 of 13 2n + 1 Table 3. The total area and delay estimations for the RNS scalers based on the four-moduli set {2 2n n n 1, 2 , 2 + 1, 2 1}. Scaler Scale Factor Area Delay Proposed 19.5ndlog ne + 2n 2n + 1 2 2 (2 1) 6dlog ne + 27 Low-Precision 95.5n + 14 Proposed 31.5ndlog ne + 2n 8dlog ne + 34 High-Precision 160.5n + 14 [15] (28.5n + 6)dlog ne + 2n 2 2 6dlog ne + 25 High-Precision 150.5n + 44 5. Conclusions Scaling is an overflow prevention mechanism that must be extensively used to prevent overflow by reducing the size of the operands before RNS addition and multiplication operations. However, high-precision single-modulo scaling is not suitable for overflow prevention in multiplication due to its inability to perform significant size reduction, but the low-precision scaling with two moduli can lead to a zero result for small numbers. Therefore, this work presents a novel zero-aware low-precision scaler based on the two- moduli scaling factor. Then, the proposed circuits are reused to derive a high-precision scaling output to use in situations where low-precision output is not usable, resulting in a zero-aware RNS scaler. Therefore, the proposed design is pushing forward the RNS into practical applications by providing an efficient mechanism for overflow prevention, which is one of the major challenges of RNS. On the other hand, the high latency of the scaler is one of the limitations of this approach which can be improved in the context of the final application. Funding: This research received no external funding. Conflicts of Interest: The author declares no conflict of interest. References 1. Chang, C.H.; Molahosseini, A.S.; Zarandi, A.A.E.; Tay, T.F. Residue Number Systems: A New Paradigm to Datapath Optimization for Low-Power and High-Performance Digital Signal Processing. IEEE Circuits Syst. Mag. 2015, 15, 26–44. [CrossRef] 2. Samimi, N.; Kamal, M.; Afzali-Kusha, A.; Pedram, M. Res-DNN: A Residue Number System-Based DNN Accelerator Unit. IEEE Trans. Circuits Syst. I Regul. Pap. 2020, 67, 658–671. [CrossRef] 3. Deng, B.; Srikanth, S.; Jain, A.; Conte, T.; Debenedictis, E.; Cook, J. Scalable Energy-Efficient Microarchitectures with Computa- tional Error Tolerance via Redundant Residue Number Systems. IEEE Trans. Comput. 2021, in press. [CrossRef] 4. Omondi, A.R.; Premkumar, B. Residue Number Systems: Theory and Implementation; Imperial College Press: London, UK, 2007. 5. Molahosseini, A.S.; Zarandi, A.A.E.; Martins, P.; Sousa, L. A Multifunctional Unit for Designing Efficient RNS-Based Datapaths. IEEE Access 2017, 5, 25972–25986. [CrossRef] 6. Kong, Y.; Phillips, B. Fast Scaling in the Residue Number System. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2009, 17, 443–447. [CrossRef] n n n 7. Chang, C.H.; Low, J.Y.S. Simple, Fast, and Exact RNS Scaler for the Three-Moduli Set {2 1, 2 , 2 + 1}. IEEE Trans. Circuits Syst. I Regul. Pap. 2011, 58, 2686–2697. [CrossRef] n n n 8. Low, J.Y.S.; Chang, C.H. A VLSI Efficient Programmable Power-of-Two Scaler for {2 1, 2 , 2 + 1} RNS. IEEE Trans. Circuits Syst. I Regul. Pap. 2012, 59, 2911–2919. [CrossRef] n n n 9. Low, J.Y.S.; Tay, T.F.; Chang, C.H. A unified {2 1, 2 , 2 + 1} RNS scaler with dual scaling constants. In Proceedings of the 2012 IEEE Asia Pacific Conference on Circuits and Systems, Kaohsiung, Taiwan, 2–5 December 2012. k n n n +1 k n 10. Patronik, P.; Piestrak, S.J. Design of Reverse Converters for General RNS Moduli Sets {2 , 2 1, 2 + 1, 2 1} and {2 , 2 1, n n 1 2 + 1, 2 1} (n even). IEEE Trans. Circuits Syst. I Regul. Pap. 2014, 61, 1687–1700. [CrossRef] 11. Molahosseini, A.S.; Navi, K.; Dadkhah, C.; Kavehei, O.; Timarchi, S. Efficient reverse converter designs for the new 4-moduli sets n n n 2n+1 n n 2n 2n {2 1, 2 , 2 + 1, 2 1} and {2 1, 2 + 1, 2 , 2 + 1} based on new CRTs. IEEE Trans. Circuits Syst. I Regul. Pap. 2010, 57, 823–835. [CrossRef] 12. Zarandi, A.A.E.; Molahosseini, A.S.; Sousa, L.; Hosseinzadeh, M. An Efficient Component for Designing Signed Reverse K P Converters for a Class of RNS Moduli Sets with Composite Form {2 , 2 1}. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2017, 25, 48–59. [CrossRef] Axioms 2022, 11, 5 13 of 13 n n n 2n+1 n n 13. Sousa, L.; Antao, S. MRC-Based RNS Reverse Converters for the Four-Moduli Sets {2 + 1, 2 1, 2 , 2 1} and {2 + 1, 2 2n 2n+1 1, 2 , 2 1}. IEEE Trans. Circuits Syst. II 2012, 59, 244–248. [CrossRef] 14. Hiasat, A. A Reverse Converter and Sign Detectors for an Extended RNS Five-Moduli Set. IEEE Trans. Circuits Syst. I Regul. Pap. 2017, 64, 111–121. [CrossRef] 15. Sousa, L. 2 RNS Scalers for Extended 4-Moduli Sets. IEEE Trans. Comput. 2015, 64, 3322–3334. [CrossRef] 16. Garcia, A.; Lioris, A. A Look-Up Scheme for Scaling in the RNS. IEEE Trans. Comput. 1999, 48, 748–751. [CrossRef] 17. Vassalos, E.; Bakalis, D. CSD-RNS-based Single Constant Multipliers. J. Signal Process. Syst. 2012, 67, 255–268. [CrossRef] 18. Piestrak, S.J. A high speed realization of a residue to binary converter. IEEE Trans. Circuits Syst. II 1995, 42, 661–663. [CrossRef] 19. Vergos, H.T.; Bakalis, D.; Efstathiou, C. Fast modulo 2 + 1 multi-operand adders and residue generators. Integration 2010, 43, 42–48. [CrossRef] http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Axioms Multidisciplinary Digital Publishing Institute

Zero-Aware Low-Precision RNS Scaling Scheme

Axioms , Volume 11 (1) – Dec 23, 2021

Loading next page...
 
/lp/multidisciplinary-digital-publishing-institute/zero-aware-low-precision-rns-scaling-scheme-u701XjAQgx
Publisher
Multidisciplinary Digital Publishing Institute
Copyright
© 1996-2022 MDPI (Basel, Switzerland) unless otherwise stated Disclaimer The statements, opinions and data contained in the journals are solely those of the individual authors and contributors and not of the publisher and the editor(s). MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. Terms and Conditions Privacy Policy
ISSN
2075-1680
DOI
10.3390/axioms11010005
Publisher site
See Article on Publisher Site

Abstract

axioms Article Amir Sabbagh Molahosseini School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast BT7 1NN, UK; A.SabbaghMolahosseini@qub.ac.uk Abstract: Scaling is one of the complex operations in the Residue Number System (RNS). This operation is necessary for RNS-based implementations of deep neural networks (DNNs) to prevent overflow. However, the state-of-the-art RNS scalers for special moduli sets consider the 2 modulo as the scaling factor, which results in a high-precision output with a high area and delay. Therefore, low-precision scaling based on multi-moduli scaling factors should be used to improve performance. However, low-precision scaling for numbers less than the scale factor results in zero output, which makes the subsequent operation result faulty. This paper first presents the formulation and hardware architecture of low-precision RNS scaling for four-moduli sets using new Chinese remainder theorem 2 (New CRT-II) based on a two-moduli scaling factor. Next, the low-precision scaler circuits are reused to achieve a high-precision scaler with the minimum overhead. Therefore, the proposed scaler can detect the zero output after low-precision scaling and then transform low-precision scaled residues to high precision to prevent zero output when the input number is not zero. Keywords: residue number system (RNS); scaling; Chinese remainder theorem (CRT) 1. Introduction Residue Number Systems (RNSs) have been used in different applications such as digital signal processing (DSP) [1] and deep learning systems [2] to provide low-power, high-speed and fault-tolerant computations [3]. The main feature of an RNS is fast and par- Citation: Sabbagh Molahosseini, A. allel implementation of addition and multiplication based on separate modular arithmetic Zero-Aware Low-Precision RNS circuits. However, detection of multiplication overflow is one of the difficult RNS problems, Scaling Scheme. Axioms 2022, 11, 5. since the multiplication of any two operands larger than half of the dynamic range results https://doi.org/10.3390/ in overflow. Therefore, the high probability of overflow occurrence in multiplication has axioms11010005 motivated researchers to develop overflow prevention mechanisms for an RNS. Scaling Academic Editor: Chong Wang (i.e., division of the RNS number by a constant number) is one of the ways to reduce the size of the operands to prevent overflow in RNS operations. However, scaling is a difficult Received: 21 November 2021 process, since the division operation in an RNS cannot be performed in parallel modular Accepted: 20 December 2021 channels like multiplication and addition [4]. Therefore, usually one of the modulo of the Published: 23 December 2021 moduli set is selected as the scaling factor to reduce the complexity [5]. Publisher’s Note: MDPI stays neutral The scaling for general moduli sets is usually realized using look-up tables (LUTs) [6], with regard to jurisdictional claims in while the adder-based implementations can be achieved based on special moduli sets with published maps and institutional affil- higher performance. Due to this, there is a variety of works focusing on designing scalers iations. n n n for the well-known RNS three-moduli set {2 1, 2 , 2 + 1} [7–9]. The authors of [7,8] n n considered the modulo 2 as the scaling factor. Using 2 as the scaling factor resulted in simplified scalers with high-precision output. However, using only one modulo as the scaling factor is mostly applicable for addition operations, since it cannot drastically reduce Copyright: © 2021 by the author. the size of the numbers to prevent multiplication overflow. Due to this, the authors of [9] Licensee MDPI, Basel, Switzerland. n n proposed two-moduli scaling based on 2 (2 + 1) as the scaling factor, which led to a This article is an open access article low-precision output. Although this scaling factor can significantly reduce the size of the distributed under the terms and n n n operands, the limited 3n-bit dynamic range of the three-moduli set {2 1, 2 , 2 + 1} is not conditions of the Creative Commons suitable for two-moduli scaling factors because in this three-moduli RNS system, the values Attribution (CC BY) license (https:// n n creativecommons.org/licenses/by/ of most numbers are less than the scaling factor (i.e., 2 (2 + 1)), which results in a zero 4.0/). output for the scaler, consequently making the next operation faulty. This is a significant Axioms 2022, 11, 5. https://doi.org/10.3390/axioms11010005 https://www.mdpi.com/journal/axioms Axioms 2022, 11, 5 2 of 13 problem which indicates the importance of a zero-aware scaling mechanism, which is not covered by previous research. Therefore, two-moduli scaling factors together with the large dynamic range four- or k n n n + 1 n n n 2n + 1 five-moduli sets, such as {2 , 2 1, 2 + 1, 2 1} [10], {2 1, 2 , 2 + 1, 2 n n 2n 2n n n 2n 2n + 1 2n + p 1} [11] and {2 1, 2 + 1, 2 , 2 + 1} [12], {2 1, 2 + 1, 2 , 2 1} [13] and {2 , n n n (n + 1)/2 n (n + 1)/2 2 1, 2 + 1, 2 2 +1, 2 + 2 +1} [14], should be used. However, there is a limited number of works that consider the scaling for four-moduli sets. The authors of [15] designed a scaler based on a two-level architecture with the single-modulo scaling n + k factor 2 . The first level of this scaler performs scaling based on the three-moduli set n n + x n {2 1, 2 , 2 + 1}, where 0  x  n, and then the second level computes the final n + k 2n four-moduli scaling using the composite set {2 (2 1), m } [15]. This two-level architecture requires high hardware requirements due to the multiple uses of modular adders. Furthermore, scaling by the 2 modulo is not sufficient for large dynamic range four-moduli sets to avoid overflow. In other words, the regular modulo 2 scaling of the n n n n numbers based on the three-moduli set {2 1, 2 , 2 + 1} is not equivalent to modulo 2 n n n n + 1 scaling in the four-moduli set {2 1, 2 , 2 + 1, 2 1}, since the dynamic ranges of these moduli sets are 3n- and (4n + 1)-bit, respectively. Therefore, two-moduli scaling must be used to prevent multiplication overflow for large dynamic range RNS systems. On the other hand, to have a zero-aware scaler, the operands should be compared with the scaling constant before operation to prevent scaling with zero output. However, magnitude comparison is a difficult RNS operation, and its realization increases hardware complexity [4]. Here, we address this problem without using an RNS magnitude compara- tor based on a method for deriving two different scaling outputs from the same circuit. In the proposed work, first, a low-precision scaler based on two moduli is proposed for RNS four-moduli sets. Then, we reuse the hardware architecture of a low-precision scaler for producing high-precision scaled output that can be used when the low-precision scaler generates zero for non-zero operands. It is shown how new Chinese remainder theorem 2 (New CRT-II) can be used to achieve simplified two-moduli scaling for four- moduli sets. The proposed approach (i.e., a double-output scaler with both low- and high-precision outputs) has two main advantages. First, the high-precision output can be applied to addition operands, while the low-precision output can be used for multiplication operations to prevent overflow by considerably reducing the operands’ size. Second, in the case of using a low-precision output for multiplication, if the low-precision output becomes zero, the high-precision output can be used to prevent overflow, resulting in a zero-aware scaling approach. Moreover, derivation of the proposed general approach for two special n n 2n 2n n n 2n large dynamic range four-moduli sets {2 1, 2 + 1, 2 , 2 + 1} and {2 1, 2 + 1, 2 , 2n + 1 2 1} is presented, and its performance is compared with the conventional method. In the rest of the paper, the mathematical formulation and proof of the proposed scaling approach for both general and special four-moduli sets are described in Section 2. Next, Section 3 presents the fully adder-based hardware design of the proposed scalers. Moreover, a performance comparison is presented in Section 4. Finally, Section 5 concludes the paper. 2. Low-Precision Scaling with Two-Moduli Scaling Factor: Mathematical Formulation This section presents the proposed approach to design scalers for RNS using the New CRT-II [11]. In the rest of this section, a brief introduction about the scaling concept and new CRT-II is first described. Then, mathematical formulations of the proposed approach in general form and for two sample special forms will be presented. 2.1. Scaling Concept and CRT-II The scaling of the weighted number X by the constant factor K according to the scaling operator defined in [16] and is as follows: X = SK +jXj (1) K Axioms 2022, 11, 5 3 of 13 This formula shows that any weighted number can be formed as a summation of its remainder with scaling factor K and the multiplication of the scaling result (i.e., S) and K. In other words, S is the integer quotient of dividing X by K, and it can be expressed as follows [5,7]: S = (2) Note that Equations (1) and (2) are based on weighted numbers. However, they should be implemented inside RNS using residues. Therefore, consider the following residue representations for X and S based on the four-moduli set {m , m , m , m }: 1 2 3 4 RNS X ! (x , x , x , x ) (3) 1 2 3 4 RNS S = ! s , s , s , s (4) ( ) 1 2 3 4 where the scale factor m is one of the moduli. Aside from that, also consider s = jSj f or i = 1 . . . 4 (5) Second, consider the RNS number (x , x , x , x ), which can be converted into its 1 2 3 4 corresponding weighted number X using the New CRT-II conversion formulas for the generic four-moduli set {m , m , m , m } as follows [11]: 1 2 3 4 X = Z + m m k Y Z (6) j ( )j 1 2 1 m m 3 4 Z = x + m jk (x x )j (7) 1 1 2 2 1 Y = x + m jk (x x )j (8) 3 3 3 4 3 where the required multiplicative inverses can be achieved by considering the following relations: jk m m j = 1 (9) 1 1 2 m m 3 4 jk m j = 1 (10) 2 1 jk m j = 1 (11) 3 3 Equations (6)–(8) can be rewritten as follows: X = Z + m m T (12) 1 2 Z = x + m H (13) 1 1 Y = x + m P (14) 3 3 where T = jk (Y Z)j (15) m m 3 4 H = jk (x x )j (16) 2 2 1 P = jk (x x )j (17) 3 3 2.2. General Formulations Now, we choose the scaling factor as the product of the first and second modulo (i.e., m m ). Therefore, scaling of X by m m can be performed by considering k = m m and 1 2 1 2 1 2 substituting Equation (6) into Equation (4) as follows: X Z + m m T Z 1 2 S = = = + T (18) K m m m m 1 2 1 2 Axioms 2022, 11, 5 4 of 13 where x is a residue in modulo m and the maximum value of H in Equation (16) is m 1 1 2 1. Therefore, the maximum value of Z in Equation (13) can be computed as follows: Z = m 1 + m (m 1) = m m 1 (19) Max 1 1 2 1 2 It is clear that the floor of the division of Z by m m is zero. Therefore, by consid- Max 1 2 ering this point and taking into account that T is an integer number, Equation (18) can be simplified as follows: Z Z S = + T = + T = T (20) m m m m 2 2 1 1 Now, according to Equation (5), the residues of T based on the moduli should be computed to achieve the residues of the scaled number as follows: s = jS j = jTj f or i = 1 . . . 4 (21) Li L m m i i Therefore, scaling of Z by m m is reduced to T, and the full reverse conversion (i.e., 1 2 full computing of Equation (6)) is not needed. Now, we are going to achieve a single-modulo scaler for the same moduli set, (i.e., {m , m , m , m }) but with the aim of reusing the two-moduli scaler formulas to reduce 1 2 3 4 the overhead. Hence, considering k = m and the main CRT-II formula of Equation (6) in Equation (4) results in X Z + m m T Z 1 2 S = = = + m T (22) H 2 K m m 1 1 Insertion of Equation (13) into Equation (22) leads to x + m H x 1 1 1 S = + m T = + H + m T (23) H 2 2 m m 1 1 Therefore, since x is less than m , and H and T are integer numbers, Equation (23) can 1 1 be simplified as follows: x + m H x 1 1 1 S = + m T = + H + m T = H + m T (24) H 2 2 2 m m 1 1 Now, according to Equation (5), we have s = S = jH + m Tj f or i = 1 . . . 4 (25) Hi Single 2 According to the residue arithmetic properties [6], Equation (25) can be rewritten as s = jHj +jm j jTj f or i = 1 . . . 4 (26) Hi 2 m m m i i i However, from Equation (21), we know the remainders of T in moduli m are the two-moduli scaling residues. Therefore, we have s = jHj +jm j s f or i = 1 . . . 4 (27) Hi 2 Li m m i i Therefore, by using Equation (21), the single-modulo scaling residues can be achieved from the previously computed two-moduli scaling residues with the minimum overhead. It should be mentioned that more simplifications of Equation (21) can be performed using the exact value of the moduli as shown in the next subsections. Axioms 2022, 11, 5 5 of 13 2n 2n n n 2.3. Case Study: Moduli Set {2 + 1, 2 ,2 + 1, 2 1} The moduli set has a 6n-bit dynamic range, and its reverse converters are all de- signed based on the New CRT-I [11]. However, in contrast to the reverse converters of this moduli set, here, we use the New CRT-II to derive efficient two-moduli scaling for- 2n 2n n n mulas. First, consider the moduli set {m , m , m , m } = {2 + 1, 2 , 2 + 1, 2 1}. 1 2 3 4 According to Equation (20), we must compute T in Equation (15), and then its residues are the low-precision scaling residues. First, the following lemma computes the required multiplicative inverses. 2n 1 Lemma 1. The multiplicative inverses required in Equations (15)–(17) are k = 2 , k = 1 and 1 2 n 1 k = 2 . Proof of Lemma 1. Verification can be performed by substituting the values of the multi- plicative increases and moduli in Equations (9)–(11) as follows: 2n1 2n 2n 2n1 j2  (2 + 1)2 j = j2  2 1j = 1 (28) 2n 2n 2 1 2 1 2n j1 (2 + 1)j 2n = 1 (29) n1 n j2  2 + 1 j n = 1 (30) ( ) 2 1 Now, inserting the values of the moduli and multiplicative inverses in Equations (13)–(17) leads to 2n Z = x + (2 + 1)H (31) Y = x + (2 + 1)P (32) H = jx x j (33) 2n 2 1 n1 P = j2 (x x )j n (34) 4 3 2 1 2n1 T = j2 (Y Z)j 2n (35) 2 1 Equation (31) can be further simplified by substituting Equations (31) and (32) into it as follows: 2n1 n 2n T = j2 (x + (2 + 1)P x (2 + 1)H)j 2n (36) 2 1 The following well-known residue arithmetic properties can be used to further simplify Equations (33)–(36). Property 1. 2 v is equal to the P-bit circular left shifting of v if v is represented as a k-bit i i i 2 1 binary number [11]. Property 2. jv j is equal to one’s complement of v (i.e., v ) if v is represented as a k-bit i i i i 2 1 binary number [11]. Property 3. jv j is equal to v + 1 if v is represented as a k-bit binary number [17]. i i i Property 4. jv j is equal to v + 2 if v is represented as a k-bit binary number [17]. i i i 2 +1 2n 2n n n First, according to the moduli set {2 + 1, 2 , 2 + 1, 2 1}, x and x are (2n + 1)- 1 2 and 2n-bit numbers, respectively. Therefore, Equation (33) can be simplified using Property 3 as follows: 2n H = jx x j = jx (x  2 + x )j 2n 2n 2 1 1,2n 2 2,(2n1)...0 1,(2n1)...0 2 (37) = jx + x + 1j 2n 2,(2n1)...0 1,(2n1)...0 2 Axioms 2022, 11, 5 6 of 13 where x means the j-th bit of the residue x and x and x are (n + 1)- and n-bit numbers, i,j i 4 3 respectively. Therefore, Equation (34) can be rewritten as n1 n P = 2 (x x  2 x ) (38) 3,n 4 3,(n1)...0 2 1 where x is a residue in modulo 2 + 1. Therefore, when x is equal to one, the other bits 3 3,n will be surely be zero, and if the n low significant bits (LSBs) of x are not equal to zero, then the most significant bit (MSB) of x (i.e., x ) should be zero [12]. Therefore, by considering 3 3,n this point and Properties 1 and 2, Equation (38) can be simplified as follows: P = jP + P j (39) 2 n 2 1 where n1 P = j2 x j n = x x x (40) 1 4 2 1 4,0 4,n1... 4,1 | {z } n bits 01 . . . 11 i f x = 1 > 3,n | {z } n bits P = (41) x x . . . x i f x = 0 > 3,0 3,n1 3,n2 3,n | {z } n bits Finally, Equation (35) can be simplified using Properties 1 and 2 as follows: T = jT + T + T + T + T j (42) 2n 1 2 3 4 5 2 1 where 2n1 T = j2 x j 2n = x 0 . . . 00x . . . x (43) 1 3 3,0 3,n 3,1 2 1 | {z } | {z } n1 bits n bits 2n1 n 2n1 T = 2 (2 + 1)P = 2 (p . . . p p . . . p ) 2n 2 n1 0 n1 0 2 1 | {z }| {z } (44) 2n n bits n bits 2 1 = p p . . . p p . . . p 0 n1 0 n1 1 | {z }| {z } n bits n1 bits 2n1 T = 2 x = x x . . . x (45) 3 1,0 1,2n1 1,1 1,(2n1)...0 2n 2 1 | {z } 2n bits 2n1 2n T = 2  2  x = x 1 . . . 11 (46) 4 1,2n 1,2n 2n 2 1 | {z } 2n bits 2n1 2n T = j2 (2 + 1)Hj 2n = jHj = H . . . H H (47) 5 2n 0 2n1 1 2 1 2 1 | {z } 2n bits Note that P is an n-bit number, and due to this computation of Equation (44), it became a simple concatenation. Aside from that, the constant coefficient of H in Equation (47) was substituted with 1 since 2n1 2n 2n1 jj2 j j2 + 1j j = j2  2j = 1 (48) 2n 2n 2n 2n 2 1 2 1 2 1 2 1 Next, after calculation of T using Equation (42), we must compute the residues of T according to Equation (21) to achieve the two-moduli scaled residues. However, the largest 2n value of T is 2 2, and therefore, it is always less than the first and second moduli. Hence, we have s = T = T (49) j j 2n L1 2 +1 s = jTj = T (50) 2n L2 2 Axioms 2022, 11, 5 7 of 13 The third and fourth two-moduli scaled residues can be achieved as follows: s = jTj = jT  2 + T j n L3 (2n1)...n (n1)...0 2 +1 2 +1 = jT T j n (51) 2 +1 (n1)...0 (2n1)...(n1) = jT + T + 2j n 2 +1 (n1)...0 (2n1)...(n) s = jTj = jT + T j n (52) L4 2 1 2 1 (2n1)...n (n1)...0 Now, based on Equation (27), we can also achieve single-modulo scaling formulas from the two-moduli scaling residues as follows: 2n s = jjHj + 2 jTj j 2n 2n 2n 2n H1 2 +1 2 +1 2 +1 2 +1 (53) 2n = jjHj + 2 s j 2n 2n 2n L1 2 +1 2 +1 2 +1 We can simplify Equation (27) by substituting Equation (49) into it and considering 2n that the maximum value of H in (37) is 2 1. Therefore, we have s = jH Tj = H + T + 2 (54) 2n 2n H1 2 +1 2 +1 Similarly, for other residues, we have 2n s = jjHj 2n +j2 j 2njTj 2nj 2n = H (55) H2 2 2 2 2 2n s = jjHj n +j2 j n jTj n j n H3 2 +1 2 +1 2 +1 2 +1 = H  2 + H + s L3 (2n1)...n (n1)...0 (56) 2 +1 2 +1 = H + H + 2 + s L3 (2n1)...n (n1)...0 2 +1 2n s = jjHj n + 2 jTj n j H4 n 2 1 2 1 2 1 2 1 (57) = jH + H + s j n (2n1)...n (n1)...0 L4 2 1 Therefore, we can compute H and T from Equations (37) and (42) just one time and then using them several times to compute both the single- and two-moduli scaling. n n 2n 2n + 1 2.4. Case Study: Moduli Set {2 1, 2 + 1, 2 , 2 1} n n 2n 2n 2n This moduli set has the same moduli as {2 1, 2 + 1, 2 , 2 + 1} except for 2 + 1 2n + 1 which is substituted with 2 1. Due to this, it can lead to a faster RNS arithmetic unit. However, its reverse converter will be more complex. The overall process of designing the n n 2n 2n scaler for this moduli set is relatively the same as for the moduli set {2 1, 2 + 1, 2 , 2 + 1} described in the previous subsection. 2n + 1 2n n n First, consider the moduli order {m , m , m , m } = {2 1, 2 , 2 + 1, 2 1}. 1 2 3 4 Then, according to Equations (15)–(17), the multiplicative inverses can be computed as k n 1 = k = 1, and k = 2 (the proof is straightforward and similar to Lemma 1). Therefore, 2 3 Equation (15) is a key formula in the scaling that can be calculated as follows: T = jY Zj 2n 2 1 n 2n+1 = x + (2 + 1)P x 2 1 H (58) 3 2n 2 1 = jT + T + T + T + T j 2n 1 2 3 4 5 2 1 where T = jx j = 0 . . . 00 x . . . x (59) 2n 1 3 3,n1 3,0 2 1 | {z } | {z } n bits n bits T = 2 + 1 P = p . . . p p . . . p (60) j( ) j 2n 2 n1 0 n1 0 2 1 | {z }| {z } n bits n bits Axioms 2022, 11, 5 8 of 13 Axioms 2022, 10, x FOR PEER REVIEW 8 of 13 T = x = x . . . x (61) 3 1,(2n1)...0 1,2n1 1,0 2n 2 1 | {z } 𝑇 = − 𝑥 = 𝑥 … 𝑥 ,( )… ,2 , (61) 2n bits 2n T = 2  x = 1 . . . 11x (62) 4 1,2n 1,2n 2n 𝑇 = − 2 × 𝑥 = 1 … 11𝑥 2 1 | {z } , ,2 (62) 2n bits 2n+1 T = j(2 1)Hj = jHj = H . . . H H (63) 2n 2n 5 2n1 1 0 2 1 2 1 𝑇 = | − (2 − 1)𝐻 | = | − 𝐻 | = 𝐻 … 𝐻 𝐻 | {z } (63) 2n bits 2n + 1 2n n n 2n + 1 2n n n The The two-moduli two-moduli scaled scaled r residue esidue formulas formulas for for the the moduli moduli set set {2 {2 − 1, 1, 2 2, 2 , + 21,+ 21, −2 1} 2n 2n n n 2n 2n n n are 1} th ar ee sthe amesame as tho as sethose for th for e m the odu moduli li set {2set +{2 1, 2 + , 2 1, + 2 1,, 2 2 −+ 11, } (i2 .e., E1} qu(i.e., ationEquations s (49–52)), (49)–(52)), since all of them are based on T. That aside, the single-modulo scaled residues since all of them are based on T. That aside, the single-modulo scaled residues are the are the same as in Equations (55)–(57) except for the first scaled residue, which is as follows: same as in Equations (55–57) except for the first scaled residue, which is as follows: 2n 𝑠 = ||𝐻 | + |2 | 𝑠 | s = jjHj + 2 s j 2n+1 2n+1 H1 2n+1 L1 2 1 2 1 2 1 (64) = 𝐻 + 𝑠 … 𝑠 𝑠 , , , (64) = H + s . . . s s L1,2n1 L1,0 L1,2n | {z } 2n+1 bits 2n+1 2 1 3. Low-Precision Scaling with Two-Moduli Scaling Factor: Hardware Design 3. Low-Precision Scaling with Two-Moduli Scaling Factor: Hardware Design This section presents the full adder-based and memory-free hardware design of the This section presents the full adder-based and memory-free hardware design of the proposed RNS scaler. The overview of the proposed approach for a generic four-moduli proposed RNS scaler. The overview of the proposed approach for a generic four-moduli set is depicted in Figure 1. First, P, H and T are computed using Equations (15–17), and set is depicted in Figure 1. First, P, H and T are computed using Equations (15)–(17), and then, the two-moduli scaled residues are computed using Equation (21). Afterward, the then, the two-moduli scaled residues are computed using Equation (21). Afterward, the single-modulo scaled residues are obtained using the precomputed two-moduli scaled single-modulo scaled residues are obtained using the precomputed two-moduli scaled residues based on Equation (27). The important part of the scaler is the calculation of T residues based on Equation (27). The important part of the scaler is the calculation of T that that is shared between both kinds of scaling, resulting in a significant hardware reduction. is shared between both kinds of scaling, resulting in a significant hardware reduction. Figure Figure 1. 1. The The block block diagram diagram of of the the pr proposed oposed zer zero-awar o-aware e low-pr low-precision ecision scaler scaler for for the the ge generic neric RNS RNS four-moduli set {m1, m2, m3, m4}. four-moduli set {m , m , m , m }. 1 2 3 4 The The scaler scaler of of Figur Figure e 1 1 is is designed designed based based on on a a general general value value of of the the moduli. moduli. H However owever,, 2n 2n n n 2n 2n n n for special RNS moduli sets with power-of-two moduli such as {2 + 1, 2 , 2 + 1, 2 1}, for special RNS moduli sets with power-of-two moduli such as {2 + 1, 2 , 2 + 1, 2 − 1}, the design can be considerably simplified, as presented in Figure 2. First, the H in Equation the design can be considerably simplified, as presented in Figure 2. First, the H in Equation (39) is implemented using a 2n-bit regular carry-propagate adder (CPA) where its carry-in (39) is implemented using a 2n-bit regular carry-propagate adder (CPA) where its carry- Double-Modulo Axioms 2022, 10, x FOR PEER REVIEW 9 of 13 Axioms 2022, 11, 5 9 of 13 in is connected to one. Aside from that, P also requires a modulo 2 − 1 CPA, which can be is connected to one. Aside from that, P also requires a modulo 2 1 CPA, which can be implemented using an n-bit CPA with EAC [18] based on Equation (40). implemented using an n-bit CPA with EAC [18] based on Equation (40). x1 x2 x3 x4 Operand Preparation 2n-bit CPA 1 n-bit CPA-EAC Operand Preparation 2n-bit CSA-EAC 2n-bit CSA-EAC 2n-bit CSA-EAC 2n-bit CPA-EAC Operand Preparation n-bit CPA- n-bit CSA- EAC CEAC n-bit CPA- CEAC Operand Preparation n-bit n-bit 2n-bit CSA- CSA- CSA- EAC CEAC CEAC n-bit n-bit 2n-bit CPA- CPA- CPA- EAC CEAC CEAC Singl e-Modulo Scal ed Resi dues 2n 2n n n 2n 2n n n 2n Figure 2. The proposed scaler for the moduli set {2 + 1, 2 , 2 + 1, 2 1} with scale coefficients Figure 2. The proposed scaler for the moduli set {2 + 1, 2 , 2 + 1, 2 − 1} with scale coefficients (2 2n 2n 2n 2n 2n (2 + 1) 2 and 2 . + 1) 2 and 2 . The operand preparation unit performs the required inversions, shifting and multi- The operand preparation unit performs the required inversions, shifting and multi- plexing needed in Equation (41). Then, the important variable T in Equation (42) can be plexing needed in Equation (41). Then, the important variable T in Equation (42) can be 2n realized using three carry-save adders (CSAs) with EAC followed by a modulo 2n 2 1 realized using three carry-save adders (CSAs) with EAC followed by a modulo 2 − 1 CPA CPA [18]. Then, according to Equations (49)–(52), the first and second two-moduli scaled [18]. Then, according to Equations (49–52), the first and second two-moduli scaled resi- residues are equal to T, and the third and fourth are only the reduction of T in moduli dues are equal to T, and the third and fourth are only the reduction of T in moduli 2 − 1 n n 2 1 and 2 + 1, which can be realized using an n-bit CPA with EAC and n-bit CPA and 2 + 1, which can be realized using an n-bit CPA with EAC and n-bit CPA with com- with complement EAC (CEAC), respectively. Note that CPA-CEAC is a representation of plement EAC (CEAC), respectively. Note that CPA-CEAC is a representation of the mod- the modulo 2 + 1 adder which can be realized using different methods [19]. Finally, the ulo 2 + 1 adder which can be realized using different methods [19]. Finally, the single- single-modulo scaled residues can be achieved using Equations (54)–(57). The CSAs are modulo scaled residues can be achieved using Equations (54–57). The CSAs are used to used to compress the three operands into two, and then a modulo adder produces the compress the three operands into two, and then a modulo adder produces the scaled res- scaled residue. It can be seen that in the customized version of the scaler for the moduli set 2n idue. It can be seen that in the customized version of the scaler for the moduli set {2 + 1, 2n 2n n n {2 + 1, 2 , 2 + 1, 2 1}, the units for m and m reduction in the two-moduli scaling 1 2 Axioms 2022, 11, 5 10 of 13 part are removed, since the scaled residues are equal to T. Aside from that, the second single-modulo scaled residue is H, and hence, the required m reduction unit is removed. Finally, Algorithm 1 shows how the proposed hardware architecture can be used to provide zero-aware RNS scaling. If the low-precision scaled residues become zero, then the high-precision scaled residue should be used as the output, except in the case that they also become zero. In this case (i.e., both scaler outputs become zero), the number is very small and is less than both of the scaling constants. In this case, its original value can be used in the computations. Note that here we do not use any magnitude comparator which is a complex unit in RNS, and only by checking the scaled residues against zero we could evaluate the relative magnitude of the number (less or greater than the scaling coefficients). Algorithm 1: Zero-Aware RNS Scaling. Input: Non-Zero RNS Number (x , x , x , x ) 1 2 3 4 Output: Non-Zero Scaled RNS Number (s , s , s , s ) 1 2 3 4 1: Calculate the low-precision scaled residues (sl , sl , sl , sl ) 1 2 3 4 2: If (sl , sl , sl , sl ) 6= (0, 0, 0, 0) Then return (sl , sl , sl , sl ) 1 2 3 4 1 2 3 4 3: Calculate the high-precision scaled residues (sh , sh , sh , sh ) 1 2 3 4 4: If (sh , sh , sh , sh ) 6= (0, 0, 0, 0) Then return (sh , sh , sh , sh ) 1 2 3 4 1 2 3 4 5: Return original residues (x , x , x , x ) 1 2 3 4 4. Performance Evaluation The majority of the available state-of-the-art RNS scalers are dedicatedly designed for three-moduli sets, and only [15] presents the first RNS scaler design for four-moduli 2n + 1 2n n n sets. The RNS scaler for the moduli set {2 1, 2 , 2 + 1, 2 1} is fully designed 2n in [15] based on the scaling factor 2 as shown in Figure 3. To perform a technology- independent performance comparison, the unit-gate (U-G) model is used according to [15] for comparative assessment of the works. All the assumptions considered in [15] for estimation of the area and delay of modular adders are also considered here for a fair comparison, as shown in Table 1. Table 1. The area and delay formulas for different n-bit modulo adders based on the U-G model reported in [15]. Modulo Adder Area Delay CPA-EAC 3ndlog n 1e + 12n 2dlog n 1e + 3 2 2 2 1 CSA-EAC 7n 4 2 CPA 1.5ndlog ne + 5n 2dlog ne + 3 2 2 CPA-CEAC 4.5ndlog ne + 0.5n + 6 2dlog ne + 3 2 2 2 + 1 CSA-CEAC 7n 4 Note that in the U-G model, each XOR or XNOR gate counts as two unit gates in the area and delay, and an AND or OR gate is considered one unit gate for both the area and delay. Therefore, the combinatorial circuits such as FAs, half adders (HAs) and one-bit 21 multiplexers count as 7, 3 and 3 unit gates in area and 4, 2 and 2 gates in the delay, respectively. Aside from that, the U-G area and delay estimations for each component of the proposed scaler is described in Table 2. Note that the gray lines in Table 2 are not on the critical delay path. Axioms 2022, 10, x FOR PEER REVIEW 11 of 13 Axioms 2022, 11, 5 11 of 13 x1 x2 x3 x4 Operand Preparation n-bit CPA- n-bit CSA- 2n-bit CSA-EAC EAC CEAC 2n-bit simplified CSA- n-bit CPA- EAC CEAC 2n-bit CPA-EAC Operand Preparation (2n+1)-bit CSA-EAC (2n+1)-bit CPA-EAC Operand Preparation 2n-bit (2n+1)-bit CPA CSA-EAC (2n+1)-bit CPA-EAC Singl e-Modulo Scal ed Resi du es 2n + 1 2n n n Figure 3. The single-modulo scaler for the special moduli set {2 1, 2 , 2 + 1, 2 1} with 2n + 1 2n n n Figure 3. The single-modulo scaler for the special moduli set {2 − 1, 2 , 2 + 1, 2 − 1} with scaling 2n scaling factor 2 proposed in [15]. 2n factor 2 proposed in [15]. Table 2. The area and delay formulas based on the U-G model for different components of the Table 1. The area and delay formulas for different n-bit modulo adders based on the U-G model proposed double-modulo scaler. reported in [15]. Component Area Delay Modulo Adder Area Delay CPA-EAC 3𝑛 ⌈log 𝑛 − 1⌉ + 12𝑛 2⌈log 𝑛 − 1⌉ + 3 2n-bit CPA 3ndlog ne + 13n 2dlog ne + 5 2 2 2 − 1 7𝑛 CSA-EAC 4 n-bit 2  1 MUX 3n 2 2 CPA 1.5𝑛 ⌈log 𝑛 ⌉ + 5𝑛 2⌈log 𝑛 ⌉ + 3 n-bit CPA-EAC 3ndlog n 1e + 12n 2dlog n 1e + 3 2 2 CPA-CEAC 4.5𝑛 ⌈log 𝑛 ⌉ + 0.5𝑛 + 6 2⌈log 𝑛 ⌉ + 3 2n-bit 2Simplified + 1 CSA-EAC1 10n + 4 4 7𝑛 CSA-CEAC 4 2n-bit Simplified CSA-EAC2 6n + 4 4 2n-bit CSA-EAC 14n 4 Note that in the U-G model, each XOR or XNOR gate counts as two unit gates in the area and delay, and an AND or OR gate is considered one unit gate for both the area and 2n-bit CPA-EAC 6ndlog ne + 24n 2dlog ne + 3 2 2 delay. Therefore, the combinatorial circuits such as FAs, half adders (HAs) and one-bit n-bit CSA-CEAC 7n 4 2×1 multiplexers count as 7, 3 and 3 unit gates in area and 4, 2 and 2 gates in the delay, n-bit CPA-CEAC 4.5ndlog ne + 0.5n + 6 2dlog ne + 3 2 2 respectively. Aside from that, the U-G area and delay estimations for each component of the proposed scaler is described in Table 2. Note that the gray lines in Table 2 are not on Finally, the overall area and delay estimations for scalers are presented in Table 3 for the critical delay path. a general value of n. It can be seen that while the proposed low-precision scaler is based on Tab altwo-moduli e 2. The area ascaling nd delayfactor formu , lthe as bhar ased dwar on th eer U equir -G m ement odel for is dless iffere than nt coth mp eosingle-modulo nents of the pro- posed double-modulo scaler. scaler for the same moduli set, while the delay is almost the same. That aside, the proposed scaler outperforms the design of [15] in terms of hardware requirements. Furthermore, as is Component Area Delay expected, the high-precision single-modulo version requires a higher area and delay since ⌈ ⌉ ⌈ ⌉ 2n-bit CPA 3𝑛 log 𝑛 + 13𝑛 2 log 𝑛 + 5 it is computed based on the output of the two-moduli scaler. n-bit 2 × 1 MUX 3𝑛 2 ⌈ ⌉ ⌈ ⌉ n-bit CPA-EAC 3𝑛 log 𝑛 − 1 + 12𝑛 2 log 𝑛 − 1 + 3 Axioms 2022, 11, 5 12 of 13 2n + 1 Table 3. The total area and delay estimations for the RNS scalers based on the four-moduli set {2 2n n n 1, 2 , 2 + 1, 2 1}. Scaler Scale Factor Area Delay Proposed 19.5ndlog ne + 2n 2n + 1 2 2 (2 1) 6dlog ne + 27 Low-Precision 95.5n + 14 Proposed 31.5ndlog ne + 2n 8dlog ne + 34 High-Precision 160.5n + 14 [15] (28.5n + 6)dlog ne + 2n 2 2 6dlog ne + 25 High-Precision 150.5n + 44 5. Conclusions Scaling is an overflow prevention mechanism that must be extensively used to prevent overflow by reducing the size of the operands before RNS addition and multiplication operations. However, high-precision single-modulo scaling is not suitable for overflow prevention in multiplication due to its inability to perform significant size reduction, but the low-precision scaling with two moduli can lead to a zero result for small numbers. Therefore, this work presents a novel zero-aware low-precision scaler based on the two- moduli scaling factor. Then, the proposed circuits are reused to derive a high-precision scaling output to use in situations where low-precision output is not usable, resulting in a zero-aware RNS scaler. Therefore, the proposed design is pushing forward the RNS into practical applications by providing an efficient mechanism for overflow prevention, which is one of the major challenges of RNS. On the other hand, the high latency of the scaler is one of the limitations of this approach which can be improved in the context of the final application. Funding: This research received no external funding. Conflicts of Interest: The author declares no conflict of interest. References 1. Chang, C.H.; Molahosseini, A.S.; Zarandi, A.A.E.; Tay, T.F. Residue Number Systems: A New Paradigm to Datapath Optimization for Low-Power and High-Performance Digital Signal Processing. IEEE Circuits Syst. Mag. 2015, 15, 26–44. [CrossRef] 2. Samimi, N.; Kamal, M.; Afzali-Kusha, A.; Pedram, M. Res-DNN: A Residue Number System-Based DNN Accelerator Unit. IEEE Trans. Circuits Syst. I Regul. Pap. 2020, 67, 658–671. [CrossRef] 3. Deng, B.; Srikanth, S.; Jain, A.; Conte, T.; Debenedictis, E.; Cook, J. Scalable Energy-Efficient Microarchitectures with Computa- tional Error Tolerance via Redundant Residue Number Systems. IEEE Trans. Comput. 2021, in press. [CrossRef] 4. Omondi, A.R.; Premkumar, B. Residue Number Systems: Theory and Implementation; Imperial College Press: London, UK, 2007. 5. Molahosseini, A.S.; Zarandi, A.A.E.; Martins, P.; Sousa, L. A Multifunctional Unit for Designing Efficient RNS-Based Datapaths. IEEE Access 2017, 5, 25972–25986. [CrossRef] 6. Kong, Y.; Phillips, B. Fast Scaling in the Residue Number System. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2009, 17, 443–447. [CrossRef] n n n 7. Chang, C.H.; Low, J.Y.S. Simple, Fast, and Exact RNS Scaler for the Three-Moduli Set {2 1, 2 , 2 + 1}. IEEE Trans. Circuits Syst. I Regul. Pap. 2011, 58, 2686–2697. [CrossRef] n n n 8. Low, J.Y.S.; Chang, C.H. A VLSI Efficient Programmable Power-of-Two Scaler for {2 1, 2 , 2 + 1} RNS. IEEE Trans. Circuits Syst. I Regul. Pap. 2012, 59, 2911–2919. [CrossRef] n n n 9. Low, J.Y.S.; Tay, T.F.; Chang, C.H. A unified {2 1, 2 , 2 + 1} RNS scaler with dual scaling constants. In Proceedings of the 2012 IEEE Asia Pacific Conference on Circuits and Systems, Kaohsiung, Taiwan, 2–5 December 2012. k n n n +1 k n 10. Patronik, P.; Piestrak, S.J. Design of Reverse Converters for General RNS Moduli Sets {2 , 2 1, 2 + 1, 2 1} and {2 , 2 1, n n 1 2 + 1, 2 1} (n even). IEEE Trans. Circuits Syst. I Regul. Pap. 2014, 61, 1687–1700. [CrossRef] 11. Molahosseini, A.S.; Navi, K.; Dadkhah, C.; Kavehei, O.; Timarchi, S. Efficient reverse converter designs for the new 4-moduli sets n n n 2n+1 n n 2n 2n {2 1, 2 , 2 + 1, 2 1} and {2 1, 2 + 1, 2 , 2 + 1} based on new CRTs. IEEE Trans. Circuits Syst. I Regul. Pap. 2010, 57, 823–835. [CrossRef] 12. Zarandi, A.A.E.; Molahosseini, A.S.; Sousa, L.; Hosseinzadeh, M. An Efficient Component for Designing Signed Reverse K P Converters for a Class of RNS Moduli Sets with Composite Form {2 , 2 1}. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2017, 25, 48–59. [CrossRef] Axioms 2022, 11, 5 13 of 13 n n n 2n+1 n n 13. Sousa, L.; Antao, S. MRC-Based RNS Reverse Converters for the Four-Moduli Sets {2 + 1, 2 1, 2 , 2 1} and {2 + 1, 2 2n 2n+1 1, 2 , 2 1}. IEEE Trans. Circuits Syst. II 2012, 59, 244–248. [CrossRef] 14. Hiasat, A. A Reverse Converter and Sign Detectors for an Extended RNS Five-Moduli Set. IEEE Trans. Circuits Syst. I Regul. Pap. 2017, 64, 111–121. [CrossRef] 15. Sousa, L. 2 RNS Scalers for Extended 4-Moduli Sets. IEEE Trans. Comput. 2015, 64, 3322–3334. [CrossRef] 16. Garcia, A.; Lioris, A. A Look-Up Scheme for Scaling in the RNS. IEEE Trans. Comput. 1999, 48, 748–751. [CrossRef] 17. Vassalos, E.; Bakalis, D. CSD-RNS-based Single Constant Multipliers. J. Signal Process. Syst. 2012, 67, 255–268. [CrossRef] 18. Piestrak, S.J. A high speed realization of a residue to binary converter. IEEE Trans. Circuits Syst. II 1995, 42, 661–663. [CrossRef] 19. Vergos, H.T.; Bakalis, D.; Efstathiou, C. Fast modulo 2 + 1 multi-operand adders and residue generators. Integration 2010, 43, 42–48. [CrossRef]

Journal

AxiomsMultidisciplinary Digital Publishing Institute

Published: Dec 23, 2021

Keywords: residue number system (RNS); scaling; Chinese remainder theorem (CRT)

There are no references for this article.