Zero-Aware Low-Precision RNS Scaling Scheme
Zero-Aware Low-Precision RNS Scaling Scheme
Sabbagh Molahosseini, Amir
2021-12-23 00:00:00
axioms Article Amir Sabbagh Molahosseini School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast BT7 1NN, UK; A.SabbaghMolahosseini@qub.ac.uk Abstract: Scaling is one of the complex operations in the Residue Number System (RNS). This operation is necessary for RNS-based implementations of deep neural networks (DNNs) to prevent overflow. However, the state-of-the-art RNS scalers for special moduli sets consider the 2 modulo as the scaling factor, which results in a high-precision output with a high area and delay. Therefore, low-precision scaling based on multi-moduli scaling factors should be used to improve performance. However, low-precision scaling for numbers less than the scale factor results in zero output, which makes the subsequent operation result faulty. This paper first presents the formulation and hardware architecture of low-precision RNS scaling for four-moduli sets using new Chinese remainder theorem 2 (New CRT-II) based on a two-moduli scaling factor. Next, the low-precision scaler circuits are reused to achieve a high-precision scaler with the minimum overhead. Therefore, the proposed scaler can detect the zero output after low-precision scaling and then transform low-precision scaled residues to high precision to prevent zero output when the input number is not zero. Keywords: residue number system (RNS); scaling; Chinese remainder theorem (CRT) 1. Introduction Residue Number Systems (RNSs) have been used in different applications such as digital signal processing (DSP) [1] and deep learning systems [2] to provide low-power, high-speed and fault-tolerant computations [3]. The main feature of an RNS is fast and par- Citation: Sabbagh Molahosseini, A. allel implementation of addition and multiplication based on separate modular arithmetic Zero-Aware Low-Precision RNS circuits. However, detection of multiplication overflow is one of the difficult RNS problems, Scaling Scheme. Axioms 2022, 11, 5. since the multiplication of any two operands larger than half of the dynamic range results https://doi.org/10.3390/ in overflow. Therefore, the high probability of overflow occurrence in multiplication has axioms11010005 motivated researchers to develop overflow prevention mechanisms for an RNS. Scaling Academic Editor: Chong Wang (i.e., division of the RNS number by a constant number) is one of the ways to reduce the size of the operands to prevent overflow in RNS operations. However, scaling is a difficult Received: 21 November 2021 process, since the division operation in an RNS cannot be performed in parallel modular Accepted: 20 December 2021 channels like multiplication and addition [4]. Therefore, usually one of the modulo of the Published: 23 December 2021 moduli set is selected as the scaling factor to reduce the complexity [5]. Publisher’s Note: MDPI stays neutral The scaling for general moduli sets is usually realized using look-up tables (LUTs) [6], with regard to jurisdictional claims in while the adder-based implementations can be achieved based on special moduli sets with published maps and institutional affil- higher performance. Due to this, there is a variety of works focusing on designing scalers iations. n n n for the well-known RNS three-moduli set {2 1, 2 , 2 + 1} [7–9]. The authors of [7,8] n n considered the modulo 2 as the scaling factor. Using 2 as the scaling factor resulted in simplified scalers with high-precision output. However, using only one modulo as the scaling factor is mostly applicable for addition operations, since it cannot drastically reduce Copyright: © 2021 by the author. the size of the numbers to prevent multiplication overflow. Due to this, the authors of [9] Licensee MDPI, Basel, Switzerland. n n proposed two-moduli scaling based on 2 (2 + 1) as the scaling factor, which led to a This article is an open access article low-precision output. Although this scaling factor can significantly reduce the size of the distributed under the terms and n n n operands, the limited 3n-bit dynamic range of the three-moduli set {2 1, 2 , 2 + 1} is not conditions of the Creative Commons suitable for two-moduli scaling factors because in this three-moduli RNS system, the values Attribution (CC BY) license (https:// n n creativecommons.org/licenses/by/ of most numbers are less than the scaling factor (i.e., 2 (2 + 1)), which results in a zero 4.0/). output for the scaler, consequently making the next operation faulty. This is a significant Axioms 2022, 11, 5. https://doi.org/10.3390/axioms11010005 https://www.mdpi.com/journal/axioms Axioms 2022, 11, 5 2 of 13 problem which indicates the importance of a zero-aware scaling mechanism, which is not covered by previous research. Therefore, two-moduli scaling factors together with the large dynamic range four- or k n n n + 1 n n n 2n + 1 five-moduli sets, such as {2 , 2 1, 2 + 1, 2 1} [10], {2 1, 2 , 2 + 1, 2 n n 2n 2n n n 2n 2n + 1 2n + p 1} [11] and {2 1, 2 + 1, 2 , 2 + 1} [12], {2 1, 2 + 1, 2 , 2 1} [13] and {2 , n n n (n + 1)/2 n (n + 1)/2 2 1, 2 + 1, 2 2 +1, 2 + 2 +1} [14], should be used. However, there is a limited number of works that consider the scaling for four-moduli sets. The authors of [15] designed a scaler based on a two-level architecture with the single-modulo scaling n + k factor 2 . The first level of this scaler performs scaling based on the three-moduli set n n + x n {2 1, 2 , 2 + 1}, where 0 x n, and then the second level computes the final n + k 2n four-moduli scaling using the composite set {2 (2 1), m } [15]. This two-level architecture requires high hardware requirements due to the multiple uses of modular adders. Furthermore, scaling by the 2 modulo is not sufficient for large dynamic range four-moduli sets to avoid overflow. In other words, the regular modulo 2 scaling of the n n n n numbers based on the three-moduli set {2 1, 2 , 2 + 1} is not equivalent to modulo 2 n n n n + 1 scaling in the four-moduli set {2 1, 2 , 2 + 1, 2 1}, since the dynamic ranges of these moduli sets are 3n- and (4n + 1)-bit, respectively. Therefore, two-moduli scaling must be used to prevent multiplication overflow for large dynamic range RNS systems. On the other hand, to have a zero-aware scaler, the operands should be compared with the scaling constant before operation to prevent scaling with zero output. However, magnitude comparison is a difficult RNS operation, and its realization increases hardware complexity [4]. Here, we address this problem without using an RNS magnitude compara- tor based on a method for deriving two different scaling outputs from the same circuit. In the proposed work, first, a low-precision scaler based on two moduli is proposed for RNS four-moduli sets. Then, we reuse the hardware architecture of a low-precision scaler for producing high-precision scaled output that can be used when the low-precision scaler generates zero for non-zero operands. It is shown how new Chinese remainder theorem 2 (New CRT-II) can be used to achieve simplified two-moduli scaling for four- moduli sets. The proposed approach (i.e., a double-output scaler with both low- and high-precision outputs) has two main advantages. First, the high-precision output can be applied to addition operands, while the low-precision output can be used for multiplication operations to prevent overflow by considerably reducing the operands’ size. Second, in the case of using a low-precision output for multiplication, if the low-precision output becomes zero, the high-precision output can be used to prevent overflow, resulting in a zero-aware scaling approach. Moreover, derivation of the proposed general approach for two special n n 2n 2n n n 2n large dynamic range four-moduli sets {2 1, 2 + 1, 2 , 2 + 1} and {2 1, 2 + 1, 2 , 2n + 1 2 1} is presented, and its performance is compared with the conventional method. In the rest of the paper, the mathematical formulation and proof of the proposed scaling approach for both general and special four-moduli sets are described in Section 2. Next, Section 3 presents the fully adder-based hardware design of the proposed scalers. Moreover, a performance comparison is presented in Section 4. Finally, Section 5 concludes the paper. 2. Low-Precision Scaling with Two-Moduli Scaling Factor: Mathematical Formulation This section presents the proposed approach to design scalers for RNS using the New CRT-II [11]. In the rest of this section, a brief introduction about the scaling concept and new CRT-II is first described. Then, mathematical formulations of the proposed approach in general form and for two sample special forms will be presented. 2.1. Scaling Concept and CRT-II The scaling of the weighted number X by the constant factor K according to the scaling operator defined in [16] and is as follows: X = SK +jXj (1) K Axioms 2022, 11, 5 3 of 13 This formula shows that any weighted number can be formed as a summation of its remainder with scaling factor K and the multiplication of the scaling result (i.e., S) and K. In other words, S is the integer quotient of dividing X by K, and it can be expressed as follows [5,7]: S = (2) Note that Equations (1) and (2) are based on weighted numbers. However, they should be implemented inside RNS using residues. Therefore, consider the following residue representations for X and S based on the four-moduli set {m , m , m , m }: 1 2 3 4 RNS X ! (x , x , x , x ) (3) 1 2 3 4 RNS S = ! s , s , s , s (4) ( ) 1 2 3 4 where the scale factor m is one of the moduli. Aside from that, also consider s = jSj f or i = 1 . . . 4 (5) Second, consider the RNS number (x , x , x , x ), which can be converted into its 1 2 3 4 corresponding weighted number X using the New CRT-II conversion formulas for the generic four-moduli set {m , m , m , m } as follows [11]: 1 2 3 4 X = Z + m m k Y Z (6) j ( )j 1 2 1 m m 3 4 Z = x + m jk (x x )j (7) 1 1 2 2 1 Y = x + m jk (x x )j (8) 3 3 3 4 3 where the required multiplicative inverses can be achieved by considering the following relations: jk m m j = 1 (9) 1 1 2 m m 3 4 jk m j = 1 (10) 2 1 jk m j = 1 (11) 3 3 Equations (6)–(8) can be rewritten as follows: X = Z + m m T (12) 1 2 Z = x + m H (13) 1 1 Y = x + m P (14) 3 3 where T = jk (Y Z)j (15) m m 3 4 H = jk (x x )j (16) 2 2 1 P = jk (x x )j (17) 3 3 2.2. General Formulations Now, we choose the scaling factor as the product of the first and second modulo (i.e., m m ). Therefore, scaling of X by m m can be performed by considering k = m m and 1 2 1 2 1 2 substituting Equation (6) into Equation (4) as follows: X Z + m m T Z 1 2 S = = = + T (18) K m m m m 1 2 1 2 Axioms 2022, 11, 5 4 of 13 where x is a residue in modulo m and the maximum value of H in Equation (16) is m 1 1 2 1. Therefore, the maximum value of Z in Equation (13) can be computed as follows: Z = m 1 + m (m 1) = m m 1 (19) Max 1 1 2 1 2 It is clear that the floor of the division of Z by m m is zero. Therefore, by consid- Max 1 2 ering this point and taking into account that T is an integer number, Equation (18) can be simplified as follows: Z Z S = + T = + T = T (20) m m m m 2 2 1 1 Now, according to Equation (5), the residues of T based on the moduli should be computed to achieve the residues of the scaled number as follows: s = jS j = jTj f or i = 1 . . . 4 (21) L i L m m i i Therefore, scaling of Z by m m is reduced to T, and the full reverse conversion (i.e., 1 2 full computing of Equation (6)) is not needed. Now, we are going to achieve a single-modulo scaler for the same moduli set, (i.e., {m , m , m , m }) but with the aim of reusing the two-moduli scaler formulas to reduce 1 2 3 4 the overhead. Hence, considering k = m and the main CRT-II formula of Equation (6) in Equation (4) results in X Z + m m T Z 1 2 S = = = + m T (22) H 2 K m m 1 1 Insertion of Equation (13) into Equation (22) leads to x + m H x 1 1 1 S = + m T = + H + m T (23) H 2 2 m m 1 1 Therefore, since x is less than m , and H and T are integer numbers, Equation (23) can 1 1 be simplified as follows: x + m H x 1 1 1 S = + m T = + H + m T = H + m T (24) H 2 2 2 m m 1 1 Now, according to Equation (5), we have s = S = jH + m Tj f or i = 1 . . . 4 (25) H i Single 2 According to the residue arithmetic properties [6], Equation (25) can be rewritten as s = jHj +jm j jTj f or i = 1 . . . 4 (26) H i 2 m m m i i i However, from Equation (21), we know the remainders of T in moduli m are the two-moduli scaling residues. Therefore, we have s = jHj +jm j s f or i = 1 . . . 4 (27) H i 2 L i m m i i Therefore, by using Equation (21), the single-modulo scaling residues can be achieved from the previously computed two-moduli scaling residues with the minimum overhead. It should be mentioned that more simplifications of Equation (21) can be performed using the exact value of the moduli as shown in the next subsections. Axioms 2022, 11, 5 5 of 13 2n 2n n n 2.3. Case Study: Moduli Set {2 + 1, 2 ,2 + 1, 2 1} The moduli set has a 6n-bit dynamic range, and its reverse converters are all de- signed based on the New CRT-I [11]. However, in contrast to the reverse converters of this moduli set, here, we use the New CRT-II to derive efficient two-moduli scaling for- 2n 2n n n mulas. First, consider the moduli set {m , m , m , m } = {2 + 1, 2 , 2 + 1, 2 1}. 1 2 3 4 According to Equation (20), we must compute T in Equation (15), and then its residues are the low-precision scaling residues. First, the following lemma computes the required multiplicative inverses. 2n 1 Lemma 1. The multiplicative inverses required in Equations (15)–(17) are k = 2 , k = 1 and 1 2 n 1 k = 2 . Proof of Lemma 1. Verification can be performed by substituting the values of the multi- plicative increases and moduli in Equations (9)–(11) as follows: 2n 1 2n 2n 2n 1 j2 (2 + 1)2 j = j2 2 1j = 1 (28) 2n 2n 2 1 2 1 2n j1 (2 + 1)j 2n = 1 (29) n 1 n j2 2 + 1 j n = 1 (30) ( ) 2 1 Now, inserting the values of the moduli and multiplicative inverses in Equations (13)–(17) leads to 2n Z = x + (2 + 1)H (31) Y = x + (2 + 1)P (32) H = jx x j (33) 2n 2 1 n 1 P = j2 (x x )j n (34) 4 3 2 1 2n 1 T = j2 (Y Z)j 2n (35) 2 1 Equation (31) can be further simplified by substituting Equations (31) and (32) into it as follows: 2n 1 n 2n T = j2 (x + (2 + 1)P x (2 + 1)H)j 2n (36) 2 1 The following well-known residue arithmetic properties can be used to further simplify Equations (33)–(36). Property 1. 2 v is equal to the P-bit circular left shifting of v if v is represented as a k-bit i i i 2 1 binary number [11]. Property 2. j v j is equal to one’s complement of v (i.e., v ) if v is represented as a k-bit i i i i 2 1 binary number [11]. Property 3. j v j is equal to v + 1 if v is represented as a k-bit binary number [17]. i i i Property 4. j v j is equal to v + 2 if v is represented as a k-bit binary number [17]. i i i 2 +1 2n 2n n n First, according to the moduli set {2 + 1, 2 , 2 + 1, 2 1}, x and x are (2n + 1)- 1 2 and 2n-bit numbers, respectively. Therefore, Equation (33) can be simplified using Property 3 as follows: 2n H = jx x j = jx (x 2 + x )j 2n 2n 2 1 1,2n 2 2,(2n 1)...0 1,(2n 1)...0 2 (37) = jx + x + 1j 2n 2,(2n 1)...0 1,(2n 1)...0 2 Axioms 2022, 11, 5 6 of 13 where x means the j-th bit of the residue x and x and x are (n + 1)- and n-bit numbers, i,j i 4 3 respectively. Therefore, Equation (34) can be rewritten as n 1 n P = 2 (x x 2 x ) (38) 3,n 4 3,(n 1)...0 2 1 where x is a residue in modulo 2 + 1. Therefore, when x is equal to one, the other bits 3 3,n will be surely be zero, and if the n low significant bits (LSBs) of x are not equal to zero, then the most significant bit (MSB) of x (i.e., x ) should be zero [12]. Therefore, by considering 3 3,n this point and Properties 1 and 2, Equation (38) can be simplified as follows: P = jP + P j (39) 2 n 2 1 where n 1 P = j2 x j n = x x x (40) 1 4 2 1 4,0 4,n 1... 4,1 | {z } n bits 01 . . . 11 i f x = 1 > 3,n | {z } n bits P = (41) x x . . . x i f x = 0 > 3,0 3,n 1 3,n 2 3,n | {z } n bits Finally, Equation (35) can be simplified using Properties 1 and 2 as follows: T = jT + T + T + T + T j (42) 2n 1 2 3 4 5 2 1 where 2n 1 T = j2 x j 2n = x 0 . . . 00x . . . x (43) 1 3 3,0 3,n 3,1 2 1 | {z } | {z } n 1 bits n bits 2n 1 n 2n 1 T = 2 (2 + 1)P = 2 (p . . . p p . . . p ) 2n 2 n 1 0 n 1 0 2 1 | {z }| {z } (44) 2n n bits n bits 2 1 = p p . . . p p . . . p 0 n 1 0 n 1 1 | {z }| {z } n bits n 1 bits 2n 1 T = 2 x = x x . . . x (45) 3 1,0 1,2n 1 1,1 1,(2n 1)...0 2n 2 1 | {z } 2n bits 2n 1 2n T = 2 2 x = x 1 . . . 11 (46) 4 1,2n 1,2n 2n 2 1 | {z } 2n bits 2n 1 2n T = j 2 (2 + 1)Hj 2n = j Hj = H . . . H H (47) 5 2n 0 2n 1 1 2 1 2 1 | {z } 2n bits Note that P is an n-bit number, and due to this computation of Equation (44), it became a simple concatenation. Aside from that, the constant coefficient of H in Equation (47) was substituted with 1 since 2n 1 2n 2n 1 jj2 j j2 + 1j j = j2 2j = 1 (48) 2n 2n 2n 2n 2 1 2 1 2 1 2 1 Next, after calculation of T using Equation (42), we must compute the residues of T according to Equation (21) to achieve the two-moduli scaled residues. However, the largest 2n value of T is 2 2, and therefore, it is always less than the first and second moduli. Hence, we have s = T = T (49) j j 2n L 1 2 +1 s = jTj = T (50) 2n L 2 2 Axioms 2022, 11, 5 7 of 13 The third and fourth two-moduli scaled residues can be achieved as follows: s = jTj = jT 2 + T j n L 3 (2n 1)...n (n 1)...0 2 +1 2 +1 = jT T j n (51) 2 +1 (n 1)...0 (2n 1)...(n 1) = jT + T + 2j n 2 +1 (n 1)...0 (2n 1)...(n) s = jTj = jT + T j n (52) L 4 2 1 2 1 (2n 1)...n (n 1)...0 Now, based on Equation (27), we can also achieve single-modulo scaling formulas from the two-moduli scaling residues as follows: 2n s = jjHj + 2 jTj j 2n 2n 2n 2n H 1 2 +1 2 +1 2 +1 2 +1 (53) 2n = jjHj + 2 s j 2n 2n 2n L 1 2 +1 2 +1 2 +1 We can simplify Equation (27) by substituting Equation (49) into it and considering 2n that the maximum value of H in (37) is 2 1. Therefore, we have s = jH Tj = H + T + 2 (54) 2n 2n H 1 2 +1 2 +1 Similarly, for other residues, we have 2n s = jjHj 2n +j2 j 2njTj 2nj 2n = H (55) H 2 2 2 2 2 2n s = jjHj n +j2 j n jTj n j n H 3 2 +1 2 +1 2 +1 2 +1 = H 2 + H + s L 3 (2n 1)...n (n 1)...0 (56) 2 +1 2 +1 = H + H + 2 + s L 3 (2n 1)...n (n 1)...0 2 +1 2n s = jjHj n + 2 jTj n j H 4 n 2 1 2 1 2 1 2 1 (57) = jH + H + s j n (2n 1)...n (n 1)...0 L 4 2 1 Therefore, we can compute H and T from Equations (37) and (42) just one time and then using them several times to compute both the single- and two-moduli scaling. n n 2n 2n + 1 2.4. Case Study: Moduli Set {2 1, 2 + 1, 2 , 2 1} n n 2n 2n 2n This moduli set has the same moduli as {2 1, 2 + 1, 2 , 2 + 1} except for 2 + 1 2n + 1 which is substituted with 2 1. Due to this, it can lead to a faster RNS arithmetic unit. However, its reverse converter will be more complex. The overall process of designing the n n 2n 2n scaler for this moduli set is relatively the same as for the moduli set {2 1, 2 + 1, 2 , 2 + 1} described in the previous subsection. 2n + 1 2n n n First, consider the moduli order {m , m , m , m } = {2 1, 2 , 2 + 1, 2 1}. 1 2 3 4 Then, according to Equations (15)–(17), the multiplicative inverses can be computed as k n 1 = k = 1, and k = 2 (the proof is straightforward and similar to Lemma 1). Therefore, 2 3 Equation (15) is a key formula in the scaling that can be calculated as follows: T = jY Zj 2n 2 1 n 2n+1 = x + (2 + 1)P x 2 1 H (58) 3 2n 2 1 = jT + T + T + T + T j 2n 1 2 3 4 5 2 1 where T = jx j = 0 . . . 00 x . . . x (59) 2n 1 3 3,n 1 3,0 2 1 | {z } | {z } n bits n bits T = 2 + 1 P = p . . . p p . . . p (60) j( ) j 2n 2 n 1 0 n 1 0 2 1 | {z }| {z } n bits n bits Axioms 2022, 11, 5 8 of 13 Axioms 2022, 10, x FOR PEER REVIEW 8 of 13 T = x = x . . . x (61) 3 1,(2n 1)...0 1,2n 1 1,0 2n 2 1 | {z } 𝑇 = − 𝑥 = 𝑥 … 𝑥 ,( )… ,2 , (61) 2n bits 2n T = 2 x = 1 . . . 11x (62) 4 1,2n 1,2n 2n 𝑇 = − 2 × 𝑥 = 1 … 11𝑥 2 1 | {z } , ,2 (62) 2n bits 2n+1 T = j (2 1)Hj = j Hj = H . . . H H (63) 2n 2n 5 2n 1 1 0 2 1 2 1 𝑇 = | − (2 − 1)𝐻 | = | − 𝐻 | = 𝐻 … 𝐻 𝐻 | {z } (63) 2n bits 2n + 1 2n n n 2n + 1 2n n n The The two-moduli two-moduli scaled scaled r residue esidue formulas formulas for for the the moduli moduli set set {2 {2 −