Access the full text.
Sign up today, get DeepDyve free for 14 days.
Solving the ﬂoating-point equation x ⊗ y = z,where x, y and z belong to ﬂoating-point intervals, is a common task in automated reasoning for which no efﬁcient algorithm is known in general. We show that it can be solved by computing a constant number of ﬂoating-point factors, and give a fast algorithm for computing successive normal ﬂoating-point factors of normal ﬂoating-point numbers in radix 2. This leads to an efﬁcient procedure for solving the given equation, running in time of the same order as ﬂoating-point multiplication. Keywords Floating-point equation solving · Interval multiplication · Decision procedures 1 Introduction Floating-point arithmetic has long been the prevailing system used for numerical computing, allowing efﬁcient and precise calculation on modern hardware. Whilst it closely approximates real arithmetic, it does not always satisfy its laws. These sporadic exceptions often confuse programmers and complicate mathematical reasoning to the point that deciding even the simplest statements can take excessive amounts of time. Our interest stems from the area of formal software veriﬁcation, where automated reasoning is used to prove that programs satisfy their speciﬁcations. Although modern veriﬁcation systems can reason about ﬂoating- point arithmetic, it has much less support than integer or bit-vector arithmetic. This makes proving correctness far more difﬁcult and drastically limits practical applications of formal methods. Our work seeks to ﬁll these gaps. Denoting ﬂoating-point multiplication by ⊗, we study the following problem: Problem 1 Let X , Y , Z be intervals over the ﬂoating-point numbers and let V ={(x , y, z) | x ∈ X , y ∈ Y , z ∈ Z , x ⊗ y = z}. What is the smallest box enclosing V ? A Cartesian product of intervals. B Mak Andrlon mnazecic@student.unimelb.edu.au School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia 0123456789().: V,-vol 123 11 Page 2 of 52 M. Andrlon We can give a simpliﬁed example without getting into the details of ﬂoating-point arith- metic: Example 1.1 For the sake of exposition, we will use decimal numbers with two decimal places, rounding downward the results of arithmetic operations. For example, since 0.50 · 1.01 = 0.505, we have 0.50 ⊗ 1.01 = 0.50, where ⊗ is rounded multiplication. Let x and y be two-place decimal numbers, and suppose x ⊗ y = 5.00 where 2.20 ≤ x ≤ 2.50 and 1.00 ≤ y ≤ 2.50. These bounds are clearly quite loose. For instance, the lower bound on y is too small: if x = 2.50 and y = 1.00, then x ⊗ y = 2.50, which is much less than the desired 5.00. First, we construct an equivalent condition that uses real multiplication instead of ﬂoating- point multiplication. This can be accomplished with a simple transformation. Since the numbers that round downward to 5.00 are exactly those in [5.00, 5.01), it follows that x ⊗ y = 5.00 is equivalent to 5.00 ≤ xy < 5.01. By getting rid of the ⊗ operator, we can now reason using real arithmetic: 1. Multiplying x and y and combining their bounds, we get 2.20 ≤ xy ≤ 6.25. 2. This is worse than our existing bounds on xy, so we cannot improve them. 3. Dividing the bounds on xy by y,weobtain5.00/2.50 ≤ x < 5.01/1.00. 4. This also does not improve our existing bounds on x. 5. Dividing by x instead, we obtain 5.00/2.50 ≤ y < 5.01/2.20. 6. We can thus improve our bounds on y to 2.00 ≤ y < 2.2772. If we were allowed the full range of the reals, we would now be ﬁnished, stating that 5.00 ≤ xy < 5.01 where 2.20 ≤ x ≤ 2.50 and 2.00 ≤ y < 2.2772. We can easily show that these bounds are tight for real multiplication: • If x = 2.20 and y < 2.2772, then xy < 5.01. If y ≥ 2.2772 instead, then xy ≥ 5.01. • If x = 2.50 and y = 2.00, then xy = 5.00. Having used each of the bounds on x, y and xy, we know that this set of bounds is tight on its own. However, we also require that x and y are decimal numbers with two decimal places. Thus, since y < 2.2772, we can further conclude that y ≤ 2.27. Since the bounds all have two decimal places now, we would ideally like to be done. If we check, though, we ﬁnd that the bounds are no longer tight: if x = 2.20 and y = 2.27, then xy = 4.994, which is too small, and therefore x ≥ 2.21. We can continue tightening even further. If x = 2.21 instead and y = 2.27, then xy = 5.0167, which is too large. Therefore y ≤ 2.26, and so on. From this point, division does not help us as much as it did earlier. Since the product of the bounds on x and y is so close to the optimal result, all of the quotients differ by less than 0.01, and so it is equivalent to an exhaustive search. After a few more iterations, we eventually reach the optimal solution: 2.33 ≤ x ≤ 2.50 and 2.00 ≤ y ≤ 2.15. This clearly satisﬁes our bounds, since 2.33·2.15 = 5.0095 ∈[5.00, 5.01). However, proving optimality is tedious without further results, requiring us to iterate the above argument for every decimal until we reach the optimum. The toy example above illustrates the limits of current ﬂoating-point interval reasoning. As demonstrated, the analogous problem over the reals is straightforward, and the steps in the example suggest a general algorithm. If we were working with the much larger IEEE 754 64-bit ﬂoating-point numbers instead, we could easily be forced to iterate over millions of The overline denotes repeating digits. 123 Finding Normal Binary Floating-Point Factors Efﬁciently Page 3 of 52 11 values: for a concrete example, take 12738103310254127 ≤ x ≤ 12738103379848963, y unrestricted and z = 2 − 1 under rounding to nearest. Although x ⊗ y = z has no solution under these conditions, it will take 69,594,839 steps to prove this using the traditional interval- based approach. In general, the traditional algorithm struggles to prove unsatisﬁability if the real relaxation of the problem is solvable, since the optimal intervals for an unsatisﬁable system are empty. By contrast, the method we present in this paper arrives at the answer immediately. We consider the following to be the primary contributions of this paper: • An algorithm to efﬁciently compute the next normal ﬂoating-point factor of any ﬂoating- point number in radix 2. – This algorithm is also partially applicable in higher bases. We do not assume a particular radix in any of our results. • When the above algorithm is applicable, an efﬁcient procedure for ﬁnding optimal interval bounds on variables of ﬂoating-point constraints of the form x ⊗ y = z. In addition, we believe the following secondary contributions are of interest: • A demonstration that “round-trip” rounding errors have useful, nontrivial piecewise poly- nomial extensions to the reals. • A characterization of upward and downward ﬂoating-point rounding in terms of the remainder of ﬂoored division. • A ﬂoating-point predicate for deciding, in any base, whether a number is ﬂoating-point factor of another ﬂoating-point number. • Results on solving certain kinds of interval constraints over integer remainders. 1.1 Related Work Although ﬂoating-point arithmetic has existed for a long time, work on solving ﬂoating- point equations and inequalities has been sporadic and has mostly been concerned with binary ﬂoating-point arithmetic. Michel [21] developed partial solutions for ﬂoating-point equations of the form x ◦y = z where ◦ is one of rounded addition, subtraction, multiplication, or division, as well as a complete solution for rounded square root. His work contains an early description of the classical interval-based algorithm, as it applies to the ﬂoating-point numbers. As an early example, only normal, binary ﬂoating-point numbers and the standard rounding modes were considered. Ziv et al. [30] showed that, under mild conditions, if z can take at least 2 different values, then the equation x ⊗ y = z has at least one solution in binary ﬂoating-point arithmetic. However, their work does not treat other bases, nor the case where z is ﬁxed. Bagnara et al. [2] gave some extremal bounds on the x and y in terms of the bounds on z for binary ﬂoating-point arithmetic. However, the resulting bounds are very weak, since they do not account for any of the bounds on x or y. At present, there are no known efﬁcient algorithms for computing the solution set of x ⊗ y = z in general. Indeed, in terms of similar efforts, the only basic operation which has been solved (apart from square root) is addition [1, 12]. Note that the classical algorithm is optimal for any strictly increasing (or decreasing) single-variable real function. 123 11 Page 4 of 52 M. Andrlon 1.2 Outline In Sect. 2, we introduce our notation and develop the formalisms used in this paper. We also give an account of the basic properties of ﬂoating-point arithmetic that we will need, and precisely deﬁne the classical interval-based algorithm. Due to the substantial amount of notation, a glossary is supplied for reference in Appendix B. In Sect. 3, we relate optimal intervals, the division-based algorithm and ﬂoating-point factors, and simplify the problem by narrowing down the conditions under which the division-based algorithm does not produce an optimal result. In Sect. 4, we then develop conditions for numbers being factors based on the rounding error in the product. Then, in Sect. 5, we show that we can control that rounding error under certain conditions and thus directly ﬁnd ﬂoating-point factors. Combining these in Sect. 6, we give an algorithm to compute such factors efﬁciently and thus solve the problem. Appendix A contains some ancillary proofs used in the results of Sect. 6. In Sect. 7,we conclude and discuss open problems. 2 Preliminaries Following convention, we denote the integers by Z, the reals by R, and the extended reals by ∗ ∗ R = R ∪{+∞, −∞}, and the same sets excluding zero by Z , R and R . The power set of a set X is denoted by P(X ). We write f [X ] for the image of X under a function f ,and −1 f [X ] for the preimage of X under f . To simplify exposition, we will denote the images of X × Y and X ×{y} under a binary operator ◦ simply by X ◦ Y and X ◦ y, respectively. We will use such notation most frequently in the context of intervals. We also discuss the size of intervals in terms of their width on the extended real line. To this end, we deﬁne a notion of diameter for subsets of the extended reals: Deﬁnition 2.1 (Set diameter)For any X ⊆ R,wedeﬁne diam X = sup {d(x , y) | x , y ∈ X}∪{0}, where d : R → R is a function such that 0if x = y, d(x , y) = |x − y| otherwise. The diameter satisﬁes the following properties for all sets X ⊆ R: diam ∅= 0 (diameter of empty set) (1) for all x ∈ R, diam {x}= 0 (diameter of singleton) (2) if inf X < sup X , then diam X = sup X − inf X (3) for all a ∈ R , diam aX =|a| diam X (absolute homogeneity) (4) for all b ∈ R, diam(X + b) = diam X (translation invariance) (5) Importantly, the diameter of an interval is equal to the distance between its endpoints: Lemma 2.2 For all a, b ∈ R,ifa < b, then diam [a, b]= diam (a, b]= diam [a, b) = diam (a, b) = b − a. Additionally, if two intervals overlap, then the larger one contains at least one of the endpoints of the smaller one: 123 Finding Normal Binary Floating-Point Factors Efﬁciently Page 5 of 52 11 Lemma 2.3 Let a, b ∈ R and let I ⊆ R be an interval. If [a, b]∩ I is nonempty and diam I > b − a, then either a ∈ Ior b ∈ I. Since the subject of this work is solving equations similar to xy = z, we will often need to use division. However, division alone does not provide what we need. For instance, if y = z = 0, then any ﬁnite and nonzero x satisﬁes, even though z/y = 0/0 is undeﬁned. In order to succinctly describe these solution sets, we introduce a special “division” operator: Deﬁnition 2.4 For any Y , Z ⊆ R,wedeﬁne Z“/”Y ={x ∈ R |∃y ∈ Y (xy ∈ Z )}. The following properties then hold for all Y , Z ⊆ R and a ∈ R : aZ“/”Y = Z“/”(Y /a), (6) ∗ ∗ Z“/”(Y ∩ R ) = Z /(Y ∩ R ), (7) R, if 0 ∈ Z , Z“/”0 = (8) ∅, otherwise, (0, +∞], if +∞ ∈ Z and −∞ ∈ / Z , [−∞, 0), if +∞ ∈ / Z and −∞ ∈ Z , Z“/” +∞ = (9) R , if +∞ ∈ Z and −∞ ∈ Z , ∅, otherwise, Furthermore, for all Y , Z ⊆ R, we also have the following properties: Z“/”(Y ∪ Y ) = (Z“/”Y ) ∪ (Z“/”Y ), (10) (Z ∪ Z )“/”Y = (Z“/”Y ) ∪ (Z “/”Y ). (11) ∗ ∗ Remark Note that, since Y = (Y ∩ R ) ∪ (Y \R ),wehave ∗ ∗ Z“/”Y = (Z /(Y ∩ R )) ∪ (Z“/”(Y \ R )). It is also worth noting that Y \R = Y ∩{0, −∞, +∞}. This allows us to compute Z“/”Y as the union of an ordinary division and at most three special cases as described above. We also deﬁne the box operator as follows: Deﬁnition 2.5 (Box closure)A box is any Cartesian product of (extended) real intervals. For any set X ⊆ R , we deﬁne its box closure X as the unique smallest n-dimensional box containing X as a subset. For any sets X , Y ⊆ R we have: X ⊆ X (extensivity) (12) if X ⊆ Y , then X ⊆ Y , (monotonicity) (13) X = X (idempotence) (14) Existence and uniqueness follow since every nonempty subset of R is bounded and thus has both a supremum and inﬁmum. 123 11 Page 6 of 52 M. Andrlon Furthermore, for all subsets A ,..., A of R, where the product is the Cartesian product, we 1 n have the following property: n n A = A (distributivity) (15) i i i =1 i =1 That is, the box of a product of sets is exactly the product of the boxes of those sets. Remark A 1-dimensional box is an interval, a 2-dimensional box is a rectangle, a 3- dimensional box is a cuboid, and so on. Also note that a box does not necessarily contain its boundary: an open interval is still an interval, and therefore a box. However, if X is a ﬁnite subset of R,then X =[min X , max X ]. In general, if X is ﬁnite (or otherwise contains its boundary), then X is a Cartesian product of closed intervals. 2.1 Floating-Point Arithmetic Much work has been done on specifying ﬂoating-point arithmetic in various formal systems [3–6, 8–11, 14–20, 22, 23, 25, 26, 29]. However, to the best of the author’s knowledge, there is no well-established and precise informal language of ﬂoating-point theory. For the sake of completeness and rigor, the rest of this section will be devoted to deﬁning our terms. Readers interested in a deeper introduction may wish to consult a reference such as Muller et al. [24] or Goldberg [13]. 2.1.1 Floating-Point Numbers, Brieﬂy A ﬁnite ﬂoating-point number is usually written in the style of scientiﬁc notation as ±d .d d ··· d × β , where each digit d lies in {0,...,β − 1}.The ±d .d d ··· d part 1 2 3 p i 1 2 3 p is the signiﬁcand, the number of digits p is the precision, β is the base (commonly 2 or 10), and e is the exponent. For example, with precision p = 3 and base β = 2, we can write −1 0 the decimal number 0.75 as either 1.10 × 2 or 0.11 × 2 . Of the two representations, the former is preferred, as it is normalized—that is, it has no leading zeros—and is therefore unique. With a higher precision we can represent more numbers. For instance, we need p ≥ 4to represent the binary number 1.101 exactly. Regardless of precision, however, there are still limits imposed by the choice of base. Just as lacks a ﬁnite decimal expansion, for any given base there are numbers that cannot be represented. Notably, the decimal 0.1is0.0001100 in binary, and thus it has no representation in any binary ﬂoating-point system. As an example, −4 −4 if we take β = 2and p = 3, then 1.10 × 2 and 1.11 × 2 are consecutive ﬂoating-point numbers, but they correspond to the decimal numbers 0.09375 and 0.109375, respectively. Consequently, the sum, product or ratio of two precision-p base-β ﬂoating-point numbers is rarely itself a precision-p base-β ﬂoating-point number and thus computing with them typically involves accepting some degree of rounding error. In ordinary usage, we minimize rounding error by rounding to nearest; that is, by choosing the ﬂoating-point number closest to the real number we are rounding. Other rounding modes such as rounding upward, downward, or towards zero (also known as truncation) are typically only used in situations where the direction of the rounding is more important than the accuracy of a single operation. For example, the simplest way to avoid underapproximating the set of true results in interval arithmetic is to always round upper bounds upward and lower bounds downward. 123 Finding Normal Binary Floating-Point Factors Efﬁciently Page 7 of 52 11 2.1.2 Deﬁnitions and Terminology The arithmetic described in this work is compatible with and generalizes that of the IEEE 754 standard, with two exceptions: we omit signed zeros “NaN” values, because they are essentially irrelevant to the problem but add clutter. Appendix B contains a summary of the deﬁnitions introduced here. We begin with a construction of the ﬂoating-point numbers. Deﬁnition 2.6 (Floating-point numbers)A ﬂoating-point format is a quadruple of integers (β, p, e , e ) such that β ≥ 2, p ≥ 1, and e ≤ e ,where β is the base, p is the min max min max precision, e is the minimum exponent,and e is the maximum exponent. For convenience, min max we also deﬁne the minimum quantum exponent q = e − p + 1and the maximum min min quantum exponent q = e − p + 1. Given these quantities, we deﬁne the set F of ﬁnite max max nonzero ﬂoating-point numbers as follows: ∗ e−p+1 p F ={M · β | M , e ∈ Z, 0 < |M | <β , e ≤ e ≤ e }. min max From here, we deﬁne the set F of ﬁnite ﬂoating-point numbers and the set F of ﬂoating-point numbers: F = F ∪{0}, F = F ∪{−∞, +∞}. To avoid repetition, from this point forward we will assume that the sets of ﬂoating-point numbers correspond to some ﬂoating-point format (β, p, e , e ). min max From the above deﬁnition, we immediately obtain the following: Lemma 2.7 F , F and F are each closed under negation. Furthermore, the largest ﬁnite ﬂoating-point number and the smallest positive ﬂoating-point number have simple expressions: ∗ p q max max F = max F = (β − 1)β (16) min min F ∩ (0, +∞] = β (17) We further categorize the elements of F as follows: Deﬁnition 2.8 (Normal and subnormal numbers) A nonzero ﬁnite ﬂoating-point number ∗ e min x ∈ F is normal if |x|≥ β ,or subnormal otherwise. The normal numbers are so called because they each have a normalized representation in scientiﬁc notation, restricting the exponent to {e ,..., e }. By contrast, a subnormal min max number can only be written without leading zeros by choosing an exponent below e .Note min that we do not consider zero to be normal or subnormal. A number written in scientiﬁc notation has an exponent and signiﬁcand. Similarly, each ﬁnite ﬂoating-point number has an associated exponent and signiﬁcand determined by the ﬂoating-point format. We will now deﬁne these concepts and generalize them to the ﬁnite reals. Although positive and negative zero correspond to the same real value, they are distinct, since 1/+0 =+∞ and 1/−0 =−∞.However,the “/” operator allows us to sidestep the issue of division by zero. In IEEE 754 parlance, the “quantum” of a ﬂoating-point number is the value of a unit in the last digit of its signiﬁcand, often referred to as the “unit in last place”. This is necessarily an integer power of β,and the quantum exponent is the exponent of that power. 123 11 Page 8 of 52 M. Andrlon Deﬁnition 2.9 (Exponent and signiﬁcand) The functions E : R → Z and Q : R → Z are deﬁned as follows: min log |x | , |x|≥ β , E (x ) = min e , |x | <β . min Q(x ) = E (x ) − p + 1. Given x ∈ R, the integers E (x ) and Q(x ) are the exponent and quantum exponent of x, −E (x ) −Q(x ) respectively. The signiﬁcand of x is the number x β .If x β is an integer, then it is the integral signiﬁcand [24]of x. To justify these names, we establish that these indeed compute the exponent of a number written as a product of a signiﬁcand and a power of β, respecting the minimum exponent: Lemma 2.10 (Exponent law) Let m ∈ R and e ∈ Z such that |m| <β and e ≥ e . If either min e e |m|≥ 1 or e = e ,then E (mβ ) = e. Otherwise, E (mβ )< e. min e e min Proof First, note that if |mβ |≥ β ,then e e E (mβ ) = log |mβ | = log |m|+ e = log |m| + e. We now proceed by cases: e e e min • Suppose |m|≥ 1. Then |mβ |≥ β and log |m| = 0, and thus E (mβ ) = e. e e e min • Suppose e = e and |m| < 1. Then |mβ | <β and so we have E (mβ ) = e = e min min immediately by the deﬁnition of E. e e min • Suppose e > e and |m| < 1. If |mβ |≥ β ,then log |m| < 0 and hence min e e e e min E (mβ ) = log |m| + e < e.If |mβ | <β instead, then E (mβ ) = e < e. min Since the result holds in all cases, we are ﬁnished. Expanding on the above, the function E also satisﬁes the following properties for all x , y ∈ R: E (x ) = E (−x ) (evenness) (18) if |x|≤|y|, then E (x ) ≤ E (y) (piecewise monotonicity) (19) if x ∈ F, then e ≤ E (x ) ≤ e (exponent bound) (20) min max −E (x ) |x β | <β (signiﬁcand bound) (21) −E (x ) e min 1 ≤|x β | if and only if β ≤|x | (normality) (22) From the above properties, the deﬁnition of Q immediately implies the following: Q(x ) = Q(−x ) (evenness) (23) if |x|≤|y|, then Q(x ) ≤ Q(y) (piecewise monotonicity) (24) if x ∈ F, then q ≤ Q(x ) ≤ q (quantum exponent bound) (25) min max −Q(x ) p |x β | <β (integral signiﬁcand bound) (26) p−1 −Q(x ) e min β ≤|x β | iff β ≤|x | (normality) (27) 123 Finding Normal Binary Floating-Point Factors Efﬁciently Page 9 of 52 11 Additionally, the following fact establishes an important connection between integral signif- icands and the ﬁnite ﬂoating-point numbers: −Q(x ) if x ∈ F, then x β ∈ Z (integrality) (28) In other words, every ﬁnite ﬂoating-point number has an integral signiﬁcand. Deﬁnition 2.11 (Unit in ﬁrst/last place [27]) The functions ufp, ulp : R → R are deﬁned as follows: log |x | β , x = 0, ufp(x ) = 0, x = 0, 1−p ulp(x ) = ufp(x )β . For any x ∈ R, we refer to ufp(x ) as the unit in ﬁrst place of x, and to ulp(x ) as the unit in last place of x. For all x , y ∈ R and n ∈ Z, the following statements hold: ufp(x ) ≤|x | (lower bound) (29) if x = 0, then ufp(x ) ≤|x | <β ufp(x ) (bounds) (30) ufp(x ) =|x | iff x = 0or |x|= β for some k ∈ Z (exactness) (31) n n ufp(x β ) = ufp(x )β (scaling) (32) ufp(x ) ufp(y) = ufp(x ufp(y)) (product law) (33) if y = 0, then ufp(x )/ ufp(y) = ufp(x / ufp(y)) (quotient law) (34) There is also a close relationship between ufp and E. In particular, the base-β logarithm of the unit in ﬁrst place is the exponent (if exponents were unbounded): e E (x ) min if |x|≥ β , then ufp(x ) = β (35) e E (x ) min if |x | <β , then ufp(x)<β (36) The following result allows us to further simplify computing the unit in ﬁrst place of a quotient by separating out the signiﬁcands: Lemma 2.12 For all x , y ∈ R ,wehave ufp(x ) ufp(x /y) = ufp(m /m ) , x y ufp(y) 1/β if |m | < |m |, x y ufp(m /m ) = x y 1 otherwise, where m = x / ufp(x ) and m = y/ ufp(y). x y In a similar vein, we have the following lemma on the exponents of quotients: ∗ −E (x ) −E (z) e min Lemma 2.13 Let x , z ∈ R and let m = x β and m = zβ .If β ≤|z/x |,then x z E (z/x ) = E (z) − E (x ) + log ufp(m /m ). z x Note that these deﬁnitions are independent of exponent range. In particular, readers should be aware that they do not treat subnormal numbers specially, unlike the Q and E functions. In effect, they assume unbounded exponents. 123 11 Page 10 of 52 M. Andrlon min Proof Suppose β ≤|z/x |. Then, E (z/x ) β = ufp(z/x ) E (z) m β = ufp E (x ) m β E (z)−E (x ) = ufp(m /m )β . z x Taking logarithms, we obtain the desired result. Having deﬁned the basic structure of F, we now turn to the deﬁnition of the computational operations involving the ﬂoating-point numbers. Deﬁnition 2.14 (Rounding) A rounding function is any function f : R → F. The downward rounding function RD : R → F and the upward rounding function RU : R → F are deﬁned as follows: RD(x ) = max {y ∈ F | x ≤ y}, RU(x ) = min {y ∈ F | x ≥ y}. A rounding function f is faithful if and only if f (x ) ∈{RD(x ), RU(x )} for all x ∈ R. It is easy to show that both RD and RU are nondecreasing functions. In addition, they satisfy the following properties for all x ∈ R: RD(x ) ≤ x ≤ RU(x ) (lower/upper bound) (37) RU(−x ) =− RD(x ) (duality) (38) x ∈ F if and only if RU(x ) = RD(x ) (exactness) (39) In addition, the following gap property holds for all x ∈ R: ∞ if |x | > max F, RU(x ) − RD(x ) = 0if x ∈ F, (40) Q(x ) β otherwise. We also give a partial deﬁnition of rounding to nearest, which is the default rounding mode used in practice: Deﬁnition 2.15 (Rounding to nearest) A rounding function f rounds to nearest if | f (x ) − x|= min |y − x | for all x ∈[min F, max F]. y∈F We denote by RN an arbitrary nondecreasing and faithful rounding to nearest. Note that this is not a complete deﬁnition! In particular, it does not give a tie-breaking rule for when x is the exact midpoint between two ﬂoating-point numbers. It also does not fully specify how to round values outside [min F, max F]. We leave these unspeciﬁed since IEEE 754 speciﬁes two different round-to-nearest modes with different rules in these cases. In particular, even Also known as monotonic, (weakly) increasing, (weakly) order-preserving, or isotone. The two tie-breaking rules used in IEEE 754 are “ties to even” (choose the result with an even integral signiﬁcand) and “ties away from zero” (choose the larger magnitude result). The former is the default mode and must be supported. The latter mode need only be available for decimal systems. Additionally, a “ties to zero” (truncation) rule is used for the “augmented” addition, subtraction and multiplication operations added in IEEE 754-2019 [28], which return the rounding error alongside the principal result. 123 Finding Normal Binary Floating-Point Factors Efﬁciently Page 11 of 52 11 though +∞ and −∞ are inﬁnitely far apart from any ﬁnite number, according to IEEE 754, they may still result from rounding ﬁnite values to nearest. More speciﬁcally, it states that e 1−p max the rounding to nearest of any x such that |x|≥ β (β − β /2) is the inﬁnity with the e +1 max same sign as x. Note that this bound is the exact midpoint of max F and β . We shall denote by ﬂ an arbitrary nondecreasing faithful rounding function. This condition is satisﬁed by most practical rounding modes, including RD, RU, and rounding to nearest as deﬁned above. The following properties hold for all x ∈ R: x ∈ F if and only if ﬂ(x ) = x (exactness) (41) ﬂ(ﬂ(x )) = ﬂ(x ) (idempotence) (42) Throughout this paper, we will frequently use the preimages of ﬂoating-point intervals under rounding. We will often rely on the following lemma: −1 Lemma 2.16 Let X ⊆ F. If X is a nonempty ﬂoating-point interval, then ﬂ [X ] is an −1 extended real interval such that ﬂ [X]⊇[min X , max X ]. Proof Suppose X is a nonempty ﬂoating-point interval, and suppose to the contrary that −1 −1 ﬂ [X ] is not an interval. Then there are some a, b ∈ ﬂ [X ] and y ∈ R such that a < −1 y < b and y ∈ / ﬂ [X ]. Since ﬂ is nondecreasing, it follows that ﬂ(a) ≤ ﬂ(y) ≤ ﬂ(b). However, since ﬂ(a), ﬂ(b) ∈ X where X is a ﬂoating-point interval, we thus have ﬂ(y) ∈ X. −1 Contradiction! Hence ﬂ [X ] is an interval. Now, since ﬂ is the identity over the ﬂoating-point numbers, ﬂ(x ) = x for all x ∈ X −1 and hence ﬂ [X]⊇ X.Since F is ﬁnite, X is also ﬁnite, and hence [min X , max X ] is the −1 smallest interval containing X as a subset. Therefore, since ﬂ [X ] is an interval, it follows −1 that ﬂ [X]⊇[min X , max X ]. To ensure the generality of our results, we deﬁne ﬂoating-point multiplication in terms of ﬂ rather than ﬁxing a speciﬁc rounding function : Deﬁnition 2.17 (Multiplication) The ﬂoating-point multiplication operator ⊗ is deﬁned by x ⊗ y = ﬂ(xy) for all x , y ∈ F such that the product xy is deﬁned. Note that the extended real product is deﬁned if and only if the factors do not include both zero and an inﬁnity. In particular, the product of an extended real with a nonzero (ﬁnite) real number is always deﬁned. As a consequence, however, R is not closed under multiplication, and so neither is F. In fact, neither F nor F are necessarily closed under ﬂoating-point multiplication, since it is possible for a sufﬁciently large product to round to inﬁnity or a sufﬁciently small nonzero product to round to zero. However, since ﬂ is nondecreasing, we can still carry over some important ordering prop- erties from the reals. Lemma 2.18 Let a ∈ F and x , y ∈ F such that x ≤ y. If a > 0,thena ⊗ x ≤ a ⊗ y. If a < 0,thena ⊗ y ≤ a ⊗ x. Since F is a subset of R, it is totally ordered by the usual order relation. As such, there is a simple deﬁnition of an interval over F: Deﬁnition 2.19 (Floating-point interval)A ﬂoating-point interval is any set X ∩ F where X is an extended real interval. Although more precise results can sometimes be found by considering a speciﬁc rounding function, the limitations of the method described in this paper stem from the unpredictability of upward- and downward- rounded quotients, which are independent of ﬂ. 123 11 Page 12 of 52 M. Andrlon With the exception of [−∞, +∞] ∩ F, due to the gaps between ﬂoating-point numbers, every ﬂoating-point interval has many representations in terms of extended real intervals. However, since F is ﬁnite, each of its nonempty subsets has a minimum and a maximum, making it trivial to ﬁnd the smallest interval containing it. For all a, b ∈ R, we have the following: if a, b ∈ F and a ≤ b, then a, b ∈[a, b]∩ F (tightness) (43) [a, b]∩ F =[RU(a), RD(b)]∩ F (rounding) (44) That is, if the bounds are in F, they are necessarily the tightest possible bounds, and inter- secting an interval with F corresponds to rounding its bounds “inward”. 2.2 The Classical Iterative Procedure As showninExample 1.1, we can use real division to correctly improve the bounds on the factors of a ﬂoating-point product. Although it did not ﬁnd optimal bounds immediately, the division-based approach shown in the example is still essential to solving the problem efﬁciently, so let us describe it more rigorously. In the example at the start, we reasoned in terms of inequalities. However, a general method will be easier to describe in terms of sets. Let X , Y , Z ⊆ F be ﬂoating-point intervals and let x ∈ X. First, note that by the deﬁnition of ﬂoating-point multiplication, for all y ∈ Y , −1 we have x ⊗ y ∈ Z if and only if xy ∈ ﬂ [Z ]. This greatly simpliﬁes things by allowing us to characterize x using extended real division in most cases. More precisely, using our −1 augmented “division”, we have x ⊗ y ∈ Z for some y ∈ Y if and only if x ∈ ﬂ [Z ]“/”Y . We relate this to our problem statement as follows: Deﬁnition 2.20 (Solution sets and optimality)A problem instance is any triple (X , Y , Z ) where X , Y , Z ⊆ F.The solution set of (X , Y , Z ) is the set of triples V ={(x , y, z) | x ∈ X , y ∈ Y , z ∈ Z , x ⊗ y = z}. A problem instance is satisﬁable if and only if its solution set is nonempty. The triple (X , Y , Z ) has optimal bounds if and only if V = X × Y × Z. Lemma 2.21 Let X , Y , Z ⊆ F and let V be the solution set of (X , Y , Z ).Then V = X × Y × Z ,where −1 X = X ∩ (ﬂ [Z ]“/”Y ), −1 Y = Y ∩ (ﬂ [Z ]“/”X ), Z = Z ∩ (X ⊗ Y ). Proof Let V , V and V be the sets containing, for each triple in V , its ﬁrst, second, or third 1 2 3 element, respectively. Then, V = V × V × V . 1 2 3 Now, by the deﬁnition of “/”, it is easy to check that V = X , V = Y ,and V = Z ,and 1 2 3 the result follows immediately. Although useful, it is impractical to directly use the above result to solve the problem. Since Y is a ﬂoating-point interval, it has many gaps, and so it is very expensive to compute the set of quotients. These gaps were invisible in Example 1.1 because we were working with 123 Finding Normal Binary Floating-Point Factors Efﬁciently Page 13 of 52 11 inequalities, rather than the exact sets. Translated in terms of sets, what we did corresponds to enlarging the set of denominators Y to the extended real interval Y . By enlarging X and Z similarly, we obtain a precise deﬁnition of our algorithm: 3 3 Deﬁnition 2.22 (Classical iteration) We deﬁne the function F : P(R) → P(R) by F (X , Y , Z ) = (X , Y , Z ) where −1 X = X ∩ (ﬂ [Z ]“/”Y ), −1 Y = Y ∩ (ﬂ [Z ]“/”X ), Z = Z ∩ ﬂ[X Y ]. −1 −1 Since Y is a superset of Y , it follows that ﬂ [Z ]“/”Y is a superset of ﬂ [Z ]“/”Y , and so this substitution is sound, though it may not produce an optimal result. Importantly, however, a quotient of intervals is easy to compute and represent, since it must itself be either an interval or a union of two disjoint intervals. Lemma 2.23 (Soundness)Let X , Y , Z ⊆ F and let x ∈ X, y ∈ Y, z ∈ Z. If x ⊗ y = z, then x ∈ X , y ∈ Y , z ∈ Z where (X , Y , Z ) = F (X , Y , Z ). Proof Suppose x ⊗ y = z. Then, since x ⊗ y = ﬂ(xy) ∈ ﬂ[X Y ] and z ∈ Z,wehave −1 z ∈ Z by deﬁnition. Since z ∈ Z,wealsohave xy ∈ ﬂ [Z ] by deﬁnition. Thus, since −1 y ∈ Y,wehave x ∈ ﬂ [Z ]“/”Y by deﬁnition, and hence x ∈ X .Since x ∈ X,we also similarly have y ∈ Y , as desired. It is easy to prove that iterating F converges in ﬁnite time, since the sets involved are ﬁnite, and we only shrink them: Lemma 2.24 (Termination) For all X , Y , Z ⊆ F, the sequences of sets given by 1 1 1 (X , Y , Z ) = F (X , Y , Z ), n > 1 n n n n−1 n−1 n−1 are eventually constant. Proof Since F is ﬁnite, X , Y and Z are ﬁnite. Therefore, since the deﬁnition of F implies 1 1 1 that the sequences X , Y and Z are monotonically decreasing by the subset relation, they n n n must be eventually constant. Furthermore, individual applications of F can be computed quickly: Lemma 2.25 (Time complexity) If X, Y , and Z are ﬂoating-point intervals or are each a disjoint union of an all-nonnegative and an all-nonpositive ﬂoating-point interval, then F (X , Y , Z ) can be computed using O(1) arithmetic operations and computations of the preimage of a ﬂoating-point interval under ﬂ. Remark Although the number of arithmetic operations needed does not increase with the problem parameters, the individual cost of computing them does. In the sign-signiﬁcand- exponent encoding, ﬂoating-point numbers (in any base) take up O(p +log(e −e +1)) max min bits. If ﬂ belongs to one of usual families of rounding functions (e.g. RD, RU, or RN), both it and its preimage can be computed in time linear in the length of its input. Thus, ﬂoating-point addition can be computed in time linear in the length of the addends and multiplying two ﬂoating-point numbers takes time O(M (p) + log(e − e + 1)),where max min M (p) is the time complexity of multiplying two p-digit integers. Correctly rounding division is more difﬁcult, but it can be computed using Newton-Raphson division in same amount 123 11 Page 14 of 52 M. Andrlon of time asymptotically as multiplication. Thus, for most choices of ﬂ, computing F takes time O(M (p) + log(e − e + 1)). Note that we do not take the base, precision or max min exponent range as constants in our analysis, as doing so tends to collapse the time complexity of ﬂoating-point algorithms to O(1), precluding any meaningful comparison. We also do not assume a speciﬁc multiplication algorithm, as there are many to choose from with different characteristics. However, given that what we are computing is a relaxation of the original problem, it is unclear how quickly the iteration converges, or even that it necessarily converges to the optimum. Certainly, it can be quite slow in some cases, as we saw in Example 1.1. Although we do not have a concrete answer, we offer the following conjecture on the rate of convergence: Conjecture 2.26 Let X , Y , Z ⊆ F be ﬂoating-point intervals, and deﬁne the sequences of 1 1 1 sets (X , Y , Z ) = F (X , Y , Z ), n > 1. n n n n−1 n−1 n−1 Then, for all m > 2,the sets X \X ,Y \Y and Z \Z contain at most 4 members m m+1 m m+1 m m+1 each. That is, we conjecture that the rate of convergence is constant. The intuition behind this is thus: after two iterations, we will have propagated the bounds from one factor to the other, and back again. After that, however, the bounds are within one ﬂoating-point number of the optimum of the real relaxation, so there is no signiﬁcant progress to be gained from interval reasoning. Instead, each step only removes at most one ﬂoating-point number from the end of each interval in the sets. (Each set after the ﬁrst iteration may be a disjoint union of two intervals, if the initial intervals contain both positive and negative numbers.) This is clearly rather slow, so we make a further conjecture: that, in the worst case, iterating F takes a number of steps to converge exponential in the precision: Conjecture 2.27 Denote by F the ﬂoating-point numbers with format (β, n, e , e ) for n min max each precision n ≥ 2 and let ﬂ be an associated nondecreasing, faithful rounding function. Let F be F parameterized over ﬂ .Let μ (X , Y , Z ) = m be the least positive integer such n n n m+1 m k that F (X , Y , Z ) = F (X , Y , Z ),where F denotes F composed with itself k times, n n n where X , Y , Z ⊆ F .If f :{2, 3,...}→ Z is a function such that f (n) = max {μ (X , Y , Z ) | X , Y , Z are ﬂoating-point intervals over F }, n n cn then f (n) = (β ) for some positive real c. Remark Proving an upper bound is trivial, as it is of the same order of steps as brute force search. The lower bound is tricky. In order to prove it, we essentially need to prove that bad (i.e. exponentially slow) instances exist in every precision. It would perhaps be easier to show that f is not dominated by any function of lesser order. For this, we merely need an inﬁnite number of precisions with bad instances. A promising candidate is the set of all instances of the form (X , F, {β − 1}),where X is the widest ﬂoating-point interval centered p−1 p on β (β − 1) such that the instance is unsatisﬁable. (The unsatisﬁable example we gave earlier for 64-bit IEEE ﬂoating-point numbers is one such instance.) Note that to obtain this complexity result rigorously, we must deﬁne a family of rounding functions suitably indexed by ﬂoating-point format, as well as a matching indexed variant of F. 123 Finding Normal Binary Floating-Point Factors Efﬁciently Page 15 of 52 11 Leaving aside the rate of convergence, we would also like to know what exactly we are converging to. We can show that whenever X and Y do not contain zero, convergence implies that the bounds on X and Y are optimal. In such cases, we obtain explicit solutions to the equation x ⊗ y = z where x or y is an endpoint of X or Y , respectively. Lemma 2.28 (Conditional partial optimality)Let X , Y , Z ⊆ F. If Z is a ﬂoating-point inter- val, and neither X nor Y contains zero, and F (X , Y , Z ) = (X , Y , Z ), then one of the following statements holds: 1. X, Y and Z are empty. 2. The elements of X have the same sign as the elements of Y , the elements of Z are non- negative, and min X ⊗ max Y ∈ Z and max X ⊗ min Y ∈ Z. 3. The elements of X have the opposite sign of the elements of Y , the elements of Z are nonpositive, and min X ⊗ min Y ∈ Z and max X ⊗ max Y ∈ Z. Proof Suppose the conditions hold. Then, if any of X, Y,or Z are empty, the deﬁnition of F trivially implies that they are all empty, so the result holds. Suppose instead that X, Y and Z are nonempty. Since X is an interval not containing zero, it is a subset of either [−∞, 0) or (0, +∞], and thus the elements of X all have the same sign. Similarly, we also ﬁnd that every element of Y has the same sign. We now proceed by cases: Suppose that the elements of X have the same sign as the elements of Y.Thenﬂ(xy) ≥ 0 for all x ∈ X and y ∈ Y where the product is deﬁned. Therefore, by the deﬁnition of F, the elements of Z are all nonnegative. −1 −1 1. Let x ∈ X and y ∈ Y such that min X · y ∈ ﬂ [Z ] and x · max Y ∈ ﬂ [Z ].(These exist by the deﬁnition of “/”.) (a) Suppose min X > 0and max Y > 0. Then, since min X ≤ x and y ≤ max Y , multiplying and rounding gives min Z ≤ ﬂ(min X · y) ≤ min X ⊗ max Y ≤ ﬂ(x · max Y ) ≤ max Z , and since Z is a ﬂoating-point interval, we thus have min X ⊗ max Y ∈ Z. (b) Suppose min X < 0and max Y < 0 instead. Then, similarly, we obtain ﬂ(x · max Y ) ≤ min X ⊗ max Y ≤ ﬂ(min X · y), and thus min X ⊗ max Y ∈ Z. −1 −1 2. Let x ∈ X and y ∈ Y instead such that max X · y ∈ ﬂ [Z ] and x · min Y ∈ ﬂ [Z ]. (a) Suppose max X > 0and min Y > 0. Then, since x ≤ max X and min Y ≤ y, ﬂ(x · min Y ) ≤ max X ⊗ min Y ≤ ﬂ(max X · y), and thus max X ⊗ min Y ∈ Z. (b) Suppose max X < 0and min Y < 0 instead. Then, since x ≤ max X and min Y ≤ y, ﬂ(max X · y) ≤ max X ⊗ min Y ≤ ﬂ(x · min Y ), and thus max X ⊗ min Y ∈ Z. Suppose instead that the elements of X have the opposite sign of the elements of Y.Then ﬂ(xy) ≤ 0for all x ∈ X and y ∈ Y and thus the elements of Z are all nonpositive. −1 −1 1. Let x ∈ X and y ∈ Y such that min X · y ∈ ﬂ [Z ] and x · min Y ∈ ﬂ [Z ]. 123 11 Page 16 of 52 M. Andrlon (a) Suppose min X > 0and min Y < 0. Then, since min X ≤ x and min Y ≤ y, multiplying and rounding gives min Z ≤ ﬂ(x · min Y ) ≤ min X ⊗ min Y ≤ ﬂ(min X · y) ≤ max Z , and therefore min X ⊗ min Y ∈ Z. (b) Suppose min X < 0and min Y > 0 instead. Then, similarly, we obtain ﬂ(min X · y) ≤ min X ⊗ min Y ≤ ﬂ(x · min Y ), and thus min X ⊗ min Y ∈ Z. −1 −1 2. Let x ∈ X and y ∈ Y instead such that max X · y ∈ ﬂ [Z ] and x · max Y ∈ ﬂ [Z ]. (a) Suppose max X > 0and max Y < 0. Then, since x ≤ max X and y ≤ max Y , ﬂ(max X · y) ≤ min X ⊗ min Y ≤ ﬂ(x · max Y ), and so max X ⊗ max Y ∈ Z. (b) Suppose max X < 0and max Y > 0 instead. Then, similarly, ﬂ(x · max Y ) ≤ min X ⊗ min Y ≤ ﬂ(max X · y), and thus max X ⊗ max Y ∈ Z. Since we have exhausted all cases, the proof is complete. Remark We can use this result in the general case by splitting X and Y into positive and neg- ative parts, though doing this does mean that may need to solve multiple smaller instances to solve the overall instance. Alternatively, we could reformulate F to split X and Y appro- priately. Remark Note that if there are no solutions, this implies that iterating F eventually proves it by converging to empty sets. Now, although the above result is very informative, it does not tell us about the bounds on Z when F converges. We now show that, unfortunately, iterating F does not always produce optimal bounds: Example 2.29 Consider β = 10, p = 2and e ≤ 0 ≤ e . (That is, 1-place decimal min max ﬂoating-point numbers.) Let X =[1.2, 1.4]∩ F, Y =[1.2, 1.3]∩ F and Z =[1.5, 1.8]∩ F and suppose ﬂ is rounding to nearest, ties to even. We ﬁrst show that F (X , Y , Z ) = (X , Y , Z ). −1 In the following, note that ﬂ [Z]= (1.45, 1.85]: −1 ﬂ [Z ]“/”Y = (1.45, 1.85]/[1.2, 1.3] = (1.1153846, 1.5416] ⊇ X , −1 ﬂ [Z ]“/”X = (1.45, 1.85]/[1.2, 1.4] = (1.03571428, 1.5416] ⊇ Y , ﬂ[X Y]= ﬂ[[1.2, 1.4]·[1.2, 1.3]] = ﬂ[[1.44, 1.82]] =[1.4, 1.8]∩ F ⊇ Z . 123 Finding Normal Binary Floating-Point Factors Efﬁciently Page 17 of 52 11 Table 1 Real and rounded · 1.2 1.3 1.4 ⊗ 1.2 1.3 1.4 products of the elements of X and 1.2 1.44 1.56 1.68 1.2 1.4 1.6 1.7 1.3 1.56 1.69 1.82 1.3 1.6 1.7 1.8 From the deﬁnition of F, the above conﬁrms that F (X , Y , Z ) = (X , Y , Z ) indeed. It is also easy to verify that the bounds on X and Y are optimal, as per Lemma 2.28: min X ⊗ max Y = ﬂ(1.2 · 1.3) = ﬂ(1.56) = 1.6 ∈ Z , max X ⊗ min Y = ﬂ(1.4 · 1.2) = ﬂ(1.68) = 1.7 ∈ Z . However, if we enumerate all the products of the members of X and Y , we ﬁnd that none of them round to min Z = 1.5. From Table 1,wesee that X ⊗Y ={1.4, 1.6, 1.7, 1.8},and so Z ∩(X ⊗Y ) =[1.6, 1.8]∩F is a proper subset of Z that is in fact a ﬂoating-point interval. Therefore X × Y × Z is not the smallest box enclosing the solution set. Example 2.29 demonstrates that the classical algorithm does not necessarily produce optimal results. However, there is a simple strategy we can use to remedy this. To ﬁnd optimal bounds on Z—that is, ﬁnd the least and greatest values z ∈ Z such that x ⊗ y = z is satisﬁed by some x ∈ X and y ∈ Y —we can use binary search. Now, testing satisﬁability is simple if we can ﬁnd optimal bounds on any part: if a problem instance is unsatisﬁable, then the optimal bounds are all empty sets. The key, then, is to ﬁnd an efﬁcient optimization method. In this paper, we develop such a procedure for the bounds on factors. 3 Simplifying the Problem 3.1 Sufficient and Necessary Conditions In order to simplify the problem, we will need to precisely determine the conditions under which the division-based algorithm produces optimal results. In the second half of Example 1.1, we encountered certain combinations of values for which the equation x ⊗ y = z has no solutions over the ﬂoating-point numbers. Unlike with the reals, there is no guarantee that we can ﬁnd a value for x satisfying the equality for any given y and z. The existence of a solution is necessary for any optimal bound, however, and so the following deﬁnition will be useful: Deﬁnition 3.1 (Floating-point factors) A ﬂoating-point number x is a ﬂoating-point factor of a ﬂoating-point number z if and only if x ⊗ y = z for some y ∈ F. Given a set Z ⊆ F,we say that x is feasible for Z if and only if x is a ﬂoating-point factor of some z ∈ Z.The set of all ﬂoating-point factors of the members of Z is denoted Feas(Z ). For the sake of brevity, we will typically simply write “factor” when it is clear from context that we are discussing ﬂoating-point factors. It is worth noting that sets of factors satisfy the following closure property: Lemma 3.2 For any Z ⊆ F,the set Feas(Z ) is closed under negation. 123 11 Page 18 of 52 M. Andrlon Proof Let Z ⊆ F. If Feas(Z ) is empty, it is trivially closed under negation. Suppose Feas(Z ) is nonempty instead, and let x ∈ Feas(Z ). Then there is some y ∈ F such that x ⊗ y ∈ Z. Therefore, −x ⊗−y = ﬂ(−x ·−y) = ﬂ(xy) = x ⊗ y ∈ Z , and so −x ∈ Feas(Z ). Corollary 3.2.1 For any x ∈ F and Z ⊆ F,wehave max {y ∈ Feas(Z ) | y ≤ x}=− min {y ∈ Feas(Z ) | y ≥−x }. Remark The above corollary implies that we only need one procedure to compute both the minimum and maximum of [a, b]∩ Feas(Z ) for any a, b ∈ F. The following two lemmas are crucial. First, we prove that the real quotients from the division-based algorithm can still provide us with useful bounds. Lemma 3.3 Let x ∈ F, and let Y , Z ⊆ F where Y is a ﬂoating-point interval. If ﬂ(xy) ∈ Z for some y ∈ Y, then 1. x ⊗ s ≥ min Zfor some s ∈ Y , and 2. x ⊗ t ≤ max Zfor some t ∈ Y. Proof Suppose ﬂ(xy) ∈ Z for some y ∈ Y.Since min Y and max Y are ﬂoating-point numbers and min Y ≤ y ≤ max Y , by the monotonicity of rounding, we have min Y ≤ RD(y) ≤ max Y and min Y ≤ RU(y) ≤ max Y.Since Y is a ﬂoating-point interval, it thus follows that RD(y) ∈ Y and RU(y) ∈ Y . We shall ﬁrst deal with the special cases of x being zero or inﬁnite. Suppose x = 0. Since xy is deﬁned by assumption, y must be ﬁnite, and hence xy = 0. Thus ﬂ(xy) = 0. Since y is ﬁnite, at least one of either RD(y) or RU(y) is ﬁnite, and thus either x RD(y) = 0or x RU(y) = 0. Therefore either x ⊗ RD(y) = 0or x ⊗ RU(y) = 0. Since ﬂ(xy) = 0and min Z ≤ ﬂ(xy) ≤ max Z, the result follows. Suppose x =+∞ or x =−∞ instead. Since xy is deﬁned by assumption, y must be nonzero and hence either xy =+∞ or xy =−∞. Therefore ﬂ(xy) = xy.Since y is nonzero, at least one of RU(y) or RD(y) is nonzero. If RD(y) is nonzero, then it has the same sign as y and hence xy = x RD(y). Similarly, if RU(y) is nonzero, then it has the same sign as y and hence xy = x RU(y). Therefore either x ⊗ RD(y) = ﬂ(xy) or x ⊗ RU(y) = ﬂ(xy), and hence the result. We now handle the remaining case of x being ﬁnite and nonzero. Suppose x ∈ F instead. Then multiplication by x is always deﬁned. Note that min Y ≤ y ≤ max Y and min Z ≤ ﬂ(xy) ≤ max Z. We shall consider each sign of x separately: • Suppose x is positive. Multiplying by x, it follows that x min Y ≤ xy ≤ x max Y , and by the monotonicity of rounding, we have x ⊗ min Y ≤ ﬂ(xy) ≤ x ⊗ max Y . 123 Finding Normal Binary Floating-Point Factors Efﬁciently Page 19 of 52 11 Combining these bounds with our previous bounds on ﬂ(xy),weobtain x ⊗ min Y ≤ ﬂ(xy) ≤ max Z , min Z ≤ ﬂ(xy) ≤ x ⊗ max Y , and hence the result holds. • Suppose x is negative. Then, similarly, we have x ⊗ min Y ≥ ﬂ(xy) ≥ x ⊗ max Y . Therefore, since ﬂ(xy) ∈ Z, it follows that x ⊗ min Y ≥ ﬂ(xy) ≥ min Z , max Z ≥ ﬂ(xy) ≥ x ⊗ max Y , and thus we obtain the result. Since we have proved the result for all possible cases, we are ﬁnished. Next, we show that the solution set is exactly the set of feasible quotients: Lemma 3.4 Let x ∈ F, and let Y , Z ⊆ F be ﬂoating-point intervals. Then there exists y ∈ Y such that x ⊗ y ∈ Z if and only if 1. x is feasible for Z, and 2. ﬂ(xa) ∈ Zfor some a ∈ Y. Proof If x ⊗ y ∈ Z for some y ∈ Y,then x ∈ Feas(Z ) by deﬁnition and ﬂ(xy) = x ⊗ y ∈ Z where y ∈ Y trivially. Suppose instead that x is feasible for Z,and that ﬂ(xa) ∈ Z for some a ∈ Y . Then, by deﬁnition, there is some w ∈ F such that x ⊗ w ∈ Z.Further,by Lemma 3.3,wehave x ⊗ s ≥ min Z and x ⊗ t ≤ max Z for some s, t ∈ Y.If w ∈ Y,then the result follows immediately. Since Y is a ﬂoating-point interval, if instead w/ ∈ Y,then either w< min Y or w> max Y . For the following, note that min Z ≤ x ⊗ w ≤ max Z.We shall divide the problem into four cases, depending on the sign of x and the region w lies in: 1. Suppose w< min Y.Then w< s and w< t. (a) Suppose x ≥ 0. Multiplying, we obtain x ⊗ w ≤ x ⊗ t, and thus min Z ≤ x ⊗ t ≤ max Z.Since Z is a ﬂoating-point interval, it follows that x ⊗ t ∈ Z. (b) Suppose x < 0. Then, we can multiply w< s by x to obtain x ⊗ w ≥ x ⊗ s, and thus max Z ≥ x ⊗ s ≥ min Z.Since Z is a ﬂoating-point interval, we have x ⊗ s ∈ Z. 2. Suppose w> max Y.Then w> s and w> t. (a) Suppose x ≥ 0. Then x ⊗ w ≥ x ⊗ s, and thus x ⊗ s ∈ Z by the reasoning of case (1)(b). (b) Suppose x < 0. Then x ⊗ w ≤ x ⊗ t, and hence x ⊗ t ∈ Z by case (1)(a). Sincewehaveeither x ⊗ s ∈ Z or x ⊗ t ∈ Z in all cases, the result follows. Remark This is not a solution on its own. Note that a is not necessarily a ﬂoating-point number! This brings us to our ﬁrst major result. The following theorem breaks the problem of computing optimal bounds into two simpler problems: computing bounds using the division- based algorithm, and then shrinking those bounds to the nearest feasible values. 123 11 Page 20 of 52 M. Andrlon Theorem 3.5 Let X , Y , Z ⊆ F be nonempty ﬂoating-point intervals, and let −1 X = X ∩ (ﬂ [Z ]“/”Y ), X ={x ∈ X |∃y ∈ Y (x ⊗ y ∈ Z )}. Then X = X ∩ Feas(Z ). Proof By Lemma 3.4,for all x ∈ F,wehave x ∈ X if and only if x ∈ X, x ∈ Feas(Z ) and −1 x ∈ ﬂ [Z ]“/”Y , and the result follows immediately. Remark This result implies that the division-based algorithm produces optimal bounds if and ˆ ˆ only if it produces feasible bounds. That is, min X = min X if and only if min X is feasible for Z. Therefore, the difﬁcult case is when the bounds produced by the division-based algorithm are infeasible. Theorem 3.5 shows that we can solve Problem 1 by using the classical division-based algorithm to compute the set of quotients and then intersecting it with the set of factors. As such, the remainder of this paper will focus on solving this new problem: Problem 2 Let x ∈ F and let Z ⊆ F be a ﬂoating point interval. What is the least (greatest) number feasible for Z that is no less (greater) than x? 3.2 Narrowing Down Infeasibility Due to Theorem 3.5, we would like to devise a method to ﬁnd the factors nearest any given non-factor. To that end, we now identify some more practical conditions for feasibility. The following lemma gives a simple and direct test for feasibility, and also shows that ﬂoating- point multiplication and division are closely related: Lemma 3.6 Let x ∈ F and let Z ⊆ F be a ﬂoating-point interval. Let z ∈ Z. Then x is feasible for Z if and only if either x ⊗ RD(z/x ) ∈ Zor x ⊗ RU(z/x ) ∈ Z. Proof If either x ⊗ RD(z/x ) ∈ Z or x ⊗ RU(z/x ) ∈ Z,then x is trivially feasible for Z. For the other half, suppose x is feasible for Z.Then x ⊗ y ∈ Z for some y ∈ F.Now, −1 let Y = ﬂ [Z ]/x.Then y ∈ Y and z/x ∈ Y , and since Z is a ﬂoating-point interval, it follows that Y is an interval. Therefore, if y ≤ z/x,then y ≤ RD(z/x ) ≤ z/x,and so RD(z/x ) ∈ Y.If y > z/x instead, then y ≥ RU(z/x ) ≥ z/x, and thus RU(z/x ) ∈ Y.Since either RD(z/x ) ∈ Y or RU(z/x ) ∈ Y,wehaveeither x ⊗RD(z/x ) ∈ Z or x ⊗RU(z/x ) ∈ Z, as desired. Corollary 3.6.1 Let x ∈ F and z ∈ F. Then x is a ﬂoating-point factor of z if and only if either x ⊗ RD(z/x ) = zor x ⊗ RU(z/x ) = z. Additionally, it is worth mentioning that the above serves as a manual proof of a general statement of the automatically derived “invertibility condition” for multiplication of Brain et al. [7] Although it is clear that if the set of products is large, there will be many solutions, it is unclear how large exactly it needs to be. The following results give a more precise idea of how its diameter relates to the set of solutions. In the following lemma, we describe a lower bound on the diameter sufﬁcient to ensure feasibility in all cases. Lemma 3.7 Let Z ⊆ F be a ﬂoating-point interval, and let x ∈ F and z ∈ Z. If |z/x|≤ −1 Q(z/x ) max F and diam ﬂ [Z ] > |x |β , then x is feasible for Z. 123 Finding Normal Binary Floating-Point Factors Efﬁciently Page 21 of 52 11 Proof If z/x ∈ F,then x ⊗ (z/x ) = z, and hence x is feasible for Z. Suppose z/x ∈ / −1 Q(z/x ) F instead, and suppose |z/x|≤ max F and diam ﬂ [Z ] > |x |β . First, note that RD(z/x)< z/x < RU(z/x ), and hence z lies strictly between x RD(z/x ) and x RU(z/x ). −1 More precisely, let I = x ·[RD(z/x ), RU(z/x )].Then z ∈ I , and hence I ∩ ﬂ [Z ] is −1 nonempty. Since Z is a ﬂoating-point interval, ﬂ [Z ] is an interval. We shall now show that −1 ﬂ [Z ] is strictly wider than I : diam I =|x |(RU(z/x ) − RD(z/x )) Q(z/x ) =|x |β −1 < diam ﬂ [Z ]. −1 Therefore, since I and ﬂ [Z ] are overlapping intervals, by Lemma 2.3, it follows that either −1 −1 min I ∈ ﬂ [Z ] or max I ∈ ﬂ [Z ]. Hence either x ⊗ RU(z/x ) ∈ Z or x ⊗ RD(z/x ) ∈ Z, and so x is feasible for Z. The next lemma builds on the previous result to give a width independent of the denominator under which virtually all values are feasible. ∗ e min Lemma 3.8 Let Z ⊆ F be a ﬂoating-point interval, and let x ∈ F and z ∈ Z. If β ≤ −1 Q(z)+1 |z/x|≤ max F and diam ﬂ [Z]≥ β , then x is feasible for Z. e −1 Q(z)+1 min Proof Suppose β ≤|z/x|≤ max F and diam ﬂ [Z]≥ β . Then, Q(z/x ) E (z/x )−p+1 |x |β =|x |β 1−p ≤|x ||z/x |β 1−p =|z|β E (z)+1 1−p <β β Q(z)+1 = β −1 ≤ diam ﬂ [Z ]. Therefore, by Lemma 3.7, x is feasible for Z. Remark For any ordinary rounding function (e.g. RD, RU, or RN), the requirement that −1 Q(z)+1 diam ﬂ [Z]≥ β roughly corresponds to Z containing at least β ﬂoating-point num- bers with the same exponent as z (cf. Ziv et al. [30]). The next lemma provides an even stronger bound, under the condition that the signiﬁcand of the numerator is no greater in magnitude than the signiﬁcand of the denominator: Lemma 3.9 Let Z ⊆ F be a ﬂoating-point interval, and let x ∈ F and z ∈ Z. Let m = −E (x ) −E (z) −1 Q(z) e min x β and m = zβ .If diam ﬂ [Z]≥ β , |m |≤|m |, and β ≤|z/x|≤ z z x max F, then x is feasible for Z. e E (z)−E (x ) min Proof Suppose β ≤|z/x|≤ max F.If |m |=|m |,then |z/x|= β and hence z x z/x ∈ F,so x is trivially feasible for Z. Suppose |m | < |m | instead, and also suppose z x −1 Q(z) diam ﬂ [Z]≥ β . Then log ufp(m /m ) ≤−1 and so by Lemma 2.13, it follows that z x E (z/x ) ≤ E (z) − E (x ) − 1. Thus, Q(z/x ) E (z/x )−p+1 |x |β =|x |β E (z)−E (x )−p ≤|x |β 123 11 Page 22 of 52 M. Andrlon Fig. 1 Preimages of 1 and 1.75 under ﬂoating-point rounding, with (β, p, e , e ) = (2, 3, −2, 1).RNE min max is rounding to nearest, breaking ties by choosing numbers with even integral signiﬁcands Q(z)−1 =|m |β Q(z) <β −1 ≤ diam ﬂ [Z ]. Therefore, by Lemma 3.7, x is feasible for Z. Any “ordinary” rounding function almost always satisﬁes the condition on the diameter −1 Q(z) on the preimage. Speciﬁcally, we always have diam ﬂ [{z}] = β unless |z| is normal and a power of β. This can be seen in Fig. 1; note the different diameters of the preimages of 1 under rounding. Note that Lemmas 3.7 and 3.9 are both easier to apply when using standard rounding modes, or more speciﬁcally, when the preimage under rounding of any ﬂoating-point number is not too narrow. Although the usual rounding functions are both faithful and monotonic, and these are indeed very useful properties, they are not enough on their own: Example 3.10 We shall construct an ill-behaved rounding function which is still both faithful and monotonic. Let f : R → F be a function such that RD(x ) if x ≤ max F, f (x ) = +∞ otherwise. Since RD is nondecreasing and faithful, it is easy to see that f is also nondecreasing and faithful. However, the only real that it rounds to max F is max F itself. That is, we have −1 f [{max F}] = {max F}, and so the preimage under rounding has zero diameter. Further, given x , y ∈ F,wehave x ⊗ y = max F if and only if xy = max F exactly. This means that ﬁnding the ﬂoating-point factors of max F requires integer factorization. Speciﬁcally, p q max since max F = (β − 1)β , the ﬂoating-point factors of max F are exactly the ﬂoating- point numbers with a nonnegative exponent (i.e. integers) whose integral signiﬁcands divide β − 1. As demonstrated above, faithfulness and monotonicity are not enough to ensure that a rounding function has a tractable preimage. We therefore restrict our attention to a better- behaved class of rounding functions. In particular, we are interested in rounding functions that allocate to each ﬂoating-point number at least half of the gap around it: 123 Finding Normal Binary Floating-Point Factors Efﬁciently Page 23 of 52 11 Deﬁnition 3.11 (Equitable rounding) A rounding function f is equitable if it is faithful, −1 Q(x )−k −E (x ) monotonic, and diam f [{x }] ≥ β where k = 1if |x β |= 1or k = 0 otherwise for all x ∈ F . Remark Note that we do not require the preimages under rounding of 0, +∞,and −∞ to have a minimum diameter. This is because they already have every nonzero ﬁnite ﬂoating-point number as a ﬂoating-point factor, and so we do not need their preimages at all. Given this deﬁnition, the following is then straightforward (albeit tedious) to prove: Lemma 3.12 RD and RU are equitable rounding functions. However, RN is not necessarily equitable, since it is unclear what rounding to nearest means for numbers outside [min F, max F] and thus our deﬁnition does not fully specify the preimages of min F and max F under RN. For instance, we could choose RN such −1 − that RN [{x }] = ((x + x )/2, x ] when x = max F. This is clearly not equitable, since −1 q max diam RN [{x }] = β /2. However, this inconvenience is not as important as it may seem. As mentioned earlier, IEEE 754-style rounding to nearest speciﬁes that a number rounds to e +1 max an inﬁnity if and only if its magnitude is no less than (max F + β )/2. This in turn max means that the preimages of min F and max F under rounding have a diameter of β ,as required for equitability. In particular, the preimages of min F and max F under any equitable rounding to nearest are at least as large as those under IEEE 754-style rounding: −1 Lemma 3.13 RN is an equitable rounding function if and only if RN [{+∞}] ⊆ I and −1 e +1 max RN [{−∞}] ⊆ −Iwhere I =[(max F + β )/2, +∞]. Therefore, since IEEE 754-style systems are virtually the only ones in use, we do not consider equitability to be an especially onerous requirement. Now, under the assumption of equitable rounding, we can substantially reﬁne the conclu- sion of Lemma 3.9. ∗ −E (x ) −E (z) Lemma 3.14 Let x ∈ F and z ∈ F, and let m = x β and m = zβ .If ﬂ is x z min equitable and β ≤|z/x|≤ max F and |m |≤|m |, then either |m |= 1 or x is a z x z ﬂoating-point factor of z. min Proof Suppose ﬂ is equitable and β ≤|z/x|≤ max F and |m |≤|m |. Suppose also that z x −1 Q(z) |m | = 1. Then, by the deﬁnition of equitable rounding, we have diam ﬂ [{z}] ≥ β . Therefore by Lemma 3.9, it follows that x is feasible for {z}, and hence the result. Remark Note that here we show that x is a factor of z, whereas Lemmas 3.7 and 3.9 only prove that it is feasible for some superset Z. Combining the previous results, we obtain a much clearer picture of what it means when a number is infeasible: Theorem 3.15 Let Z ⊆ F be a ﬂoating-point interval, and let x ∈ F and z ∈ Z. Let −E (x ) −E (z) m = x β and m = zβ .If ﬂ is equitable and |z/x|≤ max F and x is not feasible x z for Z, then either min 1. |z/x | <β ,or −1 Q(z) 2. 1 =|m | < |m | and diam ﬂ [Z ] <β ,or z x −1 Q(z)+1 3. |m | < |m | and diam ﬂ [Z ] <β . x z Proof Follows immediately from Lemmas 3.7 and 3.14. 123 11 Page 24 of 52 M. Andrlon −1 0 1 2 β β β β x RU(2.5/x) x RD(2.5/x) 2.6 2.5 2.4 0 0.1 1 10 100 Fig. 2 Graphs of exact products of ﬂoating-point divisors x and rounded quotients RD(z/x ) and RU(z/x ) where (β, p, e , e ) = (10, 2, −1, 1) and z = 2.5. The x-axis shows all positive ﬁnite ﬂoating-point min max numbers, spaced equally Remark When rounding the quotient produces a subnormal number, as in case (1), the loss of signiﬁcant digits can drastically magnify the error and make x infeasible unless Z is exceptionally wide. This makes subnormal quotients difﬁcult to handle in general, since the difference between z and x ⊗ ﬂ(z/x ) may be as large as x itself. Since we assume ﬂ is equitable, case (2) is the only instance where x is infeasible and |m |≤|m |. Although it is a very constrained special case, it shall take some work to z x dispense with, as we will see later. The most common cause of infeasibility is case (3). In this instance, we have |m |≤|m |,but Z is too narrow to provide a direct solution, likely due to x z containing fewer than β numbers. Note that this means that having more than one number in Z usually sufﬁces for feasibility when β = 2. Theorem 3.15 gives a straightforward classiﬁcation of cases where infeasibility can occur. Although we are not able to give a full treatment of the subnormal quotient case in this paper, we will present an efﬁcient solution for the other two cases. In the next section, we develop criteria for feasibility that are more amenable to a computational solution. 4 Analyzing the Error When computing a real number, we are often forced to approximate it due to the limitations of the number system. For systems as complicated as the ﬂoating-point numbers, the error 123 Finding Normal Binary Floating-Point Factors Efﬁciently Page 25 of 52 11 in this approximation—the rounding error—can seem unpredictable. The notion is simple: given a rounding function f , the rounding error of some x ∈ R with respect to f is just f (x ) − x. However, it is difﬁcult to tell from this what values the rounding error can take and how it varies with x. In our case, we are more precisely interested in a “round-trip” error. Roughly speaking, according to Lemma 3.6, a ﬂoating-point number is feasible for a given set if and only if we can undo division by using multiplication. In other words, it is feasible if and only if the error in the “round-trip” is small enough. For any given ﬂoating-point numerator z and denominator x, any error in attempting to round-trip multiplication using division stems from rounding the quotient z/x.Thisis easy to show: since ﬂ is assumed to be faithful, if z/x is also a ﬂoating-point number, then x ⊗ ﬂ(z/x ) = x exactly. Consequently, any instance where x ⊗ ﬂ(z/x ) = x implies that z/x is not exactly representable. This is the common case, since the quotient of two ﬂoating-point numbers rarely ﬁts into the available number of digits. However, the product is also rounded, so an inexact quotient may still be close enough. For instance, if we round downward, it is + + sufﬁcient that the product lies in [z, z ),where z is the next ﬂoating-point number after z. A number of such examples can be seen in Fig. 2, as well as a few instances where the product has no error (note that the graph is centered on z). Other patterns are visible as well, such as periodic behavior and a reduction in error whenever the signiﬁcand of x is greater than the signiﬁcand of z. In the remainder of this section, we shall develop ways of manipulating this round-trip error, with the goal of using integer optimization to ﬁnd ﬂoating-point factors. The key insight comes from Lemma 3.6, which tells us that feasibility can be decided using only ﬂ and upward- or downward-rounded quotients. Further, since we assume ﬂ is faithful, we also know that multiplication is also either rounded upward or downward. Therefore, we only really need to be concerned with rounding errors relative to RU and RD to solve the problem. However, the deﬁnitions of RU and RD are very cumbersome to manipulate directly. In order to make progress, we must ﬁrst ﬁnd a more practical description of rounding. We will do this by expressing rounding error in terms of the remainder of ﬂoored division. Furthermore, for the sake of simplicity, we will consider rounding over ﬂoating-point numbers with unbounded exponents. We will also show how this relates to the original case. 4.1 Further Preliminaries 4.1.1 The Mod Operator We shall use a generalized notion of the remainder of a division. Given a numerator n, denominator d, and quotient q, a remainder r satisﬁes n = qd + r and |r | < |d|.These quantities are traditionally integers, but we extend the operation to the real numbers. Given n and d, we shall determine the remainder r by choosing the quotient q = n/d. Thus, we deﬁne the binary operator mod as follows: Deﬁnition 4.1 (Remainder of ﬂoored division) For any numerator x ∈ R and denominator y ∈ R ,wedeﬁne (x mod y) = x − y x /y . More precisely, if β is prime (such as when β = 2), then the quotient either ﬁts or has no ﬁnite base-β representation. For a proof, see Lemma A.5. Due to the difﬁculties with subnormal numbers mentioned in the remark to Theorem 3.15, the speciﬁc approach presented here does not beneﬁt from restricting exponents, and doing so would drastically complicate the presentation. 123 11 Page 26 of 52 M. Andrlon Remark To avoid excessive parenthesization, the mod operator deﬁned here is nonassociative and has the lowest precedence. That is, within the parentheses, everything before mod is the ﬁrst argument, and everything after mod is the second argument. Since this is an extension of the usual notion of the remainder to real numerators and denomi- nators, not all the same properties will hold exactly. However, there is still a close connection with modular arithmetic, which the next two lemmas will illustrate. Firstly, the remainder is periodic in the numerator, with a period equal to the denominator. That is, we can add or subtract integer multiples of the denominator to the numerator without affecting the result: Lemma 4.2 For any x ∈ R,y ∈ R , and n ∈ Z, (x + ny mod y) = (x mod y). Proof Follows directly by a straightforward computation. Secondly, if we restrict ourselves wholly to the integers and ﬁx the denominator, equality of remainders is equivalent to congruence: Lemma 4.3 For all a, b ∈ Z,n ∈ Z ,wehave (a mod n) = (b mod n) if and only if a ≡ b (mod |n|). Proof If (a mod n) = (b mod n),then a − n a/n = b − n b/n, and hence a ≡ b (mod |n|). Suppose instead that a ≡ b (mod |n|). Then there exist p, q, r ∈ Z such that a = pn + r and b = qn + r.As (pn + r mod n) = (qn + r mod n) = (r mod n),the result follows. As might be expected from being so closely related to congruence, the mod operator has many useful properties, though we shall only need a few. The following properties hold for all x ∈ R and a, y ∈ R : |(x mod y)| < |y| (upper bound) (45) (ax mod ay) = a(x mod y) (distributive law) (46) if x , y ∈ Z, then (x mod y) ∈ Z (integrality) (47) if (x mod y) = 0, then (−x mod y) = y − (x mod y) (complement) (48) It is also worth noting that the remainder (when nonzero) has the same sign as the denominator: y(x mod y) ≥ 0 (49) Finally, the following results simplify some common cases of the numerator: (0mod y) =0(50) (x mod y) = 0 if and only if x = ny for some n ∈ Z (51) (x mod y) = x if and only if 0 ≤ x /y <1(52) (x mod y) = x if and only if xy ≥ 0and |x | < |y| (53) The proofs of these statements are simple and follow easily from the deﬁnition of mod. 4.1.2 Floating-Point Numbers with Unbounded exponents We deﬁne these sets almost identically to the bounded case. For compatibility with their bounded counterparts, we inherit the base β and precision p: 123 Finding Normal Binary Floating-Point Factors Efﬁciently Page 27 of 52 11 Deﬁnition 4.4 (Unbounded ﬂoating-point numbers) We deﬁne the following sets: ∗ q p F ={M β | M , q ∈ Z, 0 < |M | <β }, F = F ∪{0}, F = F ∪{−∞, +∞}. ∞ ∞ Remark Note that the unbounded sets are strict supersets of their bounded counterparts. That ∗ ∗ is,wehave F ⊂ F , F ⊂ F ,and F ⊂ F . ∞ ∞ The unbounded ﬂoating-point numbers satisfy the following properties for all x ∈ F and n ∈ Z: if |n|≤ β , then n ∈ F (54) e ∗ min if β ≤|x|≤ max F, then x ∈ F (55) x β ∈ F (scaling by powers of β) (56) if x ∈ F , then x /ulp(x ) ∈ Z (integral signiﬁcand) (57) Proving the existence of a unique upward or downward rounding takes a little more work in the unbounded case, since the sets are inﬁnite, so their subsets might not have a minimum −n or maximum. For instance, the set (0, 1]∩ F has no minimum, since it contains every β where n ≥ 0. However, we only have ﬁnitely bounded inﬁnite sets around zero: Lemma 4.5 Let a, b ∈ R.If 0 ∈[ / a, b],then [a, b]∩ F is a ﬁnite set. Proof Let S =[a, b]∩ F .If S is empty, then the result follows trivially. Suppose S is nonempty instead and suppose that 0 ∈[ / a, b].Then a ≤ b,and a and b must be nonzero and have the same sign. We shall consider the case of each sign separately. Suppose a and b are positive. Let e = log a and e = log b ,and let a b β β e−p+1 p T ={M β | M , e ∈ Z, 0 < M <β , e ≤ e ≤ e + p − 1}. a b Clearly, T is ﬁnite. Now, let s ∈ S.Since s is positive, by deﬁnition we have s = M β for some M , q ∈ Z where 0 < M <β . Thus, to show that s ∈ T , it sufﬁces to show that e − p + 1 ≤ q ≤ e . Therefore, a b q q e +1 β ≤ M β = s ≤ b <β so q < e + 1 and hence q ≤ e . Similarly, b b e q q+p β ≤ a ≤ s = M β <β so e − p < q and hence e − p + 1 ≤ q. Therefore s ∈ T , and hence S ⊆ T , and since T a a is ﬁnite, so is S. Suppose instead a and b are negative. Then −b and −a are positive, so 0 ∈[ / −b, −a],and thus [−b, −a]∩ F is ﬁnite. Since F is closed under negation, we have [−b, −a]∩ F = ∞ ∞ ∞ −([a, b]∩ F ) =−S, and hence S is also ﬁnite. Corollary 4.5.1 For any x ∈ R,the sets {y ∈ F | y ≤ x } and {y ∈ F | y ≥ x } have a ∞ ∞ minimum and a maximum. We can now deﬁne downward and upward rounding over the unbounded ﬂoating-point num- bers: 123 11 Page 28 of 52 M. Andrlon Deﬁnition 4.6 (Unbounded rounding) We deﬁne the functions RD , RU : R → F as ∞ ∞ ∞ follows: RD (x ) = max {y ∈ F | y ≤ x }, ∞ ∞ RU (x ) = min {y ∈ F | y ≥ x }. ∞ ∞ As expected, these functions are equal to their bounded counterparts over the range of the normal numbers: min Lemma 4.7 For all x ∈ R,if β ≤|x|≤ max F,then RD(x ) = RD (x ) and RU(x ) = RU (x ). We also have a similar gap property to Eq. 40: Lemma 4.8 For all x ∈ R,wehaveeither RD (x ) = RU (x ) = xor RU (x ) − ∞ ∞ ∞ RD (x ) = ulp(x ). Note that the gap is always ﬁnite, since having unbounded exponents means that every ﬁnite real rounds to another ﬁnite real. Only +∞ and −∞ round to themselves. 4.1.3 Plausibility In order to effectively translate our problem into the realm of unbounded ﬂoating-point numbers, we will also need a counterpart to our earlier concept of feasibility. Note that we keep the same rounding function: even though ﬂ is not surjective with respect to the unbounded ﬂoating-point numbers, it is compatible since it already maps every extended real to a bounded ﬂoating-point number. Deﬁnition 4.9 (Plausibility)Given Z ⊆ F and x ∈ F ,wesay that x is plausible for Z if ∞ ∞ and only if ﬂ(xy) ∈ Z for some y ∈ F . We denote the set of all unbounded ﬂoating-point numbers plausible for Z by Plaus(Z ). Remark If Z ⊆ F,thenweimmediatelyhave Z ⊆ Feas(Z ) ⊆ Plaus(Z ). Similarly to the sets of factors, we have the following closure property: Lemma 4.10 For all Z ⊆ F ,the set Plaus(Z ) is closed under negation and scaling by powers of β. Due to exponents being unrestricted, we have the following: Lemma 4.11 For all Z ⊆ F ,wehave 1 ∈ Plaus(Z ) if and only if Z is nonempty. Perhaps most importantly, we have an analogue to Lemma 3.6 (the proof is identical, mutatis mutandis): Lemma 4.12 Let Z ⊆ F be a ﬂoating-point interval, and let x ∈ F and z ∈ Z. Then x ∈ Plaus(Z ) if and only if either ﬂ(x RD (z/x )) ∈ Zor ﬂ(x RU (z/x )) ∈ Z. ∞ ∞ We also have an analogue to Lemma 3.9: Lemma 4.13 Let Z ⊆ F be a ﬂoating-point interval, and let x ∈ F and z ∈ Z. Let −1 m = x /ulp(x ) and m = z/ ulp(z).If diam ﬂ [Z]≥ ulp(z) and |m |≤|m |,then x is x z z x plausible for Z. 123 Finding Normal Binary Floating-Point Factors Efﬁciently Page 29 of 52 11 As expected, feasibility and plausibility agree when the quotients are normal: ∗ e min Lemma 4.14 Let Z ⊆ F be a ﬂoating-point interval, and let x ∈ F and z ∈ Z. If β ≤ |z/x|≤ max F,then x ∈ Feas(Z ) if and only if x ∈ Plaus(Z ). min Proof If x ∈ Feas(Z ),then x ∈ Plaus(Z ) trivially. Suppose x ∈ Plaus(Z ) and β ≤ |z/x|≤ max F.ThenRD(z/x ) = RD (z/x ) and RU(z/x ) = RU (z/x ), so by Lemma ∞ ∞ 4.12,wehaveeither x ⊗ RD(z/x ) = ﬂ(x RD (z/x )) ∈ Z , or x ⊗ RU(z/x ) = ﬂ(x RU (z/x ) ∈ Z , and therefore x ∈ Feas(Z ) by deﬁnition. Since exponents are unbounded, there is always a next and a previous plausible number. We will ﬁnd it useful to have a function giving the smallest plausible number greater than some lower bound. (Note that F being ﬁnite means that we cannot have a similar total function for the feasible numbers.) First, we prove its existence: Lemma 4.15 Let Z ⊆ F be a ﬂoating-point interval, and let x ∈ R .Let A ={y ∈ Plaus(Z ) | y ≥ x }. Then A is nonempty and has a minimum if and only if Z is nonempty. Proof If Z is empty, then Plaus(Z ) is trivially empty and hence so is A. Suppose instead that Z is nonempty and let z ∈ Z and let t = z/s where β ufp(x ) if x > 0, s = − ufp(x ) otherwise. Clearly, x ≤ s.Now,since |s| is a power of β and z ∈ F ,wehave s, t ∈ F . Therefore ∞ ∞ ﬂ(st ) = ﬂ(z) = z, and hence s, t ∈ Plaus(Z ) by deﬁnition and so s ∈ A. Now, let L =[x , s]∩ Plaus(Z ) and L =[x , s]∩ F .Since x and s have thesamesignby deﬁnition, by Lemma 4.5, it follows that L is ﬁnite. Hence, since s ∈ L ⊆ L , it follows that L is also ﬁnite, and since it is nonempty, it has a minimum. Now, let R = (s, +∞]∩Plaus(Z ). Since min L ≤ s, it follows that min L < r for all r ∈ R. Thus, since A = L ∪ R,wehave min L ≤ a for all a ∈ A, and hence min A = min L. With this result, we can now deﬁne our “round up to plausible” function: Deﬁnition 4.16 For each nonempty Z ⊆ F,wedeﬁne thefunction φ : R → F by Z ∞ φ (x ) = min {y ∈ Plaus(Z ) | y ≥ x }. It is easy to see from the deﬁnition that these functions are nondecreasing: Lemma 4.17 For all Z ⊆ F and x , y ∈ R ,if x ≤ y, then φ (x ) ≤ φ (y). Z Z Thus, since plausible numbers are a superset of feasible numbers, we have the following: Lemma 4.18 Let Z ⊆ F,x ∈ R and y ∈ Feas(Z ). Then, if x ≤ y, then φ (x ) ≤ y. That is, each of these functions gives a lower bound on the next feasible number. 123 11 Page 30 of 52 M. Andrlon 4.2 Relating Feasibility, Plausibility, and Rounding Errors We begin by giving expressions for rounding in terms of remainders. Note that the following result directly gives a rounding error term. Lemma 4.19 For all x ∈ R ,wehave RD (x ) = x − (x mod ulp(x )), RU (x ) = x + (−x mod ulp(x )). Proof Let x = x − (x mod ulp(x )), x = x + (−x mod ulp(x )) and M = x / ulp(x ). Then, rearranging using the deﬁnition of mod, we obtain x = ulp(x ) M and x = ulp(x )M p−1 p p p and thus x ≤ x ≤ x .Since β ≤|M | <β , it follows that | M |≤ β and | M |≤ β . Thus M , M ∈ F , and since ulp(x ) is a power of β,wehave x , x ∈ F .Thus,by ∞ ∞ the deﬁnitions of RD and RU , ∞ ∞ x ≤ RD (x ) ≤ x ≤ RU (x ) ≤ x . ∞ ∞ Now, if M ∈ Z,then x = x and hence the result follows immediately. Suppose M ∈ / Z instead. Then x ∈ / F ,and so RU (x ) − RD (x ) = ulp(x ) = x − x . Therefore, ∞ ∞ ∞ x ≤ RD (x ) ≤ x ≤ RD (x ) + ulp(x ) ≤ x + ulp(x ), ∞ ∞ and hence x = RD (x ) and x = RU (x ), as desired. ∞ ∞ The above result allows us to manipulate the rounding error using the various laws obeyed by the mod operator. The following result gives us an expression for the round-trip error between any two consecutive plausible numbers. ∗ ∗ Lemma 4.20 Let Z ⊆ F and z ∈ Z. Let x ∈ R .If z ∈ F , then for all t ∈[x,φ (x )],we have ufp(x ) ≤|t|≤ β ufp(x ) and t RD (z/t ) = z − (z mod t ulp(z/x )), t RU (z/t ) = z + (−z mod t ulp(z/x )). Proof Suppose z ∈ F and let t ∈[x,φ (x )].Let m = x / ufp(x ), m = z/ ufp(z),and Z x z m = t / ufp(t ). By Lemma 4.19 and the distributivity of multiplication over mod, t RD (z/t ) = z − (z mod t ulp(z/t )), t RU (z/t ) = z + (−z mod t ulp(z/t )), so for the equalities it sufﬁces to show that ulp(z/t ) = ulp(Z /x ), or equivalently, that ufp(z/t ) = ufp(z/x ). Note that if ufp(x ) = ufp(t ) and ufp(m ) = ufp(m ), then by Lemma x t 2.12,wehave ufp(z/t ) = ufp(m /m ) ufp(z)/ ufp(t ) z t = ufp(m /m ) ufp(z)/ ufp(x ) z x = ufp(z/x ). Also note that if ufp(t ) = ufp(x ),thenufp(x ) ≤|t | <β ufp(t ) immediately. We proceed by cases. In the following, note that by Lemma 4.10,every powerof β and every scaling of z by a power of β is plausible for Z. • Suppose |m | < m . Then by Lemma 2.12,wehaveufp(m /m ) = 1/β. Then, since z x z x x <β ufp(x ) ∈ Plaus(Z ),wehave t ≤ β ufp(x ). 123 Finding Normal Binary Floating-Point Factors Efﬁciently Page 31 of 52 11 – Suppose t <β ufp(x ). Then, since ufp(x ) ≤ x ≤ t,wehaveufp(t ) = ufp(x ) and hence |m | < m ≤ m . Therefore, by Lemma 2.12,wehaveufp(m /m ) = 1/β as z x t z t well, as desired. – Suppose t = β ufp(x ) instead. Then, ufp(z/t ) = ufp β ufp(x ) 1 ufp(z) = · β ufp(x ) = ufp(m /m ) ufp(z)/ ufp(x ) z x = ufp(z/x ). • Suppose 0 < m ≤|m | instead. Then, since |m | ufp(x ) ∈ Plaus(Z ),wehave x z z x ≤ t ≤|m | ufp(x)<β ufp(x ). Thus ufp(t ) = ufp(x ) and hence m ≤ m = t / ufp(x ) ≤|m |. x t z Thus by Lemma 2.12,wealsohaveufp(m /m ) = ufp(m /m ) = 1, as desired. z x z t • Suppose −|m |≤ m < 0 instead. Then, since − ufp(x ) ∈ Plaus(Z ),wehave z x x ≤ t ≤− ufp(x ), so ufp(t ) = ufp(x ) and hence −|m |≤ m ≤ m ≤−1. z x t Therefore, by Lemma 2.12,wehaveufp(m /m ) = ufp(m /m ) = 1, as desired. z x z t • Suppose m < −|m | instead. Then ufp(m /m ) = 1/β by Lemma 2.12, and since x z z x −|m | ufp(x ) ∈ Plaus(Z ),wehave x ≤ t ≤−|m | ufp(x ), and thus ufp(x ) = ufp(t ). – Suppose t < −|m | ufp(x ).Then ufp(m /m ) = ufp(m /m ) = 1/β, z x z t and hence ufp(z/t ) = ufp(z/x ), as desired. – Suppose t =−|m | ufp(x ) instead. Then, t ulp(z/x ) =−|m | ufp(x ) ulp(z/x ) ufp(z) 1−p =−|m | ufp(x ) ufp(m /m ) β z z x ufp(x ) 1−p =−|m | ufp(z)β |z| =− , p−1 hence z is an integer multiple of t ulp(z/x ). Thus (z mod t ulp(z/x )) = (−z mod t ulp(z/x )) = 0, 123 11 Page 32 of 52 M. Andrlon and so the rounding error is zero. Similarly, since z ufp(z) z/t = =− sgn(z) , −|m | ufp(x ) ufp(x ) it is a (possibly negated) power of β, and thus in F . Therefore, t RD (z/t ) = t RU (z/t ) = t · = z ∞ ∞ exactly, and hence the result follows. We have thus proven the result in all cases, so we are ﬁnished. The following lemma constrains the feasibility of tiny quotients. Essentially, if any especially small quotients are feasible, so are any larger ones, up to a limit. This result will be useful in determining whether there are any ﬂoating-point factors around zero. Although we cannot easily ﬁnd subnormal factors, we can often eliminate them when they do not exist. Lemma 4.21 Let Z ⊆ F be a ﬂoating-point interval, x , y ∈ F , and z ∈ Z. If y ∈ Feas(Z ) and |y|≤|x|≤|z|/ max F,then x ∈ Feas(Z ). Proof If |x|=|z|/ max F exactly, then |x|⊗ max F =|z| and hence x ∈ Feas(Z ) by Lemma 3.2. Suppose y ∈ Feas(Z ) and |y|≤|x | < |z/ max F| instead. Then |z/x | > max F and hence z is nonzero. We ﬁrst prove the result for positive x and y, before then generalizing to all cases. In the following, note that by Lemma 3.6,wehaveeither y ⊗ RU(z/y) ∈ Z or y ⊗ RD(z/y) ∈ Z. Suppose z > 0. Since |z/x | > max F and 0 < y ≤ x by assumption, we have RD(z/y) = RD(z/x ) = max F and RU(z/y) =+∞. Therefore, if y ⊗ RU(z/y) =+∞∈ Z,then x ⊗+∞ = +∞ ∈ Z also, and hence x is feasible for Z.If y ⊗ RD(z/y) ∈ Z instead, then, since y ≤ x, multiplying by RD(z/y) gives, min Z ≤ y ⊗ RD(z/y) ≤ x ⊗ RD(z/y) = x ⊗ RD(z/x ) ≤ z ≤ max Z , and since Z is a ﬂoating-point interval, it follows that x ⊗ RD(z/x ) ∈ Z. Therefore x is feasible for Z. Suppose z < 0 instead. Then, we similarly have RD(z/y) =−∞ and RU(z/y) = RU(z/x ) = min F. Hence, if y ⊗ RD(z/y) =−∞∈ Z,then x ⊗−∞ = −∞ ∈ Z,and so x is feasible for Z.If y ⊗ RU(z/y) ∈ Z instead, then since y ≤ x, multiplying by RU(z/y) gives max Z ≥ y ⊗ RU(z/y) ≥ x ⊗ RU(z/y) = x ⊗ RU(z/x ) ≥ z ≥ min Z , and since Z is a ﬂoating-point interval, it follows that x ⊗ RU(z/x ) ∈ Z. Therefore x is feasible for Z. We have thus shown the result for positive x and y. Now, suppose instead that y is negative (x can have any sign). Then |y| is feasible for Z by Lemma 3.2, and thus |x | is feasible for Z by the earlier result. Therefore |x | is feasible for Z regardless of the sign of y, and hence by Lemma 3.2,sois x. The following theorem allows us to reduce the question of ﬁnding feasible numbers to that of ﬁnding plausible numbers, when applicable: 123 Finding Normal Binary Floating-Point Factors Efﬁciently Page 33 of 52 11 Theorem 4.22 Let Z ⊆ F be a ﬂoating-point interval, and let x ∈ F and z ∈ Z. Let A ={y ∈ Feas(Z ) | y ≥ x }, B ={y ∈ Plaus(Z ) | y ≥ x }, C ={y ∈ Plaus(Z ) | y ≥|x |}. min If x is normal and β ≤|z/x|≤ max F, then either A is empty or min A ∈{min B, min C }. min Proof Suppose x is normal, β ≤|z/x|≤ max F,and A is nonempty. Let a = min A, b = min B,and c = min C. Since Feas(Z ) ⊆ Plaus(Z ),wehave a ∈ B and thus x ≤ b ≤ a. E (z/x ) We ﬁrst show that if c ≤ max F,then c ∈ Feas(Z ).Let t = z/β .ByLemma min 4.10,since |z|∈ Plaus(Z ),wehave |t|∈ Plaus(Z ).Now,since |z/x|≥ β ,wehave E (z/x ) E (z/x ) |z/x|≥ β . Taking reciprocals and multiplying by |z|, we obtain |x|≤|z|/β =|t | and therefore |t|∈ C. Thus we obtain |x|≤ c ≤|t |, and taking reciprocals and multiplying by |z| again, it follows that E (z/x ) e min max F ≥|z/x|≥|z/c|≥|z/t|= β ≥ β . e ∗ min Therefore, by Lemma 4.14,since β ≤|x|≤ c,if c ≤ max F,then c ∈ F and thus c ∈ Feas(Z ). Next, we dispose of two trivial cases: • Suppose 0 ∈ Feas(Z ).Then0 ⊗ y ∈ Z for some y ∈ F and hence 0 ∈ Z. Thus 0 ⊗ x = 0 ∈ Z,and so x ∈ Feas(Z ). Therefore a ≤ x, and hence x = a = b. • Suppose +∞ ∈ Feas(Z ).Then +∞ ⊗ y ∈ Z for some nonzero y ∈ F.If y > 0, then +∞ ⊗ y =+∞=+∞ ⊗ x,and so x ∈ Feas(Z ).If y < 0, then +∞ ⊗ y =−∞= −∞ ⊗ x,and so x ∈ Feas(Z ) again. Thus a ≤ x and so x = a = b. We can now proceed to the proof of the main statement. Suppose 0 ∈ / Feas(Z ) and +∞ ∈ / Feas(Z ) instead. Then a < +∞,and so a ≤ max F. We shall divide the problem into two cases, depending on the sign of x. Suppose x > 0. Then B = C, and hence b = c ≤ a ≤ max F. Therefore c ∈ Feas(Z ) by our earlier result, so a ≤ c also, and hence a = b = c, as desired. e e min min Suppose x < 0 instead. Then, since x is normal, we have x ≤−β .Since −β ∈ e e min min Plaus(Z ) by Lemma 4.10,wehave −β ∈ B and thus x ≤ b ≤−β . Therefore e ∗ e min min β ≤|b|≤|x|≤ max F and hence b ∈ F and β ≤|z/x|≤|z/b|.Thus,if |z/b|≤ max F,then b ∈ Feas(Z ) by Lemma 4.14 and hence a = b. Suppose |z/b| > max F instead. Then |b| < |z|/ max F. We further divide this into two subcases, depending on the sign of a.Notethat a = 0since 0 ∈ / Feas(Z ) by assumption. • Suppose a is negative. Then b ≤ a < 0, and thus 0 < |a|≤|b| < |z|/ max F.Since a ∈ Feas(Z ) by deﬁnition, Lemma 4.21 implies that b ∈ Feas(Z ) also and thus a = b. • Suppose a is positive. Then [x , 0]∩ Feas(Z ) is empty, and since Feas(Z ) is closed under negation, it follows that a > |x |. Thus, since a ∈ Plaus(Z ),wehave a ∈ C and hence c ≤ a ≤ max F. Therefore c ∈ Feas(Z ) by our earlier result, so a ≤ c and therefore a = c, as desired. Therefore the result holds in all cases, and we are done. We now give the main result of this section. The following theorem reduces the problem to an optimization problem over the integral signiﬁcands of plausible numbers. 123 11 Page 34 of 52 M. Andrlon ∗ ∗ Theorem 4.23 Let Z ⊆ F be a ﬂoating-point interval and z ∈ Z ∩ F . Then, for all x ∈ F , we have φ (x ) = M ulp(x ) where M = min {N ∈ A ∪ B | N ≥ M }, ∗ p−k p−k A ={N ∈ Z |−(M β mod N ) ∈ I β }, ∗ p−k p−k B ={N ∈ Z | (−M β mod N ) ∈ I β }, −1 ﬂ [Z]− z I = , ulp(z) 0 if |M | > |M |, x z k = 1 otherwise, M = , ulp(x ) M = . ulp(z) Proof Let y = φ (x ) and M = y/ ulp(x ).Since y is plausible by deﬁnition, we have either Z y ﬂ(y RD (z/y)) ∈ Z or ﬂ(y RU (z/y)) ∈ Z by Lemma 4.12. Therefore, by Lemma 4.20, ∞ ∞ we have either −1 z − (z mod y ulp(z/x )) ∈ ﬂ [Z ], or −1 z + (−z mod y ulp(z/x )) ∈ ﬂ [Z ]. k−p Since ulp(M /M ) = β by Lemma 2.12, subtracting z andthendividingbyulp(z) · z x ulp(M /M ) givesuseither z x p−k p−k −(M β mod M ) ∈ I β , z y or p−k p−k (−M β mod M ) ∈ I β . z y p−1 p By Lemma 4.20,wehaveufp(x ) ≤|y|≤ β ufp(x ) and thus β ≤|M |≤ β .If p p |M | <β ,thenufp(y) = ufp(x ), and hence M is an integer. If |M |= β instead, y y y then M is again an integer, and thus M ∈ A ∪ B in all cases. Hence, since x ≤ y and y y p−1 p thus M ≤ M by deﬁnition, we obtain M ≤ M ≤ M . Hence β ≤|M|≤ β and x y x y x ≤ M ulp(x ) ≤ y, and therefore M ulp(x ) ∈ F . Then, by Lemma 4.20,wehave t RD (z/t ) = z − (z mod t ulp(z/x )), t RU (z/t ) = z + (−z mod t ulp(z/x )), where t = M ulp(x ). Next we prove that t ∈ Plaus(Z ). Substituting the deﬁnition of I into the deﬁnitions of A and B and rearranging, we obtain ∗ k−p −1 A ={N ∈ Z | z − (z mod N ulp(z)β ) ∈ ﬂ [Z ]}, ∗ k−p −1 B ={N ∈ Z | z + (−z mod N ulp(z)β ) ∈ ﬂ [Z ]}. k−p Now, since β = ulp(M /M ),wehave z x k−p M ulp(z)β = M ulp(z) ulp(M /M ) z x 123 Finding Normal Binary Floating-Point Factors Efﬁciently Page 35 of 52 11 = M ulp(z/M ) = M ulp(x ) ulp(z/x ) = t ulp(z/x ). Hence, since M ∈ A ∪ B,wehaveeither −1 z − (z mod t ulp(z/x )) = t RD (z/t ) ∈ ﬂ [Z ], or −1 z + (−z mod t ulp(z/x )) = t RU (z/t ) ∈ ﬂ [Z ]. Since t ∈ F , we thus have t ∈ Plaus(Z ). Therefore y ≤ t by deﬁnition, and hence y = t = M ulp(x ), as desired. By transforming the problem of ﬁnding plausible numbers into an integer optimization prob- lem, we make it much more straightforward to analyze. Even so, the optimization problem presented in Theorem 4.23 is unusual, since what is being optimized is a remainder. However, as we saw earlier in Fig. 2, there are patterns in the round-trip error. In the next section, we will make use of them to bound the error and thus solve the optimization problem efﬁciently. 5 Minimizing Remainders of a Constant Divided by a Variable In the previous section, we reduced the problem of ﬁnding factors to an optimization problem over remainders where the dividend is constant but the divisor is not. The underlying problem is not speciﬁc to the ﬂoating-point numbers, and we can state it generally as follows: Problem 3 Given a rational x and an interval I containing zero, what are the integers n such that (x mod n) ∈ I ? Speciﬁcally, given a ﬁxed integer m, what is the least (or greatest) such n where n ≥ m (or n ≤ m)? Naively, this problem can be solved by exhaustive search on the divisors in at most |x|+ 1 trials, but this is clearly impractical. We do not have a provably efﬁcient solution to the general problem, and it may well be the case that one does not exist at all. However, the speciﬁc case we are interested in can be solved in a constant number of arithmetic operations. To better grasp the problem, let us ﬁrst observe how the remainder varies with the divisor. Plotting (−1000 mod n) in Fig. 3, we see parabolas near n = 30, as well as between n = 20 and n = 25. After n = 40, a pattern of “stepped” parabolas emerges, with an increasing number of “steps” later on. Finally, the parabolas start disappearing entirely around n = 60 and give way to a linear pattern. As one might expect from rounding being expressible in terms of remainder, these patterns are also visible in Fig. 2. This behavior may seem an unusual coincidence at ﬁrst, but it occurs for any choice of numerator. Algebraically, the reason is remarkably simple. By the deﬁnition of the mod operator, we have (−1000 mod n) =−1000 − n −1000/n . Thus, if −1000/n is constant over a certain interval, then (−1000 mod n) is linear in n over that interval. If −1000/n changes linearly, then (−1000 mod n) is instead quadratic in n over that interval. This relationship can be clearly seen in Fig. 3, comparing the plots For instance, if x is a positive integer and I =[0, 0], it is equivalent to factorization. 123 11 Page 36 of 52 M. Andrlon −1000 mod n 80 −1000/n −20 −40 −60 −80 −100 10 20 30 40 50 60 70 80 90 100 Fig. 3 Graphs of (−1000 mod n) and −1000/n, respectively above and below the x-axis. The upper graph follows a parabola whenever the lower graph behaves linearly (e.g. around n = 30). Similarly, linear segments in the upper graph correspond to constant segments in the lower (e.g. near n = 80) above and below the x-axis. The “stepped parabolas” correspond to values of n where the slope of −1000/n is close to linear, but less than 1, and so −1000/n does not change at every step. More formally, consider integers a, n, i such that n and n +i are nonzero, and let q = : Constant quotient If = q,then n+i (a mod n + i ) = a − (n + i )q = (a mod n) − iq. Linear quotient If there is some integer d such that = q + di,then n+i (a mod n + i ) = a − (n + i )(q + di ) = (a mod n) − (di + (q + nd)i ). Note that the above still holds if we permit the variables to take arbitrary real values. In this way we can also ﬁnd parameters corresponding to each parabola inside the “stepped” segments. In our work, we will focus on linear and quadratic segments. These are especially useful, since we already have the tools to work with them. The following observations are key: Speciﬁcally, it corresponds to the second case with d = 1/w,where w is the number of integers m satisfying k = a/m. We can see this below the x-axis in Fig. 3,where w is the “width” of a level (e.g. around n = 80, we have w = 7). 123 Finding Normal Binary Floating-Point Factors Efﬁciently Page 37 of 52 11 1. Given a linear or quadratic segment, we can directly check it for a satisfying divisor by solving a linear or quadratic equation, respectively. 2. The graph of remainders can always be partitioned into such segments. 3. Therefore, we can search for a solution one segment at a time, rather than one divisor at a time. With a good (that is, small) partition, we can shrink the search space signiﬁcantly. However, this is still an exhaustive search, so proving the absence of a solution requires checking every segment. Furthermore, it is not clear that it is necessarily possible to construct a partition that is asymptotically smaller than magnitude of the dividend. For the speciﬁc case we are interested in, however, the number of segments does not matter: for a suitable partition, we can show that the remainder at either the start or the end of any segment always satisﬁes the bounds. More speciﬁcally, according to Theorem 4.23,we are interested in remainders of the form (−ab mod n) or (ab mod n) for nonzero integers a, b, n. We now study the solution space under these special conditions. 5.1 Conditions for a Solution via Extrapolation Given nonzero integers a and b, we construct an appropriate partition using a function ∗ 2 f : R → R such that f (x ) = (−x −|ab| mod x ).Clearly,wehave f (n) = (−|ab| mod n) for all nonzero integers n. Since the numerator is a quadratic itself, the interval between each pair of consecutive roots of f is a quadratic segment (Fig. 4, bottom). Now, note the following: 1. A segment contains solutions if and only if the value of least magnitude within it satisﬁes the bounds. 2. In each segment, the value of least magnitude is either at the start or the end of the segment. 3. The start and the end of each segment lie between two consecutive roots of f ,and are obtained by rounding those roots. Therefore, it sufﬁces to prove that if f (x ) = 0, then either (−ab mod x ) or (−ab mod x ) lies within the bounds. Note that a linear mapping does not sufﬁce: although it ﬁts better when x is large (this can be seen in Fig. 4), it does not provide sufﬁciently strong bounds when x is small. The next result further explains the behavior of real extensions of remainders. In particular, we see that a quadratic extension makes quadratic segments have constant quotients. Lemma 5.1 Let I be an interval not containing zero and let f be a real-valued function continuous over I . If ( f (x ) mod x ) = 0 for all x ∈ I, then f (x )/x = f (y)/y for all x , y ∈ I. Proof We proceed by contraposition. Suppose f (x )/x = f (y)/y for some x , y ∈ I . Without loss of generality, we assume that f (x )/x < f (y)/y.Let q : I → R be a function such that q(t ) = f (t )/t. Then, since f is continuous over I,sois q.If q(y) ≤ q(x ),then q(y) ≤ q(x ) by monotonicity, which is a contradiction. Therefore q(x)< q(y) ≤ q(y), and hence by the intermediate value theorem there is some w ∈ (x , y]⊆ I such that q(w) = q(y). Thus f (w) is an integer multiple of w,and so ( f (w) mod w) = 0, as desired. Corollary 5.1.1 Let I be an interval not containing zero and let f be a real-valued function continuous over I . If ( f (x ) mod x ) = 0 for all x ∈ I, then ( f (x ) mod x ) = f (x ) − xq for all x ∈ I , where q is an integer such that q = f (y)/y) for all y ∈ I. For instance, we can trivially assign each pair of consecutive divisors its own segment. 123 11 Page 38 of 52 M. Andrlon 20 40 60 80 100 20 40 60 80 100 Fig. 4 Graphs of (−1000 mod x ) and (−x − 1000 mod x ). The dots indicate points where x is an integer The following lemma allows us to express the remainders of interest as quadratics. Lemma 5.2 Let a, b ∈ Z and let f , g, h : R → R be functions such that f (t ) = (−t −|ab| mod t ), g(t ) = (−ab mod t ), h(t ) =−(ab mod t ). Let I be an interval not containing zero. Let q = (−x −|ab|)/x for some x ∈ I. If f (t ) = 0 for all t ∈ I , then for all n ∈ I ∩ Z, f (n) =−n − qn −|ab|, f (n) if ab ≥ 0, g(n) = n − f (n) otherwise. −x − 1000 mod x −1000 mod x Finding Normal Binary Floating-Point Factors Efﬁciently Page 39 of 52 11 h(n) = g(n) − n. Proof Suppose f (t ) = 0for all t ∈ I and let n ∈ I ∩ Z. Then, by Lemma 5.1 we immediately have f (n) =−n − qn −|ab|.Since n is an integer, we also have f (n) = (−|ab| mod n). Thus, since f (n) is nonzero, it follows that n does not divide ab. Therefore g(n) and h(n) are nonzero, and hence g(n) − n =−(n − (−ab mod n)) =−(ab mod n) = h(n). If ab ≥ 0, then f (n) = (−ab mod n) = g(n).If ab < 0, then f (n) = (ab mod n) = −h(n) = n − g(n), and thus g(n) = n − f (n) as desired. We now proceed to demonstrate the conditions under which the roots of the quadratic mapping provide a solution. We ﬁrst derive a simpler expression for the quadratic passing through a given root. ∗ 2 2 ab Lemma 5.3 Let a, b,δ ∈ R and let x ∈ R .Then −(x +δ) +(x +δ)k −ab =−δ −(x − )δ x +ab where k = . −x −ab Remark Note that (−x − ab mod x ) = 0 if and only if is an integer. We now bound the magnitude of that quadratic in terms of the difference between the factors in the numerator. Lemma 5.4 Let a, b, x ∈ R.If 1 < a + 1 ≤|x|≤ b − 1,then |x − ab/x | < b − a − 1. ∗ ab Proof Let f : R → R be a function such that f (t ) = t − and suppose 1 < a + 1 ≤ |x|≤ b − 1. Since a and b are positive, we have ab > 0and so f is strictly increasing over (−∞, 0) and over (0, +∞). Hence f (a + 1) ≤ f (|x |) ≤ f (b − 1).Now,since b − 1and a arepositive,wehave ab (b − a − 1) − f (b − 1) = (b − a − 1) − b − 1 − b − 1 ab = − a b − 1 b − 1 > 0, and hence b − a − 1 > f (b − 1). Similarly, since b and a + 1 are positive, ab −(b − a − 1) − f (a + 1) =−(b − a − 1) − a + 1 − a + 1 ab = − b a + 1 =− a + 1 < 0, and so −(b − a − 1)< f (a + 1). Therefore, | f (|x |)| < b − a − 1. If x > 0, then |x|= x and the result follows immediately. If x < 0 instead,wehave | f (|x |)|=| f (−x )|=|− f (x )|=| f (x )|, and so the result follows. 123 11 Page 40 of 52 M. Andrlon Remark We assume that a + 1 ≤ x ≤ b − 1 in order to obtain the necessary bound. When a and b are integers, if x = a or x = b,then (−ab mod x ) = 0, which lies within the bounding interval by assumption. The following lemma is an auxiliary result. Lemma 5.5 Let x ∈ R and δ ∈ (0, 1).If δx and (δ − 1)(x + 1) are integers, then they are both nonzero and have opposite signs. Proof Suppose δx and (δ − 1)(x + 1) are integers. If x = 0, then (δ − 1)(x + 1) = δ − 1 ∈ / Z, which is a contradiction. If 0 < |x|≤ 1, then δx ∈ / Z, which is also a contradiction. Therefore |x | > 1, and hence x + 1 is nonzero and has the same sign as x. Therefore, since δ − 1is negative, it follows that δx and (δ − 1)(x + 1) are nonzero and have opposite signs. We now show that rounding a root of the quadratic extension gives a solution for a sufﬁciently wide interval. However, we assume that the factors in the numerator are within a factor of two of each other. Otherwise, the bound from Lemma 5.4 grows too quickly to be useful. Luckily, this assumption always holds when β = 2, which is the most common case by far. Lemma 5.6 Let a, b ∈ Z and let x ∈ R . Let I be an interval containing zero. If 0 < a ≤ |x|≤ b, (−x −ab mod x ) = 0,b ≤ 2a and diam I ≥ a, then for some xˆ , xˆ ∈{x , x }, 1 2 • either (−ab mod xˆ ) ∈ Ior −(ab mod xˆ ) ∈ I , and 1 1 • either (ab mod xˆ ) ∈ Ior −(−ab mod xˆ ) ∈ I. 2 2 Proof Suppose 0 < a ≤|x|≤ b, (−x − ab mod x ) = 0, b ≤ 2a and diam I ≥ a.Weshall ﬁrst dispose of some trivial cases. If x ∈ Z,then x = x = x ,so (−ab mod x ) = (−ab mod x ) = (−x − ab mod x ) = 0 ∈ I , and hence the result. Since 0 ∈ I,if a < |x | < a + 1 instead, then either |x |= a or |x |= a, and hence the result holds trivially. Similarly, if b − 1 < |x | < b, then either |x |= b or |x |= b, and the result follows trivially again. Suppose x ∈ / Z and a + 1 < |x | < b − 1 instead, and let f : R → R be a function such 2 x +ab 2 that f (t ) =−t + kt − ab where k = . Then, since (−x − ab mod x ) = 0, it follows that k is an integer and thus f (t ) is an integer whenever t is an integer. Now, let δ = x − x ab and y = − x − δ. Then, by Lemma 5.3, f (x ) =−x + kx − ab =−(x + δ) + k(x + δ) − ab ab =−δ − x − δ ab = δ − x − δ = δy. 123 Finding Normal Binary Floating-Point Factors Efﬁciently Page 41 of 52 11 Since x is not an integer, it follows that x − x = 1 and hence x = x + δ − 1. Thus, by Lemma 5.3 again, f (x ) =− x + k x − ab =−(x + (δ − 1)) + k(x + (δ − 1)) − ab ab =−(δ − 1) − x − (δ − 1) ab = (δ − 1) − x − (δ − 1) = (δ − 1)(y + 1). Now, since b ≤ 2a, by Lemma 5.4,wehave |ab/x − x | < b − a − 1 ≤ a − 1. Hence, applying the triangle inequality, | f (x )|=|δy| < |ab/x − x − δ| ≤|ab/x − x|+ δ < |ab/x − x|+ 1 < a, and | f (x )|=|(δ − 1)(y + 1)| < |ab/x − x − (δ − 1)| ≤|ab/x − x|+ (1 − δ) < |ab/x − x|+ 1 < a, and | f (x ) − f (x )|=|δy − (δ − 1)(y + 1)| =|y + 1 − δ| =|ab/x − x + 1 − 2δ| ≤|ab/x − x|+|1 − 2δ| < |ab/x − x|+ 1 < a. Hence the distance between each of f ( x ), f ( x ) and 0 is strictly less than diam I ≥ a.Let c = min R and d = max R where R ={ f (x ), f (x )}.Thendiam [c, d]= d −c < b −a. Since f (x ) and f (x ) are both integers and δ ∈ (0, 1), it follows by Lemma 5.5 that δy and (δ − 1)(y + 1) are nonzero and have opposite signs and hence c < 0 < d. Thus both [c, d] and I contain zero, and since diam [c, d] < a ≤ diam I,wehaveeither c ∈ I or d ∈ I by Lemma 2.3, and therefore either f (x ) ∈ I or f (x ) ∈ I . Similarly, since [−d, −c] also contains zero and has the same diameter as [c, d], we also have either − f (x ) ∈ I or − f (x ) ∈ I . Now, in the following, note that we have both | f (x )| < a < |x | and | f (x )| < a < |x |, which drastically simpliﬁes working with mod. 123 11 Page 42 of 52 M. Andrlon • Suppose f (x ) and x have the same sign. Then − f (x ) also has the same sign, and − f (x ) and f (x ) have the opposite sign. Therefore, f (x ) = ( f (x ) mod x ) = (−ab mod x ), − f (x ) = (− f (x ) mod x ) = (ab mod x ), and thus − f (x ) =−(−ab mod x ), f (x ) =−(ab mod x ). Hence either (−ab mod x ) ∈ I or −(ab mod x ) ∈ I , and also either (ab mod x ) ∈ I or −(−ab mod x ) ∈ I . • Suppose instead f (x ) and x have opposite signs. Then f (x ) and − f (x ) have the same sign as x, and hence − f (x ) has the opposite sign of x. Therefore, f (x ) = ( f (x ) mod x ) = (−ab mod x ), − f (x ) = (− f (x ) mod x ) = (ab mod x ), and thus − f (x ) =−(−ab mod x ), f (x ) =−(ab mod x ). Hence either (−ab mod x ) ∈ I or −(ab mod x ) ∈ I , and also either (ab mod x ) ∈ I or −(−ab mod x ) ∈ I . Therefore, for some xˆ , xˆ ∈{x , x },wehaveeither (−ab mod xˆ ) ∈ I or −(ab mod 1 2 1 xˆ ∈ I ), and also either (ab mod xˆ ) ∈ I or −(−ab mod xˆ ) ∈ I . 1 2 2 Combining the above result with Theorem 4.23, we can ﬁnally show that, for equitable rounding functions, roots of the quadratic extension give a bound on plausible integral sig- niﬁcands. With the help of Lemma 5.2, this will let us compute plausible numbers efﬁciently in base 2. ∗ ∗ Theorem 5.7 Let Z ⊆ F be a ﬂoating-point interval, and let z ∈ Z ∩ F and x ∈ F .Let ∗ 2 p−k f : R → R be a function such that f (t ) = (−t −|M |β mod t ) where 0 if |M | > |M |, x z k = 1 otherwise, M = , ulp(x ) M = . ulp(z) ∗ p−1 p−1 Let r ∈ R be a root of f . If ﬂ is equitable, r ≥ M , and either β < |M |≤ 2β or x z β = 2,then φ (x ) ≤ r ulp(x ). −1 Proof Suppose the conditions hold and let I = (ﬂ [Z ]−z)/ ulp(z) and M = φ (x )/ ulp(x ). We divide the proof into two major cases based on the value of M , and then handle the different values of k within those cases. p−1 p−1 Suppose β < |M |≤ 2β . Then, since ﬂ is equitable, by deﬁnition we have −1 −1 Q(z) diam ﬂ [Z]≥ diam ﬂ [{z}] ≥ β ≥ ulp(z), 123 Finding Normal Binary Floating-Point Factors Efﬁciently Page 43 of 52 11 and therefore, −1 ﬂ [Z]− z p−k p−k diam I β = β diam ulp(z) p−k −1 = diam ﬂ [Z ] ulp(z) p−k ≥ β . If k = 0, then |M | < |M |, so Lemma 4.13 implies that x is plausible for Z, and thus z x φ (x ) = x = M ulp(x ) ≤ r ulp(x ) ≤ r ulp(x ). Z x p−k p−1 Suppose k = 1instead,and let a = β and b =|M |.Then b ≤ 2a = 2β by p−k assumption and diam I β ≥ a. We now consider each sign of M : • Suppose M > 0. Then b = M , and hence by Lemma 5.6,wehaveeither −(ab mod z z p−k p−k p−k p−k rˆ) =−(M β mod rˆ) ∈ I β or (−ab mod rˆ) = (−M β mod rˆ) ∈ I β for z z some rˆ ∈{r , r } and hence M ≤ˆr ≤ r by Theorem 4.23. • Suppose M < 0. Then b =−M andbyLemma 5.6 we have either (ab mod rˆ) = z z p−k p−k p−k p−k (−M β mod rˆ) ∈ I β or −(−ab mod rˆ) =−(M β mod rˆ) ∈ I β for z z some rˆ ∈{ r , r } and hence M ≤ˆr ≤ r by Theorem 4.23. Therefore M ≤ r in both cases, and multiplying by ulp(x ),weobtain φ (x ) ≤ r ulp(x ). p−1 Finally, suppose β = 2and |M |= β instead. If k = 1, then |M |=|M |,so x is z x z trivially plausible for Z and hence φ (x ) = x ≤ r ulp(x ) immediately. Suppose k = 0 p−k instead, and let a =|M | and b = β . Then, since β = 2, we have b = 2a.Since ﬂis equitable, −1 −1 Q(z)−1 diam ﬂ [Z]≥ diam ﬂ [{z}] ≥ β ≥ ulp(z)/β, and therefore, −1 ﬂ [Z]− z p−k p diam I β = β diam ulp(z) −1 = diam ﬂ [Z ] ulp(z) p−1 ≥ β = a. We now complete the proof using Lemma 5.6 and Theorem 4.23. By the same argument as earlier, we have M ≤ˆr ≤ r for some rˆ ∈{r , r }, and thus φ (x ) ≤ r ulp(x ),as desired. Remark If r is the least root of f greater than M ,then f is nonzero over (M , r ). x x p−k Thus, according to Lemma 5.2, the restrictions of M →−(M β mod M ) and M → p−k (−M β mod M ) to (M , r )∩Z are quadratics. Hence, if neither x nor r ulp(x ) are plau- z x sible for Z, then we necessarily have φ (x ) = r ulp(x ). Otherwise, φ (x ) ≤ r ulp(x ), Z Z so we can compute φ (x ) by solving the quadratics from Lemma 5.2 to ﬁnd the earliest point −1 where they lie in the interval (ﬂ [Z]− z)/ ulp(z). As discussed above, combining the bound given by Theorem 5.7 with Lemma 5.2 allows us to solve Problem 1 for normal numbers in binary ﬂoating-point arithmetic. We next turn to assembling our results into an algorithm. 123 11 Page 44 of 52 M. Andrlon 6 Algorithms Theorem 3.5 states that the classical division-based algorithm can be used to produce optimal bounds for ﬂoating-point arithmetic if we can devise some way of ﬁnding the feasible ﬂoating- point numbers nearest any given infeasible number. If the bounds and their quotients are within the range of the normal numbers, Theorem 4.22 tells us that we can do this by instead ﬁnding the nearest plausible numbers. Finally, for equitable rounding functions in base 2 (and other bases under certain conditions), Theorem 5.7 allows us to efﬁciently ﬁnd the next plausible number. Putting these together, we can derive an algorithm for efﬁciently computing optimal bounds on the factors of a binary ﬂoating-point multiplication. Algorithms 1 to 6 form a complete pseudocode implementation of the results of this paper. They are annotated with the properties they satisfy and justiﬁcations for them. The proofs of correctness for Algorithms 4 to 6 require some additional minor results which are included in Appendix A. We thus obtain the following: Theorem 6.1 Algorithms 1 to 6 are correct. Algorithms 3 to 6 can be computed in O(1) arithmetic operations. Algorithms 1 and 2 can be computed in O(1) arithmetic operations and computations of the preimage of a ﬂoating-point interval under ﬂ. Remark As in Lemma 2.25, for typical choices of ﬂ, these algorithms have time complexity O(M (p) + log(e − e + 1)),where M (p) is the time complexity of multiplying two max min p-digit integers. We again do not take the base, precision or exponent range as constants in our analysis. According to Theorem 3.5, given the preconditions hold, applying Algorithm 1 to the bounds produced by the traditional algorithm gives optimal bounds on the factors. This takes −1 O(1) arithmetic operations and computations of ﬂ , which is much less than the exponential worst case we proposed in Conjecture 2.27 for just iterating the classical algorithm. Although the preconditions on Algorithm 1 almost always hold when β = 2, they must still be checked. Otherwise, even though the algorithm always returns either a feasible number or +∞, the value returned may not in fact be the next feasible number (or, in the case of +∞ being infeasible, the problem instance may still be satisﬁable). In the event that some or all of the inputs are subnormal or we are working in a higher base, we still have some options for using these results. By Lemma 4.18, the next plausible number is always a lower bound on the next feasible number, and Algorithm 2 works correctly for subnormal input, since there is no such distinction over the unbounded ﬂoating-point numbers. Furthermore, it can be shown that Algorithm 2 returns a lower bound on the next plausible number even in higher bases. However, we make no claims to the efﬁciency under such conditions of iterating Algorithm 2 to ﬁnd the next feasible number. Now, as we demonstrated in Example 2.29, iterating the classical algorithm is insufﬁcient to ﬁnd optimal bounds on solution sets in the general case. That is, given a problem instance (X , Y , Z ) where X , Y , Z ⊆ F are ﬂoating-point intervals, iterating F until a ﬁxed point is reached may not produce optimal bounds on Z. We proposed solving this by using binary search to ﬁnd the greatest and least values in Z for which the instance is satisﬁable. (As a reminder, an instance is satisﬁable if and only if its optimal bounds are nonempty.) Naively, this search takes O(p + log(e − e + 1)) satisﬁability queries. However, we can do max min better. Lemma 3.8 says that if we have at least β values in Z, then almost any value is feasible for Z. Thus, in most cases, we only need to consider ﬁrst and last β values of Z to obtain optimal intervals. When β = 2, this means that we typically only need to trim at most one value from each endpoint of Z. Note that this situation only occurs when the problem instance 123 Finding Normal Binary Floating-Point Factors Efﬁciently Page 45 of 52 11 is satisﬁable—otherwise, our optimal bounds on X and Y will be empty, and thus so will the optimal bounds on Z. Algorithm 1 Finding the next feasible number. min Require: ﬂ equitable; normal ﬂoating-point number x; ﬂoating-point interval Z; z ∈ Z such that β ≤ |z/x|≤ max F; either β = 2or1 < |z/ ufp(z)|≤ 2 Ensure: returns inf Feas(Z ) ∩[x , +∞] 1: function NextFeasible(x , z, Z) 2: if x ∈ Feas(Z ) then using Lemma 3.6 3: return x 4: end if 5: x := NextPlausible(x , z, Z) by Theorem 4.22, if a solution exists, it is either x or x b b 6: if x ∈ Feas(Z ) then check that x ∈ F and then use Lemma 3.6 b b 7: return x 8: end if 9: x := NextPlausible(|x |, z, Z) 10: if x ∈ Feas(Z ) then 11: return x 12: end if 13: return +∞ Feas(Z ) ∩[x , +∞] is empty by Theorem 4.22 14: end function Algorithm 2 NextPlausible Require: ﬂ equitable; x ∈ F ; ﬂoating-point interval Z; nonzero and ﬁnite z ∈ Z; either β = 2or1 < |z/ ufp(z)|≤ 2 Ensure: returns φ (x ) 1: function NextPlausible(x , z, Z) 2: M := x / ulp(x ) 3: M := z/ ulp(z) −1 4: I := (ﬂ [Z]− z)/ ulp(z) 5: if |M | > |M | then x z 6: a := M 7: b := β 8: I := I β 9: else p−1 10: a := β 11: b := M p−1 12: I := I β 13: end if 14: M := NextDivisorInBounds(a, b, M , I ) by Theorems 4.23 and 5.7 15: return M ulp(x ) 16: end function 7 Conclusion In this paper, we have rigorously presented the ﬁrst step towards an efﬁcient solution to the problem of computing optimal interval bounds on the variables of the ﬂoating-point constraint x ⊗ y = z. The algorithms presented here are sound, complete, and efﬁcient for binary ﬂoating-point arithmetic involving normal numbers. This has immediate implications 123 11 Page 46 of 52 M. Andrlon Algorithm 3 NextDivisorInBounds Require: integers a, b, n such that 0 < |a|≤|n|≤|b|≤|2a|; interval I containing zero such that diam I ≥|a| Ensure: returns min {m ∈ Z | m ≥ n ∧ ((−ab mod m) ∈ I ∨−(ab mod m) ∈ I )} 1: function NextDivisorInBounds(a, b, n, I ) 2: if (−ab mod n) ∈ I ∨−(ab mod n) ∈ I then 3: return n 4: end if 5: r := NextQuadraticModLinearRootFloor(−1, −|ab|, n) 6: if (−ab mod r)/ ∈ I ∧−(ab mod r)/ ∈ I then 7: return r + 1 by Lemma 5.6, since there are no solutions in [n, r ] 8: end if 9: q := (−n −|ab|)/n solution is in [n, r ]; ﬁnd minimum using Lemma 5.2 10: if ab > 0 then 11: c := NextQuadraticPointWithinBounds(−1, − q , −ab, n, I ) 12: d := NextQuadraticPointWithinBounds(−1, −q, −ab, n, I ) 13: else 14: c := NextQuadraticPointWithinBounds(1, q, −ab, n, I ) 15: d := NextQuadraticPointWithinBounds(1, q , −ab, n, I ) 16: end if 17: return min {c, d} 18: end function Algorithm 4 NextQuadraticModLinearRootFloor Require: nonzero integers a, c, n Ensure: returns min {x | x ∈[n, +∞) ∧ (ax + c mod x ) = 0} 1: function NextQuadraticModLinearRootFloor(a, c, n) 2: q := (an + c)/n 3: R := QuadraticRootsFloor(a, − q , c) 4: R := QuadraticRootsFloor(a, − q , c) 5: return min {m ∈ R ∪ R | m ≥ n} by Lemma A.4 1 2 6: end function Algorithm 5 NextQuadraticPointWithinBounds Require: integers a, b, c, n where a = 0; interval I Ensure: returns inf {m ∈ Z | m ≥ n ∧ am + bm + c ∈ I } 1: function NextQuadraticPointWithinBounds(a, b, c, n, I ) 2: if an + bn + c ∈ I then 3: return n 4: end if 5: I := I ∩ Z 6: if I =∅ then 7: return +∞ 8: end if 9: R := QuadraticRootsFloor(a, b, c − max I ) 10: R := QuadraticRootsFloor(a, b, c − min I ) 11: R := R ∪ R 1 2 12: R := {r + 1 | r ∈ R} solutions are ceilings of roots per Lemma A.2 13: return min {m ∈ R ∪ R | m ≥ n ∧ am + bm + c ∈ I}∪{+∞} 14: end function 123 Finding Normal Binary Floating-Point Factors Efﬁciently Page 47 of 52 11 Algorithm 6 QuadraticRootsFloor Require: integers a, b, c where a = 0 Ensure: returns {x | x ∈ R ∧ ax + bx + c = 0} 1: function QuadraticRootsFloor(a, b, c) 2: if a < 0 then 3: a := −a 4: b := −b 5: c := −c 6: end if 7: := b − 4ac 8: if < 0 then 9: return ∅ 10: end if −b+ − 11: r := by applying Lemma A.1 to the quadratic formula 2a −b+ 12: r := 2a 13: return {r , r } 1 2 14: end function for ﬂoating-point decision procedures based on interval constraint solving. We hope that the techniques developed here will be useful in future study of ﬂoating-point theory. As suggested earlier, a number of questions remain open in this topic: • What is the optimal time complexity for computing feasible numbers in general (i.e. in all bases and with subnormals)? Can we asymptotically outperform brute force search? Does an efﬁcient algorithm exist for this task? • Although the algorithms in this paper are relatively brief, the work to develop them is much longer. Is there are shorter derivation of this solution? • The results of this paper rely on a mixture of integer and ﬂoating-point reasoning. Can these results be stated naturally using only ﬂoating-point numbers? If so, is there a version of this algorithm which uses only ﬂoating-point arithmetic? • Separately, Sect. 5 contains results which may be useful for solving mod constraints beyond the scope of this paper. Can they be applied more broadly? Acknowledgements This research was supported by an Australian Government Research Training Program (RTP) Scholarship. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. Appendix A Additional Proofs This section contains some additional results used to show the correctness of Algorithms 4 to 6. There is also a proof of a minor property on the digital expansions of ﬂoating-point quotients. 123 11 Page 48 of 52 M. Andrlon Lemma A.1 Let x ∈ R and n ∈ Z.Ifn > 0,then x /n = x /n. Proof Suppose n > 0and let δ = x − x and δ = (x mod n).Since n is a positive 1 2 integer and x is an integer, we have 0 ≤ δ ≤ n − 1. Hence 0 ≤ δ + δ < n and thus 2 1 2 (δ + δ )/n = 0. By the deﬁnition of mod, we also have (x − δ )/n = x /n and 1 2 2 thus x /n = (x + δ + δ − δ )/n 1 2 2 = x /n + (δ + δ )/n 1 2 = x /n + (δ + δ )/n 1 2 = x /n . Lemma A.2 Let f be a continuous real function and let n ∈ Z and a, b ∈ R.Let M ={m ∈ Z | m ≥ n ∧ f (m) ∈[a, b]}. If M is nonempty, then either min M = nor min M = x for some x ∈ R such that either f (x ) = aor f (x ) = b. Proof If f (n) ∈[a, b], the result follows trivially. Suppose instead that M is nonempty and m > n where m = min M.Then a ≤ f (m) ≤ b and either f (m − 1)< a or f (m − 1)> b by assumption. Suppose f (m − 1)< a. Then, by the intermediate value theorem, there is some x ∈ (m − 1, m] such that f (x ) = a.Since x = m, the result follows. Suppose f (m − 1)> b instead. Then, similarly, there is some x ∈ (m − 1, m] such that f (x ) = b. Since x = m again, we are ﬁnished. ∗ ∗ 2 Lemma A.3 Let a, c ∈ R and let q : R → R be a function such that q(t ) = (at + c)/t. For all x ∈ R , there exists some y ∈[x , +∞) such that q is continuous over [x , y] and q(y) ∈{q(x ) , q(x )}. Proof In the following, note that q is continuous over its domain and q(t ) = at + c/t for all t ∈ R . We proceed by cases. Let x ∈ R and suppose x is positive. Suppose a is also positive. Then, lim q(t ) = lim at + lim c/t =∞ + 0 =+∞, t →∞ t →∞ t →∞ and thus q grows without bound. Since it is also continuous over (0, +∞), it therefore attains q(x ) at some y ∈[x , +∞), and hence the result follows. Suppose a is negative instead. Then lim q(t ) =−∞ similarly, and hence q has no t →∞ lower bound. Since it is continuous over (0, +∞), we thus have q(y) = q(x ) for some y ∈[x , +∞), and hence the result follows. Now, suppose x is negative instead. Suppose c is positive. Then, lim q(t ) = lim at + lim c/t = 0 −∞ = −∞, − − − t →0 t →0 t →0 and therefore q has no lower bound over (−∞, 0). Since it is also continuous over (−∞, 0), we thus have q(y) = q(x ) for some y ∈[x , 0), as desired. Suppose c is negative instead. Then lim q(t ) =+∞, and hence q has no upper t →0 bound over (−∞, 0).Since q is continuous over (−∞, 0), it follows that q(y) = q(x ) for some y ∈[x , 0). Therefore the result holds in all cases, and we are done. 123 Finding Normal Binary Floating-Point Factors Efﬁciently Page 49 of 52 11 ∗ ∗ 2 Lemma A.4 Let a, c ∈ R and let q : R → R be a function such that q(t ) = (at + c)/t. Let x ∈ R and R ={t ∈[x , +∞) | q(t ) ∈ Z}. Then R has a minimum, and either q(r ) = q(x ) or q(r ) = q(x ) where r = min R. Proof Let S be a set such that S ={t ∈[x , +∞) | q(t ) ∈{q(x ) , q(x ))}}. ∗ 2 Clearly, S ⊆ R.For all t ∈ R and u ∈ R,wehave q(t ) = u if and only if at − ut + c = 0. Therefore, for any particular u, there are at most 2 choices of t that satisfy the equation, and hence S has at most 4 elements. By Lemma A.3, there is some y ∈[x , +∞) such that q is continuous over [x , y] and q(y) ∈{q(x ) , q(x ))},and so y ∈ S. Therefore S is nonempty and ﬁnite, and hence has a minimum. Now, let s = min S.Since S ⊆ R, we thus have s ∈ R.Let t ∈ R ∩[x , s].Then q(t ) is an integer by deﬁnition, and hence either q(t ) ≥ q(x ) or q(t ) ≤ q(x ). Therefore, since q is continuous over [x , y] and thus also continuous over [x , t ], the intermediate value theorem implies that either q(u) = q(x ) or q(u) = q(x ) for some u ∈[x , t ]. Therefore u ∈ S, and hence s ≤ u by deﬁnition. Since u ≤ t ≤ s, we thus have s = t = u and hence R ∩[x , s]={s},somin R = s. By the deﬁnition of S,wehaveeither q(s) = q(x ) or q(s) = q(x ), so we are done. ∗ ∗ 2 Corollary A.4.1 Let a, c ∈ R and let q : R → R be a function such that q(t ) = (at +c)/t. For all x ∈ R , min {t ∈[x , +∞) | (at + c mod t ) = 0}= min {t ∈[x , +∞) | q(t ) ∈{q(x ) , q(x )}}. ∗ ∗ Lemma A.5 For all x , y ∈ F ,if β is prime, then either x /y ∈ F or x /y has no ﬁnite ∞ ∞ representation in base β. Proof Let x , y ∈ F and suppose β is prime and x /y has a ﬁnite representation in base β. We shall prove that x /y ∈ F .Let M = x / ulp(x ) and M = y/ ulp(y).Then M /M has x y x y a ﬁnite base-β representation, and thus M /M = n/β for some n, k ∈ Z.If k ≤ 0, then x y k p |n|=|M /M |β ≤|M /M |≤|M | <β . x y x y x Suppose k > 0 instead, and suppose without loss of generality that β does not divide n and that M /M is in lowest terms. Then, since n = M β /M is an integer and M has no x y x y x common factors with M , it follows that β /M is an integer. Therefore, since β is prime y y and M is an integer, |M | is a power of β.However,since β is not a factor of n, it follows y y k p k ∗ that |M |= β , and thus |n|=|M | <β again. Therefore M /M = n/β ∈ F by y x x y deﬁnition in all cases, and so x /y ∈ F . Appendix B Glossary B.1 List of Terms diameter the greatest distance between any pair of a set’s elements exponent the integer e in ±d .d ... d × β 1 2 p 123 11 Page 50 of 52 M. Andrlon equitable see Deﬁnition 3.11; roughly, a rounding function under which no ﬂoating-point number has an especially small preimage; alterna- tively, a rounding function which does not allocate too many reals to any ﬂoating-point number faithful a rounding function is faithful if it always rounds to either the near- est number above or below (i.e. does not jump over ﬂoating-point numbers) feasible x is feasible for Z if x is a ﬂoating-point factor of some z ∈ Z ﬂoating-point factor x is a ﬂoating-point factor of z if x ⊗ y = z for some ﬂoating-point ﬂoating-point interval a set of consecutive ﬂoating-point numbers ﬂoating-point number see Deﬁnition 2.6; roughly, a number of the form ±d .d ... d ×β 1 2 p p−1 integral signiﬁcand the signiﬁcand multiplied by β , making it an integer normal a ﬂoating-point number which can be written without leading zeros with an exponent between e and e ; equivalently, a ﬂoating- min max min point number with magnitude no less than β plausible the unbounded ﬂoating-point numbers x and y are plausible for Z if ﬂ(xy) ∈ Z quantum the place value of a 1 in the last place of a ﬂoating-point number quantum exponent the exponent of the quantum (the quantum is always a power of the base) rounding function a function from R to F signiﬁcand the ±d .d ... d part of ±d .d ... d × β 1 2 p 1 2 p subnormal nonzero, but smaller than any normal number unit in ﬁrst place the place value of a 1 in the ﬁrst digit of a number written in base- β scientiﬁc notation; equally, β where e is the exponent of the number unit in last place the place value of a 1 in the last digit of a number written in base- e−p+1 β precision-p scientiﬁc notation; equally, β where e is the exponent of the number B.2 List of Symbols β Base (alt. radix) p Precision m Signiﬁcand M Integral signiﬁcand e Exponent q Quantum exponent: satisﬁes q = e − p + 1 E (x ) Exponent of x Q(x ) Quantum exponent of x ⊗ Rounded multiplication: x ⊗ y = ﬂ(xy) ﬂ Any nondecreasing and faithful rounding function RD(x ) x rounded downward (i.e. to the nearest number below) RU(x ) x rounded upward (i.e. to nearest number above) RN(x ) x rounded to the nearest number; no rule is speciﬁed for breaking rounding ties −1 ﬂ [Z ] Preimage of Z under rounding; the set of extended real numbers that round to some z ∈ Z 123 Finding Normal Binary Floating-Point Factors Efﬁciently Page 51 of 52 11 Feas(Z ) Set of all ﬂoating-point numbers feasible for Z Plaus(Z ) Set of all unbounded ﬂoating-point numbers plausible for Z φ (x ) The least x ∈ Plaus(Z ) greater than or equal to x ufp(x ) Unit in ﬁrst place of x ulp(x ) Unit in last place of x F (X , Y , Z ) Classical interval-based narrowing of X , Y , Z; see Deﬁnition 2.22 diam X Diameter of the set X (x mod y) Remainder of ﬂoored division of x by y Z Integers Z Nonzero integers R Real numbers R Nonzero real numbers R Real numbers with +∞ and −∞ R Nonzero real numbers with +∞ and −∞ F Nonzero ﬁnite ﬂoating-point numbers F Finite ﬂoating-point numbers F Finite ﬂoating-point numbers with +∞ and −∞ F Nonzero ﬁnite unbounded ﬂoating-point numbers F Finite unbounded ﬂoating-point numbers F Finite unbounded ﬂoating-point numbers with +∞ and −∞ References 1. Andrlon, M., Schachte, P., Søndergaard, H., Stuckey, P.J.: Optimal bounds for ﬂoating-point addition in constant time. In: 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH), pp. 159–166 (2019). https://doi.org/10.1109/ARITH.2019.00038 2. Bagnara, R., Carlier, M., Gori, R., Gotlieb, A.: Exploiting binary ﬂoating-point representations for con- straint propagation. INFORMS J. Comput. 28(1), 31–46 (2016). https://doi.org/10.1287/ijoc.2015.0663 3. Barrett, G.: Formal methods applied to a ﬂoating-point number system. IEEE Trans. Softw. Eng. 15(5), 611–621 (1989). https://doi.org/10.1109/32.24710 4. Boldo, S., Melquiond, G.: Computer arithmetic and formal proofs: verifying ﬂoating-point algorithms with the Coq system. ISTE Press, London (2017). https://doi.org/10.1016/C2015-0-01301-6 5. Boldo, S., Melquiond, G.: Flocq: a uniﬁed library for proving ﬂoating-point algorithms in Coq. In: 2011 IEEE 20th Symposium on Computer Arithmetic, pp. 243–252 (2011). https://doi.org/10.1109/ARITH. 2011.40 6. Boldo, S., Muñoz, C.A.: A high-level formalization of ﬂoating-point number in PVS. NASA contractor report 2006-214298, NASA Langley Research Center, Hampton (2006). https://ntrs.nasa.gov/citations/ 20070003560. Accessed 04-08-2022 7. Brain, M., Niemetz, A., Preiner, M., Reynolds, A., Barrett, C., Tinelli, C.: Invertibility conditions for ﬂoating-point formulas. In: Dillig, I., Tasiran, S. (eds.) Computer Aided Veriﬁcation: 31st International Conference, CAV 2019, New York City, July 15–18, 2019, Proceedings, Part II. Lecture Notes in Computer Science, vol. 11562, pp. 116–136. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-25543-5_8 8. Brain, M., Tinelli, C., Rümmer, P., Wahl, T.: An automatable formal semantics for IEEE-754 ﬂoating- point arithmetic. In: Proceedings of the 22nd IEEE Symposium on Computer Arithmetic, pp. 160–167. IEEE Computer Society, (2015). https://doi.org/10.1109/ARITH.2015.26 9. Carreño, V.A., Miner, P.S.: Speciﬁcation of the IEEE-854 ﬂoating-point standard in HOL and PVS. In: Higher Order Logic Theorem Proving and Its Applications. Category B proceedings, Aspen Grove, (1995). https://web.archive.org/web/20000919054331/ 10. Carreño, V.A.: Interpretation of IEEE-854 ﬂoating-point standard and deﬁnition in the HOL system. NASA technical memorandum 110189, NASA Langley Research Center, Hampton (1995). https://ntrs. nasa.gov/api/citations/19960008463/downloads/19960008463.pdf. Accessed 03- 08-2022 11. Daumas, M., Rideau, L., Théry, L.: A generic library for ﬂoating-point numbers and its application to exact computing. In: Boulton, R.J., Jackson, P.B. (eds.) Theorem Proving in Higher Order Logics: 14th 123 11 Page 52 of 52 M. Andrlon International Conference, TPHOLs 2001, Edinburgh, September 3–6, 2001. Proceedings. Lecture Notes in Computer Science, vol. 2152, pp. 169–184. Springer, Berlin (2001). https://doi.org/10.1007/3-540- 44755-5_13 12. Gallois-Wong, D., Boldo, S., Cuoq, P.: Optimal inverse projection of ﬂoating-point addition. Numer. Algorithms 83(3), 957–986 (2020). https://doi.org/10.1007/s11075-019-00711-z 13. Goldberg, D.: What every computer scientist should know about ﬂoating-point arithmetic. ACM Comput. Surv. 23(1), 5–48 (1991). https://doi.org/10.1145/103162.103163 14. Harrison, J.: A machine-checked theory of ﬂoating point arithmetic. In: Theorem Proving in Higher Order Logics:12th International Conference, TPHOLs ’99, Nice, France, September 14–17, 1999, Proceedings. Lecture Notes in Computer Science, vol. 1690, pp. 113–130. Springer, Berlin (1999). https://doi.org/10. 1007/3-540-48256-3_9 15. Harrison, J.: Floating point veriﬁcation in HOL. In: Thomas Schubert, E., Windley, P.J., Alves-Foss, J. (eds.) Higher order logic theorem proving and its applications: 8th International Workshop, Aspen Grove, September 11–14, 1995. Proceedings. Lecture Notes in Computer Science, vol. 971, pp. 186– 99. Springer, Berlin (1995). https://doi.org/10.1007/3-540-60275-5_65 16. Harrison, J.: Floating-point veriﬁcation using theorem proving. In: Bernardo, M., Cimatti, A. (eds.) Formal Methods for Hardware Veriﬁcation: 6th International School on Formal Methods for the Design of Computer, Communication, and Software Systems, SFM 2006, Bertinoro, May 22–27, 2006, Advances Lectures. Lecture Notes in Computer Science, vol. 3965, pp. 211–242. Springer, Berlin (2006). https:// doi.org/10.1007/11757283_8 17. Jacobi, C., Berg, C.: Formal veriﬁcation of the VAMP ﬂoating point unit. Formal Methods Syst. Des. 26(3), 227–266 (2005). https://doi.org/10.1007/s10703-005-1613-y 18. Jacobsen, C., Solovyev, A., Gopalakrishnan, G.: A parameterized ﬂoating-point formalizaton in HOL Light. Electron. Notes Theor. Comput. Sci. 317, 101–107 (2015). https://doi.org/10.1016/j.entcs.2015. 10.010 19. Loiseleur, P.: Formalisation en Coq de la norme IEEE 754 sur l’arithmétique à virgule ﬂottante. Research report, École normale supérieure, Paris, (1997).https://web.archive.org/web/20001003100321/http:// www.lri.fr/%7Eloisel/rapport-stage-dea.ps.gz 20. Melquiond, G.: Floating-point arithmetic in the Coq system. Inform. Comput. 216, 14–23 (2012). https:// doi.org/10.1016/j.ic.2011.09.005 21. Michel, C.: Exact projection functions for ﬂoating point number constraints. In: Seventh International Symposium on Artiﬁcial Intelligence and Mathematics. AI&M 15-2002, Fort Lauderdale (2002) 22. Miner, P.S.: Deﬁning the IEEE-854 ﬂoating-point standard in PVS. Technical Report NASA-TM-110167 (1995). https://ntrs.nasa.gov/citations/19950023402. Accessed 04-08-2022 23. Miner, P.S.: Formal speciﬁcation of IEEE ﬂoating-point arithmetic using PVS. IFAC Proc. Vol. 28(25), 31–36 (1995). https://doi.org/10.1016/S1474-6670(17)44820-8 24. Muller, J.M., Brunie, N., de Dinechin, F., Jeannerod, C.P., Joldes, M., Lefèvre, V., Melquiond, G., Revol, N., Torres, S.: Handbook of Floating-Point Arithmetic, 2nd edn. Birkhäuser, Cham (2018). https://doi. org/10.1007/978-3-319-76526-6 25. Pan, J., Levitt, K.N.: A formal speciﬁcation of the IEEE ﬂoating-point standard with application to the veriﬁcation of F. In: 1990 Conference Record Twenty-Fourth Asilomar Conference on Signals, Systems and Computers, 1990., vol. 1, p. 505 (1990). https://doi.org/10.1109/ACSSC.1990.523389 26. Rümmer, P., Wahl, T.: An SMT-LIB theory of binary ﬂoating-point arithmetic. In: Informal Proceedings of the 8th International Workshop on Satisﬁability Modulo Theories (SMT) at FLoC, Edinburgh p. 14 (2010) 27. Rump, S., Ogita, T., Oishi, S.: Accurate ﬂoating-point summation part I: faithful rounding. SIAM J. Sci. Comput. 31(1), 189–224 (2008). https://doi.org/10.1137/050645671 28. Std, I.E.E.E.: 754–2019: IEEE Standard for Floating-Point Arithmetic. Technical report, IEEE (2019) 29. Yu, L.: A formal model of IEEE ﬂoating point arithmetic. Archive of Formal Proofs (2013) 30. Ziv, A., Aharoni, M., Asaf, S.: Solving range constraints for binary ﬂoating-point instructions. In: Proceed- ings of the 16th IEEE Symposium on Computer Arithmetic. ARITH’03, pp. 158–164. IEEE Computer Society, Los Alamitos (2003). https://doi.org/10.1109/ARITH.2003.1207674 Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional afﬁliations.
Journal of Automated Reasoning – Springer Journals
Published: Mar 1, 2023
Keywords: Floating-point equation solving; Interval multiplication; Decision procedures
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.