Access the full text.
Sign up today, get DeepDyve free for 14 days.
Solving the floating-point equation x ⊗ y = z,where x, y and z belong to floating-point intervals, is a common task in automated reasoning for which no efficient algorithm is known in general. We show that it can be solved by computing a constant number of floating-point factors, and give a fast algorithm for computing successive normal floating-point factors of normal floating-point numbers in radix 2. This leads to an efficient procedure for solving the given equation, running in time of the same order as floating-point multiplication. Keywords Floating-point equation solving · Interval multiplication · Decision procedures 1 Introduction Floating-point arithmetic has long been the prevailing system used for numerical computing, allowing efficient and precise calculation on modern hardware. Whilst it closely approximates real arithmetic, it does not always satisfy its laws. These sporadic exceptions often confuse programmers and complicate mathematical reasoning to the point that deciding even the simplest statements can take excessive amounts of time. Our interest stems from the area of formal software verification, where automated reasoning is used to prove that programs satisfy their specifications. Although modern verification systems can reason about floating- point arithmetic, it has much less support than integer or bit-vector arithmetic. This makes proving correctness far more difficult and drastically limits practical applications of formal methods. Our work seeks to fill these gaps. Denoting floating-point multiplication by ⊗, we study the following problem: Problem 1 Let X , Y , Z be intervals over the floating-point numbers and let V ={(x , y, z) | x ∈ X , y ∈ Y , z ∈ Z , x ⊗ y = z}. What is the smallest box enclosing V ? A Cartesian product of intervals. B Mak Andrlon mnazecic@student.unimelb.edu.au School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia 0123456789().: V,-vol 123 11 Page 2 of 52 M. Andrlon We can give a simplified example without getting into the details of floating-point arith- metic: Example 1.1 For the sake of exposition, we will use decimal numbers with two decimal places, rounding downward the results of arithmetic operations. For example, since 0.50 · 1.01 = 0.505, we have 0.50 ⊗ 1.01 = 0.50, where ⊗ is rounded multiplication. Let x and y be two-place decimal numbers, and suppose x ⊗ y = 5.00 where 2.20 ≤ x ≤ 2.50 and 1.00 ≤ y ≤ 2.50. These bounds are clearly quite loose. For instance, the lower bound on y is too small: if x = 2.50 and y = 1.00, then x ⊗ y = 2.50, which is much less than the desired 5.00. First, we construct an equivalent condition that uses real multiplication instead of floating- point multiplication. This can be accomplished with a simple transformation. Since the numbers that round downward to 5.00 are exactly those in [5.00, 5.01), it follows that x ⊗ y = 5.00 is equivalent to 5.00 ≤ xy < 5.01. By getting rid of the ⊗ operator, we can now reason using real arithmetic: 1. Multiplying x and y and combining their bounds, we get 2.20 ≤ xy ≤ 6.25. 2. This is worse than our existing bounds on xy, so we cannot improve them. 3. Dividing the bounds on xy by y,weobtain5.00/2.50 ≤ x < 5.01/1.00. 4. This also does not improve our existing bounds on x. 5. Dividing by x instead, we obtain 5.00/2.50 ≤ y < 5.01/2.20. 6. We can thus improve our bounds on y to 2.00 ≤ y < 2.2772. If we were allowed the full range of the reals, we would now be finished, stating that 5.00 ≤ xy < 5.01 where 2.20 ≤ x ≤ 2.50 and 2.00 ≤ y < 2.2772. We can easily show that these bounds are tight for real multiplication: • If x = 2.20 and y < 2.2772, then xy < 5.01. If y ≥ 2.2772 instead, then xy ≥ 5.01. • If x = 2.50 and y = 2.00, then xy = 5.00. Having used each of the bounds on x, y and xy, we know that this set of bounds is tight on its own. However, we also require that x and y are decimal numbers with two decimal places. Thus, since y < 2.2772, we can further conclude that y ≤ 2.27. Since the bounds all have two decimal places now, we would ideally like to be done. If we check, though, we find that the bounds are no longer tight: if x = 2.20 and y = 2.27, then xy = 4.994, which is too small, and therefore x ≥ 2.21. We can continue tightening even further. If x = 2.21 instead and y = 2.27, then xy = 5.0167, which is too large. Therefore y ≤ 2.26, and so on. From this point, division does not help us as much as it did earlier. Since the product of the bounds on x and y is so close to the optimal result, all of the quotients differ by less than 0.01, and so it is equivalent to an exhaustive search. After a few more iterations, we eventually reach the optimal solution: 2.33 ≤ x ≤ 2.50 and 2.00 ≤ y ≤ 2.15. This clearly satisfies our bounds, since 2.33·2.15 = 5.0095 ∈[5.00, 5.01). However, proving optimality is tedious without further results, requiring us to iterate the above argument for every decimal until we reach the optimum. The toy example above illustrates the limits of current floating-point interval reasoning. As demonstrated, the analogous problem over the reals is straightforward, and the steps in the example suggest a general algorithm. If we were working with the much larger IEEE 754 64-bit floating-point numbers instead, we could easily be forced to iterate over millions of The overline denotes repeating digits. 123 Finding Normal Binary Floating-Point Factors Efficiently Page 3 of 52 11 values: for a concrete example, take 12738103310254127 ≤ x ≤ 12738103379848963, y unrestricted and z = 2 − 1 under rounding to nearest. Although x ⊗ y = z has no solution under these conditions, it will take 69,594,839 steps to prove this using the traditional interval- based approach. In general, the traditional algorithm struggles to prove unsatisfiability if the real relaxation of the problem is solvable, since the optimal intervals for an unsatisfiable system are empty. By contrast, the method we present in this paper arrives at the answer immediately. We consider the following to be the primary contributions of this paper: • An algorithm to efficiently compute the next normal floating-point factor of any floating- point number in radix 2. – This algorithm is also partially applicable in higher bases. We do not assume a particular radix in any of our results. • When the above algorithm is applicable, an efficient procedure for finding optimal interval bounds on variables of floating-point constraints of the form x ⊗ y = z. In addition, we believe the following secondary contributions are of interest: • A demonstration that “round-trip” rounding errors have useful, nontrivial piecewise poly- nomial extensions to the reals. • A characterization of upward and downward floating-point rounding in terms of the remainder of floored division. • A floating-point predicate for deciding, in any base, whether a number is floating-point factor of another floating-point number. • Results on solving certain kinds of interval constraints over integer remainders. 1.1 Related Work Although floating-point arithmetic has existed for a long time, work on solving floating- point equations and inequalities has been sporadic and has mostly been concerned with binary floating-point arithmetic. Michel [21] developed partial solutions for floating-point equations of the form x ◦y = z where ◦ is one of rounded addition, subtraction, multiplication, or division, as well as a complete solution for rounded square root. His work contains an early description of the classical interval-based algorithm, as it applies to the floating-point numbers. As an early example, only normal, binary floating-point numbers and the standard rounding modes were considered. Ziv et al. [30] showed that, under mild conditions, if z can take at least 2 different values, then the equation x ⊗ y = z has at least one solution in binary floating-point arithmetic. However, their work does not treat other bases, nor the case where z is fixed. Bagnara et al. [2] gave some extremal bounds on the x and y in terms of the bounds on z for binary floating-point arithmetic. However, the resulting bounds are very weak, since they do not account for any of the bounds on x or y. At present, there are no known efficient algorithms for computing the solution set of x ⊗ y = z in general. Indeed, in terms of similar efforts, the only basic operation which has been solved (apart from square root) is addition [1, 12]. Note that the classical algorithm is optimal for any strictly increasing (or decreasing) single-variable real function. 123 11 Page 4 of 52 M. Andrlon 1.2 Outline In Sect. 2, we introduce our notation and develop the formalisms used in this paper. We also give an account of the basic properties of floating-point arithmetic that we will need, and precisely define the classical interval-based algorithm. Due to the substantial amount of notation, a glossary is supplied for reference in Appendix B. In Sect. 3, we relate optimal intervals, the division-based algorithm and floating-point factors, and simplify the problem by narrowing down the conditions under which the division-based algorithm does not produce an optimal result. In Sect. 4, we then develop conditions for numbers being factors based on the rounding error in the product. Then, in Sect. 5, we show that we can control that rounding error under certain conditions and thus directly find floating-point factors. Combining these in Sect. 6, we give an algorithm to compute such factors efficiently and thus solve the problem. Appendix A contains some ancillary proofs used in the results of Sect. 6. In Sect. 7,we conclude and discuss open problems. 2 Preliminaries Following convention, we denote the integers by Z, the reals by R, and the extended reals by ∗ ∗ R = R ∪{+∞, −∞}, and the same sets excluding zero by Z , R and R . The power set of a set X is denoted by P(X ). We write f [X ] for the image of X under a function f ,and −1 f [X ] for the preimage of X under f . To simplify exposition, we will denote the images of X × Y and X ×{y} under a binary operator ◦ simply by X ◦ Y and X ◦ y, respectively. We will use such notation most frequently in the context of intervals. We also discuss the size of intervals in terms of their width on the extended real line. To this end, we define a notion of diameter for subsets of the extended reals: Definition 2.1 (Set diameter)For any X ⊆ R,wedefine diam X = sup {d(x , y) | x , y ∈ X}∪{0}, where d : R → R is a function such that 0if x = y, d(x , y) = |x − y| otherwise. The diameter satisfies the following properties for all sets X ⊆ R: diam ∅= 0 (diameter of empty set) (1) for all x ∈ R, diam {x}= 0 (diameter of singleton) (2) if inf X < sup X , then diam X = sup X − inf X (3) for all a ∈ R , diam aX =|a| diam X (absolute homogeneity) (4) for all b ∈ R, diam(X + b) = diam X (translation invariance) (5) Importantly, the diameter of an interval is equal to the distance between its endpoints: Lemma 2.2 For all a, b ∈ R,ifa < b, then diam [a, b]= diam (a, b]= diam [a, b) = diam (a, b) = b − a. Additionally, if two intervals overlap, then the larger one contains at least one of the endpoints of the smaller one: 123 Finding Normal Binary Floating-Point Factors Efficiently Page 5 of 52 11 Lemma 2.3 Let a, b ∈ R and let I ⊆ R be an interval. If [a, b]∩ I is nonempty and diam I > b − a, then either a ∈ Ior b ∈ I. Since the subject of this work is solving equations similar to xy = z, we will often need to use division. However, division alone does not provide what we need. For instance, if y = z = 0, then any finite and nonzero x satisfies, even though z/y = 0/0 is undefined. In order to succinctly describe these solution sets, we introduce a special “division” operator: Definition 2.4 For any Y , Z ⊆ R,wedefine Z“/”Y ={x ∈ R |∃y ∈ Y (xy ∈ Z )}. The following properties then hold for all Y , Z ⊆ R and a ∈ R : aZ“/”Y = Z“/”(Y /a), (6) ∗ ∗ Z“/”(Y ∩ R ) = Z /(Y ∩ R ), (7) R, if 0 ∈ Z , Z“/”0 = (8) ∅, otherwise, (0, +∞], if +∞ ∈ Z and −∞ ∈ / Z , [−∞, 0), if +∞ ∈ / Z and −∞ ∈ Z , Z“/” +∞ = (9) R , if +∞ ∈ Z and −∞ ∈ Z , ∅, otherwise, Furthermore, for all Y , Z ⊆ R, we also have the following properties: Z“/”(Y ∪ Y ) = (Z“/”Y ) ∪ (Z“/”Y ), (10) (Z ∪ Z )“/”Y = (Z“/”Y ) ∪ (Z “/”Y ). (11) ∗ ∗ Remark Note that, since Y = (Y ∩ R ) ∪ (Y \R ),wehave ∗ ∗ Z“/”Y = (Z /(Y ∩ R )) ∪ (Z“/”(Y \ R )). It is also worth noting that Y \R = Y ∩{0, −∞, +∞}. This allows us to compute Z“/”Y as the union of an ordinary division and at most three special cases as described above. We also define the box operator as follows: Definition 2.5 (Box closure)A box is any Cartesian product of (extended) real intervals. For any set X ⊆ R , we define its box closure X as the unique smallest n-dimensional box containing X as a subset. For any sets X , Y ⊆ R we have: X ⊆ X (extensivity) (12) if X ⊆ Y , then X ⊆ Y , (monotonicity) (13) X = X (idempotence) (14) Existence and uniqueness follow since every nonempty subset of R is bounded and thus has both a supremum and infimum. 123 11 Page 6 of 52 M. Andrlon Furthermore, for all subsets A ,..., A of R, where the product is the Cartesian product, we 1 n have the following property: n n A = A (distributivity) (15) i i i =1 i =1 That is, the box of a product of sets is exactly the product of the boxes of those sets. Remark A 1-dimensional box is an interval, a 2-dimensional box is a rectangle, a 3- dimensional box is a cuboid, and so on. Also note that a box does not necessarily contain its boundary: an open interval is still an interval, and therefore a box. However, if X is a finite subset of R,then X =[min X , max X ]. In general, if X is finite (or otherwise contains its boundary), then X is a Cartesian product of closed intervals. 2.1 Floating-Point Arithmetic Much work has been done on specifying floating-point arithmetic in various formal systems [3–6, 8–11, 14–20, 22, 23, 25, 26, 29]. However, to the best of the author’s knowledge, there is no well-established and precise informal language of floating-point theory. For the sake of completeness and rigor, the rest of this section will be devoted to defining our terms. Readers interested in a deeper introduction may wish to consult a reference such as Muller et al. [24] or Goldberg [13]. 2.1.1 Floating-Point Numbers, Briefly A finite floating-point number is usually written in the style of scientific notation as ±d .d d ··· d × β , where each digit d lies in {0,...,β − 1}.The ±d .d d ··· d part 1 2 3 p i 1 2 3 p is the significand, the number of digits p is the precision, β is the base (commonly 2 or 10), and e is the exponent. For example, with precision p = 3 and base β = 2, we can write −1 0 the decimal number 0.75 as either 1.10 × 2 or 0.11 × 2 . Of the two representations, the former is preferred, as it is normalized—that is, it has no leading zeros—and is therefore unique. With a higher precision we can represent more numbers. For instance, we need p ≥ 4to represent the binary number 1.101 exactly. Regardless of precision, however, there are still limits imposed by the choice of base. Just as lacks a finite decimal expansion, for any given base there are numbers that cannot be represented. Notably, the decimal 0.1is0.0001100 in binary, and thus it has no representation in any binary floating-point system. As an example, −4 −4 if we take β = 2and p = 3, then 1.10 × 2 and 1.11 × 2 are consecutive floating-point numbers, but they correspond to the decimal numbers 0.09375 and 0.109375, respectively. Consequently, the sum, product or ratio of two precision-p base-β floating-point numbers is rarely itself a precision-p base-β floating-point number and thus computing with them typically involves accepting some degree of rounding error. In ordinary usage, we minimize rounding error by rounding to nearest; that is, by choosing the floating-point number closest to the real number we are rounding. Other rounding modes such as rounding upward, downward, or towards zero (also known as truncation) are typically only used in situations where the direction of the rounding is more important than the accuracy of a single operation. For example, the simplest way to avoid underapproximating the set of true results in interval arithmetic is to always round upper bounds upward and lower bounds downward. 123 Finding Normal Binary Floating-Point Factors Efficiently Page 7 of 52 11 2.1.2 Definitions and Terminology The arithmetic described in this work is compatible with and generalizes that of the IEEE 754 standard, with two exceptions: we omit signed zeros “NaN” values, because they are essentially irrelevant to the problem but add clutter. Appendix B contains a summary of the definitions introduced here. We begin with a construction of the floating-point numbers. Definition 2.6 (Floating-point numbers)A floating-point format is a quadruple of integers (β, p, e , e ) such that β ≥ 2, p ≥ 1, and e ≤ e ,where β is the base, p is the min max min max precision, e is the minimum exponent,and e is the maximum exponent. For convenience, min max we also define the minimum quantum exponent q = e − p + 1and the maximum min min quantum exponent q = e − p + 1. Given these quantities, we define the set F of finite max max nonzero floating-point numbers as follows: ∗ e−p+1 p F ={M · β | M , e ∈ Z, 0 < |M | <β , e ≤ e ≤ e }. min max From here, we define the set F of finite floating-point numbers and the set F of floating-point numbers: F = F ∪{0}, F = F ∪{−∞, +∞}. To avoid repetition, from this point forward we will assume that the sets of floating-point numbers correspond to some floating-point format (β, p, e , e ). min max From the above definition, we immediately obtain the following: Lemma 2.7 F , F and F are each closed under negation. Furthermore, the largest finite floating-point number and the smallest positive floating-point number have simple expressions: ∗ p q max max F = max F = (β − 1)β (16) min min F ∩ (0, +∞] = β (17) We further categorize the elements of F as follows: Definition 2.8 (Normal and subnormal numbers) A nonzero finite floating-point number ∗ e min x ∈ F is normal if |x|≥ β ,or subnormal otherwise. The normal numbers are so called because they each have a normalized representation in scientific notation, restricting the exponent to {e ,..., e }. By contrast, a subnormal min max number can only be written without leading zeros by choosing an exponent below e .Note min that we do not consider zero to be normal or subnormal. A number written in scientific notation has an exponent and significand. Similarly, each finite floating-point number has an associated exponent and significand determined by the floating-point format. We will now define these concepts and generalize them to the finite reals. Although positive and negative zero correspond to the same real value, they are distinct, since 1/+0 =+∞ and 1/−0 =−∞.However,the “/” operator allows us to sidestep the issue of division by zero. In IEEE 754 parlance, the “quantum” of a floating-point number is the value of a unit in the last digit of its significand, often referred to as the “unit in last place”. This is necessarily an integer power of β,and the quantum exponent is the exponent of that power. 123 11 Page 8 of 52 M. Andrlon Definition 2.9 (Exponent and significand) The functions E : R → Z and Q : R → Z are defined as follows: min log |x | , |x|≥ β , E (x ) = min e , |x | <β . min Q(x ) = E (x ) − p + 1. Given x ∈ R, the integers E (x ) and Q(x ) are the exponent and quantum exponent of x, −E (x ) −Q(x ) respectively. The significand of x is the number x β .If x β is an integer, then it is the integral significand [24]of x. To justify these names, we establish that these indeed compute the exponent of a number written as a product of a significand and a power of β, respecting the minimum exponent: Lemma 2.10 (Exponent law) Let m ∈ R and e ∈ Z such that |m| <β and e ≥ e . If either min e e |m|≥ 1 or e = e ,then E (mβ ) = e. Otherwise, E (mβ )< e. min e e min Proof First, note that if |mβ |≥ β ,then e e E (mβ ) = log |mβ | = log |m|+ e = log |m| + e. We now proceed by cases: e e e min • Suppose |m|≥ 1. Then |mβ |≥ β and log |m| = 0, and thus E (mβ ) = e. e e e min • Suppose e = e and |m| < 1. Then |mβ | <β and so we have E (mβ ) = e = e min min immediately by the definition of E. e e min • Suppose e > e and |m| < 1. If |mβ |≥ β ,then log |m| < 0 and hence min e e e e min E (mβ ) = log |m| + e < e.If |mβ | <β instead, then E (mβ ) = e < e. min Since the result holds in all cases, we are finished. Expanding on the above, the function E also satisfies the following properties for all x , y ∈ R: E (x ) = E (−x ) (evenness) (18) if |x|≤|y|, then E (x ) ≤ E (y) (piecewise monotonicity) (19) if x ∈ F, then e ≤ E (x ) ≤ e (exponent bound) (20) min max −E (x ) |x β | <β (significand bound) (21) −E (x ) e min 1 ≤|x β | if and only if β ≤|x | (normality) (22) From the above properties, the definition of Q immediately implies the following: Q(x ) = Q(−x ) (evenness) (23) if |x|≤|y|, then Q(x ) ≤ Q(y) (piecewise monotonicity) (24) if x ∈ F, then q ≤ Q(x ) ≤ q (quantum exponent bound) (25) min max −Q(x ) p |x β | <β (integral significand bound) (26) p−1 −Q(x ) e min β ≤|x β | iff β ≤|x | (normality) (27) 123 Finding Normal Binary Floating-Point Factors Efficiently Page 9 of 52 11 Additionally, the following fact establishes an important connection between integral signif- icands and the finite floating-point numbers: −Q(x ) if x ∈ F, then x β ∈ Z (integrality) (28) In other words, every finite floating-point number has an integral significand. Definition 2.11 (Unit in first/last place [27]) The functions ufp, ulp : R → R are defined as follows: log |x | β , x = 0, ufp(x ) = 0, x = 0, 1−p ulp(x ) = ufp(x )β . For any x ∈ R, we refer to ufp(x ) as the unit in first place of x, and to ulp(x ) as the unit in last place of x. For all x , y ∈ R and n ∈ Z, the following statements hold: ufp(x ) ≤|x | (lower bound) (29) if x = 0, then ufp(x ) ≤|x | <β ufp(x ) (bounds) (30) ufp(x ) =|x | iff x = 0or |x|= β for some k ∈ Z (exactness) (31) n n ufp(x β ) = ufp(x )β (scaling) (32) ufp(x ) ufp(y) = ufp(x ufp(y)) (product law) (33) if y = 0, then ufp(x )/ ufp(y) = ufp(x / ufp(y)) (quotient law) (34) There is also a close relationship between ufp and E. In particular, the base-β logarithm of the unit in first place is the exponent (if exponents were unbounded): e E (x ) min if |x|≥ β , then ufp(x ) = β (35) e E (x ) min if |x | <β , then ufp(x)<β (36) The following result allows us to further simplify computing the unit in first place of a quotient by separating out the significands: Lemma 2.12 For all x , y ∈ R ,wehave ufp(x ) ufp(x /y) = ufp(m /m ) , x y ufp(y) 1/β if |m | < |m |, x y ufp(m /m ) = x y 1 otherwise, where m = x / ufp(x ) and m = y/ ufp(y). x y In a similar vein, we have the following lemma on the exponents of quotients: ∗ −E (x ) −E (z) e min Lemma 2.13 Let x , z ∈ R and let m = x β and m = zβ .If β ≤|z/x |,then x z E (z/x ) = E (z) − E (x ) + log ufp(m /m ). z x Note that these definitions are independent of exponent range. In particular, readers should be aware that they do not treat subnormal numbers specially, unlike the Q and E functions. In effect, they assume unbounded exponents. 123 11 Page 10 of 52 M. Andrlon min Proof Suppose β ≤|z/x |. Then, E (z/x ) β = ufp(z/x ) E (z) m β = ufp E (x ) m β E (z)−E (x ) = ufp(m /m )β . z x Taking logarithms, we obtain the desired result. Having defined the basic structure of F, we now turn to the definition of the computational operations involving the floating-point numbers. Definition 2.14 (Rounding) A rounding function is any function f : R → F. The downward rounding function RD : R → F and the upward rounding function RU : R → F are defined as follows: RD(x ) = max {y ∈ F | x ≤ y}, RU(x ) = min {y ∈ F | x ≥ y}. A rounding function f is faithful if and only if f (x ) ∈{RD(x ), RU(x )} for all x ∈ R. It is easy to show that both RD and RU are nondecreasing functions. In addition, they satisfy the following properties for all x ∈ R: RD(x ) ≤ x ≤ RU(x ) (lower/upper bound) (37) RU(−x ) =− RD(x ) (duality) (38) x ∈ F if and only if RU(x ) = RD(x ) (exactness) (39) In addition, the following gap property holds for all x ∈ R: ∞ if |x | > max F, RU(x ) − RD(x ) = 0if x ∈ F, (40) Q(x ) β otherwise. We also give a partial definition of rounding to nearest, which is the default rounding mode used in practice: Definition 2.15 (Rounding to nearest) A rounding function f rounds to nearest if | f (x ) − x|= min |y − x | for all x ∈[min F, max F]. y∈F We denote by RN an arbitrary nondecreasing and faithful rounding to nearest. Note that this is not a complete definition! In particular, it does not give a tie-breaking rule for when x is the exact midpoint between two floating-point numbers. It also does not fully specify how to round values outside [min F, max F]. We leave these unspecified since IEEE 754 specifies two different round-to-nearest modes with different rules in these cases. In particular, even Also known as monotonic, (weakly) increasing, (weakly) order-preserving, or isotone. The two tie-breaking rules used in IEEE 754 are “ties to even” (choose the result with an even integral significand) and “ties away from zero” (choose the larger magnitude result). The former is the default mode and must be supported. The latter mode need only be available for decimal systems. Additionally, a “ties to zero” (truncation) rule is used for the “augmented” addition, subtraction and multiplication operations added in IEEE 754-2019 [28], which return the rounding error alongside the principal result. 123 Finding Normal Binary Floating-Point Factors Efficiently Page 11 of 52 11 though +∞ and −∞ are infinitely far apart from any finite number, according to IEEE 754, they may still result from rounding finite values to nearest. More specifically, it states that e 1−p max the rounding to nearest of any x such that |x|≥ β (β − β /2) is the infinity with the e +1 max same sign as x. Note that this bound is the exact midpoint of max F and β . We shall denote by fl an arbitrary nondecreasing faithful rounding function. This condition is satisfied by most practical rounding modes, including RD, RU, and rounding to nearest as defined above. The following properties hold for all x ∈ R: x ∈ F if and only if fl(x ) = x (exactness) (41) fl(fl(x )) = fl(x ) (idempotence) (42) Throughout this paper, we will frequently use the preimages of floating-point intervals under rounding. We will often rely on the following lemma: −1 Lemma 2.16 Let X ⊆ F. If X is a nonempty floating-point interval, then fl [X ] is an −1 extended real interval such that fl [X]⊇[min X , max X ]. Proof Suppose X is a nonempty floating-point interval, and suppose to the contrary that −1 −1 fl [X ] is not an interval. Then there are some a, b ∈ fl [X ] and y ∈ R such that a < −1 y < b and y ∈ / fl [X ]. Since fl is nondecreasing, it follows that fl(a) ≤ fl(y) ≤ fl(b). However, since fl(a), fl(b) ∈ X where X is a floating-point interval, we thus have fl(y) ∈ X. −1 Contradiction! Hence fl [X ] is an interval. Now, since fl is the identity over the floating-point numbers, fl(x ) = x for all x ∈ X −1 and hence fl [X]⊇ X.Since F is finite, X is also finite, and hence [min X , max X ] is the −1 smallest interval containing X as a subset. Therefore, since fl [X ] is an interval, it follows −1 that fl [X]⊇[min X , max X ]. To ensure the generality of our results, we define floating-point multiplication in terms of fl rather than fixing a specific rounding function : Definition 2.17 (Multiplication) The floating-point multiplication operator ⊗ is defined by x ⊗ y = fl(xy) for all x , y ∈ F such that the product xy is defined. Note that the extended real product is defined if and only if the factors do not include both zero and an infinity. In particular, the product of an extended real with a nonzero (finite) real number is always defined. As a consequence, however, R is not closed under multiplication, and so neither is F. In fact, neither F nor F are necessarily closed under floating-point multiplication, since it is possible for a sufficiently large product to round to infinity or a sufficiently small nonzero product to round to zero. However, since fl is nondecreasing, we can still carry over some important ordering prop- erties from the reals. Lemma 2.18 Let a ∈ F and x , y ∈ F such that x ≤ y. If a > 0,thena ⊗ x ≤ a ⊗ y. If a < 0,thena ⊗ y ≤ a ⊗ x. Since F is a subset of R, it is totally ordered by the usual order relation. As such, there is a simple definition of an interval over F: Definition 2.19 (Floating-point interval)A floating-point interval is any set X ∩ F where X is an extended real interval. Although more precise results can sometimes be found by considering a specific rounding function, the limitations of the method described in this paper stem from the unpredictability of upward- and downward- rounded quotients, which are independent of fl. 123 11 Page 12 of 52 M. Andrlon With the exception of [−∞, +∞] ∩ F, due to the gaps between floating-point numbers, every floating-point interval has many representations in terms of extended real intervals. However, since F is finite, each of its nonempty subsets has a minimum and a maximum, making it trivial to find the smallest interval containing it. For all a, b ∈ R, we have the following: if a, b ∈ F and a ≤ b, then a, b ∈[a, b]∩ F (tightness) (43) [a, b]∩ F =[RU(a), RD(b)]∩ F (rounding) (44) That is, if the bounds are in F, they are necessarily the tightest possible bounds, and inter- secting an interval with F corresponds to rounding its bounds “inward”. 2.2 The Classical Iterative Procedure As showninExample 1.1, we can use real division to correctly improve the bounds on the factors of a floating-point product. Although it did not find optimal bounds immediately, the division-based approach shown in the example is still essential to solving the problem efficiently, so let us describe it more rigorously. In the example at the start, we reasoned in terms of inequalities. However, a general method will be easier to describe in terms of sets. Let X , Y , Z ⊆ F be floating-point intervals and let x ∈ X. First, note that by the definition of floating-point multiplication, for all y ∈ Y , −1 we have x ⊗ y ∈ Z if and only if xy ∈ fl [Z ]. This greatly simplifies things by allowing us to characterize x using extended real division in most cases. More precisely, using our −1 augmented “division”, we have x ⊗ y ∈ Z for some y ∈ Y if and only if x ∈ fl [Z ]“/”Y . We relate this to our problem statement as follows: Definition 2.20 (Solution sets and optimality)A problem instance is any triple (X , Y , Z ) where X , Y , Z ⊆ F.The solution set of (X , Y , Z ) is the set of triples V ={(x , y, z) | x ∈ X , y ∈ Y , z ∈ Z , x ⊗ y = z}. A problem instance is satisfiable if and only if its solution set is nonempty. The triple (X , Y , Z ) has optimal bounds if and only if V = X × Y × Z. Lemma 2.21 Let X , Y , Z ⊆ F and let V be the solution set of (X , Y , Z ).Then V = X × Y × Z ,where −1 X = X ∩ (fl [Z ]“/”Y ), −1 Y = Y ∩ (fl [Z ]“/”X ), Z = Z ∩ (X ⊗ Y ). Proof Let V , V and V be the sets containing, for each triple in V , its first, second, or third 1 2 3 element, respectively. Then, V = V × V × V . 1 2 3 Now, by the definition of “/”, it is easy to check that V = X , V = Y ,and V = Z ,and 1 2 3 the result follows immediately. Although useful, it is impractical to directly use the above result to solve the problem. Since Y is a floating-point interval, it has many gaps, and so it is very expensive to compute the set of quotients. These gaps were invisible in Example 1.1 because we were working with 123 Finding Normal Binary Floating-Point Factors Efficiently Page 13 of 52 11 inequalities, rather than the exact sets. Translated in terms of sets, what we did corresponds to enlarging the set of denominators Y to the extended real interval Y . By enlarging X and Z similarly, we obtain a precise definition of our algorithm: 3 3 Definition 2.22 (Classical iteration) We define the function F : P(R) → P(R) by F (X , Y , Z ) = (X , Y , Z ) where −1 X = X ∩ (fl [Z ]“/”Y ), −1 Y = Y ∩ (fl [Z ]“/”X ), Z = Z ∩ fl[X Y ]. −1 −1 Since Y is a superset of Y , it follows that fl [Z ]“/”Y is a superset of fl [Z ]“/”Y , and so this substitution is sound, though it may not produce an optimal result. Importantly, however, a quotient of intervals is easy to compute and represent, since it must itself be either an interval or a union of two disjoint intervals. Lemma 2.23 (Soundness)Let X , Y , Z ⊆ F and let x ∈ X, y ∈ Y, z ∈ Z. If x ⊗ y = z, then x ∈ X , y ∈ Y , z ∈ Z where (X , Y , Z ) = F (X , Y , Z ). Proof Suppose x ⊗ y = z. Then, since x ⊗ y = fl(xy) ∈ fl[X Y ] and z ∈ Z,wehave −1 z ∈ Z by definition. Since z ∈ Z,wealsohave xy ∈ fl [Z ] by definition. Thus, since −1 y ∈ Y,wehave x ∈ fl [Z ]“/”Y by definition, and hence x ∈ X .Since x ∈ X,we also similarly have y ∈ Y , as desired. It is easy to prove that iterating F converges in finite time, since the sets involved are finite, and we only shrink them: Lemma 2.24 (Termination) For all X , Y , Z ⊆ F, the sequences of sets given by 1 1 1 (X , Y , Z ) = F (X , Y , Z ), n > 1 n n n n−1 n−1 n−1 are eventually constant. Proof Since F is finite, X , Y and Z are finite. Therefore, since the definition of F implies 1 1 1 that the sequences X , Y and Z are monotonically decreasing by the subset relation, they n n n must be eventually constant. Furthermore, individual applications of F can be computed quickly: Lemma 2.25 (Time complexity) If X, Y , and Z are floating-point intervals or are each a disjoint union of an all-nonnegative and an all-nonpositive floating-point interval, then F (X , Y , Z ) can be computed using O(1) arithmetic operations and computations of the preimage of a floating-point interval under fl. Remark Although the number of arithmetic operations needed does not increase with the problem parameters, the individual cost of computing them does. In the sign-significand- exponent encoding, floating-point numbers (in any base) take up O(p +log(e −e +1)) max min bits. If fl belongs to one of usual families of rounding functions (e.g. RD, RU, or RN), both it and its preimage can be computed in time linear in the length of its input. Thus, floating-point addition can be computed in time linear in the length of the addends and multiplying two floating-point numbers takes time O(M (p) + log(e − e + 1)),where max min M (p) is the time complexity of multiplying two p-digit integers. Correctly rounding division is more difficult, but it can be computed using Newton-Raphson division in same amount 123 11 Page 14 of 52 M. Andrlon of time asymptotically as multiplication. Thus, for most choices of fl, computing F takes time O(M (p) + log(e − e + 1)). Note that we do not take the base, precision or max min exponent range as constants in our analysis, as doing so tends to collapse the time complexity of floating-point algorithms to O(1), precluding any meaningful comparison. We also do not assume a specific multiplication algorithm, as there are many to choose from with different characteristics. However, given that what we are computing is a relaxation of the original problem, it is unclear how quickly the iteration converges, or even that it necessarily converges to the optimum. Certainly, it can be quite slow in some cases, as we saw in Example 1.1. Although we do not have a concrete answer, we offer the following conjecture on the rate of convergence: Conjecture 2.26 Let X , Y , Z ⊆ F be floating-point intervals, and define the sequences of 1 1 1 sets (X , Y , Z ) = F (X , Y , Z ), n > 1. n n n n−1 n−1 n−1 Then, for all m > 2,the sets X \X ,Y \Y and Z \Z contain at most 4 members m m+1 m m+1 m m+1 each. That is, we conjecture that the rate of convergence is constant. The intuition behind this is thus: after two iterations, we will have propagated the bounds from one factor to the other, and back again. After that, however, the bounds are within one floating-point number of the optimum of the real relaxation, so there is no significant progress to be gained from interval reasoning. Instead, each step only removes at most one floating-point number from the end of each interval in the sets. (Each set after the first iteration may be a disjoint union of two intervals, if the initial intervals contain both positive and negative numbers.) This is clearly rather slow, so we make a further conjecture: that, in the worst case, iterating F takes a number of steps to converge exponential in the precision: Conjecture 2.27 Denote by F the floating-point numbers with format (β, n, e , e ) for n min max each precision n ≥ 2 and let fl be an associated nondecreasing, faithful rounding function. Let F be F parameterized over fl .Let μ (X , Y , Z ) = m be the least positive integer such n n n m+1 m k that F (X , Y , Z ) = F (X , Y , Z ),where F denotes F composed with itself k times, n n n where X , Y , Z ⊆ F .If f :{2, 3,...}→ Z is a function such that f (n) = max {μ (X , Y , Z ) | X , Y , Z are floating-point intervals over F }, n n cn then f (n) = (β ) for some positive real c. Remark Proving an upper bound is trivial, as it is of the same order of steps as brute force search. The lower bound is tricky. In order to prove it, we essentially need to prove that bad (i.e. exponentially slow) instances exist in every precision. It would perhaps be easier to show that f is not dominated by any function of lesser order. For this, we merely need an infinite number of precisions with bad instances. A promising candidate is the set of all instances of the form (X , F, {β − 1}),where X is the widest floating-point interval centered p−1 p on β (β − 1) such that the instance is unsatisfiable. (The unsatisfiable example we gave earlier for 64-bit IEEE floating-point numbers is one such instance.) Note that to obtain this complexity result rigorously, we must define a family of rounding functions suitably indexed by floating-point format, as well as a matching indexed variant of F. 123 Finding Normal Binary Floating-Point Factors Efficiently Page 15 of 52 11 Leaving aside the rate of convergence, we would also like to know what exactly we are converging to. We can show that whenever X and Y do not contain zero, convergence implies that the bounds on X and Y are optimal. In such cases, we obtain explicit solutions to the equation x ⊗ y = z where x or y is an endpoint of X or Y , respectively. Lemma 2.28 (Conditional partial optimality)Let X , Y , Z ⊆ F. If Z is a floating-point inter- val, and neither X nor Y contains zero, and F (X , Y , Z ) = (X , Y , Z ), then one of the following statements holds: 1. X, Y and Z are empty. 2. The elements of X have the same sign as the elements of Y , the elements of Z are non- negative, and min X ⊗ max Y ∈ Z and max X ⊗ min Y ∈ Z. 3. The elements of X have the opposite sign of the elements of Y , the elements of Z are nonpositive, and min X ⊗ min Y ∈ Z and max X ⊗ max Y ∈ Z. Proof Suppose the conditions hold. Then, if any of X, Y,or Z are empty, the definition of F trivially implies that they are all empty, so the result holds. Suppose instead that X, Y and Z are nonempty. Since X is an interval not containing zero, it is a subset of either [−∞, 0) or (0, +∞], and thus the elements of X all have the same sign. Similarly, we also find that every element of Y has the same sign. We now proceed by cases: Suppose that the elements of X have the same sign as the elements of Y.Thenfl(xy) ≥ 0 for all x ∈ X and y ∈ Y where the product is defined. Therefore, by the definition of F, the elements of Z are all nonnegative. −1 −1 1. Let x ∈ X and y ∈ Y such that min X · y ∈ fl [Z ] and x · max Y ∈ fl [Z ].(These exist by the definition of “/”.) (a) Suppose min X > 0and max Y > 0. Then, since min X ≤ x and y ≤ max Y , multiplying and rounding gives min Z ≤ fl(min X · y) ≤ min X ⊗ max Y ≤ fl(x · max Y ) ≤ max Z , and since Z is a floating-point interval, we thus have min X ⊗ max Y ∈ Z. (b) Suppose min X < 0and max Y < 0 instead. Then, similarly, we obtain fl(x · max Y ) ≤ min X ⊗ max Y ≤ fl(min X · y), and thus min X ⊗ max Y ∈ Z. −1 −1 2. Let x ∈ X and y ∈ Y instead such that max X · y ∈ fl [Z ] and x · min Y ∈ fl [Z ]. (a) Suppose max X > 0and min Y > 0. Then, since x ≤ max X and min Y ≤ y, fl(x · min Y ) ≤ max X ⊗ min Y ≤ fl(max X · y), and thus max X ⊗ min Y ∈ Z. (b) Suppose max X < 0and min Y < 0 instead. Then, since x ≤ max X and min Y ≤ y, fl(max X · y) ≤ max X ⊗ min Y ≤ fl(x · min Y ), and thus max X ⊗ min Y ∈ Z. Suppose instead that the elements of X have the opposite sign of the elements of Y.Then fl(xy) ≤ 0for all x ∈ X and y ∈ Y and thus the elements of Z are all nonpositive. −1 −1 1. Let x ∈ X and y ∈ Y such that min X · y ∈ fl [Z ] and x · min Y ∈ fl [Z ]. 123 11 Page 16 of 52 M. Andrlon (a) Suppose min X > 0and min Y < 0. Then, since min X ≤ x and min Y ≤ y, multiplying and rounding gives min Z ≤ fl(x · min Y ) ≤ min X ⊗ min Y ≤ fl(min X · y) ≤ max Z , and therefore min X ⊗ min Y ∈ Z. (b) Suppose min X < 0and min Y > 0 instead. Then, similarly, we obtain fl(min X · y) ≤ min X ⊗ min Y ≤ fl(x · min Y ), and thus min X ⊗ min Y ∈ Z. −1 −1 2. Let x ∈ X and y ∈ Y instead such that max X · y ∈ fl [Z ] and x · max Y ∈ fl [Z ]. (a) Suppose max X > 0and max Y < 0. Then, since x ≤ max X and y ≤ max Y , fl(max X · y) ≤ min X ⊗ min Y ≤ fl(x · max Y ), and so max X ⊗ max Y ∈ Z. (b) Suppose max X < 0and max Y > 0 instead. Then, similarly, fl(x · max Y ) ≤ min X ⊗ min Y ≤ fl(max X · y), and thus max X ⊗ max Y ∈ Z. Since we have exhausted all cases, the proof is complete. Remark We can use this result in the general case by splitting X and Y into positive and neg- ative parts, though doing this does mean that may need to solve multiple smaller instances to solve the overall instance. Alternatively, we could reformulate F to split X and Y appro- priately. Remark Note that if there are no solutions, this implies that iterating F eventually proves it by converging to empty sets. Now, although the above result is very informative, it does not tell us about the bounds on Z when F converges. We now show that, unfortunately, iterating F does not always produce optimal bounds: Example 2.29 Consider β = 10, p = 2and e ≤ 0 ≤ e . (That is, 1-place decimal min max floating-point numbers.) Let X =[1.2, 1.4]∩ F, Y =[1.2, 1.3]∩ F and Z =[1.5, 1.8]∩ F and suppose fl is rounding to nearest, ties to even. We first show that F (X , Y , Z ) = (X , Y , Z ). −1 In the following, note that fl [Z]= (1.45, 1.85]: −1 fl [Z ]“/”Y = (1.45, 1.85]/[1.2, 1.3] = (1.1153846, 1.5416] ⊇ X , −1 fl [Z ]“/”X = (1.45, 1.85]/[1.2, 1.4] = (1.03571428, 1.5416] ⊇ Y , fl[X Y]= fl[[1.2, 1.4]·[1.2, 1.3]] = fl[[1.44, 1.82]] =[1.4, 1.8]∩ F ⊇ Z . 123 Finding Normal Binary Floating-Point Factors Efficiently Page 17 of 52 11 Table 1 Real and rounded · 1.2 1.3 1.4 ⊗ 1.2 1.3 1.4 products of the elements of X and 1.2 1.44 1.56 1.68 1.2 1.4 1.6 1.7 1.3 1.56 1.69 1.82 1.3 1.6 1.7 1.8 From the definition of F, the above confirms that F (X , Y , Z ) = (X , Y , Z ) indeed. It is also easy to verify that the bounds on X and Y are optimal, as per Lemma 2.28: min X ⊗ max Y = fl(1.2 · 1.3) = fl(1.56) = 1.6 ∈ Z , max X ⊗ min Y = fl(1.4 · 1.2) = fl(1.68) = 1.7 ∈ Z . However, if we enumerate all the products of the members of X and Y , we find that none of them round to min Z = 1.5. From Table 1,wesee that X ⊗Y ={1.4, 1.6, 1.7, 1.8},and so Z ∩(X ⊗Y ) =[1.6, 1.8]∩F is a proper subset of Z that is in fact a floating-point interval. Therefore X × Y × Z is not the smallest box enclosing the solution set. Example 2.29 demonstrates that the classical algorithm does not necessarily produce optimal results. However, there is a simple strategy we can use to remedy this. To find optimal bounds on Z—that is, find the least and greatest values z ∈ Z such that x ⊗ y = z is satisfied by some x ∈ X and y ∈ Y —we can use binary search. Now, testing satisfiability is simple if we can find optimal bounds on any part: if a problem instance is unsatisfiable, then the optimal bounds are all empty sets. The key, then, is to find an efficient optimization method. In this paper, we develop such a procedure for the bounds on factors. 3 Simplifying the Problem 3.1 Sufficient and Necessary Conditions In order to simplify the problem, we will need to precisely determine the conditions under which the division-based algorithm produces optimal results. In the second half of Example 1.1, we encountered certain combinations of values for which the equation x ⊗ y = z has no solutions over the floating-point numbers. Unlike with the reals, there is no guarantee that we can find a value for x satisfying the equality for any given y and z. The existence of a solution is necessary for any optimal bound, however, and so the following definition will be useful: Definition 3.1 (Floating-point factors) A floating-point number x is a floating-point factor of a floating-point number z if and only if x ⊗ y = z for some y ∈ F. Given a set Z ⊆ F,we say that x is feasible for Z if and only if x is a floating-point factor of some z ∈ Z.The set of all floating-point factors of the members of Z is denoted Feas(Z ). For the sake of brevity, we will typically simply write “factor” when it is clear from context that we are discussing floating-point factors. It is worth noting that sets of factors satisfy the following closure property: Lemma 3.2 For any Z ⊆ F,the set Feas(Z ) is closed under negation. 123 11 Page 18 of 52 M. Andrlon Proof Let Z ⊆ F. If Feas(Z ) is empty, it is trivially closed under negation. Suppose Feas(Z ) is nonempty instead, and let x ∈ Feas(Z ). Then there is some y ∈ F such that x ⊗ y ∈ Z. Therefore, −x ⊗−y = fl(−x ·−y) = fl(xy) = x ⊗ y ∈ Z , and so −x ∈ Feas(Z ). Corollary 3.2.1 For any x ∈ F and Z ⊆ F,wehave max {y ∈ Feas(Z ) | y ≤ x}=− min {y ∈ Feas(Z ) | y ≥−x }. Remark The above corollary implies that we only need one procedure to compute both the minimum and maximum of [a, b]∩ Feas(Z ) for any a, b ∈ F. The following two lemmas are crucial. First, we prove that the real quotients from the division-based algorithm can still provide us with useful bounds. Lemma 3.3 Let x ∈ F, and let Y , Z ⊆ F where Y is a floating-point interval. If fl(xy) ∈ Z for some y ∈ Y, then 1. x ⊗ s ≥ min Zfor some s ∈ Y , and 2. x ⊗ t ≤ max Zfor some t ∈ Y. Proof Suppose fl(xy) ∈ Z for some y ∈ Y.Since min Y and max Y are floating-point numbers and min Y ≤ y ≤ max Y , by the monotonicity of rounding, we have min Y ≤ RD(y) ≤ max Y and min Y ≤ RU(y) ≤ max Y.Since Y is a floating-point interval, it thus follows that RD(y) ∈ Y and RU(y) ∈ Y . We shall first deal with the special cases of x being zero or infinite. Suppose x = 0. Since xy is defined by assumption, y must be finite, and hence xy = 0. Thus fl(xy) = 0. Since y is finite, at least one of either RD(y) or RU(y) is finite, and thus either x RD(y) = 0or x RU(y) = 0. Therefore either x ⊗ RD(y) = 0or x ⊗ RU(y) = 0. Since fl(xy) = 0and min Z ≤ fl(xy) ≤ max Z, the result follows. Suppose x =+∞ or x =−∞ instead. Since xy is defined by assumption, y must be nonzero and hence either xy =+∞ or xy =−∞. Therefore fl(xy) = xy.Since y is nonzero, at least one of RU(y) or RD(y) is nonzero. If RD(y) is nonzero, then it has the same sign as y and hence xy = x RD(y). Similarly, if RU(y) is nonzero, then it has the same sign as y and hence xy = x RU(y). Therefore either x ⊗ RD(y) = fl(xy) or x ⊗ RU(y) = fl(xy), and hence the result. We now handle the remaining case of x being finite and nonzero. Suppose x ∈ F instead. Then multiplication by x is always defined. Note that min Y ≤ y ≤ max Y and min Z ≤ fl(xy) ≤ max Z. We shall consider each sign of x separately: • Suppose x is positive. Multiplying by x, it follows that x min Y ≤ xy ≤ x max Y , and by the monotonicity of rounding, we have x ⊗ min Y ≤ fl(xy) ≤ x ⊗ max Y . 123 Finding Normal Binary Floating-Point Factors Efficiently Page 19 of 52 11 Combining these bounds with our previous bounds on fl(xy),weobtain x ⊗ min Y ≤ fl(xy) ≤ max Z , min Z ≤ fl(xy) ≤ x ⊗ max Y , and hence the result holds. • Suppose x is negative. Then, similarly, we have x ⊗ min Y ≥ fl(xy) ≥ x ⊗ max Y . Therefore, since fl(xy) ∈ Z, it follows that x ⊗ min Y ≥ fl(xy) ≥ min Z , max Z ≥ fl(xy) ≥ x ⊗ max Y , and thus we obtain the result. Since we have proved the result for all possible cases, we are finished. Next, we show that the solution set is exactly the set of feasible quotients: Lemma 3.4 Let x ∈ F, and let Y , Z ⊆ F be floating-point intervals. Then there exists y ∈ Y such that x ⊗ y ∈ Z if and only if 1. x is feasible for Z, and 2. fl(xa) ∈ Zfor some a ∈ Y. Proof If x ⊗ y ∈ Z for some y ∈ Y,then x ∈ Feas(Z ) by definition and fl(xy) = x ⊗ y ∈ Z where y ∈ Y trivially. Suppose instead that x is feasible for Z,and that fl(xa) ∈ Z for some a ∈ Y . Then, by definition, there is some w ∈ F such that x ⊗ w ∈ Z.Further,by Lemma 3.3,wehave x ⊗ s ≥ min Z and x ⊗ t ≤ max Z for some s, t ∈ Y.If w ∈ Y,then the result follows immediately. Since Y is a floating-point interval, if instead w/ ∈ Y,then either w< min Y or w> max Y . For the following, note that min Z ≤ x ⊗ w ≤ max Z.We shall divide the problem into four cases, depending on the sign of x and the region w lies in: 1. Suppose w< min Y.Then w< s and w< t. (a) Suppose x ≥ 0. Multiplying, we obtain x ⊗ w ≤ x ⊗ t, and thus min Z ≤ x ⊗ t ≤ max Z.Since Z is a floating-point interval, it follows that x ⊗ t ∈ Z. (b) Suppose x < 0. Then, we can multiply w< s by x to obtain x ⊗ w ≥ x ⊗ s, and thus max Z ≥ x ⊗ s ≥ min Z.Since Z is a floating-point interval, we have x ⊗ s ∈ Z. 2. Suppose w> max Y.Then w> s and w> t. (a) Suppose x ≥ 0. Then x ⊗ w ≥ x ⊗ s, and thus x ⊗ s ∈ Z by the reasoning of case (1)(b). (b) Suppose x < 0. Then x ⊗ w ≤ x ⊗ t, and hence x ⊗ t ∈ Z by case (1)(a). Sincewehaveeither x ⊗ s ∈ Z or x ⊗ t ∈ Z in all cases, the result follows. Remark This is not a solution on its own. Note that a is not necessarily a floating-point number! This brings us to our first major result. The following theorem breaks the problem of computing optimal bounds into two simpler problems: computing bounds using the division- based algorithm, and then shrinking those bounds to the nearest feasible values. 123 11 Page 20 of 52 M. Andrlon Theorem 3.5 Let X , Y , Z ⊆ F be nonempty floating-point intervals, and let −1 X = X ∩ (fl [Z ]“/”Y ), X ={x ∈ X |∃y ∈ Y (x ⊗ y ∈ Z )}. Then X = X ∩ Feas(Z ). Proof By Lemma 3.4,for all x ∈ F,wehave x ∈ X if and only if x ∈ X, x ∈ Feas(Z ) and −1 x ∈ fl [Z ]“/”Y , and the result follows immediately. Remark This result implies that the division-based algorithm produces optimal bounds if and ˆ ˆ only if it produces feasible bounds. That is, min X = min X if and only if min X is feasible for Z. Therefore, the difficult case is when the bounds produced by the division-based algorithm are infeasible. Theorem 3.5 shows that we can solve Problem 1 by using the classical division-based algorithm to compute the set of quotients and then intersecting it with the set of factors. As such, the remainder of this paper will focus on solving this new problem: Problem 2 Let x ∈ F and let Z ⊆ F be a floating point interval. What is the least (greatest) number feasible for Z that is no less (greater) than x? 3.2 Narrowing Down Infeasibility Due to Theorem 3.5, we would like to devise a method to find the factors nearest any given non-factor. To that end, we now identify some more practical conditions for feasibility. The following lemma gives a simple and direct test for feasibility, and also shows that floating- point multiplication and division are closely related: Lemma 3.6 Let x ∈ F and let Z ⊆ F be a floating-point interval. Let z ∈ Z. Then x is feasible for Z if and only if either x ⊗ RD(z/x ) ∈ Zor x ⊗ RU(z/x ) ∈ Z. Proof If either x ⊗ RD(z/x ) ∈ Z or x ⊗ RU(z/x ) ∈ Z,then x is trivially feasible for Z. For the other half, suppose x is feasible for Z.Then x ⊗ y ∈ Z for some y ∈ F.Now, −1 let Y = fl [Z ]/x.Then y ∈ Y and z/x ∈ Y , and since Z is a floating-point interval, it follows that Y is an interval. Therefore, if y ≤ z/x,then y ≤ RD(z/x ) ≤ z/x,and so RD(z/x ) ∈ Y.If y > z/x instead, then y ≥ RU(z/x ) ≥ z/x, and thus RU(z/x ) ∈ Y.Since either RD(z/x ) ∈ Y or RU(z/x ) ∈ Y,wehaveeither x ⊗RD(z/x ) ∈ Z or x ⊗RU(z/x ) ∈ Z, as desired. Corollary 3.6.1 Let x ∈ F and z ∈ F. Then x is a floating-point factor of z if and only if either x ⊗ RD(z/x ) = zor x ⊗ RU(z/x ) = z. Additionally, it is worth mentioning that the above serves as a manual proof of a general statement of the automatically derived “invertibility condition” for multiplication of Brain et al. [7] Although it is clear that if the set of products is large, there will be many solutions, it is unclear how large exactly it needs to be. The following results give a more precise idea of how its diameter relates to the set of solutions. In the following lemma, we describe a lower bound on the diameter sufficient to ensure feasibility in all cases. Lemma 3.7 Let Z ⊆ F be a floating-point interval, and let x ∈ F and z ∈ Z. If |z/x|≤ −1 Q(z/x ) max F and diam fl [Z ] > |x |β , then x is feasible for Z. 123 Finding Normal Binary Floating-Point Factors Efficiently Page 21 of 52 11 Proof If z/x ∈ F,then x ⊗ (z/x ) = z, and hence x is feasible for Z. Suppose z/x ∈ / −1 Q(z/x ) F instead, and suppose |z/x|≤ max F and diam fl [Z ] > |x |β . First, note that RD(z/x)< z/x < RU(z/x ), and hence z lies strictly between x RD(z/x ) and x RU(z/x ). −1 More precisely, let I = x ·[RD(z/x ), RU(z/x )].Then z ∈ I , and hence I ∩ fl [Z ] is −1 nonempty. Since Z is a floating-point interval, fl [Z ] is an interval. We shall now show that −1 fl [Z ] is strictly wider than I : diam I =|x |(RU(z/x ) − RD(z/x )) Q(z/x ) =|x |β −1 < diam fl [Z ]. −1 Therefore, since I and fl [Z ] are overlapping intervals, by Lemma 2.3, it follows that either −1 −1 min I ∈ fl [Z ] or max I ∈ fl [Z ]. Hence either x ⊗ RU(z/x ) ∈ Z or x ⊗ RD(z/x ) ∈ Z, and so x is feasible for Z. The next lemma builds on the previous result to give a width independent of the denominator under which virtually all values are feasible. ∗ e min Lemma 3.8 Let Z ⊆ F be a floating-point interval, and let x ∈ F and z ∈ Z. If β ≤ −1 Q(z)+1 |z/x|≤ max F and diam fl [Z]≥ β , then x is feasible for Z. e −1 Q(z)+1 min Proof Suppose β ≤|z/x|≤ max F and diam fl [Z]≥ β . Then, Q(z/x ) E (z/x )−p+1 |x |β =|x |β 1−p ≤|x ||z/x |β 1−p =|z|β E (z)+1 1−p <β β Q(z)+1 = β −1 ≤ diam fl [Z ]. Therefore, by Lemma 3.7, x is feasible for Z. Remark For any ordinary rounding function (e.g. RD, RU, or RN), the requirement that −1 Q(z)+1 diam fl [Z]≥ β roughly corresponds to Z containing at least β floating-point num- bers with the same exponent as z (cf. Ziv et al. [30]). The next lemma provides an even stronger bound, under the condition that the significand of the numerator is no greater in magnitude than the significand of the denominator: Lemma 3.9 Let Z ⊆ F be a floating-point interval, and let x ∈ F and z ∈ Z. Let m = −E (x ) −E (z) −1 Q(z) e min x β and m = zβ .If diam fl [Z]≥ β , |m |≤|m |, and β ≤|z/x|≤ z z x max F, then x is feasible for Z. e E (z)−E (x ) min Proof Suppose β ≤|z/x|≤ max F.If |m |=|m |,then |z/x|= β and hence z x z/x ∈ F,so x is trivially feasible for Z. Suppose |m | < |m | instead, and also suppose z x −1 Q(z) diam fl [Z]≥ β . Then log ufp(m /m ) ≤−1 and so by Lemma 2.13, it follows that z x E (z/x ) ≤ E (z) − E (x ) − 1. Thus, Q(z/x ) E (z/x )−p+1 |x |β =|x |β E (z)−E (x )−p ≤|x |β 123 11 Page 22 of 52 M. Andrlon Fig. 1 Preimages of 1 and 1.75 under floating-point rounding, with (β, p, e , e ) = (2, 3, −2, 1).RNE min max is rounding to nearest, breaking ties by choosing numbers with even integral significands Q(z)−1 =|m |β Q(z) <β −1 ≤ diam fl [Z ]. Therefore, by Lemma 3.7, x is feasible for Z. Any “ordinary” rounding function almost always satisfies the condition on the diameter −1 Q(z) on the preimage. Specifically, we always have diam fl [{z}] = β unless |z| is normal and a power of β. This can be seen in Fig. 1; note the different diameters of the preimages of 1 under rounding. Note that Lemmas 3.7 and 3.9 are both easier to apply when using standard rounding modes, or more specifically, when the preimage under rounding of any floating-point number is not too narrow. Although the usual rounding functions are both faithful and monotonic, and these are indeed very useful properties, they are not enough on their own: Example 3.10 We shall construct an ill-behaved rounding function which is still both faithful and monotonic. Let f : R → F be a function such that RD(x ) if x ≤ max F, f (x ) = +∞ otherwise. Since RD is nondecreasing and faithful, it is easy to see that f is also nondecreasing and faithful. However, the only real that it rounds to max F is max F itself. That is, we have −1 f [{max F}] = {max F}, and so the preimage under rounding has zero diameter. Further, given x , y ∈ F,wehave x ⊗ y = max F if and only if xy = max F exactly. This means that finding the floating-point factors of max F requires integer factorization. Specifically, p q max since max F = (β − 1)β , the floating-point factors of max F are exactly the floating- point numbers with a nonnegative exponent (i.e. integers) whose integral significands divide β − 1. As demonstrated above, faithfulness and monotonicity are not enough to ensure that a rounding function has a tractable preimage. We therefore restrict our attention to a better- behaved class of rounding functions. In particular, we are interested in rounding functions that allocate to each floating-point number at least half of the gap around it: 123 Finding Normal Binary Floating-Point Factors Efficiently Page 23 of 52 11 Definition 3.11 (Equitable rounding) A rounding function f is equitable if it is faithful, −1 Q(x )−k −E (x ) monotonic, and diam f [{x }] ≥ β where k = 1if |x β |= 1or k = 0 otherwise for all x ∈ F . Remark Note that we do not require the preimages under rounding of 0, +∞,and −∞ to have a minimum diameter. This is because they already have every nonzero finite floating-point number as a floating-point factor, and so we do not need their preimages at all. Given this definition, the following is then straightforward (albeit tedious) to prove: Lemma 3.12 RD and RU are equitable rounding functions. However, RN is not necessarily equitable, since it is unclear what rounding to nearest means for numbers outside [min F, max F] and thus our definition does not fully specify the preimages of min F and max F under RN. For instance, we could choose RN such −1 − that RN [{x }] = ((x + x )/2, x ] when x = max F. This is clearly not equitable, since −1 q max diam RN [{x }] = β /2. However, this inconvenience is not as important as it may seem. As mentioned earlier, IEEE 754-style rounding to nearest specifies that a number rounds to e +1 max an infinity if and only if its magnitude is no less than (max F + β )/2. This in turn max means that the preimages of min F and max F under rounding have a diameter of β ,as required for equitability. In particular, the preimages of min F and max F under any equitable rounding to nearest are at least as large as those under IEEE 754-style rounding: −1 Lemma 3.13 RN is an equitable rounding function if and only if RN [{+∞}] ⊆ I and −1 e +1 max RN [{−∞}] ⊆ −Iwhere I =[(max F + β )/2, +∞]. Therefore, since IEEE 754-style systems are virtually the only ones in use, we do not consider equitability to be an especially onerous requirement. Now, under the assumption of equitable rounding, we can substantially refine the conclu- sion of Lemma 3.9. ∗ −E (x ) −E (z) Lemma 3.14 Let x ∈ F and z ∈ F, and let m = x β and m = zβ .If fl is x z min equitable and β ≤|z/x|≤ max F and |m |≤|m |, then either |m |= 1 or x is a z x z floating-point factor of z. min Proof Suppose fl is equitable and β ≤|z/x|≤ max F and |m |≤|m |. Suppose also that z x −1 Q(z) |m | = 1. Then, by the definition of equitable rounding, we have diam fl [{z}] ≥ β . Therefore by Lemma 3.9, it follows that x is feasible for {z}, and hence the result. Remark Note that here we show that x is a factor of z, whereas Lemmas 3.7 and 3.9 only prove that it is feasible for some superset Z. Combining the previous results, we obtain a much clearer picture of what it means when a number is infeasible: Theorem 3.15 Let Z ⊆ F be a floating-point interval, and let x ∈ F and z ∈ Z. Let −E (x ) −E (z) m = x β and m = zβ .If fl is equitable and |z/x|≤ max F and x is not feasible x z for Z, then either min 1. |z/x | <β ,or −1 Q(z) 2. 1 =|m | < |m | and diam fl [Z ] <β ,or z x −1 Q(z)+1 3. |m | < |m | and diam fl [Z ] <β . x z Proof Follows immediately from Lemmas 3.7 and 3.14. 123 11 Page 24 of 52 M. Andrlon −1 0 1 2 β β β β x RU(2.5/x) x RD(2.5/x) 2.6 2.5 2.4 0 0.1 1 10 100 Fig. 2 Graphs of exact products of floating-point divisors x and rounded quotients RD(z/x ) and RU(z/x ) where (β, p, e , e ) = (10, 2, −1, 1) and z = 2.5. The x-axis shows all positive finite floating-point min max numbers, spaced equally Remark When rounding the quotient produces a subnormal number, as in case (1), the loss of significant digits can drastically magnify the error and make x infeasible unless Z is exceptionally wide. This makes subnormal quotients difficult to handle in general, since the difference between z and x ⊗ fl(z/x ) may be as large as x itself. Since we assume fl is equitable, case (2) is the only instance where x is infeasible and |m |≤|m |. Although it is a very constrained special case, it shall take some work to z x dispense with, as we will see later. The most common cause of infeasibility is case (3). In this instance, we have |m |≤|m |,but Z is too narrow to provide a direct solution, likely due to x z containing fewer than β numbers. Note that this means that having more than one number in Z usually suffices for feasibility when β = 2. Theorem 3.15 gives a straightforward classification of cases where infeasibility can occur. Although we are not able to give a full treatment of the subnormal quotient case in this paper, we will present an efficient solution for the other two cases. In the next section, we develop criteria for feasibility that are more amenable to a computational solution. 4 Analyzing the Error When computing a real number, we are often forced to approximate it due to the limitations of the number system. For systems as complicated as the floating-point numbers, the error 123 Finding Normal Binary Floating-Point Factors Efficiently Page 25 of 52 11 in this approximation—the rounding error—can seem unpredictable. The notion is simple: given a rounding function f , the rounding error of some x ∈ R with respect to f is just f (x ) − x. However, it is difficult to tell from this what values the rounding error can take and how it varies with x. In our case, we are more precisely interested in a “round-trip” error. Roughly speaking, according to Lemma 3.6, a floating-point number is feasible for a given set if and only if we can undo division by using multiplication. In other words, it is feasible if and only if the error in the “round-trip” is small enough. For any given floating-point numerator z and denominator x, any error in attempting to round-trip multiplication using division stems from rounding the quotient z/x.Thisis easy to show: since fl is assumed to be faithful, if z/x is also a floating-point number, then x ⊗ fl(z/x ) = x exactly. Consequently, any instance where x ⊗ fl(z/x ) = x implies that z/x is not exactly representable. This is the common case, since the quotient of two floating-point numbers rarely fits into the available number of digits. However, the product is also rounded, so an inexact quotient may still be close enough. For instance, if we round downward, it is + + sufficient that the product lies in [z, z ),where z is the next floating-point number after z. A number of such examples can be seen in Fig. 2, as well as a few instances where the product has no error (note that the graph is centered on z). Other patterns are visible as well, such as periodic behavior and a reduction in error whenever the significand of x is greater than the significand of z. In the remainder of this section, we shall develop ways of manipulating this round-trip error, with the goal of using integer optimization to find floating-point factors. The key insight comes from Lemma 3.6, which tells us that feasibility can be decided using only fl and upward- or downward-rounded quotients. Further, since we assume fl is faithful, we also know that multiplication is also either rounded upward or downward. Therefore, we only really need to be concerned with rounding errors relative to RU and RD to solve the problem. However, the definitions of RU and RD are very cumbersome to manipulate directly. In order to make progress, we must first find a more practical description of rounding. We will do this by expressing rounding error in terms of the remainder of floored division. Furthermore, for the sake of simplicity, we will consider rounding over floating-point numbers with unbounded exponents. We will also show how this relates to the original case. 4.1 Further Preliminaries 4.1.1 The Mod Operator We shall use a generalized notion of the remainder of a division. Given a numerator n, denominator d, and quotient q, a remainder r satisfies n = qd + r and |r | < |d|.These quantities are traditionally integers, but we extend the operation to the real numbers. Given n and d, we shall determine the remainder r by choosing the quotient q = n/d. Thus, we define the binary operator mod as follows: Definition 4.1 (Remainder of floored division) For any numerator x ∈ R and denominator y ∈ R ,wedefine (x mod y) = x − y x /y . More precisely, if β is prime (such as when β = 2), then the quotient either fits or has no finite base-β representation. For a proof, see Lemma A.5. Due to the difficulties with subnormal numbers mentioned in the remark to Theorem 3.15, the specific approach presented here does not benefit from restricting exponents, and doing so would drastically complicate the presentation. 123 11 Page 26 of 52 M. Andrlon Remark To avoid excessive parenthesization, the mod operator defined here is nonassociative and has the lowest precedence. That is, within the parentheses, everything before mod is the first argument, and everything after mod is the second argument. Since this is an extension of the usual notion of the remainder to real numerators and denomi- nators, not all the same properties will hold exactly. However, there is still a close connection with modular arithmetic, which the next two lemmas will illustrate. Firstly, the remainder is periodic in the numerator, with a period equal to the denominator. That is, we can add or subtract integer multiples of the denominator to the numerator without affecting the result: Lemma 4.2 For any x ∈ R,y ∈ R , and n ∈ Z, (x + ny mod y) = (x mod y). Proof Follows directly by a straightforward computation. Secondly, if we restrict ourselves wholly to the integers and fix the denominator, equality of remainders is equivalent to congruence: Lemma 4.3 For all a, b ∈ Z,n ∈ Z ,wehave (a mod n) = (b mod n) if and only if a ≡ b (mod |n|). Proof If (a mod n) = (b mod n),then a − n a/n = b − n b/n, and hence a ≡ b (mod |n|). Suppose instead that a ≡ b (mod |n|). Then there exist p, q, r ∈ Z such that a = pn + r and b = qn + r.As (pn + r mod n) = (qn + r mod n) = (r mod n),the result follows. As might be expected from being so closely related to congruence, the mod operator has many useful properties, though we shall only need a few. The following properties hold for all x ∈ R and a, y ∈ R : |(x mod y)| < |y| (upper bound) (45) (ax mod ay) = a(x mod y) (distributive law) (46) if x , y ∈ Z, then (x mod y) ∈ Z (integrality) (47) if (x mod y) = 0, then (−x mod y) = y − (x mod y) (complement) (48) It is also worth noting that the remainder (when nonzero) has the same sign as the denominator: y(x mod y) ≥ 0 (49) Finally, the following results simplify some common cases of the numerator: (0mod y) =0(50) (x mod y) = 0 if and only if x = ny for some n ∈ Z (51) (x mod y) = x if and only if 0 ≤ x /y <1(52) (x mod y) = x if and only if xy ≥ 0and |x | < |y| (53) The proofs of these statements are simple and follow easily from the definition of mod. 4.1.2 Floating-Point Numbers with Unbounded exponents We define these sets almost identically to the bounded case. For compatibility with their bounded counterparts, we inherit the base β and precision p: 123 Finding Normal Binary Floating-Point Factors Efficiently Page 27 of 52 11 Definition 4.4 (Unbounded floating-point numbers) We define the following sets: ∗ q p F ={M β | M , q ∈ Z, 0 < |M | <β }, F = F ∪{0}, F = F ∪{−∞, +∞}. ∞ ∞ Remark Note that the unbounded sets are strict supersets of their bounded counterparts. That ∗ ∗ is,wehave F ⊂ F , F ⊂ F ,and F ⊂ F . ∞ ∞ The unbounded floating-point numbers satisfy the following properties for all x ∈ F and n ∈ Z: if |n|≤ β , then n ∈ F (54) e ∗ min if β ≤|x|≤ max F, then x ∈ F (55) x β ∈ F (scaling by powers of β) (56) if x ∈ F , then x /ulp(x ) ∈ Z (integral significand) (57) Proving the existence of a unique upward or downward rounding takes a little more work in the unbounded case, since the sets are infinite, so their subsets might not have a minimum −n or maximum. For instance, the set (0, 1]∩ F has no minimum, since it contains every β where n ≥ 0. However, we only have finitely bounded infinite sets around zero: Lemma 4.5 Let a, b ∈ R.If 0 ∈[ / a, b],then [a, b]∩ F is a finite set. Proof Let S =[a, b]∩ F .If S is empty, then the result follows trivially. Suppose S is nonempty instead and suppose that 0 ∈[ / a, b].Then a ≤ b,and a and b must be nonzero and have the same sign. We shall consider the case of each sign separately. Suppose a and b are positive. Let e = log a and e = log b ,and let a b β β e−p+1 p T ={M β | M , e ∈ Z, 0 < M <β , e ≤ e ≤ e + p − 1}. a b Clearly, T is finite. Now, let s ∈ S.Since s is positive, by definition we have s = M β for some M , q ∈ Z where 0 < M <β . Thus, to show that s ∈ T , it suffices to show that e − p + 1 ≤ q ≤ e . Therefore, a b q q e +1 β ≤ M β = s ≤ b <β so q < e + 1 and hence q ≤ e . Similarly, b b e q q+p β ≤ a ≤ s = M β <β so e − p < q and hence e − p + 1 ≤ q. Therefore s ∈ T , and hence S ⊆ T , and since T a a is finite, so is S. Suppose instead a and b are negative. Then −b and −a are positive, so 0 ∈[ / −b, −a],and thus [−b, −a]∩ F is finite. Since F is closed under negation, we have [−b, −a]∩ F = ∞ ∞ ∞ −([a, b]∩ F ) =−S, and hence S is also finite. Corollary 4.5.1 For any x ∈ R,the sets {y ∈ F | y ≤ x } and {y ∈ F | y ≥ x } have a ∞ ∞ minimum and a maximum. We can now define downward and upward rounding over the unbounded floating-point num- bers: 123 11 Page 28 of 52 M. Andrlon Definition 4.6 (Unbounded rounding) We define the functions RD , RU : R → F as ∞ ∞ ∞ follows: RD (x ) = max {y ∈ F | y ≤ x }, ∞ ∞ RU (x ) = min {y ∈ F | y ≥ x }. ∞ ∞ As expected, these functions are equal to their bounded counterparts over the range of the normal numbers: min Lemma 4.7 For all x ∈ R,if β ≤|x|≤ max F,then RD(x ) = RD (x ) and RU(x ) = RU (x ). We also have a similar gap property to Eq. 40: Lemma 4.8 For all x ∈ R,wehaveeither RD (x ) = RU (x ) = xor RU (x ) − ∞ ∞ ∞ RD (x ) = ulp(x ). Note that the gap is always finite, since having unbounded exponents means that every finite real rounds to another finite real. Only +∞ and −∞ round to themselves. 4.1.3 Plausibility In order to effectively translate our problem into the realm of unbounded floating-point numbers, we will also need a counterpart to our earlier concept of feasibility. Note that we keep the same rounding function: even though fl is not surjective with respect to the unbounded floating-point numbers, it is compatible since it already maps every extended real to a bounded floating-point number. Definition 4.9 (Plausibility)Given Z ⊆ F and x ∈ F ,wesay that x is plausible for Z if ∞ ∞ and only if fl(xy) ∈ Z for some y ∈ F . We denote the set of all unbounded floating-point numbers plausible for Z by Plaus(Z ). Remark If Z ⊆ F,thenweimmediatelyhave Z ⊆ Feas(Z ) ⊆ Plaus(Z ). Similarly to the sets of factors, we have the following closure property: Lemma 4.10 For all Z ⊆ F ,the set Plaus(Z ) is closed under negation and scaling by powers of β. Due to exponents being unrestricted, we have the following: Lemma 4.11 For all Z ⊆ F ,wehave 1 ∈ Plaus(Z ) if and only if Z is nonempty. Perhaps most importantly, we have an analogue to Lemma 3.6 (the proof is identical, mutatis mutandis): Lemma 4.12 Let Z ⊆ F be a floating-point interval, and let x ∈ F and z ∈ Z. Then x ∈ Plaus(Z ) if and only if either fl(x RD (z/x )) ∈ Zor fl(x RU (z/x )) ∈ Z. ∞ ∞ We also have an analogue to Lemma 3.9: Lemma 4.13 Let Z ⊆ F be a floating-point interval, and let x ∈ F and z ∈ Z. Let −1 m = x /ulp(x ) and m = z/ ulp(z).If diam fl [Z]≥ ulp(z) and |m |≤|m |,then x is x z z x plausible for Z. 123 Finding Normal Binary Floating-Point Factors Efficiently Page 29 of 52 11 As expected, feasibility and plausibility agree when the quotients are normal: ∗ e min Lemma 4.14 Let Z ⊆ F be a floating-point interval, and let x ∈ F and z ∈ Z. If β ≤ |z/x|≤ max F,then x ∈ Feas(Z ) if and only if x ∈ Plaus(Z ). min Proof If x ∈ Feas(Z ),then x ∈ Plaus(Z ) trivially. Suppose x ∈ Plaus(Z ) and β ≤ |z/x|≤ max F.ThenRD(z/x ) = RD (z/x ) and RU(z/x ) = RU (z/x ), so by Lemma ∞ ∞ 4.12,wehaveeither x ⊗ RD(z/x ) = fl(x RD (z/x )) ∈ Z , or x ⊗ RU(z/x ) = fl(x RU (z/x ) ∈ Z , and therefore x ∈ Feas(Z ) by definition. Since exponents are unbounded, there is always a next and a previous plausible number. We will find it useful to have a function giving the smallest plausible number greater than some lower bound. (Note that F being finite means that we cannot have a similar total function for the feasible numbers.) First, we prove its existence: Lemma 4.15 Let Z ⊆ F be a floating-point interval, and let x ∈ R .Let A ={y ∈ Plaus(Z ) | y ≥ x }. Then A is nonempty and has a minimum if and only if Z is nonempty. Proof If Z is empty, then Plaus(Z ) is trivially empty and hence so is A. Suppose instead that Z is nonempty and let z ∈ Z and let t = z/s where β ufp(x ) if x > 0, s = − ufp(x ) otherwise. Clearly, x ≤ s.Now,since |s| is a power of β and z ∈ F ,wehave s, t ∈ F . Therefore ∞ ∞ fl(st ) = fl(z) = z, and hence s, t ∈ Plaus(Z ) by definition and so s ∈ A. Now, let L =[x , s]∩ Plaus(Z ) and L =[x , s]∩ F .Since x and s have thesamesignby definition, by Lemma 4.5, it follows that L is finite. Hence, since s ∈ L ⊆ L , it follows that L is also finite, and since it is nonempty, it has a minimum. Now, let R = (s, +∞]∩Plaus(Z ). Since min L ≤ s, it follows that min L < r for all r ∈ R. Thus, since A = L ∪ R,wehave min L ≤ a for all a ∈ A, and hence min A = min L. With this result, we can now define our “round up to plausible” function: Definition 4.16 For each nonempty Z ⊆ F,wedefine thefunction φ : R → F by Z ∞ φ (x ) = min {y ∈ Plaus(Z ) | y ≥ x }. It is easy to see from the definition that these functions are nondecreasing: Lemma 4.17 For all Z ⊆ F and x , y ∈ R ,if x ≤ y, then φ (x ) ≤ φ (y). Z Z Thus, since plausible numbers are a superset of feasible numbers, we have the following: Lemma 4.18 Let Z ⊆ F,x ∈ R and y ∈ Feas(Z ). Then, if x ≤ y, then φ (x ) ≤ y. That is, each of these functions gives a lower bound on the next feasible number. 123 11 Page 30 of 52 M. Andrlon 4.2 Relating Feasibility, Plausibility, and Rounding Errors We begin by giving expressions for rounding in terms of remainders. Note that the following result directly gives a rounding error term. Lemma 4.19 For all x ∈ R ,wehave RD (x ) = x − (x mod ulp(x )), RU (x ) = x + (−x mod ulp(x )). Proof Let x = x − (x mod ulp(x )), x = x + (−x mod ulp(x )) and M = x / ulp(x ). Then, rearranging using the definition of mod, we obtain x = ulp(x ) M and x = ulp(x )M p−1 p p p and thus x ≤ x ≤ x .Since β ≤|M | <β , it follows that | M |≤ β and | M |≤ β . Thus M , M ∈ F , and since ulp(x ) is a power of β,wehave x , x ∈ F .Thus,by ∞ ∞ the definitions of RD and RU , ∞ ∞ x ≤ RD (x ) ≤ x ≤ RU (x ) ≤ x . ∞ ∞ Now, if M ∈ Z,then x = x and hence the result follows immediately. Suppose M ∈ / Z instead. Then x ∈ / F ,and so RU (x ) − RD (x ) = ulp(x ) = x − x . Therefore, ∞ ∞ ∞ x ≤ RD (x ) ≤ x ≤ RD (x ) + ulp(x ) ≤ x + ulp(x ), ∞ ∞ and hence x = RD (x ) and x = RU (x ), as desired. ∞ ∞ The above result allows us to manipulate the rounding error using the various laws obeyed by the mod operator. The following result gives us an expression for the round-trip error between any two consecutive plausible numbers. ∗ ∗ Lemma 4.20 Let Z ⊆ F and z ∈ Z. Let x ∈ R .If z ∈ F , then for all t ∈[x,φ (x )],we have ufp(x ) ≤|t|≤ β ufp(x ) and t RD (z/t ) = z − (z mod t ulp(z/x )), t RU (z/t ) = z + (−z mod t ulp(z/x )). Proof Suppose z ∈ F and let t ∈[x,φ (x )].Let m = x / ufp(x ), m = z/ ufp(z),and Z x z m = t / ufp(t ). By Lemma 4.19 and the distributivity of multiplication over mod, t RD (z/t ) = z − (z mod t ulp(z/t )), t RU (z/t ) = z + (−z mod t ulp(z/t )), so for the equalities it suffices to show that ulp(z/t ) = ulp(Z /x ), or equivalently, that ufp(z/t ) = ufp(z/x ). Note that if ufp(x ) = ufp(t ) and ufp(m ) = ufp(m ), then by Lemma x t 2.12,wehave ufp(z/t ) = ufp(m /m ) ufp(z)/ ufp(t ) z t = ufp(m /m ) ufp(z)/ ufp(x ) z x = ufp(z/x ). Also note that if ufp(t ) = ufp(x ),thenufp(x ) ≤|t | <β ufp(t ) immediately. We proceed by cases. In the following, note that by Lemma 4.10,every powerof β and every scaling of z by a power of β is plausible for Z. • Suppose |m | < m . Then by Lemma 2.12,wehaveufp(m /m ) = 1/β. Then, since z x z x x <β ufp(x ) ∈ Plaus(Z ),wehave t ≤ β ufp(x ). 123 Finding Normal Binary Floating-Point Factors Efficiently Page 31 of 52 11 – Suppose t <β ufp(x ). Then, since ufp(x ) ≤ x ≤ t,wehaveufp(t ) = ufp(x ) and hence |m | < m ≤ m . Therefore, by Lemma 2.12,wehaveufp(m /m ) = 1/β as z x t z t well, as desired. – Suppose t = β ufp(x ) instead. Then, ufp(z/t ) = ufp β ufp(x ) 1 ufp(z) = · β ufp(x ) = ufp(m /m ) ufp(z)/ ufp(x ) z x = ufp(z/x ). • Suppose 0 < m ≤|m | instead. Then, since |m | ufp(x ) ∈ Plaus(Z ),wehave x z z x ≤ t ≤|m | ufp(x)<β ufp(x ). Thus ufp(t ) = ufp(x ) and hence m ≤ m = t / ufp(x ) ≤|m |. x t z Thus by Lemma 2.12,wealsohaveufp(m /m ) = ufp(m /m ) = 1, as desired. z x z t • Suppose −|m |≤ m < 0 instead. Then, since − ufp(x ) ∈ Plaus(Z ),wehave z x x ≤ t ≤− ufp(x ), so ufp(t ) = ufp(x ) and hence −|m |≤ m ≤ m ≤−1. z x t Therefore, by Lemma 2.12,wehaveufp(m /m ) = ufp(m /m ) = 1, as desired. z x z t • Suppose m < −|m | instead. Then ufp(m /m ) = 1/β by Lemma 2.12, and since x z z x −|m | ufp(x ) ∈ Plaus(Z ),wehave x ≤ t ≤−|m | ufp(x ), and thus ufp(x ) = ufp(t ). – Suppose t < −|m | ufp(x ).Then ufp(m /m ) = ufp(m /m ) = 1/β, z x z t and hence ufp(z/t ) = ufp(z/x ), as desired. – Suppose t =−|m | ufp(x ) instead. Then, t ulp(z/x ) =−|m | ufp(x ) ulp(z/x ) ufp(z) 1−p =−|m | ufp(x ) ufp(m /m ) β z z x ufp(x ) 1−p =−|m | ufp(z)β |z| =− , p−1 hence z is an integer multiple of t ulp(z/x ). Thus (z mod t ulp(z/x )) = (−z mod t ulp(z/x )) = 0, 123 11 Page 32 of 52 M. Andrlon and so the rounding error is zero. Similarly, since z ufp(z) z/t = =− sgn(z) , −|m | ufp(x ) ufp(x ) it is a (possibly negated) power of β, and thus in F . Therefore, t RD (z/t ) = t RU (z/t ) = t · = z ∞ ∞ exactly, and hence the result follows. We have thus proven the result in all cases, so we are finished. The following lemma constrains the feasibility of tiny quotients. Essentially, if any especially small quotients are feasible, so are any larger ones, up to a limit. This result will be useful in determining whether there are any floating-point factors around zero. Although we cannot easily find subnormal factors, we can often eliminate them when they do not exist. Lemma 4.21 Let Z ⊆ F be a floating-point interval, x , y ∈ F , and z ∈ Z. If y ∈ Feas(Z ) and |y|≤|x|≤|z|/ max F,then x ∈ Feas(Z ). Proof If |x|=|z|/ max F exactly, then |x|⊗ max F =|z| and hence x ∈ Feas(Z ) by Lemma 3.2. Suppose y ∈ Feas(Z ) and |y|≤|x | < |z/ max F| instead. Then |z/x | > max F and hence z is nonzero. We first prove the result for positive x and y, before then generalizing to all cases. In the following, note that by Lemma 3.6,wehaveeither y ⊗ RU(z/y) ∈ Z or y ⊗ RD(z/y) ∈ Z. Suppose z > 0. Since |z/x | > max F and 0 < y ≤ x by assumption, we have RD(z/y) = RD(z/x ) = max F and RU(z/y) =+∞. Therefore, if y ⊗ RU(z/y) =+∞∈ Z,then x ⊗+∞ = +∞ ∈ Z also, and hence x is feasible for Z.If y ⊗ RD(z/y) ∈ Z instead, then, since y ≤ x, multiplying by RD(z/y) gives, min Z ≤ y ⊗ RD(z/y) ≤ x ⊗ RD(z/y) = x ⊗ RD(z/x ) ≤ z ≤ max Z , and since Z is a floating-point interval, it follows that x ⊗ RD(z/x ) ∈ Z. Therefore x is feasible for Z. Suppose z < 0 instead. Then, we similarly have RD(z/y) =−∞ and RU(z/y) = RU(z/x ) = min F. Hence, if y ⊗ RD(z/y) =−∞∈ Z,then x ⊗−∞ = −∞ ∈ Z,and so x is feasible for Z.If y ⊗ RU(z/y) ∈ Z instead, then since y ≤ x, multiplying by RU(z/y) gives max Z ≥ y ⊗ RU(z/y) ≥ x ⊗ RU(z/y) = x ⊗ RU(z/x ) ≥ z ≥ min Z , and since Z is a floating-point interval, it follows that x ⊗ RU(z/x ) ∈ Z. Therefore x is feasible for Z. We have thus shown the result for positive x and y. Now, suppose instead that y is negative (x can have any sign). Then |y| is feasible for Z by Lemma 3.2, and thus |x | is feasible for Z by the earlier result. Therefore |x | is feasible for Z regardless of the sign of y, and hence by Lemma 3.2,sois x. The following theorem allows us to reduce the question of finding feasible numbers to that of finding plausible numbers, when applicable: 123 Finding Normal Binary Floating-Point Factors Efficiently Page 33 of 52 11 Theorem 4.22 Let Z ⊆ F be a floating-point interval, and let x ∈ F and z ∈ Z. Let A ={y ∈ Feas(Z ) | y ≥ x }, B ={y ∈ Plaus(Z ) | y ≥ x }, C ={y ∈ Plaus(Z ) | y ≥|x |}. min If x is normal and β ≤|z/x|≤ max F, then either A is empty or min A ∈{min B, min C }. min Proof Suppose x is normal, β ≤|z/x|≤ max F,and A is nonempty. Let a = min A, b = min B,and c = min C. Since Feas(Z ) ⊆ Plaus(Z ),wehave a ∈ B and thus x ≤ b ≤ a. E (z/x ) We first show that if c ≤ max F,then c ∈ Feas(Z ).Let t = z/β .ByLemma min 4.10,since |z|∈ Plaus(Z ),wehave |t|∈ Plaus(Z ).Now,since |z/x|≥ β ,wehave E (z/x ) E (z/x ) |z/x|≥ β . Taking reciprocals and multiplying by |z|, we obtain |x|≤|z|/β =|t | and therefore |t|∈ C. Thus we obtain |x|≤ c ≤|t |, and taking reciprocals and multiplying by |z| again, it follows that E (z/x ) e min max F ≥|z/x|≥|z/c|≥|z/t|= β ≥ β . e ∗ min Therefore, by Lemma 4.14,since β ≤|x|≤ c,if c ≤ max F,then c ∈ F and thus c ∈ Feas(Z ). Next, we dispose of two trivial cases: • Suppose 0 ∈ Feas(Z ).Then0 ⊗ y ∈ Z for some y ∈ F and hence 0 ∈ Z. Thus 0 ⊗ x = 0 ∈ Z,and so x ∈ Feas(Z ). Therefore a ≤ x, and hence x = a = b. • Suppose +∞ ∈ Feas(Z ).Then +∞ ⊗ y ∈ Z for some nonzero y ∈ F.If y > 0, then +∞ ⊗ y =+∞=+∞ ⊗ x,and so x ∈ Feas(Z ).If y < 0, then +∞ ⊗ y =−∞= −∞ ⊗ x,and so x ∈ Feas(Z ) again. Thus a ≤ x and so x = a = b. We can now proceed to the proof of the main statement. Suppose 0 ∈ / Feas(Z ) and +∞ ∈ / Feas(Z ) instead. Then a < +∞,and so a ≤ max F. We shall divide the problem into two cases, depending on the sign of x. Suppose x > 0. Then B = C, and hence b = c ≤ a ≤ max F. Therefore c ∈ Feas(Z ) by our earlier result, so a ≤ c also, and hence a = b = c, as desired. e e min min Suppose x < 0 instead. Then, since x is normal, we have x ≤−β .Since −β ∈ e e min min Plaus(Z ) by Lemma 4.10,wehave −β ∈ B and thus x ≤ b ≤−β . Therefore e ∗ e min min β ≤|b|≤|x|≤ max F and hence b ∈ F and β ≤|z/x|≤|z/b|.Thus,if |z/b|≤ max F,then b ∈ Feas(Z ) by Lemma 4.14 and hence a = b. Suppose |z/b| > max F instead. Then |b| < |z|/ max F. We further divide this into two subcases, depending on the sign of a.Notethat a = 0since 0 ∈ / Feas(Z ) by assumption. • Suppose a is negative. Then b ≤ a < 0, and thus 0 < |a|≤|b| < |z|/ max F.Since a ∈ Feas(Z ) by definition, Lemma 4.21 implies that b ∈ Feas(Z ) also and thus a = b. • Suppose a is positive. Then [x , 0]∩ Feas(Z ) is empty, and since Feas(Z ) is closed under negation, it follows that a > |x |. Thus, since a ∈ Plaus(Z ),wehave a ∈ C and hence c ≤ a ≤ max F. Therefore c ∈ Feas(Z ) by our earlier result, so a ≤ c and therefore a = c, as desired. Therefore the result holds in all cases, and we are done. We now give the main result of this section. The following theorem reduces the problem to an optimization problem over the integral significands of plausible numbers. 123 11 Page 34 of 52 M. Andrlon ∗ ∗ Theorem 4.23 Let Z ⊆ F be a floating-point interval and z ∈ Z ∩ F . Then, for all x ∈ F , we have φ (x ) = M ulp(x ) where M = min {N ∈ A ∪ B | N ≥ M }, ∗ p−k p−k A ={N ∈ Z |−(M β mod N ) ∈ I β }, ∗ p−k p−k B ={N ∈ Z | (−M β mod N ) ∈ I β }, −1 fl [Z]− z I = , ulp(z) 0 if |M | > |M |, x z k = 1 otherwise, M = , ulp(x ) M = . ulp(z) Proof Let y = φ (x ) and M = y/ ulp(x ).Since y is plausible by definition, we have either Z y fl(y RD (z/y)) ∈ Z or fl(y RU (z/y)) ∈ Z by Lemma 4.12. Therefore, by Lemma 4.20, ∞ ∞ we have either −1 z − (z mod y ulp(z/x )) ∈ fl [Z ], or −1 z + (−z mod y ulp(z/x )) ∈ fl [Z ]. k−p Since ulp(M /M ) = β by Lemma 2.12, subtracting z andthendividingbyulp(z) · z x ulp(M /M ) givesuseither z x p−k p−k −(M β mod M ) ∈ I β , z y or p−k p−k (−M β mod M ) ∈ I β . z y p−1 p By Lemma 4.20,wehaveufp(x ) ≤|y|≤ β ufp(x ) and thus β ≤|M |≤ β .If p p |M | <β ,thenufp(y) = ufp(x ), and hence M is an integer. If |M |= β instead, y y y then M is again an integer, and thus M ∈ A ∪ B in all cases. Hence, since x ≤ y and y y p−1 p thus M ≤ M by definition, we obtain M ≤ M ≤ M . Hence β ≤|M|≤ β and x y x y x ≤ M ulp(x ) ≤ y, and therefore M ulp(x ) ∈ F . Then, by Lemma 4.20,wehave t RD (z/t ) = z − (z mod t ulp(z/x )), t RU (z/t ) = z + (−z mod t ulp(z/x )), where t = M ulp(x ). Next we prove that t ∈ Plaus(Z ). Substituting the definition of I into the definitions of A and B and rearranging, we obtain ∗ k−p −1 A ={N ∈ Z | z − (z mod N ulp(z)β ) ∈ fl [Z ]}, ∗ k−p −1 B ={N ∈ Z | z + (−z mod N ulp(z)β ) ∈ fl [Z ]}. k−p Now, since β = ulp(M /M ),wehave z x k−p M ulp(z)β = M ulp(z) ulp(M /M ) z x 123 Finding Normal Binary Floating-Point Factors Efficiently Page 35 of 52 11 = M ulp(z/M ) = M ulp(x ) ulp(z/x ) = t ulp(z/x ). Hence, since M ∈ A ∪ B,wehaveeither −1 z − (z mod t ulp(z/x )) = t RD (z/t ) ∈ fl [Z ], or −1 z + (−z mod t ulp(z/x )) = t RU (z/t ) ∈ fl [Z ]. Since t ∈ F , we thus have t ∈ Plaus(Z ). Therefore y ≤ t by definition, and hence y = t = M ulp(x ), as desired. By transforming the problem of finding plausible numbers into an integer optimization prob- lem, we make it much more straightforward to analyze. Even so, the optimization problem presented in Theorem 4.23 is unusual, since what is being optimized is a remainder. However, as we saw earlier in Fig. 2, there are patterns in the round-trip error. In the next section, we will make use of them to bound the error and thus solve the optimization problem efficiently. 5 Minimizing Remainders of a Constant Divided by a Variable In the previous section, we reduced the problem of finding factors to an optimization problem over remainders where the dividend is constant but the divisor is not. The underlying problem is not specific to the floating-point numbers, and we can state it generally as follows: Problem 3 Given a rational x and an interval I containing zero, what are the integers n such that (x mod n) ∈ I ? Specifically, given a fixed integer m, what is the least (or greatest) such n where n ≥ m (or n ≤ m)? Naively, this problem can be solved by exhaustive search on the divisors in at most |x|+ 1 trials, but this is clearly impractical. We do not have a provably efficient solution to the general problem, and it may well be the case that one does not exist at all. However, the specific case we are interested in can be solved in a constant number of arithmetic operations. To better grasp the problem, let us first observe how the remainder varies with the divisor. Plotting (−1000 mod n) in Fig. 3, we see parabolas near n = 30, as well as between n = 20 and n = 25. After n = 40, a pattern of “stepped” parabolas emerges, with an increasing number of “steps” later on. Finally, the parabolas start disappearing entirely around n = 60 and give way to a linear pattern. As one might expect from rounding being expressible in terms of remainder, these patterns are also visible in Fig. 2. This behavior may seem an unusual coincidence at first, but it occurs for any choice of numerator. Algebraically, the reason is remarkably simple. By the definition of the mod operator, we have (−1000 mod n) =−1000 − n −1000/n . Thus, if −1000/n is constant over a certain interval, then (−1000 mod n) is linear in n over that interval. If −1000/n changes linearly, then (−1000 mod n) is instead quadratic in n over that interval. This relationship can be clearly seen in Fig. 3, comparing the plots For instance, if x is a positive integer and I =[0, 0], it is equivalent to factorization. 123 11 Page 36 of 52 M. Andrlon −1000 mod n 80 −1000/n −20 −40 −60 −80 −100 10 20 30 40 50 60 70 80 90 100 Fig. 3 Graphs of (−1000 mod n) and −1000/n, respectively above and below the x-axis. The upper graph follows a parabola whenever the lower graph behaves linearly (e.g. around n = 30). Similarly, linear segments in the upper graph correspond to constant segments in the lower (e.g. near n = 80) above and below the x-axis. The “stepped parabolas” correspond to values of n where the slope of −1000/n is close to linear, but less than 1, and so −1000/n does not change at every step. More formally, consider integers a, n, i such that n and n +i are nonzero, and let q = : Constant quotient If = q,then n+i (a mod n + i ) = a − (n + i )q = (a mod n) − iq. Linear quotient If there is some integer d such that = q + di,then n+i (a mod n + i ) = a − (n + i )(q + di ) = (a mod n) − (di + (q + nd)i ). Note that the above still holds if we permit the variables to take arbitrary real values. In this way we can also find parameters corresponding to each parabola inside the “stepped” segments. In our work, we will focus on linear and quadratic segments. These are especially useful, since we already have the tools to work with them. The following observations are key: Specifically, it corresponds to the second case with d = 1/w,where w is the number of integers m satisfying k = a/m. We can see this below the x-axis in Fig. 3,where w is the “width” of a level (e.g. around n = 80, we have w = 7). 123 Finding Normal Binary Floating-Point Factors Efficiently Page 37 of 52 11 1. Given a linear or quadratic segment, we can directly check it for a satisfying divisor by solving a linear or quadratic equation, respectively. 2. The graph of remainders can always be partitioned into such segments. 3. Therefore, we can search for a solution one segment at a time, rather than one divisor at a time. With a good (that is, small) partition, we can shrink the search space significantly. However, this is still an exhaustive search, so proving the absence of a solution requires checking every segment. Furthermore, it is not clear that it is necessarily possible to construct a partition that is asymptotically smaller than magnitude of the dividend. For the specific case we are interested in, however, the number of segments does not matter: for a suitable partition, we can show that the remainder at either the start or the end of any segment always satisfies the bounds. More specifically, according to Theorem 4.23,we are interested in remainders of the form (−ab mod n) or (ab mod n) for nonzero integers a, b, n. We now study the solution space under these special conditions. 5.1 Conditions for a Solution via Extrapolation Given nonzero integers a and b, we construct an appropriate partition using a function ∗ 2 f : R → R such that f (x ) = (−x −|ab| mod x ).Clearly,wehave f (n) = (−|ab| mod n) for all nonzero integers n. Since the numerator is a quadratic itself, the interval between each pair of consecutive roots of f is a quadratic segment (Fig. 4, bottom). Now, note the following: 1. A segment contains solutions if and only if the value of least magnitude within it satisfies the bounds. 2. In each segment, the value of least magnitude is either at the start or the end of the segment. 3. The start and the end of each segment lie between two consecutive roots of f ,and are obtained by rounding those roots. Therefore, it suffices to prove that if f (x ) = 0, then either (−ab mod x ) or (−ab mod x ) lies within the bounds. Note that a linear mapping does not suffice: although it fits better when x is large (this can be seen in Fig. 4), it does not provide sufficiently strong bounds when x is small. The next result further explains the behavior of real extensions of remainders. In particular, we see that a quadratic extension makes quadratic segments have constant quotients. Lemma 5.1 Let I be an interval not containing zero and let f be a real-valued function continuous over I . If ( f (x ) mod x ) = 0 for all x ∈ I, then f (x )/x = f (y)/y for all x , y ∈ I. Proof We proceed by contraposition. Suppose f (x )/x = f (y)/y for some x , y ∈ I . Without loss of generality, we assume that f (x )/x < f (y)/y.Let q : I → R be a function such that q(t ) = f (t )/t. Then, since f is continuous over I,sois q.If q(y) ≤ q(x ),then q(y) ≤ q(x ) by monotonicity, which is a contradiction. Therefore q(x)< q(y) ≤ q(y), and hence by the intermediate value theorem there is some w ∈ (x , y]⊆ I such that q(w) = q(y). Thus f (w) is an integer multiple of w,and so ( f (w) mod w) = 0, as desired. Corollary 5.1.1 Let I be an interval not containing zero and let f be a real-valued function continuous over I . If ( f (x ) mod x ) = 0 for all x ∈ I, then ( f (x ) mod x ) = f (x ) − xq for all x ∈ I , where q is an integer such that q = f (y)/y) for all y ∈ I. For instance, we can trivially assign each pair of consecutive divisors its own segment. 123 11 Page 38 of 52 M. Andrlon 20 40 60 80 100 20 40 60 80 100 Fig. 4 Graphs of (−1000 mod x ) and (−x − 1000 mod x ). The dots indicate points where x is an integer The following lemma allows us to express the remainders of interest as quadratics. Lemma 5.2 Let a, b ∈ Z and let f , g, h : R → R be functions such that f (t ) = (−t −|ab| mod t ), g(t ) = (−ab mod t ), h(t ) =−(ab mod t ). Let I be an interval not containing zero. Let q = (−x −|ab|)/x for some x ∈ I. If f (t ) = 0 for all t ∈ I , then for all n ∈ I ∩ Z, f (n) =−n − qn −|ab|, f (n) if ab ≥ 0, g(n) = n − f (n) otherwise. −x − 1000 mod x −1000 mod x Finding Normal Binary Floating-Point Factors Efficiently Page 39 of 52 11 h(n) = g(n) − n. Proof Suppose f (t ) = 0for all t ∈ I and let n ∈ I ∩ Z. Then, by Lemma 5.1 we immediately have f (n) =−n − qn −|ab|.Since n is an integer, we also have f (n) = (−|ab| mod n). Thus, since f (n) is nonzero, it follows that n does not divide ab. Therefore g(n) and h(n) are nonzero, and hence g(n) − n =−(n − (−ab mod n)) =−(ab mod n) = h(n). If ab ≥ 0, then f (n) = (−ab mod n) = g(n).If ab < 0, then f (n) = (ab mod n) = −h(n) = n − g(n), and thus g(n) = n − f (n) as desired. We now proceed to demonstrate the conditions under which the roots of the quadratic mapping provide a solution. We first derive a simpler expression for the quadratic passing through a given root. ∗ 2 2 ab Lemma 5.3 Let a, b,δ ∈ R and let x ∈ R .Then −(x +δ) +(x +δ)k −ab =−δ −(x − )δ x +ab where k = . −x −ab Remark Note that (−x − ab mod x ) = 0 if and only if is an integer. We now bound the magnitude of that quadratic in terms of the difference between the factors in the numerator. Lemma 5.4 Let a, b, x ∈ R.If 1 < a + 1 ≤|x|≤ b − 1,then |x − ab/x | < b − a − 1. ∗ ab Proof Let f : R → R be a function such that f (t ) = t − and suppose 1 < a + 1 ≤ |x|≤ b − 1. Since a and b are positive, we have ab > 0and so f is strictly increasing over (−∞, 0) and over (0, +∞). Hence f (a + 1) ≤ f (|x |) ≤ f (b − 1).Now,since b − 1and a arepositive,wehave ab (b − a − 1) − f (b − 1) = (b − a − 1) − b − 1 − b − 1 ab = − a b − 1 b − 1 > 0, and hence b − a − 1 > f (b − 1). Similarly, since b and a + 1 are positive, ab −(b − a − 1) − f (a + 1) =−(b − a − 1) − a + 1 − a + 1 ab = − b a + 1 =− a + 1 < 0, and so −(b − a − 1)< f (a + 1). Therefore, | f (|x |)| < b − a − 1. If x > 0, then |x|= x and the result follows immediately. If x < 0 instead,wehave | f (|x |)|=| f (−x )|=|− f (x )|=| f (x )|, and so the result follows. 123 11 Page 40 of 52 M. Andrlon Remark We assume that a + 1 ≤ x ≤ b − 1 in order to obtain the necessary bound. When a and b are integers, if x = a or x = b,then (−ab mod x ) = 0, which lies within the bounding interval by assumption. The following lemma is an auxiliary result. Lemma 5.5 Let x ∈ R and δ ∈ (0, 1).If δx and (δ − 1)(x + 1) are integers, then they are both nonzero and have opposite signs. Proof Suppose δx and (δ − 1)(x + 1) are integers. If x = 0, then (δ − 1)(x + 1) = δ − 1 ∈ / Z, which is a contradiction. If 0 < |x|≤ 1, then δx ∈ / Z, which is also a contradiction. Therefore |x | > 1, and hence x + 1 is nonzero and has the same sign as x. Therefore, since δ − 1is negative, it follows that δx and (δ − 1)(x + 1) are nonzero and have opposite signs. We now show that rounding a root of the quadratic extension gives a solution for a sufficiently wide interval. However, we assume that the factors in the numerator are within a factor of two of each other. Otherwise, the bound from Lemma 5.4 grows too quickly to be useful. Luckily, this assumption always holds when β = 2, which is the most common case by far. Lemma 5.6 Let a, b ∈ Z and let x ∈ R . Let I be an interval containing zero. If 0 < a ≤ |x|≤ b, (−x −ab mod x ) = 0,b ≤ 2a and diam I ≥ a, then for some xˆ , xˆ ∈{x , x }, 1 2 • either (−ab mod xˆ ) ∈ Ior −(ab mod xˆ ) ∈ I , and 1 1 • either (ab mod xˆ ) ∈ Ior −(−ab mod xˆ ) ∈ I. 2 2 Proof Suppose 0 < a ≤|x|≤ b, (−x − ab mod x ) = 0, b ≤ 2a and diam I ≥ a.Weshall first dispose of some trivial cases. If x ∈ Z,then x = x = x ,so (−ab mod x ) = (−ab mod x ) = (−x − ab mod x ) = 0 ∈ I , and hence the result. Since 0 ∈ I,if a < |x | < a + 1 instead, then either |x |= a or |x |= a, and hence the result holds trivially. Similarly, if b − 1 < |x | < b, then either |x |= b or |x |= b, and the result follows trivially again. Suppose x ∈ / Z and a + 1 < |x | < b − 1 instead, and let f : R → R be a function such 2 x +ab 2 that f (t ) =−t + kt − ab where k = . Then, since (−x − ab mod x ) = 0, it follows that k is an integer and thus f (t ) is an integer whenever t is an integer. Now, let δ = x − x ab and y = − x − δ. Then, by Lemma 5.3, f (x ) =−x + kx − ab =−(x + δ) + k(x + δ) − ab ab =−δ − x − δ ab = δ − x − δ = δy. 123 Finding Normal Binary Floating-Point Factors Efficiently Page 41 of 52 11 Since x is not an integer, it follows that x − x = 1 and hence x = x + δ − 1. Thus, by Lemma 5.3 again, f (x ) =− x + k x − ab =−(x + (δ − 1)) + k(x + (δ − 1)) − ab ab =−(δ − 1) − x − (δ − 1) ab = (δ − 1) − x − (δ − 1) = (δ − 1)(y + 1). Now, since b ≤ 2a, by Lemma 5.4,wehave |ab/x − x | < b − a − 1 ≤ a − 1. Hence, applying the triangle inequality, | f (x )|=|δy| < |ab/x − x − δ| ≤|ab/x − x|+ δ < |ab/x − x|+ 1 < a, and | f (x )|=|(δ − 1)(y + 1)| < |ab/x − x − (δ − 1)| ≤|ab/x − x|+ (1 − δ) < |ab/x − x|+ 1 < a, and | f (x ) − f (x )|=|δy − (δ − 1)(y + 1)| =|y + 1 − δ| =|ab/x − x + 1 − 2δ| ≤|ab/x − x|+|1 − 2δ| < |ab/x − x|+ 1 < a. Hence the distance between each of f ( x ), f ( x ) and 0 is strictly less than diam I ≥ a.Let c = min R and d = max R where R ={ f (x ), f (x )}.Thendiam [c, d]= d −c < b −a. Since f (x ) and f (x ) are both integers and δ ∈ (0, 1), it follows by Lemma 5.5 that δy and (δ − 1)(y + 1) are nonzero and have opposite signs and hence c < 0 < d. Thus both [c, d] and I contain zero, and since diam [c, d] < a ≤ diam I,wehaveeither c ∈ I or d ∈ I by Lemma 2.3, and therefore either f (x ) ∈ I or f (x ) ∈ I . Similarly, since [−d, −c] also contains zero and has the same diameter as [c, d], we also have either − f (x ) ∈ I or − f (x ) ∈ I . Now, in the following, note that we have both | f (x )| < a < |x | and | f (x )| < a < |x |, which drastically simplifies working with mod. 123 11 Page 42 of 52 M. Andrlon • Suppose f (x ) and x have the same sign. Then − f (x ) also has the same sign, and − f (x ) and f (x ) have the opposite sign. Therefore, f (x ) = ( f (x ) mod x ) = (−ab mod x ), − f (x ) = (− f (x ) mod x ) = (ab mod x ), and thus − f (x ) =−(−ab mod x ), f (x ) =−(ab mod x ). Hence either (−ab mod x ) ∈ I or −(ab mod x ) ∈ I , and also either (ab mod x ) ∈ I or −(−ab mod x ) ∈ I . • Suppose instead f (x ) and x have opposite signs. Then f (x ) and − f (x ) have the same sign as x, and hence − f (x ) has the opposite sign of x. Therefore, f (x ) = ( f (x ) mod x ) = (−ab mod x ), − f (x ) = (− f (x ) mod x ) = (ab mod x ), and thus − f (x ) =−(−ab mod x ), f (x ) =−(ab mod x ). Hence either (−ab mod x ) ∈ I or −(ab mod x ) ∈ I , and also either (ab mod x ) ∈ I or −(−ab mod x ) ∈ I . Therefore, for some xˆ , xˆ ∈{x , x },wehaveeither (−ab mod xˆ ) ∈ I or −(ab mod 1 2 1 xˆ ∈ I ), and also either (ab mod xˆ ) ∈ I or −(−ab mod xˆ ) ∈ I . 1 2 2 Combining the above result with Theorem 4.23, we can finally show that, for equitable rounding functions, roots of the quadratic extension give a bound on plausible integral sig- nificands. With the help of Lemma 5.2, this will let us compute plausible numbers efficiently in base 2. ∗ ∗ Theorem 5.7 Let Z ⊆ F be a floating-point interval, and let z ∈ Z ∩ F and x ∈ F .Let ∗ 2 p−k f : R → R be a function such that f (t ) = (−t −|M |β mod t ) where 0 if |M | > |M |, x z k = 1 otherwise, M = , ulp(x ) M = . ulp(z) ∗ p−1 p−1 Let r ∈ R be a root of f . If fl is equitable, r ≥ M , and either β < |M |≤ 2β or x z β = 2,then φ (x ) ≤ r ulp(x ). −1 Proof Suppose the conditions hold and let I = (fl [Z ]−z)/ ulp(z) and M = φ (x )/ ulp(x ). We divide the proof into two major cases based on the value of M , and then handle the different values of k within those cases. p−1 p−1 Suppose β < |M |≤ 2β . Then, since fl is equitable, by definition we have −1 −1 Q(z) diam fl [Z]≥ diam fl [{z}] ≥ β ≥ ulp(z), 123 Finding Normal Binary Floating-Point Factors Efficiently Page 43 of 52 11 and therefore, −1 fl [Z]− z p−k p−k diam I β = β diam ulp(z) p−k −1 = diam fl [Z ] ulp(z) p−k ≥ β . If k = 0, then |M | < |M |, so Lemma 4.13 implies that x is plausible for Z, and thus z x φ (x ) = x = M ulp(x ) ≤ r ulp(x ) ≤ r ulp(x ). Z x p−k p−1 Suppose k = 1instead,and let a = β and b =|M |.Then b ≤ 2a = 2β by p−k assumption and diam I β ≥ a. We now consider each sign of M : • Suppose M > 0. Then b = M , and hence by Lemma 5.6,wehaveeither −(ab mod z z p−k p−k p−k p−k rˆ) =−(M β mod rˆ) ∈ I β or (−ab mod rˆ) = (−M β mod rˆ) ∈ I β for z z some rˆ ∈{r , r } and hence M ≤ˆr ≤ r by Theorem 4.23. • Suppose M < 0. Then b =−M andbyLemma 5.6 we have either (ab mod rˆ) = z z p−k p−k p−k p−k (−M β mod rˆ) ∈ I β or −(−ab mod rˆ) =−(M β mod rˆ) ∈ I β for z z some rˆ ∈{ r , r } and hence M ≤ˆr ≤ r by Theorem 4.23. Therefore M ≤ r in both cases, and multiplying by ulp(x ),weobtain φ (x ) ≤ r ulp(x ). p−1 Finally, suppose β = 2and |M |= β instead. If k = 1, then |M |=|M |,so x is z x z trivially plausible for Z and hence φ (x ) = x ≤ r ulp(x ) immediately. Suppose k = 0 p−k instead, and let a =|M | and b = β . Then, since β = 2, we have b = 2a.Since flis equitable, −1 −1 Q(z)−1 diam fl [Z]≥ diam fl [{z}] ≥ β ≥ ulp(z)/β, and therefore, −1 fl [Z]− z p−k p diam I β = β diam ulp(z) −1 = diam fl [Z ] ulp(z) p−1 ≥ β = a. We now complete the proof using Lemma 5.6 and Theorem 4.23. By the same argument as earlier, we have M ≤ˆr ≤ r for some rˆ ∈{r , r }, and thus φ (x ) ≤ r ulp(x ),as desired. Remark If r is the least root of f greater than M ,then f is nonzero over (M , r ). x x p−k Thus, according to Lemma 5.2, the restrictions of M →−(M β mod M ) and M → p−k (−M β mod M ) to (M , r )∩Z are quadratics. Hence, if neither x nor r ulp(x ) are plau- z x sible for Z, then we necessarily have φ (x ) = r ulp(x ). Otherwise, φ (x ) ≤ r ulp(x ), Z Z so we can compute φ (x ) by solving the quadratics from Lemma 5.2 to find the earliest point −1 where they lie in the interval (fl [Z]− z)/ ulp(z). As discussed above, combining the bound given by Theorem 5.7 with Lemma 5.2 allows us to solve Problem 1 for normal numbers in binary floating-point arithmetic. We next turn to assembling our results into an algorithm. 123 11 Page 44 of 52 M. Andrlon 6 Algorithms Theorem 3.5 states that the classical division-based algorithm can be used to produce optimal bounds for floating-point arithmetic if we can devise some way of finding the feasible floating- point numbers nearest any given infeasible number. If the bounds and their quotients are within the range of the normal numbers, Theorem 4.22 tells us that we can do this by instead finding the nearest plausible numbers. Finally, for equitable rounding functions in base 2 (and other bases under certain conditions), Theorem 5.7 allows us to efficiently find the next plausible number. Putting these together, we can derive an algorithm for efficiently computing optimal bounds on the factors of a binary floating-point multiplication. Algorithms 1 to 6 form a complete pseudocode implementation of the results of this paper. They are annotated with the properties they satisfy and justifications for them. The proofs of correctness for Algorithms 4 to 6 require some additional minor results which are included in Appendix A. We thus obtain the following: Theorem 6.1 Algorithms 1 to 6 are correct. Algorithms 3 to 6 can be computed in O(1) arithmetic operations. Algorithms 1 and 2 can be computed in O(1) arithmetic operations and computations of the preimage of a floating-point interval under fl. Remark As in Lemma 2.25, for typical choices of fl, these algorithms have time complexity O(M (p) + log(e − e + 1)),where M (p) is the time complexity of multiplying two max min p-digit integers. We again do not take the base, precision or exponent range as constants in our analysis. According to Theorem 3.5, given the preconditions hold, applying Algorithm 1 to the bounds produced by the traditional algorithm gives optimal bounds on the factors. This takes −1 O(1) arithmetic operations and computations of fl , which is much less than the exponential worst case we proposed in Conjecture 2.27 for just iterating the classical algorithm. Although the preconditions on Algorithm 1 almost always hold when β = 2, they must still be checked. Otherwise, even though the algorithm always returns either a feasible number or +∞, the value returned may not in fact be the next feasible number (or, in the case of +∞ being infeasible, the problem instance may still be satisfiable). In the event that some or all of the inputs are subnormal or we are working in a higher base, we still have some options for using these results. By Lemma 4.18, the next plausible number is always a lower bound on the next feasible number, and Algorithm 2 works correctly for subnormal input, since there is no such distinction over the unbounded floating-point numbers. Furthermore, it can be shown that Algorithm 2 returns a lower bound on the next plausible number even in higher bases. However, we make no claims to the efficiency under such conditions of iterating Algorithm 2 to find the next feasible number. Now, as we demonstrated in Example 2.29, iterating the classical algorithm is insufficient to find optimal bounds on solution sets in the general case. That is, given a problem instance (X , Y , Z ) where X , Y , Z ⊆ F are floating-point intervals, iterating F until a fixed point is reached may not produce optimal bounds on Z. We proposed solving this by using binary search to find the greatest and least values in Z for which the instance is satisfiable. (As a reminder, an instance is satisfiable if and only if its optimal bounds are nonempty.) Naively, this search takes O(p + log(e − e + 1)) satisfiability queries. However, we can do max min better. Lemma 3.8 says that if we have at least β values in Z, then almost any value is feasible for Z. Thus, in most cases, we only need to consider first and last β values of Z to obtain optimal intervals. When β = 2, this means that we typically only need to trim at most one value from each endpoint of Z. Note that this situation only occurs when the problem instance 123 Finding Normal Binary Floating-Point Factors Efficiently Page 45 of 52 11 is satisfiable—otherwise, our optimal bounds on X and Y will be empty, and thus so will the optimal bounds on Z. Algorithm 1 Finding the next feasible number. min Require: fl equitable; normal floating-point number x; floating-point interval Z; z ∈ Z such that β ≤ |z/x|≤ max F; either β = 2or1 < |z/ ufp(z)|≤ 2 Ensure: returns inf Feas(Z ) ∩[x , +∞] 1: function NextFeasible(x , z, Z) 2: if x ∈ Feas(Z ) then using Lemma 3.6 3: return x 4: end if 5: x := NextPlausible(x , z, Z) by Theorem 4.22, if a solution exists, it is either x or x b b 6: if x ∈ Feas(Z ) then check that x ∈ F and then use Lemma 3.6 b b 7: return x 8: end if 9: x := NextPlausible(|x |, z, Z) 10: if x ∈ Feas(Z ) then 11: return x 12: end if 13: return +∞ Feas(Z ) ∩[x , +∞] is empty by Theorem 4.22 14: end function Algorithm 2 NextPlausible Require: fl equitable; x ∈ F ; floating-point interval Z; nonzero and finite z ∈ Z; either β = 2or1 < |z/ ufp(z)|≤ 2 Ensure: returns φ (x ) 1: function NextPlausible(x , z, Z) 2: M := x / ulp(x ) 3: M := z/ ulp(z) −1 4: I := (fl [Z]− z)/ ulp(z) 5: if |M | > |M | then x z 6: a := M 7: b := β 8: I := I β 9: else p−1 10: a := β 11: b := M p−1 12: I := I β 13: end if 14: M := NextDivisorInBounds(a, b, M , I ) by Theorems 4.23 and 5.7 15: return M ulp(x ) 16: end function 7 Conclusion In this paper, we have rigorously presented the first step towards an efficient solution to the problem of computing optimal interval bounds on the variables of the floating-point constraint x ⊗ y = z. The algorithms presented here are sound, complete, and efficient for binary floating-point arithmetic involving normal numbers. This has immediate implications 123 11 Page 46 of 52 M. Andrlon Algorithm 3 NextDivisorInBounds Require: integers a, b, n such that 0 < |a|≤|n|≤|b|≤|2a|; interval I containing zero such that diam I ≥|a| Ensure: returns min {m ∈ Z | m ≥ n ∧ ((−ab mod m) ∈ I ∨−(ab mod m) ∈ I )} 1: function NextDivisorInBounds(a, b, n, I ) 2: if (−ab mod n) ∈ I ∨−(ab mod n) ∈ I then 3: return n 4: end if 5: r := NextQuadraticModLinearRootFloor(−1, −|ab|, n) 6: if (−ab mod r)/ ∈ I ∧−(ab mod r)/ ∈ I then 7: return r + 1 by Lemma 5.6, since there are no solutions in [n, r ] 8: end if 9: q := (−n −|ab|)/n solution is in [n, r ]; find minimum using Lemma 5.2 10: if ab > 0 then 11: c := NextQuadraticPointWithinBounds(−1, − q , −ab, n, I ) 12: d := NextQuadraticPointWithinBounds(−1, −q, −ab, n, I ) 13: else 14: c := NextQuadraticPointWithinBounds(1, q, −ab, n, I ) 15: d := NextQuadraticPointWithinBounds(1, q , −ab, n, I ) 16: end if 17: return min {c, d} 18: end function Algorithm 4 NextQuadraticModLinearRootFloor Require: nonzero integers a, c, n Ensure: returns min {x | x ∈[n, +∞) ∧ (ax + c mod x ) = 0} 1: function NextQuadraticModLinearRootFloor(a, c, n) 2: q := (an + c)/n 3: R := QuadraticRootsFloor(a, − q , c) 4: R := QuadraticRootsFloor(a, − q , c) 5: return min {m ∈ R ∪ R | m ≥ n} by Lemma A.4 1 2 6: end function Algorithm 5 NextQuadraticPointWithinBounds Require: integers a, b, c, n where a = 0; interval I Ensure: returns inf {m ∈ Z | m ≥ n ∧ am + bm + c ∈ I } 1: function NextQuadraticPointWithinBounds(a, b, c, n, I ) 2: if an + bn + c ∈ I then 3: return n 4: end if 5: I := I ∩ Z 6: if I =∅ then 7: return +∞ 8: end if 9: R := QuadraticRootsFloor(a, b, c − max I ) 10: R := QuadraticRootsFloor(a, b, c − min I ) 11: R := R ∪ R 1 2 12: R := {r + 1 | r ∈ R} solutions are ceilings of roots per Lemma A.2 13: return min {m ∈ R ∪ R | m ≥ n ∧ am + bm + c ∈ I}∪{+∞} 14: end function 123 Finding Normal Binary Floating-Point Factors Efficiently Page 47 of 52 11 Algorithm 6 QuadraticRootsFloor Require: integers a, b, c where a = 0 Ensure: returns {x | x ∈ R ∧ ax + bx + c = 0} 1: function QuadraticRootsFloor(a, b, c) 2: if a < 0 then 3: a := −a 4: b := −b 5: c := −c 6: end if 7: := b − 4ac 8: if < 0 then 9: return ∅ 10: end if −b+ − 11: r := by applying Lemma A.1 to the quadratic formula 2a −b+ 12: r := 2a 13: return {r , r } 1 2 14: end function for floating-point decision procedures based on interval constraint solving. We hope that the techniques developed here will be useful in future study of floating-point theory. As suggested earlier, a number of questions remain open in this topic: • What is the optimal time complexity for computing feasible numbers in general (i.e. in all bases and with subnormals)? Can we asymptotically outperform brute force search? Does an efficient algorithm exist for this task? • Although the algorithms in this paper are relatively brief, the work to develop them is much longer. Is there are shorter derivation of this solution? • The results of this paper rely on a mixture of integer and floating-point reasoning. Can these results be stated naturally using only floating-point numbers? If so, is there a version of this algorithm which uses only floating-point arithmetic? • Separately, Sect. 5 contains results which may be useful for solving mod constraints beyond the scope of this paper. Can they be applied more broadly? Acknowledgements This research was supported by an Australian Government Research Training Program (RTP) Scholarship. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. Appendix A Additional Proofs This section contains some additional results used to show the correctness of Algorithms 4 to 6. There is also a proof of a minor property on the digital expansions of floating-point quotients. 123 11 Page 48 of 52 M. Andrlon Lemma A.1 Let x ∈ R and n ∈ Z.Ifn > 0,then x /n = x /n. Proof Suppose n > 0and let δ = x − x and δ = (x mod n).Since n is a positive 1 2 integer and x is an integer, we have 0 ≤ δ ≤ n − 1. Hence 0 ≤ δ + δ < n and thus 2 1 2 (δ + δ )/n = 0. By the definition of mod, we also have (x − δ )/n = x /n and 1 2 2 thus x /n = (x + δ + δ − δ )/n 1 2 2 = x /n + (δ + δ )/n 1 2 = x /n + (δ + δ )/n 1 2 = x /n . Lemma A.2 Let f be a continuous real function and let n ∈ Z and a, b ∈ R.Let M ={m ∈ Z | m ≥ n ∧ f (m) ∈[a, b]}. If M is nonempty, then either min M = nor min M = x for some x ∈ R such that either f (x ) = aor f (x ) = b. Proof If f (n) ∈[a, b], the result follows trivially. Suppose instead that M is nonempty and m > n where m = min M.Then a ≤ f (m) ≤ b and either f (m − 1)< a or f (m − 1)> b by assumption. Suppose f (m − 1)< a. Then, by the intermediate value theorem, there is some x ∈ (m − 1, m] such that f (x ) = a.Since x = m, the result follows. Suppose f (m − 1)> b instead. Then, similarly, there is some x ∈ (m − 1, m] such that f (x ) = b. Since x = m again, we are finished. ∗ ∗ 2 Lemma A.3 Let a, c ∈ R and let q : R → R be a function such that q(t ) = (at + c)/t. For all x ∈ R , there exists some y ∈[x , +∞) such that q is continuous over [x , y] and q(y) ∈{q(x ) , q(x )}. Proof In the following, note that q is continuous over its domain and q(t ) = at + c/t for all t ∈ R . We proceed by cases. Let x ∈ R and suppose x is positive. Suppose a is also positive. Then, lim q(t ) = lim at + lim c/t =∞ + 0 =+∞, t →∞ t →∞ t →∞ and thus q grows without bound. Since it is also continuous over (0, +∞), it therefore attains q(x ) at some y ∈[x , +∞), and hence the result follows. Suppose a is negative instead. Then lim q(t ) =−∞ similarly, and hence q has no t →∞ lower bound. Since it is continuous over (0, +∞), we thus have q(y) = q(x ) for some y ∈[x , +∞), and hence the result follows. Now, suppose x is negative instead. Suppose c is positive. Then, lim q(t ) = lim at + lim c/t = 0 −∞ = −∞, − − − t →0 t →0 t →0 and therefore q has no lower bound over (−∞, 0). Since it is also continuous over (−∞, 0), we thus have q(y) = q(x ) for some y ∈[x , 0), as desired. Suppose c is negative instead. Then lim q(t ) =+∞, and hence q has no upper t →0 bound over (−∞, 0).Since q is continuous over (−∞, 0), it follows that q(y) = q(x ) for some y ∈[x , 0). Therefore the result holds in all cases, and we are done. 123 Finding Normal Binary Floating-Point Factors Efficiently Page 49 of 52 11 ∗ ∗ 2 Lemma A.4 Let a, c ∈ R and let q : R → R be a function such that q(t ) = (at + c)/t. Let x ∈ R and R ={t ∈[x , +∞) | q(t ) ∈ Z}. Then R has a minimum, and either q(r ) = q(x ) or q(r ) = q(x ) where r = min R. Proof Let S be a set such that S ={t ∈[x , +∞) | q(t ) ∈{q(x ) , q(x ))}}. ∗ 2 Clearly, S ⊆ R.For all t ∈ R and u ∈ R,wehave q(t ) = u if and only if at − ut + c = 0. Therefore, for any particular u, there are at most 2 choices of t that satisfy the equation, and hence S has at most 4 elements. By Lemma A.3, there is some y ∈[x , +∞) such that q is continuous over [x , y] and q(y) ∈{q(x ) , q(x ))},and so y ∈ S. Therefore S is nonempty and finite, and hence has a minimum. Now, let s = min S.Since S ⊆ R, we thus have s ∈ R.Let t ∈ R ∩[x , s].Then q(t ) is an integer by definition, and hence either q(t ) ≥ q(x ) or q(t ) ≤ q(x ). Therefore, since q is continuous over [x , y] and thus also continuous over [x , t ], the intermediate value theorem implies that either q(u) = q(x ) or q(u) = q(x ) for some u ∈[x , t ]. Therefore u ∈ S, and hence s ≤ u by definition. Since u ≤ t ≤ s, we thus have s = t = u and hence R ∩[x , s]={s},somin R = s. By the definition of S,wehaveeither q(s) = q(x ) or q(s) = q(x ), so we are done. ∗ ∗ 2 Corollary A.4.1 Let a, c ∈ R and let q : R → R be a function such that q(t ) = (at +c)/t. For all x ∈ R , min {t ∈[x , +∞) | (at + c mod t ) = 0}= min {t ∈[x , +∞) | q(t ) ∈{q(x ) , q(x )}}. ∗ ∗ Lemma A.5 For all x , y ∈ F ,if β is prime, then either x /y ∈ F or x /y has no finite ∞ ∞ representation in base β. Proof Let x , y ∈ F and suppose β is prime and x /y has a finite representation in base β. We shall prove that x /y ∈ F .Let M = x / ulp(x ) and M = y/ ulp(y).Then M /M has x y x y a finite base-β representation, and thus M /M = n/β for some n, k ∈ Z.If k ≤ 0, then x y k p |n|=|M /M |β ≤|M /M |≤|M | <β . x y x y x Suppose k > 0 instead, and suppose without loss of generality that β does not divide n and that M /M is in lowest terms. Then, since n = M β /M is an integer and M has no x y x y x common factors with M , it follows that β /M is an integer. Therefore, since β is prime y y and M is an integer, |M | is a power of β.However,since β is not a factor of n, it follows y y k p k ∗ that |M |= β , and thus |n|=|M | <β again. Therefore M /M = n/β ∈ F by y x x y definition in all cases, and so x /y ∈ F . Appendix B Glossary B.1 List of Terms diameter the greatest distance between any pair of a set’s elements exponent the integer e in ±d .d ... d × β 1 2 p 123 11 Page 50 of 52 M. Andrlon equitable see Definition 3.11; roughly, a rounding function under which no floating-point number has an especially small preimage; alterna- tively, a rounding function which does not allocate too many reals to any floating-point number faithful a rounding function is faithful if it always rounds to either the near- est number above or below (i.e. does not jump over floating-point numbers) feasible x is feasible for Z if x is a floating-point factor of some z ∈ Z floating-point factor x is a floating-point factor of z if x ⊗ y = z for some floating-point floating-point interval a set of consecutive floating-point numbers floating-point number see Definition 2.6; roughly, a number of the form ±d .d ... d ×β 1 2 p p−1 integral significand the significand multiplied by β , making it an integer normal a floating-point number which can be written without leading zeros with an exponent between e and e ; equivalently, a floating- min max min point number with magnitude no less than β plausible the unbounded floating-point numbers x and y are plausible for Z if fl(xy) ∈ Z quantum the place value of a 1 in the last place of a floating-point number quantum exponent the exponent of the quantum (the quantum is always a power of the base) rounding function a function from R to F significand the ±d .d ... d part of ±d .d ... d × β 1 2 p 1 2 p subnormal nonzero, but smaller than any normal number unit in first place the place value of a 1 in the first digit of a number written in base- β scientific notation; equally, β where e is the exponent of the number unit in last place the place value of a 1 in the last digit of a number written in base- e−p+1 β precision-p scientific notation; equally, β where e is the exponent of the number B.2 List of Symbols β Base (alt. radix) p Precision m Significand M Integral significand e Exponent q Quantum exponent: satisfies q = e − p + 1 E (x ) Exponent of x Q(x ) Quantum exponent of x ⊗ Rounded multiplication: x ⊗ y = fl(xy) fl Any nondecreasing and faithful rounding function RD(x ) x rounded downward (i.e. to the nearest number below) RU(x ) x rounded upward (i.e. to nearest number above) RN(x ) x rounded to the nearest number; no rule is specified for breaking rounding ties −1 fl [Z ] Preimage of Z under rounding; the set of extended real numbers that round to some z ∈ Z 123 Finding Normal Binary Floating-Point Factors Efficiently Page 51 of 52 11 Feas(Z ) Set of all floating-point numbers feasible for Z Plaus(Z ) Set of all unbounded floating-point numbers plausible for Z φ (x ) The least x ∈ Plaus(Z ) greater than or equal to x ufp(x ) Unit in first place of x ulp(x ) Unit in last place of x F (X , Y , Z ) Classical interval-based narrowing of X , Y , Z; see Definition 2.22 diam X Diameter of the set X (x mod y) Remainder of floored division of x by y Z Integers Z Nonzero integers R Real numbers R Nonzero real numbers R Real numbers with +∞ and −∞ R Nonzero real numbers with +∞ and −∞ F Nonzero finite floating-point numbers F Finite floating-point numbers F Finite floating-point numbers with +∞ and −∞ F Nonzero finite unbounded floating-point numbers F Finite unbounded floating-point numbers F Finite unbounded floating-point numbers with +∞ and −∞ References 1. Andrlon, M., Schachte, P., Søndergaard, H., Stuckey, P.J.: Optimal bounds for floating-point addition in constant time. In: 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH), pp. 159–166 (2019). https://doi.org/10.1109/ARITH.2019.00038 2. Bagnara, R., Carlier, M., Gori, R., Gotlieb, A.: Exploiting binary floating-point representations for con- straint propagation. INFORMS J. Comput. 28(1), 31–46 (2016). https://doi.org/10.1287/ijoc.2015.0663 3. Barrett, G.: Formal methods applied to a floating-point number system. IEEE Trans. Softw. Eng. 15(5), 611–621 (1989). https://doi.org/10.1109/32.24710 4. Boldo, S., Melquiond, G.: Computer arithmetic and formal proofs: verifying floating-point algorithms with the Coq system. ISTE Press, London (2017). https://doi.org/10.1016/C2015-0-01301-6 5. Boldo, S., Melquiond, G.: Flocq: a unified library for proving floating-point algorithms in Coq. In: 2011 IEEE 20th Symposium on Computer Arithmetic, pp. 243–252 (2011). https://doi.org/10.1109/ARITH. 2011.40 6. Boldo, S., Muñoz, C.A.: A high-level formalization of floating-point number in PVS. NASA contractor report 2006-214298, NASA Langley Research Center, Hampton (2006). https://ntrs.nasa.gov/citations/ 20070003560. Accessed 04-08-2022 7. Brain, M., Niemetz, A., Preiner, M., Reynolds, A., Barrett, C., Tinelli, C.: Invertibility conditions for floating-point formulas. In: Dillig, I., Tasiran, S. (eds.) Computer Aided Verification: 31st International Conference, CAV 2019, New York City, July 15–18, 2019, Proceedings, Part II. Lecture Notes in Computer Science, vol. 11562, pp. 116–136. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-25543-5_8 8. Brain, M., Tinelli, C., Rümmer, P., Wahl, T.: An automatable formal semantics for IEEE-754 floating- point arithmetic. In: Proceedings of the 22nd IEEE Symposium on Computer Arithmetic, pp. 160–167. IEEE Computer Society, (2015). https://doi.org/10.1109/ARITH.2015.26 9. Carreño, V.A., Miner, P.S.: Specification of the IEEE-854 floating-point standard in HOL and PVS. In: Higher Order Logic Theorem Proving and Its Applications. Category B proceedings, Aspen Grove, (1995). https://web.archive.org/web/20000919054331/ 10. Carreño, V.A.: Interpretation of IEEE-854 floating-point standard and definition in the HOL system. NASA technical memorandum 110189, NASA Langley Research Center, Hampton (1995). https://ntrs. nasa.gov/api/citations/19960008463/downloads/19960008463.pdf. Accessed 03- 08-2022 11. Daumas, M., Rideau, L., Théry, L.: A generic library for floating-point numbers and its application to exact computing. In: Boulton, R.J., Jackson, P.B. (eds.) Theorem Proving in Higher Order Logics: 14th 123 11 Page 52 of 52 M. Andrlon International Conference, TPHOLs 2001, Edinburgh, September 3–6, 2001. Proceedings. Lecture Notes in Computer Science, vol. 2152, pp. 169–184. Springer, Berlin (2001). https://doi.org/10.1007/3-540- 44755-5_13 12. Gallois-Wong, D., Boldo, S., Cuoq, P.: Optimal inverse projection of floating-point addition. Numer. Algorithms 83(3), 957–986 (2020). https://doi.org/10.1007/s11075-019-00711-z 13. Goldberg, D.: What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23(1), 5–48 (1991). https://doi.org/10.1145/103162.103163 14. Harrison, J.: A machine-checked theory of floating point arithmetic. In: Theorem Proving in Higher Order Logics:12th International Conference, TPHOLs ’99, Nice, France, September 14–17, 1999, Proceedings. Lecture Notes in Computer Science, vol. 1690, pp. 113–130. Springer, Berlin (1999). https://doi.org/10. 1007/3-540-48256-3_9 15. Harrison, J.: Floating point verification in HOL. In: Thomas Schubert, E., Windley, P.J., Alves-Foss, J. (eds.) Higher order logic theorem proving and its applications: 8th International Workshop, Aspen Grove, September 11–14, 1995. Proceedings. Lecture Notes in Computer Science, vol. 971, pp. 186– 99. Springer, Berlin (1995). https://doi.org/10.1007/3-540-60275-5_65 16. Harrison, J.: Floating-point verification using theorem proving. In: Bernardo, M., Cimatti, A. (eds.) Formal Methods for Hardware Verification: 6th International School on Formal Methods for the Design of Computer, Communication, and Software Systems, SFM 2006, Bertinoro, May 22–27, 2006, Advances Lectures. Lecture Notes in Computer Science, vol. 3965, pp. 211–242. Springer, Berlin (2006). https:// doi.org/10.1007/11757283_8 17. Jacobi, C., Berg, C.: Formal verification of the VAMP floating point unit. Formal Methods Syst. Des. 26(3), 227–266 (2005). https://doi.org/10.1007/s10703-005-1613-y 18. Jacobsen, C., Solovyev, A., Gopalakrishnan, G.: A parameterized floating-point formalizaton in HOL Light. Electron. Notes Theor. Comput. Sci. 317, 101–107 (2015). https://doi.org/10.1016/j.entcs.2015. 10.010 19. Loiseleur, P.: Formalisation en Coq de la norme IEEE 754 sur l’arithmétique à virgule flottante. Research report, École normale supérieure, Paris, (1997).https://web.archive.org/web/20001003100321/http:// www.lri.fr/%7Eloisel/rapport-stage-dea.ps.gz 20. Melquiond, G.: Floating-point arithmetic in the Coq system. Inform. Comput. 216, 14–23 (2012). https:// doi.org/10.1016/j.ic.2011.09.005 21. Michel, C.: Exact projection functions for floating point number constraints. In: Seventh International Symposium on Artificial Intelligence and Mathematics. AI&M 15-2002, Fort Lauderdale (2002) 22. Miner, P.S.: Defining the IEEE-854 floating-point standard in PVS. Technical Report NASA-TM-110167 (1995). https://ntrs.nasa.gov/citations/19950023402. Accessed 04-08-2022 23. Miner, P.S.: Formal specification of IEEE floating-point arithmetic using PVS. IFAC Proc. Vol. 28(25), 31–36 (1995). https://doi.org/10.1016/S1474-6670(17)44820-8 24. Muller, J.M., Brunie, N., de Dinechin, F., Jeannerod, C.P., Joldes, M., Lefèvre, V., Melquiond, G., Revol, N., Torres, S.: Handbook of Floating-Point Arithmetic, 2nd edn. Birkhäuser, Cham (2018). https://doi. org/10.1007/978-3-319-76526-6 25. Pan, J., Levitt, K.N.: A formal specification of the IEEE floating-point standard with application to the verification of F. In: 1990 Conference Record Twenty-Fourth Asilomar Conference on Signals, Systems and Computers, 1990., vol. 1, p. 505 (1990). https://doi.org/10.1109/ACSSC.1990.523389 26. Rümmer, P., Wahl, T.: An SMT-LIB theory of binary floating-point arithmetic. In: Informal Proceedings of the 8th International Workshop on Satisfiability Modulo Theories (SMT) at FLoC, Edinburgh p. 14 (2010) 27. Rump, S., Ogita, T., Oishi, S.: Accurate floating-point summation part I: faithful rounding. SIAM J. Sci. Comput. 31(1), 189–224 (2008). https://doi.org/10.1137/050645671 28. Std, I.E.E.E.: 754–2019: IEEE Standard for Floating-Point Arithmetic. Technical report, IEEE (2019) 29. Yu, L.: A formal model of IEEE floating point arithmetic. Archive of Formal Proofs (2013) 30. Ziv, A., Aharoni, M., Asaf, S.: Solving range constraints for binary floating-point instructions. In: Proceed- ings of the 16th IEEE Symposium on Computer Arithmetic. ARITH’03, pp. 158–164. IEEE Computer Society, Los Alamitos (2003). https://doi.org/10.1109/ARITH.2003.1207674 Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Journal of Automated Reasoning – Springer Journals
Published: Mar 1, 2023
Keywords: Floating-point equation solving; Interval multiplication; Decision procedures
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.