30 March 2018

Floating-point numbers

Often floating-point numbers lead to confusion and frustration. Unfortunately, these problems cannot be avoided and to properly work with floating-point values a basic understanding is required.

How floating-point numbers are stored
Floating point decimal values generally do not have an exact binary representation. This is a side effect of how the FPU represents and processes floating point numbers. The storage format for Double and Single is the same as expected by the FPU-registers of the CPU. This ensures consistency and fast reading from and writing to memory. The problem however, is how to store a floating-point value in a binary computer. This is solved by storing a floating-point number as a formula. There are two types of floating-point numbers: Float (or Single) and Double. The difference is their size in bytes, and therefore the minimum and maximum values that can be stored. Another difference is the higher accuracy for a Double. The maximum number a Float (taking 4-bytes) can store is much less than a Double (taking 8-bytes) can store. Since floating-point numbers can have an infinite number of values, you cannot store all of them in either 4 bytes (float) or 8 bytes (double). To be able to store as much numbers as possible, with as much accuracy as possible, another approach is necessary. A floating-point value is stored as a formula:

X = (-1)^sign * 2^(exponent - bias) * (1 + fraction * 2^-23)

The formula contains 3 variables (sign, exponent, fraction) and one constant (bias).The bias for single-precision numbers is 127 and 1,023 (decimal) for double-precision numbers. The values of these formula-variables are stored in either 4 bytes for a Single or 8 bytes for a Double.

The next example uses an user defined type Sfloat to illustrate the storage of the Float data type. The values for fraction and exponent, together with a sign bit, are stored in the 4 bytes. By using a Union we can assign a value to a Single variable and then use Sfloat to dump the 4 bytes that make up the Float:

Type Sfloat
  fraction As Bits 23   // fractional part
  exponent As Bits  8   // exponent + 127
  sign     As Bits  1   // sign bit
Type TFloat Union       // sizeof() = 4
  value As Float
  sf    As Sfloat
Dim fv As TFloat
fv.value = 2.0     : DumpFloat(fv)
fv.value = 0.0     : DumpFloat(fv)
fv.value = -345.01 : DumpFloat(fv)
' Do test some more ..

Proc DumpFloat(ByRef tflt As TFloat)
  Global Const bias As Int = 127        ' Standard IEEE
  Global Const frexp As Float = 2 ^ -23 ' Standard IEEE
  Dim flt!

  Debug "> DumpFloat:"; tflt.value;
  With tflt.sf
    Debug " (sign =";.sign; " exponent ="; .exponent; _
      " fraction =";.fraction;")"
    Debug "Binary format: ";Bin(.sign, 1)` _
      Bin(.exponent, 8)`Bin(.fraction, 23)

    ' Reconstruct value from Sfloat using formula:
    flt! = ((-1) ^ .sign Mul 2 ^ (.exponent - bias)) _
      * (1 + (.fraction * frexp))
    Debug "Float reconstructed ="; flt!

The output of the demo is:

> DumpFloat: 2 (sign = 0 exponent = 128 fraction = 0)
Binary format: 0 10000000 00000000000000000000000
Float reconstructed = 2
> DumpFloat: 0 (sign = 0 exponent = 0 fraction = 0)
Binary format: 0 00000000 00000000000000000000000
Float reconstructed = 0
> DumpFloat:-345.01 (sign = 1 exponent = 135 fraction = 2916680)
Binary format: 1 10000111 01011001000000101001000
Float reconstructed =-345.01

What does this tell us? A Float (or Double) is stored and described using 3 components in the bits of either a 4 or 8 bytes type. Due to the limited storage of all these components only an approximation of the decimal value can be ‘described’. When a floating point value is assigned to a variable the value is dissected into these 3 components. To get back to the original floating-point number these 3 components are substituted in this standardized formula.

Effect of floating point values
A floating-point value is stored by a description, not by its value. This makes it inherently inaccurate. Even common decimal fractions, such as 0.0001 cannot be represented exactly in binary, only fractional numbers of the form n/f where f is an integer power of 2 can be expressed exactly with a finite number of bits. Examples are 1/4, 7/16, 3/128, of each f is a power of 2.
The inaccuracy may increase slightly when a floating point value is loaded into a 80-bits FPU register. The FPU fills the remaining bits, because the 80-bits representation is different from the 32-bits Single or 64-bits Double format. Moving a value out of the FPU register will round the value back to fit in either a Single or a Double. Storing and transporting may add to the inaccuracy of of the value.

The following example shows what happens when the small error in representing 0.0001 propagates to the sum:

Dim dSum As Double, i As Int
For i = 1 To 10000
  dSum = dSum + 0.0001
Next i
Debug dSum     ' = 0.999999999999906

Theoretically the sum should be 1.0.

Not only the calculations suffer from inaccuracy, comparisons with floating point numbers are equally problematic. The following example demonstrates a ‘forbidden’ comparison between a floating-point constant number and the result of a calculation:

Global Double dVal1, dVal2
dVal1 = 69.82
dVal2 = 69.20 + 0.62
Assert dVal1 == dVal2  ' Not equal

This throws an ASSERT exception, because the assertion that dVal1 and dVal2 are equal fails.
A comparison between two floating-point constants of the same type is allowed. For instance, the GFA-BASIC runtime returns a Single constant from DllVersion (2.33; 2.341; etc.). This constant may be compared to a literal Single constant (note the exclamation mark, without it 2.33 is a double!):

If DllVersion == 2.33! MsgBox "This is version 2.33"

Never compare two different data types, a Single to to Double, or a floating-point to an integer, These comparisons will most certainly fail (unless they can be described using a finite number of bits, see above). Any comparison to a floating point will most likely fail, because the comparison is executed in the FPU expanding the values to 80-bits. The same is true for the comparison of the results of two floating point calculations, it will most certainly fail.

The most logical solution for floating-point comparison of type Double is the use of the special operator NEAR, which uses only 7 decimal digits from both expressions for the comparison. In practice the expressions are compared as if they are both of Single precision.

Improve floating-point consistency in calculations
If your application expects multiple fp-calculations, it is necessary to keep the intermediate values in the proper data format, otherwise small errors are propagated through the calculations. For multiple floating-point calculations the FPU uses the intermediate results that it holds in the 80-bits FPU registers. However, these 80-bits calculations do not reflect the data types involved, the Single or Double. Due to the extra level of accuracy multiple calculations may produce unexpected results. The compiler setting ‘Improve floating-point consistency’ inserts code to load and write immediate results from and to memory in the appropriate type. This decreases program speed, but improves the chance for an expected result of the calculation. Make sure the ‘Improve floating-point consistency’ is checked always (unless you know exactly what you’re doing).

Floating-point values are inherently inaccurate, you might want to avoid them as much as possible. Instead use integers when ever possible, or otherwise use Currency, which is an integer value as well. The Currency data type exactly stores up to 19 digits, with 4 digits after the decimal point.

No comments:

Post a Comment