20. How do I represent non integer quantities?#

  • floating point

  • IEEE double format

20.1. What about a fixed point?#

Let’s experiment with an 8 bit representation with 1 bit for sign and then the next 4 used for the part of the number greater than 0 and the last 3 for the fractional part.

so then the number:

01000001

would be

0-1000-001

positive, 8.1

in this then we can represent the numbers 0-8 for the right hand side and 0-15 on the left hand side so we get

import itertools
num_list = [str(n+f) for n,f in itertools.product(range(16),[i/10 for i in range(10)])]
', '.join(num_list) + ', '.join(['-'+ n for n in num_list])
len(num_list)*2

This is far fewer different values than we could represent with 1 bit for sign and 7 bits to represent intergers

(2**7)*2

Another way we could represent numbers with a fixed point, is to use base two fractions: \(\frac{1}{2}\), \(\frac{1}{4}\), \(\frac{1}{8}\) etc.

In this fixed point we would have, for example: 0101.1010 would be $\(0*2^3 + 1*2^2 + 0*2^2 + 1*2^0 + 1*\frac{1}{2^{-1}} + 0*\frac{1}{2^{-2}} + 1*\frac{1}{2^{-}} + 0*\frac{1}{2^{-4}} = 4 + 1 + \frac{1}{2} + \frac{1}{8} = 5 + \frac{5}{8}= 5.625\)$

In this case we still have a small range of numbers, but it’s a different set of possible numbers.

20.2. Floating Point Notation#

We can write numbers in many different forms. We have written integers through many different bases so far.

For example in scientific contexts, we often write numbers (in base 10) like:

\[ 3.45 \times 10^2 = 345 \]

We can use a similiar form to represent numbers in binary. Using base 2 instead of base 10.

20.3. Floating point numbers are not all exact#

Let’s look at an example, we add .1 together 3 times, and we get the expected result.

.1 + .1 + .1

However, if we check what it’s equal to, it does not equal .3

.1 + .1 + .1 == .3

This is because the floating point representation is an approximation and there are multiple full numbers that are close to any given number that cannot be expressed exactly. However, the display truncates since usually we want only a few significant digits.
Even rounding does not make it work.

round(.1,1) + round(.1,1) + round(.1,1) == round(.3,1)
round(.1 + .1+ .1 ,1)
.1
num = .1
num

20.4. Floating point IEEE standard#

Now, lets see how these numbers are actually represented.

IEEE Floating Point is what basically everyone uses, but it is technically a choice hardware manufacturers can techically make.

  • Initially in 1985

  • Reviesd in 2008 after 7 year process that expanded

  • revised in 2019 after 4 year process that mostly clarified

IBM mainframes use their own representiton based on Hex

Next revision is projected to 2028.

this is a double precision float or binary64 inthe current standard.

it was called double in the original, offiically, so it is commonly called that.

In this case we will 1 bit for sign, 11 for exponent and 52 for the fraction part

float bit image

20.4.1. How do we read a number like this?#

if the sign bit is \(s\) and the number represented by the exponent bits is \(e\) and the 52 bits are number from right most is 0 and the last one is 52.

\[ (-1)s + \left(1 + \sum_{i=1}^{52} b_{52-i} 2^{-i}\right) \times 2^{e-1023} \]

Note that this is \(2^{-1}\) so we are working with fractions instead of integers in the sum.

So if we had:

0 01111111111 0000000000000000000000000000000000000000000000000000

it would represent:

\[ (-1)*0 + \left(1 + 0 \cdot 2^{-0} + 0 \cdot 2^{-1} + \ldots + 0 \cdot 2^{-51) + 0 \cdot 2^{-52) \times 2^{1023-1023} = 0 + (1 + 0) \times 2^0 = 1 \times 1 = 1.0 \]

or

0 01111111111 0100000000000000000000000000000000000000000000000000

it would represent: $\( (-1)*0 + \left(1 + 0\cdot 2^{-0} + 1\cdot 2^{-1} + \ldots + 0\cdot 2^{-51) + 0\cdot 2^{-52) \times 2^{1023-1023} = 0 + (1 + \frac{1}{2}) \times 2^0 = 1.5 \times 1 = 1.5 \)$

0b01111111111
float.hex(1.5)
0b01111111000-1023

20.4.2. How do we get numbers into this form?#

Now, let’s return to our example of .1.

First we take the sign of the number for the sign bit, but then to get the exponent and fraction, we have more work to do.

Let’s take the equation we had from the standard:

\[ \left(1 + \sum_{i=1}^{52} b_{52-i} 2^{-i}\right) \times 2^{e-1023}\]

If we think about the fraction and how we add fractions, by reaching a common denominator. Since they’re all powers of two, we can use the last one.

\[ \left(1 + \sum_{i=1}^{52} b_{52-i} 2^{-i}\right) \times 2^{e-1023}\]
\[ \left( 1 + \frac{b_{52}}{2^{1}} + \frac{b_{51}}{2^{2}} + \cdots + \frac{b_{1}}{2^{51}} + \frac{b_{0}}{2^{52}} \right) \times 2^{e-1023} \]

Now with a common denominator:

\[ \left(\frac{2^{52}}{2^{52}} + \frac{2^{51} b_{52}}{2^{52}} + \frac{2^{50} b_{51}}{2^{52}} + \cdots + \frac{2^{1} b_{1}}{2^{52}} + \frac{2^0 b_{0}}{2^{52}} \right) \times 2^{e-{1023}} \]

So then this becomes a binary number with 53 bits (the 52 + 1) and a denominator of \(2^{52}\), let’s call that number J.

\[ \frac{J}{2^{52}} \times 2^{e-{1023}} \]

we can then combine the powers of 2.

\[ \frac{J}{2^{52-e+1023}} \]

So i order to return to our .1 that we want to represent, we can represent it as a fraction and then estimate it in the form above.

\[\frac{1}{10} ~= \frac{ J }{2^N}\]
\[J ~= 2^N / 10\]

Since we want to use exactly 53 bits to represent \(J\), we want \(\frac{2^N }{10}\) to be greater than or equalt to \(2^{52}\) and less than \(2^{53}\).

\[ 2^{52} <= \frac{2^N }{10} < 2^{53} \]

Since \(10 =8+2 = 2^3 +2^1\) then \(2^3<10<2^4\) We can say that $\( \frac{2^N }{2^4} < \frac{2^N }{10} < \frac{2^N }{2^3} \)$

\[ 2^{N-4} < \frac{2^N }{10} < 2^{N-3} \]

so if we want:

\[ 2^{52} <= \frac{2^N }{10} < 2^{53} \]

then best \(N\) is 56, but we can check it.

2**52 <= 2**56 //10 < 2**53

Now we can get the number we will use for \(J\). We want to get as close to .1 with our fraction as possible, so

q,r = divmod(2**56,10)

Then we check the remainder to decide if we should round up by 1 or not.

r

\( 6 > 5 = \frac{10}{2}\) so we round up

q+1

then we chan check the length in bits of that number

len(bin(q+1))-2
bin(q+1)

We need \(52-e+1023 = 56\) so

52-56+1023

Python doesn’t provide a binary reprsentation of floats, but it does provide a hex representation.

float.hex(.1)

If we tak the binary above, it matched this for the part before the p. 0b1 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010

after the p is the exponent of \(-4 = 1019-1023\). Which matches the approximation we found.

(q+1)/2**56 == .1

this confirms that the approximaiton we found is the same as the float representation of .1.

format(.1, '.17f')

We can also use built in classes to get at the needed quantitites

20.5. Example In Python#

Python 3.8.3 (default, Jul  2 2020, 11:26:31) 
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 2**11
2048
>>> 2**10
1024
>>> .1 + .1 + .1
0.30000000000000004
>>> .1 + .1 + .1 == .3
False
>>> .3
0.3
>>> .1
0.1
>>> round(.1,1) *3 ==round(.3,1)
False
>>> round(.1*3, 1) 
0.3
>>> q,r =divmod(2**56,10)
>>> q
7205759403792793
>>> r
6
>>> J = q+1
>>> len(bin(J))
55
>>> 52-56+1023
1019
>>> float.hex(.1)
'0x1.999999999999ap-4'
>>> 1019-1023
-4
>>> (q+1)/2**56 == .1
True
>>> from decimal import Decimal
>>> from fractions import Fraction
>>> Fraction.from_float(.1)
Fraction(3602879701896397, 36028797018963968)
>>> q+1/2
7205759403792794.0
>>> q+1/4
7205759403792793.0
>>> (q+1)2
  File "<stdin>", line 1
    (q+1)2
         ^
SyntaxError: invalid syntax
>>> (q+1)/2
3602879701896397.0
>>> Fraction.from_float(.1)
Fraction(3602879701896397, 36028797018963968)
>>> exit()

20.6. Review today’s class#

Free. You may do the following as a bonus practice badge though.

  1. Write a C program to compare values as doubles and as float (single precision/32bit) to see that this comparison issue is related to the IEEE standard and is not language specific. Make notes and comparison around its behavior and include it in a code cell in cdouble.md

20.7. Prepare for Next Class#

None, come as you are.

20.8. More Practice#

Free. You may do the following as a bonus practice badge though.

  1. Write a C program to compare values as doubles and as float (single precision/32bit) to see that this comparison issue is related to the IEEE standard and is not language specific. Make notes and comparison around its behavior and include it in a code cell in cdouble.md

  2. In floatexpt.md design an experiment using the fractions.Fraction class in python that shows helps illustrate how .1*3 == .3 evaluates to False but .1*4 ==.4 evaluates to True.

20.9. Experience Report Evidence#

Nothing separate

20.10. Questions After Today’s Class#

Post manually for a community badge.