21. How does a computer represent non integer quantities?#

So far we have looked at how computers represent integers in binary and how it can do math using logic gates.

A lot of operations we do require representing quatntities that are not whole numbers.

21.1. Let’s Try a Fixed point#

Let’s experiment with an 8 bit representation with 1 bit for sign and then the next 4 used for the part of the number greater than 0 and the last 3 for the fractional part.

so then the number:

01000001

would be interpretted like:

0-1000.001

positive, 8.1

in this then we can represent the numbers 0-8 for the right hand side and 0-15 on the left hand side so we get

import itertools
num_list = [str(n)+'.'+str(f) for n,f in itertools.product(range(16),range(8))]

', '.join(num_list) + ', '.join(['-'+ n for n in num_list])
'0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 4.0, 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 5.0, 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, 6.0, 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, 6.7, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 8.0, 8.1, 8.2, 8.3, 8.4, 8.5, 8.6, 8.7, 9.0, 9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 9.7, 10.0, 10.1, 10.2, 10.3, 10.4, 10.5, 10.6, 10.7, 11.0, 11.1, 11.2, 11.3, 11.4, 11.5, 11.6, 11.7, 12.0, 12.1, 12.2, 12.3, 12.4, 12.5, 12.6, 12.7, 13.0, 13.1, 13.2, 13.3, 13.4, 13.5, 13.6, 13.7, 14.0, 14.1, 14.2, 14.3, 14.4, 14.5, 14.6, 14.7, 15.0, 15.1, 15.2, 15.3, 15.4, 15.5, 15.6, 15.7-0.0, -0.1, -0.2, -0.3, -0.4, -0.5, -0.6, -0.7, -1.0, -1.1, -1.2, -1.3, -1.4, -1.5, -1.6, -1.7, -2.0, -2.1, -2.2, -2.3, -2.4, -2.5, -2.6, -2.7, -3.0, -3.1, -3.2, -3.3, -3.4, -3.5, -3.6, -3.7, -4.0, -4.1, -4.2, -4.3, -4.4, -4.5, -4.6, -4.7, -5.0, -5.1, -5.2, -5.3, -5.4, -5.5, -5.6, -5.7, -6.0, -6.1, -6.2, -6.3, -6.4, -6.5, -6.6, -6.7, -7.0, -7.1, -7.2, -7.3, -7.4, -7.5, -7.6, -7.7, -8.0, -8.1, -8.2, -8.3, -8.4, -8.5, -8.6, -8.7, -9.0, -9.1, -9.2, -9.3, -9.4, -9.5, -9.6, -9.7, -10.0, -10.1, -10.2, -10.3, -10.4, -10.5, -10.6, -10.7, -11.0, -11.1, -11.2, -11.3, -11.4, -11.5, -11.6, -11.7, -12.0, -12.1, -12.2, -12.3, -12.4, -12.5, -12.6, -12.7, -13.0, -13.1, -13.2, -13.3, -13.4, -13.5, -13.6, -13.7, -14.0, -14.1, -14.2, -14.3, -14.4, -14.5, -14.6, -14.7, -15.0, -15.1, -15.2, -15.3, -15.4, -15.5, -15.6, -15.7'

That is it, all fo the number we can represent in this way.

len(num_list)*2
256

Important

correction

It is still the samenumber of values, but they’re spread out differenty

The integers would be spread out uniformly, but these are not.

(2**7)*2
256

21.2. Binary fractions#

What if instead, we could represent numbers with a fixed point, is to use base two fractions, just like we represent numbers in deceimal. Instead of the points to the right of the decimal being the \(\frac{1}{10}\), \(\frac{1}{100}\), \(\frac{1}{100}\) they are the \(\frac{1}{2}\), \(\frac{1}{4}\), \(\frac{1}{8}\) etc.

In this fixed point we would have, for example: 0101.1010 would be $\(0*2^3 + 1*2^2 + 0*2^2 + 1*2^0 + 1*\frac{1}{2^{-1}} + 0*\frac{1}{2^{-2}} + 1*\frac{1}{2^{-}} + 0*\frac{1}{2^{-4}} = 4 + 1 + \frac{1}{2} + \frac{1}{8} = 5 + \frac{5}{8}= 5.625\)$

In this case we still have a small set of numbers, but it’s a different set of possible numbers.

21.3. Review: Scientific Notation#

We can write numbers in many different forms. We have written integers through many different bases so far.

For example in scientific contexts, we often write numbers (in base 10) like:

\[ 3.45 \times 10^2 = 345 \]

or

\[ 3.45 \times 10^-2 = .0345 \]

Sometimes, we shorten these to 3.45e2 for short.

In math, we do not have to block out how many place values we are goign to use in advance explicitly, but to build a computer we do. So we have to decide how many bits we will use and how to use each one of them.

21.4. Floating point numbers are not all exact#

Before we dig into the details, let’s review why this matters.

Julia Evans asked people on Mastodon for examples how floats do weird things and then wrote a blog post summarizing eight examples, and the link to the mastodon thread with even more replies.

She also wrote a zine with complementary resources. If you would like a print copy of the zine, email Dr. Brown.

Let’s look at an example, we add .1 together 3 times, and we get the expected result.

.1 + .1 + .1
0.30000000000000004

However, if we check what it’s equal to, it does not equal .3

.1 + .1 + .1 == .3
False

This is because the floating point representation is an approximation and there are multiple full numbers that are close to any given number that cannot be expressed exactly. However, the display truncates since usually we want only a few significant digits.
Even rounding does not make it work.

round(.1,1) + round(.1,1) + round(.1,1) == round(.3,1)
False

21.5. Floating point IEEE standard#

Now, lets see how these numbers are actually represented.

IEEE Floating Point is what basically everyone uses, but it is technically a choice hardware manufacturers can techically make.

  • Initially proposed in 1985

  • Reviesd in 2008 after 7 year process that expanded

  • revised in 2019 after 4 year process that mostly clarified

IBM mainframes use their own representiton based on Hex

Next revision is projected to 2028.

It includes specificaitons for using varying numbers of bits. In different progrmaming languages different types are represented with different standards.

In C/C++ for example, floats are single precision floating point numbers or using 32 bits. double is double precision, using 64 bits.

In python, float is 64 bit on a 64 bit computer.

in Javascript, there are no integers, only floats with the precision specified by the architecture.

21.5.1. Double Standard#

this is a double precision float or binary64 inthe current standard.

it was called double in the original, offiically, so it is commonly called that.

In this case we will use:

  • 1 bit for sign

  • 11 bits for exponent and

  • 52bits for the fraction part

float bit image

21.5.2. How do we read a number like this?#

if the sign bit is \(s\) and the number represented by the exponent bits is \(e\) and the 52 bits are number from right most is 0 and the last one is 52.

Then we read it as a value like:

\[ (-1)s + \left(1 + \sum_{i=1}^{52} b_{52-i} 2^{-i}\right) \times 2^{e-1023} \]

Note that this is \(2^{-i}\) so we are working with fractions instead of integers in the sum.

So if we had:

0 01111111111 0000000000000000000000000000000000000000000000000000

it would represent:

\[ (-1)*0 + \left(1 + 0 \cdot 2^{-1} + \ldots + 0 \cdot 2^{-51}+ 0 \cdot 2^{-52} \right)\times 2^{1023-1023} = 0 + (1 + 0) \times 2^0 = 1 \times 1 = 1.0 \]

In English:

  • the sign bit is zero, so it is a positive number

  • the exponent is 1023, so we end up multiplying by 1

  • the fraction bits represent 0, so we end up with just the 1 in the equation

visualized

and

0 01111111111 0100000000000000000000000000000000000000000000000000

would represent: $\( (-1)*0 + \left(1 + 0\cdot 2^{-1} + 1\cdot 2^{-2} + 0\cdot 2^{-3} + \ldots + 0\cdot 2^{-51} + 0\cdot 2^{-52} \times \right) 2^{1023-1023} = 0 + (1 + \frac{1}{2}) \times 2^0 = 1.5 \times 1 = 1.25 \)$

You can see this visually using float exposed

Tip

the URL of that link is to float.exposed/0x3ff4000000000000

3ff4000000000000 is the number we want to view in hex. Since the first 12 bits of the 64bit nubme are the sign and expontend, the first 3 characters represent those then the remaining 13 characters represent the 52 bits of the fraction.

So 3ff=011111111 and then 4000000000000=0100000000000000000000000000000000000000000000000000`

Python can give you the hex representation of a float:

float.hex(1.25)
'0x1.4000000000000p+0'

This matches what we saw above, but it takes some re-arranging, this is a good practice.

21.5.3. How do we get numbers into this form?#

Now, let’s return to our example of .1.

First we take the sign of the number for the sign bit, but then to get the exponent and fraction, we have more work to do.

To figure this out, we’ll start by manipulating the format that we need to get our number into:

\[ \left(1 + \sum_{i=1}^{52} b_{52-i} 2^{-i}\right) \times 2^{e-1023}\]

If we think about the fraction and how we add fractions, by reaching a common denominator. Since they’re all powers of two, we can use the last one as our common denominator:

\[ \left(1 + \sum_{i=1}^{52} b_{52-i} 2^{-i}\right) \times 2^{e-1023}= \left( 1 + \frac{b_{52}}{2^{1}} + \frac{b_{51}}{2^{2}} + \cdots + \frac{b_{1}}{2^{51}} + \frac{b_{0}}{2^{52}} \right) \times 2^{e-1023} \]

Now with a common denominator:

\[ \left(\frac{2^{52}}{2^{52}} + \frac{2^{51} b_{52}}{2^{52}} + \frac{2^{50} b_{51}}{2^{52}} + \cdots + \frac{2^{1} b_{1}}{2^{52}} + \frac{2^0 b_{0}}{2^{52}} \right) \times 2^{e-{1023}} \]
\[ \left(\frac{2^{52}+ 2^{51} b_{52}+ 2^{50} b_{51}+ \cdots + 2^{1} b_{1} + 2^0 b_{0}}{2^{52}} \right) \times 2^{e-{1023}} \]

So then we can treat the numerator as a binary number with 53 bits (the 52 + 1) let’s call that number \(J\).

\[ \frac{J}{2^{52}} \times 2^{e-{1023}} \]

we can then combine the powers of 2 and we’ll call that exponent \(N\)

\[ \frac{J}{2^{52-e+1023}} = \frac{ J }{2^N}\]

So i order to return to our .1 that we want to represent, we can represent our target number as a fraction and then estimate it in the form above.

\[\frac{1}{10} \approx \frac{ J }{2^N}\]

let’s isolate \(J\) by multiplying both sides by \(2^N\)

\[J \approx \frac{2^N }{10}\]

We want to use exactly 53 bits to represent \(J\) because it needs to be represented by the part of the float standard that is \(1+\sum_{i=1}^{52} b_{52-i}2^{-i}\). Which is 52 bits that can be either 0 or 1, plus the left most bit is a fixed 1 from the \(1+\) in the definition.

We can use this to find what \(N\) needs to be first.

we want \(J \approx \frac{2^N }{10}\) to be greater than or equal to \(2^{52}\) and less than \(2^{53}\), or equivalently:

\[ 2^{52} <= \frac{2^N }{10} < 2^{53} \]

Since \(10 =8+2 = 2^3 +2^1\) then \(2^3<10<2^4\) We can change the bounds of our inequality above.

\[ \frac{2^N }{16} = \frac{2^N }{2^4} < \frac{2^N }{10} < \frac{2^N }{2^3} = \frac{2^N }{8} \]

and then we can re-write the middle comparison like this:

\[ 2^{N-4} < \frac{2^N }{10} < 2^{N-3} \]

so if we want:

\[ 2^{52} <= \frac{2^N }{10} < 2^{53} \]

then best \(N\) is 56, because that will make the upper inequality the same as the lower one.

We can also check tha this does what we want.

2**52 <= 2**56 //10 < 2**53
True

Now we can get the number we will use for \(J\). So far, mathematically, we would have:

\[J \approx \frac{2^{56}} {10}\]

but we need to find an integer value, so we divide and separate the quotient and remainder.

q,r = divmod(2**56,10)

Next we check the remainder to decide if we should round up by 1 or not.

r
6

\( 6 > 5 = \frac{10}{2}\) so we round up. Since the remainder is more than half of the divisor it is closer to the next number up, so we increment.

J = q+1
J
7205759403792794

then we can check the length in bits of that number

Python can print it for us:

bin(J)
'0b11001100110011001100110011001100110011001100110011010'

then we note that the first 2 characters are not a part of the binary representation, but an indicator that it is a binary string, so we subtract 2

len(bin(J))-2 
53

Which is 53 as expected! To get the actual number we represent in the fraction part we want to drop the left most bit.

To actually represent our final number, we want to drop the left most bit. Remember, we found the number that gets represented including the additional 1 outside of the sum of the bits in the fraction.

significand = J-2**52
significand
2702159776422298

or in binary:

bin(significand)
'0b1001100110011001100110011001100110011001100110011010'

Now, we go back to the \(e\). We need \(52-e+1023 = 56\) so we solve to find $\(e = 52+1023-56\)$

e = 52-56+1023
e
1019

So our representation of .1 would be:

exp_bin = bin(e)[2:]
sig_bin = bin(significand)[2:]
'0' + exp_bin + sig_bin
'011111110111001100110011001100110011001100110011001100110011010'

dropping the first two characters is because those are the 0b syntax indicator, not the actual binary.

see on float exposed

Python doesn’t provide a binary reprsentation of floats, but it does provide a hex representation.

float.hex(.1)

If we take the binary above, it matched this for the part before the p. 0b1 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010

after the p is the exponent of \(-4 = 1019-1023\). Which matches the approximation we found.

We can also check like this:

(J)/2**56 == .1

this confirms that the approximation we found is the same as the float representation of .1.

21.6. Prepare for Next Class#

none

21.7. Review today’s class#

  1. Write a C program to compare values as doubles and as float (single precision/32bit) to see that this comparison issue is related to the IEEE standard and is not language specific. Make notes and comparison around its behavior and include the program in a code cell in cdouble.md

21.8. More Practice#

  1. Write a C program to compare values as doubles and as float (single precision/32bit) to see that this comparison issue is related to the IEEE standard and is not language specific. Make notes and comparison around its behavior and include the program in a code cell in cdouble.md

  2. In floatexpt.md design an experiment using the fractions.Fraction class in Python that shows helps illustrate why .1*3 == .3 evaluates to False but .1*4 ==.4 evaluates to True. Your file must include both code and interpretation of the results.

21.9. Experience Report Evidence#

21.10. Questions After Today’s Class#

21.10.1. Why can’t systems just interpret decimal numbers like humans can?#

Because it has to be represented mechanically or electrically for it to be a machine.

21.10.2. My question is how bad can the approximations get when calculating floats? How much precision gets lost when dealing with fractional numbers?#

The quality of the approximation varies depending on the particular number.

At the bottom of float exposed you can see how close the next value is.

For 1 (the link above) the next number is a little more than 2.22e-16 away.

For 1532.625 the next value is 1.22e-4 away.

For 4503599627370497.0 to 9007199254740991.0the next value is 1.0 away, that is we can only represent integers between those two values, nothing in between.

21.10.3. Did the people that came from an academia background typically design and implement these low-level systems or people in industry?#

A lot of this work was deep collaborations between academia and industry. Some of it occured in universities, some of it was at places like Bell Laboratory. That was a research instittue inside of the for profit Bell Telephone Company.

21.10.4. How can I practice what we learned today for better understanding?#

This is a hard topic and it will come back up in CSC411.

Also, I recommend reading these notes carefully, and trying the float.exposed site.

21.10.5. Does this process of going through all the bits increase in time with more bytes? Or is there a set number of bits?#

This standard is a fixed size of 64 bits, but ther are others with different numbers of bits.

21.10.6. How different was the initial standard from the one that exists today?#

The initial standard defined fewer basic formats, more were introduced in 2008. Official IEEE standards are not free so they’re slightly harder to read easily.

21.10.7. How are we able to assign 0.1 to a variable despite that?#

We do not actually get a .1 for doing math. It is close, but not exact. Unless your programming environment has a different base type.