CHAPTER 01

CHAPTER 01.05: FLOATING POINT REPRESENTATION: IEEE-754 Single Precision Representation: Part 1 of 2

In this segment, I'm going to just talk briefly about the IEEE 754 standard. We're only going to talk about how we represent a single-precision number. IEEE 754 standard is a standard which is used to standardize the representation of numbers in various computers, as well as standardize the arithmetic operations of multiplication, addition, division, and subtraction in those computers. So the standard is basically to see that, hey, the different computers, whether you're using a VAX, a Cray computer, a PC, a Macintosh, that all those computers will be able to represent the numbers in a similar fashion, the same as the case with the arithmetic operations which we do on those computers. Now, there's a nice paper, and you can find this link in the PowerPoint presentation of this particular presentation here. So if you go to the keyword floating point on the numerical methods website, you will see a link to the PowerPoint, but you can also get it right from here, if you just want to punch in the URL directly into your . . . into your browser. This is a very good paper on what every computer scientist should know about floating point arithmetic, and in my opinion, and even if you are not a computer scientist, you can gain quite a bit of knowledge about floating point operations, or floating point arithmetic from this particular paper. It's a long paper, it's about 150 to 175 pages long. I don't expect that you are going to go through all the 175 pages, but what I would like to see is that you skim through it, see some of the initial details, some of the summary details of the paper, and I think you will learn quite a bit about how this floating point arithmetic works, and how the IEEE 754 standard works for floating point operations.

And, again, I said that we are limiting our discussion to the single-precision number, we're not even talking about arithmetic operations, but just the single-precision format, so that you get some good feel about, we already talked about the floating point representation in a hypothetical word or a hypothetical . . . hypothetical ten-bit word, and here, what we are trying to do is we are trying to use the thirty-two bits which are actually used for single-precision in real life, how is that different? Now, one of the things which you see is that you have thirty-two bits for your single-precision number. You're going to use one bit for the sign of the number, and then you use eight bits for the biased exponent, and we'll talk about the biased exponent in a little bit, and then use twenty-three bits for the mantissa. Now what does that mean? It's that your -1 raised power s will then dictate, where s is the sign of the bit, it can be either a 0 or a 1, it will dictate whether it's a negative number or a positive number, the mantissa is twenty-three bits which go right here, those 0s and 1s go right here, and what you're seeing here is that before the radix point, here, you are seeing a 1. What does that mean? We know that when we do scientific notation, we use a nonzero number before the decimal point, when we're talking about the base-10 number, so that nonzero number can be . . . has to be 1, 2, 3, 4, 5, 6, 7, 8, or 9, depending on what the number is, but when we talk about binary numbers, the only nonzero number which we have is 1. So the 1 is automatically assumed that it is before the radix point, so that's why you find out that you don't put that 1 in the mantissa part of the number at all, because if it is already assumed, there's no need to store it, and you are saving some space, and also increasing the accuracy of your numbers by doing that. Now so far as the exponent is concerned, it's a biased exponent. What we mean by biased exponent is that you'll have to subtract 127 from the exponent which is being stored here. What you are finding out here in the biased exponent is that there is no bit which is being used for negative exponents, because you're going to get negative for a number, let's suppose you have 2 to the power -50. How are we going to represent that in an exponent which has no bit for the sign of the exponent? And the way it is done is that you bias it, and you subtract 127, which means that whatever number is being represented in exponent, you subtract 127 from it to show that, hey, that's the real number which is being represented. Now, if you have the exponent as something, then you add 127 to it, and then you store it right here in the eight bits of the biased exponent.