Review: Representing Text and Integers


This is a review of prerequisite material on data representation and manipulating integers for CS 351.

Introduction

The fundamental unit of data storage is a switch, and information is encoded in these switches by strategically turning them on or off.

Let's say you have a row of 8 switches in the following positions:

OFF ON OFF ON ON ON OFF ON

For visual convenience, let's use 0 to represent OFF and 1 to represent ON:

01011101

At this point, we usually refer to each switch as a bit (binary digit). Thus, this is an 8-bit pattern.

So what does our sequence of 0's and 1's mean?

Trick question! It doesn't have any inherent meaning. We have to agree on what it represents and a system for how to interpret it.

You can use bit patterns to represent anything, but two very important possibilities are text and integers. Most of this document is about integers, but we'll start by briefly looking at text.

Representing Text

If we decide to use bit patterns to represent text, we should agree on some standard ways to encode characters of text into sequences of bits. One of the most common such systems is ASCII.

ASCII is a scheme for mapping 8-bit patterns to characters of text. It's just a lookup table. This table tells us that, if we decide to interpret our bit pattern as a character of ASCII text, we should interpret it as the character

']'
(a closing square bracket).

The more bytes (8-bit chunks) of storage you have, the more text you can represent. If you have 4 GB of memory, and you want to fill it entirely with text for some reason, you can store approximately 4 billion (really 4*230) letters. (Insert your Game of Thrones jokes here.)

If you look at the ASCII table, you won't find characters from non-Western alphabets. For example, there are no Korean, Thai, Arabic, or Hindi characters.

This omission isn't solely due to cultural insensitivity. The fundamental problem is that there are only 256 possible 8-bit patterns, which isn't nearly enough to allow us to encode all of the characters from all of these alphabets.

Therefore, schemes for representing these characters will need to use more than 8 bits for a single character. Unicode is a family of encodings that go beyond the ASCII character set.

Converting between different text encodings is the bane of many a real-world programmer's existence. It's outside the scope of this hardware class, but I'd recommend this Joel Spolsky article to anyone considering a software engineering career.

The rest of this document is about how to encode integers using bit patterns and how to operate on these encodings.

Overview: Representing Integers

We can also assign bit patterns to represent integers. This requires a system for deciding:

The arithmetic-logic units (ALUs) built into processors will offer hardware support for the most common integer sizes and formats, and that's what we'll learn about in this class.

Programming languages, in turn, support various integer formats. For example, C and C++ support different sizes of integers via the char, short, int, long, and long long data types. They also support two different ways of interpreting those bit patterns: signed (the default) or unsigned. There's a note about C integer types at the end of this document.

If your programming language uses an integer size or format that isn't built into your hardware, that's no problem. When your source code is translated into machine code, your desired operations will be translated into a combination of the low-level operations supported by the processor. This means that your code will run more slowly than it would if you used the native integer sizes and types, but you probably won't notice unless you're doing serious computation.

If you want to do something even weirder that's not supplied by your programming language's integer data types, that's also no problem. You'll have to write your own integer class (in an object-oriented language) or integer functions by using your programming language's data types as building blocks. Then, that in turn will get translated to machine code.

That said, just as with text, a few standard sizes and formats dominate in real-world hardware and software.

Representing Integers: Size

In modern hardware, the registers and ALUs support either 32- or 64-bit integers.

In CS 351, our LEGv8 ISA uses 64-bit integers. In other words, it grabs 8-byte chunks of memory and treats them as a unit.

Remember that each byte has its own address. So when you go to fetch a 8-byte integer from memory, you might be grabbing the bytes with addresses 0x800, 0x801, 0x802, 0x803, 0x804, 0x805, 0x806 and 0x807 into a 64-bit register. You'll then treat these 8 bytes as a unified value.

As a side note, there's some dispute about how these 8 bytes should get loaded into the 64-bit register. In other words, should the byte at 0x800 be the most significant (leftmost) or least significant (rightmost) 8 bits of the 64-bit integer? This is a hotly contested issue called endian-ness; Wikipedia has a great diagram of how the two different options work. LEGv8, like ARM, can be operated in either big-endian or little-endian mode; to make the choice, the OS has to write a value to a special hardware control register.

As a SIDE side note, this can be a source of really exciting bugs when copying data across machines that have different endian-ness; see the "Endianness in networking" section of the Wikipedia article.

Representing Integers: Format

Once we've decided on how many bytes we'll use to represent integers, we still have to decide how to map the different possible bit patterns to integer values.

Some sacrifice is clearly involved: it's just not possible to map the infinite set of integers onto a very finite set of possible n-bit patterns.

How many integers can we map onto an n-bit pattern? Well, let's see...

The pattern is that we have 2n possible bit patterns for a given value of n. For our 32-bit integers, we have 232 (about 4 billion) possible values to choose from.

If you know your binary (and you should!), you may feel like these bit patterns are begging to be interpreted as binary numbers. For our 3-bit integers, below, for example, it seems very natural to let 000 represent 0, 001 represent 1, 010 represent 2, etc.

You're not alone! This is a very common way to represent integers. This is called an unsigned integer representation. It lets you represent the integers from 0 to 2n-1 using an n-bit pattern. Verify this formula with the 2-bit and 3-bit examples above before continuing.

There is a drawback to this system, though: you can't represent negative numbers. And if you think about the programs you have written, you've probably used the value -1 much more than the value 4,000,000,000. So a system that lets you represent 4 billion but not -1 arguably has misplaced priorities.

The point of a signed representation is to allow you to represent both positive and negative values. In a signed format, you can represent values as small as -2n-1 and as large as 2n-1-1. So you have 2n-1 possible negative integers, 0, and 2n-1-1 possible positive integers. That adds up to our familiar total of 2n possible bit patterns.

In CS 252, you probably practiced converting numbers by hand between the signed and unsigned formats, so I won't spend time on the mechanics. Here are the high-level things you need to remember:

The sign bit

If you interpret an n-bit binary number as a signed integer, you can tell at a glance whether or not it's negative by looking at its most significant/leftmost bit (called the sign bit). If this bit is a 1, your number is negative. If this bit is a 0, your number is either positive or zero.

For example, you can tell that the 8-bit signed integer 10001100 is negative and the 8-bit signed integer 00100111 is non-negative by looking at only their leftmost bits.

Negating numbers

You can negate a signed binary number (multiply it by -1) by flipping all the bits and then mathematically adding 1 to the result. Verify that the 8-bit signed value 11111111 translates to -1 by applying this procedure and making sure you get 00000001 as the result. Then apply the same procedure and verify that you get -1 (11111111) again.

Magnitude of signed numbers

With a positive n-bit number, you know that the more leading 0s you have, the closer that number is to 0 (the smaller it is). So at a glance, you know that 00000101 is smaller than 01110001.

With negative n-bit numbers, there is a corresponding rule: the more leading 1s you have, the closer the number is to 0 (the larger it is).

To test the extremes of this rule, note that 11111111 (-1) is as close to 0 as a negative integer can get, and it consists of all 1s. By contrast, 10000000 has no leading 1s other than the sign bit, and it is the most negative number you can represent in 8 bits: -128.

This rule lets you determine at a glance that 10001101 is farther from zero (more negative) than 11110010, since it has fewer leading 1s.

Signed and Unsigned Arithmetic

The flip-and-add-1 method of negating a number is called two's complement. You can imagine other schemes for representing negative numbers, so why did this one become the universal standard?

One compelling reason is that adding signed numbers is just like adding unsigned numbers, in terms of the mechanical bit manipulations. In other words, whether 10001010 and 00011001 are supposed to represent signed or unsigned 8-bit values, their sum is the same: 10100011. How we interpret this bit pattern, of course, depends on whether the initial values were supposed to be signed or unsigned.

In fact, the only difference between signed and unsigned arithmetic is how we detect overflow. It is a fact of life that – no matter what system of representation you use – the sum of 2 n-bit integers might not fit in n bits. So how do you know when the 8-bit sum does not accurately reflect the true arithmetic result?

Unsigned Overflow

Detecting unsigned overflow is easy, and it resembles the arithmetic operations you're used to: if there is a 1 carried over from adding the leftmost bits, then you have overflow, since there's nowhere to store this 1.

Signed Overflow

Signed overflow can't possibly be that simple. For example, 11111111 + 11111111 = 11111110 definitely results in a 1 carried over from the leftmost bit. However, if we decode this operation, it's -1 + -1 = -2, which is the answer we'd expect. What's going on?

Remember that with negative integers, leading 1s are sort of analogous to leading 0s with positive integers. We don't panic when a 0 is carried over from the leftmost bit when we're adding positive values, and we shouldn't panic when a 1 is carried over when adding negative values.

OK, so when should we panic? It turns out that a reliable indicator of signed overflow can be found by looking at the leftmost (sign) bit of the operands and the result, rather than the carry bit. Here are the rules:

In other words, you have signed overflow if the sign bits of the operands agree with each other but don't agree with the sign bit of the result.

Bonus: C/C++ Integer Types

Fun fact: the basic C/C++ integer types can vary in size from one processor and compiler to another! The C language standard makes guarantees about the minimum and relative sizes of some of the data types, but that's it.

This Wikipedia table and the paragraphs right below it have concise summaries of the rules.

The C/C++ function call sizeof(int) will tell you (in bytes) the size of an integer on your system.

In practice, the size of a char is 8 bits on any implementation that's not deliberately weird. For the moment, 32 bits is by far the most common size of an int, but that is more likely to evolve over time.