Welcome to Data Compression!

Ever wondered how you can send a high-quality photo to a friend in seconds, or how thousands of songs fit onto a tiny smartphone? The secret is data compression. In this chapter, we will explore how computers "shrink" data to save space and time. Don't worry if this seems a bit technical at first—we’ll break it down step-by-step!

What is Data Compression?

Data compression is the process of reducing the size of a file. It involves using an algorithm to pack the same information into fewer bits.

Why do we need it?

We compress data for two main reasons:

1. Storage Space: Smaller files take up less room on your hard drive or phone memory.
2. Transmission Speed: Smaller files travel faster over the internet. This means websites load quicker and videos stream without buffering as much.

Analogy: Think of a sleeping bag. When it's rolled out, it takes up a lot of room. When you stuff it into its small compression sack, it’s the same sleeping bag, but it's much easier to carry and store!

Quick Review: Why Compress?

• Uses less storage space.
Faster to download or send over a network.
• Allows more data to be stored on a device.

Run Length Encoding (RLE)

Run Length Encoding (RLE) is a simple form of compression that works by looking for "runs" of identical data. Instead of storing every single piece of data, it stores the data value and how many times it repeats.

How it works

Imagine a row of pixels in a simple black-and-white image where 0 is white and 1 is black:
Original data: 0000011100000011

Without compression, this takes 16 bits. With RLE, we group them into frequency/data pairs:

• Five 0s
• Three 1s
• Six 0s
• Two 1s

The compressed version becomes: 5 0 3 1 6 0 2 1

Common Mistake to Avoid: Always remember the order! It is usually the count (frequency) followed by the value. Don't mix them up, or the computer won't be able to reconstruct the original file!

Key Takeaway

RLE is most effective when data has lots of repeating patterns, such as simple icons or images with large areas of the same color.

Huffman Coding

Huffman coding is a more clever way to compress data, especially text. It is based on frequency—how often a character appears.

The Core Concept

In standard ASCII, every character uses 7 or 8 bits, regardless of how often it's used. In Huffman coding, we give short binary codes to characters that appear often (like 'e' or 'space') and longer codes to characters that appear rarely (like 'z' or 'q').

Interpreting a Huffman Tree

To find the binary code for a character, you follow the path from the top (the root) of a Huffman Tree down to the character.

• Every time you go Left, you add a 0.
• Every time you go Right, you add a 1.

Example: If you go Left, then Right, then Left to reach the letter 'A', its Huffman code is 010.

Did you know? Because common letters have very short codes (like just '0' or '10'), the total number of bits used for a whole sentence is much lower than using 8 bits for every single letter!

Calculating Compression Savings

In your exam, you might be asked to compare uncompressed data (ASCII) with Huffman compressed data.

Step-by-Step: Comparing Bits

1. Uncompressed: Count the number of characters in the string and multiply by 8 (assuming 8-bit ASCII).
Formula: \( \text{Total Bits} = \text{Number of characters} \times 8 \)

2. Huffman Compressed: Use the provided tree to find the code for each letter. Count the number of bits in each code and add them all up.

3. The Saving: Subtract the compressed total from the uncompressed total.

Example:
To store the word "BEE" using 8-bit ASCII:
\( 3 \text{ characters} \times 8 \text{ bits} = 24 \text{ bits} \).
Using Huffman, if 'B' is 11 and 'E' is 0:
B (11) + E (0) + E (0) = 4 bits.
Total saved: 20 bits!

Quick Review Box

ASCII: Fixed length (usually 8 bits per character).
Huffman: Variable length (common characters = short codes; rare characters = long codes).
RLE: Frequency/data pairs (e.g., 4 'A's becomes 4 A).

Summary of Key Terms

Data Compression: Reducing file size to save space or transmission time.
Run Length Encoding (RLE): A method that replaces repeating data with a count and a value.
Huffman Coding: A method that uses a tree to assign shorter binary codes to more frequent characters.
Frequency: How often a specific piece of data or character occurs.
Bit: The smallest unit of data in a computer (a 0 or a 1).

Don't worry if the Huffman trees look a bit like a spider web at first! Just remember: start at the top, and follow the branches to the letter you need. You've got this!