17-18 Text Compression

The text I chose to compress:
A_tutor_who_tooted_the_flute_Tried_to_tutor_two_tooters_to_toot_Said_the_two_to_their_tutor,_“Is_it_harder_to_toot_Or_to_tutor_two_tooters_to_toot?”

My compression:
A_☄_who_☀ed☇flute_Tried☆★Said☇☃_☂_their_☄,_“Is_it_harder★Or☆_☂_☀?”

The dictionary:
:sunny: toot
:open_umbrella: to
:snowman_with_snow: two
:comet: tu​:open_umbrella:r
:open_umbrella:_​:sunny:
☆ _☂_☄_☃_☀ers
the

Compression Numbers:
Compressed text size: 66 bytes
Dictionary size: 41 bytes
Total: 107 bytes
Original text size: 148 bytes
Compression: 27.7%

Process:

  1. Looked for parts of word that repeated (i.e. “th” in “the”, “that”, “those”, etc)
  2. Looked for phrases that repeated (used symbols already in dictionary, if possible)
  3. Looked for words that repeated (using symbols in dictionary, if possible)

Challenges students may encounter include:
Deciding when it is worth compressing as some compressions will reduce the compression rate
Finding the patterns and realizing that they can use the symbols as part of the word/phrase they are compressing

I will encourage students to use trial and error and to continue to look at the compression numbers to see what is of value to compress. Also will remind the students to consider the dictionary size since it is part of the file size!

My students had some fun with this one with each of them trying to get better compression than the others. I have two sections and the intersection rivalry was fun. Of course keeping it from getting out of hand is always an issue but for a single event it was more fun and motivation than anything.

My heuristic:

  1. I looked for patterns, and compressed phrases that repeated. This might include patterns from items I added to the dictionary.
    My compression rate was 35.9%

There is a tipping point to the text compression. If you put every word in the dictionary, the compression rate is not as high. I might ask students what characteristics made a heuristic work well, and what characteristics made the heuristic ineffective.