THE MEMORY OF THE CITY
log in
Syngenesis
Huffman coding in human communication
2012-12-09 00:57:06

Huffman coding in human communication

One of the easiest mistakes to make for a fledgling language designer is to make every part of the language elaborate and unique. There are plenty of examples of this throughout the real history of the world—particularly in the form of pictographic writing systems—but there are few, if any, alive today. Why?



BJ945 cuneiform, shot by Charles Tilford. Cc by-nc-sa.

The answer is simple: humans are incredibly, incredibly lazy. But not exactly in the way you might expect.

This phenomenon is most prominent among writing systems, but it can (and certainly does) occur with nearly every form of communication. The phonology of Greek has simplified extensively since the fifth century BC, in part due to external influences but also because of its own heavy usage; cuneiform simplified throughout its lengthy tenure from an elaborate system of pictograms into an ideographic system (comparable to Chinese or late Egyptian), then grew extensively regularized, and eventually simplified in how it was used, into a special stylus-based alphabet, the Ugaritic script.

There are many reasons for a method of communication to be pressured into simplicity, and not just ease of use—Korean, for example, which has a very efficient phonology, is spoken very quickly, perhaps in part to make up for its regular and bulky inflections. Even apparently very stable methods of communication, like the Latin alphabet, have undergone abbreviations and complete reworkings to improve performance. (Tironian notes may be of particular surprise to the language reformer, as they are quite a radical departure from normal writing and yet enjoyed success for many centuries.) These systems of shorthand fell by the wayside as the speed of the printing press rapidly obsoleted the widespread need for handwriting of such performance—but even still there are examples of early typographers (1, 2) preserving some quirks.

So what is Huffman coding, and how does it pertain to writing and language? In relatively simple terms, Huffman compression is a model of data storage that works by reserving the simplest possible patterns for the most common elements, without creating ambiguity. (In natural languages, of course, true ambiguity is context-dependent, so you may be able to get away with a little bit of it, but it should be kept in mind.)

If you're making a writing system, for example, chances are you'll want to assign meanings to a simple vertical line ( | ) and a simple horizontal one (—), as a significant portion of natural languages do. These may not be purely linguistic meanings—after all, horizontal lines in the Latin alphabet denote punctuation—but they should probably be used. A writing system where all of the characters are extremely complex loops and swirls, or have half a dozen strokes each, may make sense in a very formal, decorative context, but the average human is not going to want to use such a system on a regular basis.

The prominent exceptions to this, like Chinese, get away with complex glyphs because of the extremely high branching factor (many glyphs, which all need to be distinguished) and the terse number of glyphs necessary to communicate a thought. Similarly, syllabaries like Yi or Linear B are allowed some leeway because their characters represent more than one sound—but keep in mind that even Japanese, one of the most widely-used syllabaries in the world, assigns meanings to several single-stroke glyphs.

Ambiguity in a writing system can occur very easily, such as when a language contains both "|" and "||" as glyphs, and permits "|" to be used twice in a row. A designer must be careful to prevent such ambiguity unless it is extremely obvious from context—a problem not unknown to English speakers struggling with the confusion between "Ill" and "III". (Does the newspaper headline "New English King George lll; Makes Way to France" mean that King George is ill, or simply the third in line?)

Grammars, however, are rife with ambiguous stems. English's verbs are so degenerate that we have to use nominative pronouns in every single sentence to know what we're talking about, otherwise most statements look like an imperative. In general, a language will simplify its inventory of inflections until syntax and context are critical to keeping the meaning clear. As a result, few languages retain a completely flexible word order; this action is prominent in vulgar Latin and its derivatives.

Vocabulary is another area in which cumbersomeness of terms quickly becomes a problem. It is easy as a language author to create a compound word that is unambiguous in meaning and grammatically sound, but would never be used by a language speaker on a regular basis because it is simply too lengthy. In English, we solve this problem with the band-aids of initialisms (or acronyms if we're lucky); many languages from further East in Europe use the much more sensible approach of combining the first syllables of each word, such as the German "Stasi" from "Ministerium für Staatssicherheit." You may also witness frozen contractions and slurring, such as English's own "e'er" for "ever", where the endings of words survive—simply because there was no other word in the way to stop speakers from doing so. (Of course, this doesn't stop homophones and homonyms from spreading like wildfire, but context generally overcomes any problems they may cause.)

Parallels exist in computing, as well, and can be credited for the appeal of RISC processors and the C programming language.

A constructed language does not, by any stretch of the imagination, have to be completely 'boiled down' to be good—but the degree to which it has been boiled down should correspond to the culture in which it is forming. A slow, plodding writing system and intricate grammar mean that the people using them generally have time and patience to think through what they are communicating. This may not even be the whole civilization, but merely a subsection of it—many dialects of Greek became possessions of the wealthy and educated as they passed out of broader usage. Such systems of communication generally exhibit fossilization (Italian notwithstanding), and cannot quite be counted as alive.

Despite this, a fully living language will always favour some sense of simplicity, and this is true of all natural languages—it's just that the definition of what is simple (whether that be compact writing, fast writing, a simple phonology, or accessible pictographs) changes as the language community matures. As a world-builder developing your own cultures, you should always be aware of what your citizens value when they speak to each other.
Syngenesis comment   8452.6 tgc / 2012.938 ce