Claude Shannon demonstrated how to generate "english looking" text using Markov chains and how this gives a satisfactory representation of the statistical structure of any message. He uses this model as a framework with which to define 'information sources' and how they should be measured.
Transcript
Now, Shannon had just finished developing his theories related to cryptography and therefore was well aware that human communication was a mix of randomness and statistical dependencies. Letters in our messages were obviously dependent on previous letters to some extent. And in 1949, he published a groundbreaking paper, a mathematical theory of communication. And in it, he uses Markov models as the basis for how we can think about communication.
And he starts with a toy example. Imagine you encounter a bunch of texts written in an alphabet of A, B, and C. Perhaps you know nothing about this language, but you notice As seem to clump together while B and Cs do not. He then shows that you could design a machine to generate similar looking text using a Markov chain.
And he starts off with a zeroth-order approximation, which means we just independently select each symbol, A, B, or C at random, and form a sequence. However, notice that this sequence doesn't look like the original. He shows then you could do a bit better with a first-order approximation where the letters are chosen independently, but according to the probability of each letter in the original sequence. So this is slightly better, as As are now more likely, but it still doesn't capture much structure.
The next step is key. A second-order order approximation takes into account each pair of letters which can occur. And in this case, we need three states. The first state represents all pairs which began with A, the second, all pairs that begin with B, and the third state, all pairs that begin with C.
And notice now that the A cup as many AA pairs, which makes sense since the conditional probability of an A after an A is higher in our original message. Now we can generate a sequence using this second-order model easily, as follows. We start anywhere and pick a tile, and we write down or output the first letter and move to the cup defined by the second letter. Then we pick a new tile and repeat this process indefinitely.
Notice that this sequence is starting to look very similar to the original message, because this model is capturing the conditional dependencies between letters. And if we want to do even better, we could move to a third order approximation which takes into account groups of three letters, or trigrams. would need nine states. But next, Shannon applies this exact same logic to actual English texts, using statistics that were known for letters, pairs, and trigrams, et cetera.
And he shows the same progression from zeroth-order, random letters, to first-order, second-order, and third-order sequences. He then goes on and tries this same thing using words instead of letters. And he writes "the resemblance to ordinary English text increases quite noticeably at each depth." Indeed, these machines were producing meaningless text, though they contained approximately the same statistical structure you'd see in actual English. Shannon then proceeds to define a quantitative measure of information, as he realizes that the amount of information in some message must be tied up in the design of the machine, which could be used to generate similar looking sequences.
Which brings us to his concept of entropy.