PREDICTION AN erratic and uncertain, and t: involved. Prediction and Entropy of Printed 2. ENTROPY CALCULA Engli~h One method of calculatin e F o , F I , F 2 , ••• , which sue of the language into accoun the N-g~am entropy; it mea to statistics extending over _ By C. E. SHANNON (Manuscript &ceiDcd Sept. IS, I950) A Dew method of estimating the entropy and redundancy of a language is described. This method exploits the knowledge of the language statistics possessed by those who speak the language, and depends on experimental results in prediction of the next letter when the preceding text is known. Results of experiments in prediction are given, and some properties of an ideal predictor are developed. 1. FH = i,j -:- L pCb, INTRODUCTION in which: b i is a block of A I N A previous paper! the entropy and redundancy of a language have been defined. The entropy is a statistical parameter which measures, in a certain sense, how much infonnation is produced on the average for each letter of a text in the language. If the language is translated into binary digits (0 or 1) in the most efficient way, the entropy H is the average number of binary digits required per letter of the original language. The redundancy, on the other hand, measures the amount of constraint imposed on a text in the language_ due to its statistical structure, e.g., in English the high frequency of the letter E, the strong tendency of H to follow T or of U to follow Q: It was estimated that when statistical effects extending over not more than eight letters are considered the entropy is roughly 2.3 bits per letter, the redundancy about 50 per cent. Since then a new method has been found for estimating· these quantIties, which is more sensitive and takes account of long range statistics, iniluences extending over phrases, sentences, etc. This method is based on a study of the predictability of English; how well can the next letter of a text be predicted when the preceding .1\7 letters are known. The results of some experiments in prediction will be given, and a theoretical analysis of some of the properties of ideal prediction. By combining the experimental and theoretical results it is possible to estimate upper and lower bounds for the entropy and redundancy. From this analysis it appears that, in ordinary literary English; the long range statistical effects (up to 100 letters) reduce the entropy to somethiD.g of the order of one bit per letter, with a corresponding redundancy of roughly 75%. The redundancy may be still higher when structure extending over paragraphs, chapters, etc: is included. However, as the lengths involved ate 4J.creased, the parameters in question become more j is an arbitrary pCb, ,j) is the pr poJj) is the cone and is E The equation (1) can be (conditional entropy) of th known. As N is increased; and the entropy, H, is givt The N-gram entropies 1 standard tables of letter, punctuation are ignored Vi be taken (by definition) to frequencies and is given t F, " =-L i=l . The digram approximatio F, ~ - ~ C. E. Shannon, <lA Mathematical Theory of Communication," Bdt S;'stem Tedmical Journal, v. 27, pp. 379-423, 623-656, July, October, 1948. -LP(b, 1 2: f---~· Ii ;; ~ p(i 7.70 - 4. Fletcher Pratt, "Secret a· erratic and involved. Printed Engli~h uncertai~J 2. ENTROPY CALCULATION FROM THE STATISTICS OF ENGLISH One method of calculating the entropy H is by a series of approximations F 0 J F 1 J F 2 J ... J which successively take more and more of the statistics of the language into account and approach H as a limit. F N may be called the lV-gram entropy; it measures the amount of infonnation or entropy due to statistics extending over .LV adjacent letters of text. F N is given by l ~ON t. I5, I950) md redundancy of a lan.:,ouage is ;e of the language statistics posdepends on experimental results ceding text is known. Results of roperties of an ideal predictor are FN ~ - L. p(b"j) log, p"V) 1.j - L. pCb, ,j) log, p(b" :ON 'edundaney of a language have cal parameter which measures, is produced on the average fOf :mguage is translated into binary ~ntropy H is the average number ginallanguage. The redundancy, constraint imposed on a text in 'e, e.g., in English the high fre)f H to follow T or of U to follow effects extending over not more )y is roughly 2.3 bits per letter, and they depend more critically on the type of text .... j) 'i :ommunication," Bell System Technical )er, 194-8. (1) i in which: b, is a block of N-1letters [(N-1)-gram] j is an arbitrary letter following hi p(b" j) is the probability of the N-gram b" j Po,(j) is the conditional probability of letter j after the block b" • and is given by p(b" j)1pCb,). The equation (1) can be interpreted as measuring t~e average uncertainty (conditional entropy) of the next letter j when the preceding N-1leiters are known. As f{ is increased, F N includes longer and longer range statistics and th~ entropy, H, is given by the limiting value of F N as N ----7 00 : H = Lim FN l for estimating these quantities, .f long range statistics, influences s method is based on a study of the next letter of a text be preown. The results of some experi,eoretical analysis of some of the .g the experimental and theoreti.nd lower bounds for the entropy )pears that, in ordinary literary (up to 100 letters) reduce the t per letter, with a corresponding .aney may be still higher when :ers, etc: .is included. However I as lrneters in question become more + L, pCb,) log pCb,) (2) • The 'N-gram entropies F N for small values of iV can be calculated from 2 standard tables of letter, digram and trigram frequencies. If spaces and punc.tuation are ignored we have a twenty-six letter alphabet and F 0 may be taken (by definition) to be log, 26, or 4.7 bits per letter. F , involves letter frequen~ies and is given by F, ~ - " p(i) L, log, p(i) = 4.14 bits per letter. (3) 1=1 The digram approximation F 2 gives the result F, = - L, p(i, j) log, p,(j) i,j - L, p(i, j) log, p(i, j) i,j = 7.70 2 4.14 + L, p(i) log, p(i) i ~ 3.56 bits per letter. Fletcher Pratt, "Secret and Urgent," Blue Ribbon Books, 1942. (4) 52 THE BELL SYSTEM TECHNIClli, JOURNAL, JAl\lJARY 1951 PREDICTION AND EN1 The trigram entropy is given by F, = formula (6) clearly cannot hold i" L: p(i, j, k) log, p;;(k) ~ L: p(i, j, k) log, p(i, j, k) i,i,k - 11.0 - + L: p(i, j) log, p(i, j) (5) i,j 1 better estimate) that the formula 7.7 = 3.3 0.1 In this calculation the trigram table2 used did not take into account trigrams bridging two words, such as WOW and OWO in TWO WORDS. To compensate partially for this omission, corrected trigram probabilities p(i~ j, k) were obtained from the prohabilities p'(i,j, k) of the table by the following rough formula: p(i,j, k) = ~:~ p'Ci,j, k) + 4\ r(i)p(j, k) + 4\ = .1 1Z Zipf 4 has pointed out that this type of formula, pn = kin, gives a rather good approximation to the word probabilities in many different languages. The 3 G. Dewey, "Relative Frequency of English Speech Sounds," Harvard University Press, 1923. , 4 G. K. Zipf, "Human Behavior and the Prinnple of Least Effort," Addison-Wesley Press, 1949. - -~.~-~"_._,- I ~; " THE " I ~OF " I AND ~t+-TO , , I ..-I 0.01 '"'~ p(i,j)s(k) where rei) is the probability of letter i as the terminal letter of a word and. s(k) is the probability of k as an initial letter. Thus the trigrams "\vitl1in words (all average of 2.5 per word) are counted according to the table; the bridging trigrams (one of each type per word) are counted approximately by assuming independence of the terminal letter of-one word and the initial digram in the next or vice versa. Because of the approximations"involve4 here, and also because of the fact that the sampling error in identifyipg probability with sample frequency is more serious, the value of F s is less reliable than the previous numbers. Since tables of N-gram frequencies were not available for~V > 3, F 41 F s , etc. could not be calculated in the same way. However, word frequencies have been tabulated:! and can b~ used to obtain a further approximation. Figure 1 is· a plot on log-log paper of the probabilities of words against frequency rank. The most frequent English word "the" has a probability "'" .071 and this is plotted against 1. The next most frequent word "of" has a probability of .034 and is plotted against 2, etc. Using logarithmic scales both for probability and rank, the curve is approximately a straight line with slope -1; thus, if Pn is the probability of the nth most frequent word, we have, roughly p. L: .1/n is in! must be unity, while i.i.k ",,~ u z w "@ a:: 0.00 1 w o '"o • . 0.0001 0.000 01 2 4681020 , Fig. i,-Relative freq total probability is unity, am critical 11, is the word of rank : 8727 - L: p.lo, 1 or 11.82/4.5 = 2.62 bits per 1, is 4.5 letters. One migh t be 1 actually the ordinate of the 1 The reason is that F, or F 5 IT of word division. A word is a 53 PREDICTION AND ENTROPY OF PRINTED ENGLISH formula (6) clearly cannot bold indefinitely since the total probability Lpn ~ 1- L i,j pCi, j) log, p(i, J} .lln is infinite. If we assume (in the absence of any 1 better estimate) that the formula o did not take -into account triad OWO in TWO WORDS. To 'ected trigram probabilities P(i, i,j, k) of the table by the follow- + "" = .lln holds out to the n at which the .,---OF A~O ~t+-TO . ~ _I 1 4 • p(i, j)s(k) '''' "'~OR .0 he terminal letter of a word and ~tter. Thus the trigrams within lilted according to the table; the ord) are counted approximately letter atone word and the initiai of the approximations involved :Ie sampling error in identifying ~ sc:rious, the value of F s is less not available for N > 3, F 4 1 F 5 , way. However, word frequencies obtain a further approximation. le probabilities of words against sh word "the" has a probability :t most frequent word "of" has a : 2, etc. Using logarithmic scales : is approximately a straight line ty of the 12th most frequent word, Pn '~~ ... THE 0. 01 'p(j, k) L must be unity, while (5) oz , ...... ~ :-- w o @ c:: 0.0 01 ~ , _SAY ''l. c oc °3 ~REAt..LY '\ ... __ QUALITY 0.000 1 " ." ,, , , ~ 0.000 01 2 4 6 8 to 20 40 60 100 200 WORD ORpER 400 1000 2000 4000 ,, 10,000 Fig. i-Relative frequency against rank for English wOrds. total probability is unity, and that pn = 0 for larger n, we find that the critical n is the word of rank 8,727. Tbe entropy is then: fr/27 J.ula, pn = kj n, gives a rather good in many different languages. The 1. Speech Sounds," Harvard University lldple of Least Effort," Addison-Wesley -L p. log, p. = 11.82 bits per word, (7) 1 or 11.82/4.5 ~ 2.62 bits per letter since the average word length in English is 4.5 letters. One might be tempted to identify this value with F 4 • 5 , but -a-ctuall§"the ordinate of the F N curve at N = 4.5 will be above this value. The reason is that F4 Or F" involves groups·of four or five letters regardless division. A word is a cohesive group of letters with strong internal 54 THE BELL SYSTEM TECHNICAL JOURNAL, JA..-l\ffiARY PREDICTION Al\TJ) 1951 Of a total or 129 letters, 89 c expected, occur mo~ syU.ables where the line of thou be thought that the secc contains much less inform, the same .information in t to recover the first line -f anidentical twin of the indivj (who must be mathelnaticallYl the salne way when faced witl only the reduced text of (8). W point we will know whether his the first twin and the preser to a. correct guess. The letters b each stage he can be supplied' twin had available. statistical influences, and consequently the N-grams within words are more restricted than those which bridge words. The effect of this is that we have obtained, in 2.62 bits per letter, an estimate which corresponds more nearly to, say, F 5 or F 6 " A similar set of calculations was carried Qut illcluding the space as an additional letter, givljlg a 27 letter alphabet. The results of both 26- and 27-1etter calculations are summarized below: 26 letter. 27 letter. F, F, F, F, F."", 4.70 4.14 4.03 3.56 3.3 3.1 2.62 2.14 4.76 3.32 The estimate of 2.3 for F 8 J alluded to above, was found by several methods, one of which is the extrapolation of the 26-letter series above out to that point. Since the space symbol is almost completely redundant when sequences of one or more words are involved, the values of F N in the 27-!etter case will be ~.5 or .818 of F N for the 26-letter alphabet when. N is reasonably ~.5 large. ORIGINAL 3. The new method of estimating entropy exploits the fact that anyone speaking a language possesses, implicitly, an enormous knO"IYledge of the statistics of the language. Faroiliari,ty with the words, idioms, cliches and grammar enables him to fill in missing or incorrect letters :in proof-reading; or to complete an unfinished phrase in conversation. An e.xperitnental demonstration of the extent to which English is predictable can be given as follows: Select a short passage unfamiliar to the person who is to do the predicting. He is then asked to guess the first letter in the passage. If the guess is correct he is so informed, and procee9s to guess the second letter. If not, he is told the correct first letter and proceeds to his next guess. This is continued through the text. As the experiment progresses, the subject writes dowp. the correct text up to the current point for use in predicting future letters. 1;he result of a typical experiment of this type is given below. Spaces were included as an additional letter, making a 27 letter alphabet. The first line is the- original text; the second line contains a dash for each .letter correctly guessed.. In the case of incorrect guesses the correct let~er is copied in the second line. (1) THE ROOM WAS NOT VERY LIGHT A SMALL OBLONG (2) ----Roo------NOT-V~--c-I------SM----OBL--- (1) READING LAMP ON THE DESK SHED GLOW ON (2) REA----------O"-----D----SHED-GLO--O-(1) POLISHED WOOD BIlT LESS ON THE SHABBY RED CARPET (2) P-L-S -----O'-_BU --L-3--0 ------3H -----RE --C ------ -~---------~ .~ ---~.---~----~~.~-..~ COMPARISON mer PREDICTION OF ENGLISH Fig. 2-Communi (8) l The need for an identical 1 eliminated as follows. In gener . edge of more than N preceding only a finite nUlllber· of possih subject to guess the next letter plete list of these predictions reduced text from the original " To put this another way, tl encoded form of the original, tl a- reversible transd ncer. In fa( structed in which only the red the other. This could be set up diction devices. An extension of the .above cerning the predictability of Er up to the current point and is a he is told so and asked to gues Corred letter. A typical result T.RNAL, JA,}..""(JARY 1951 N-grams within words are more he effect qf this is that we have : which corresponds more nearly ~ out including the space as an et. The results of both 26- and w: F, F, F,.ord 3.56 3.3 3.1 2.62 2.14 3.32 ~, was found by several methods, 5-letter series above out to that completely redundant when sethe values of F lot in the 27-letter " alpbabet when N is reasonably exploits the fact that anyone an enormous knowledge of the :1 the words, idioms, cliches and ncorrect letters in proof-reading, ~rsation. An experimental demon'edictable can be given as follows: rson who is to do the predicting. he passage. If the guess is correct le second letter. If not, he is told is next guess. This is continued $ses, the subject writes dowp. the , in predicting future letters. The ~ is given below. Spaces were in7 letter alphabet. Tbe first line is ; a dash for each letter correctly ,he correct letter is copied in the SHED GLOW ON SHEO-GLD--O-THE SHABBY RED CARl'ET ·-----SH-----RE --C ------ - TEXT.'.I 1 :HT A SMALL OBLOHG ------SM----OBL-.-- Of a total of 129 letters, 89 or 69% were guessed correctly. The errors, as would be expected, occur most frequently at the beginning of words and syllables where the line of thought has more possibility of branching out. It might be thought that the second line in (8), which we will call the reduced text, contains much less information than the first. Actually, both lines contain the same information in the sense that it is possible, at least in principle, to recover the first line from the second. To accomplish this we need 'an identical twin of the individual who produced the sequence. The twin (\...ho must be mathematically, not just biologically identical) will respond in the same way when faced with the same problem. Suppose, now, we have only the reduced text of (8). We ask the twin to guess the passage. At each point we will know whether his guess is correct, since he is guessing the same as the first twin and the presence of a dash in the reduced text corresponds to a correct guess. The letters he guesses wrong are also available, so that at each stage he can be supplied with precisely the same information the first twin had available. ORIGINAL ENGLISH (8) 55 PREDICTION A..1>,J"D ENTROPY OF PRThi"TED ENGLISH J COMPARISON COMPARISON REDUCED TEXT ORIGINAL TEXT Fig. 2-Communication system using reduced text. The need for an identical twin in this conceptual experiment can be eliminated as follows. In general, good prediction does not require knowledge of more than N preceding letters of text, with N fairly small. There are only a finite number of possible ,sequences of N letters. We could ask the subject to guess the next letter for each of these possible N-grams. The co::nplete list of these predictions could thell'be used both for obtaining the reduced text from the original and for the inverse reconstruction process. To put this another way, the reduced text can be considered to be an encoded form of the original, the result of passing the original text through ct reversible transducer. In fact, a communication system could be constructed in which only the reduced text is transmitted from one point to the other. This could be set up as shown in Fig. 2, with two identical prediction devices. An extension of the above experiment yields further information cancerning the predictability of English. As before, the subject knows the text up ~o the current point and is asked to guess the next letter. If he is wrong, he IS told so and asked to guess again. This is continued until he [lllds the correct letter. A typical result with this expe~ent is shown below. The 56 THE BELL SYSTEM TECHNICAL JOURNAL, JANUARY PREDICTION AND 1951 first line is the original text and the numbers in the second line guess at which the correct letter was obtained. ~ndic3.te , __ H 0 REV E R 8 E 0 HAM 0 TOR G Y G LEA (2) 1 1 1 5 11 2 11 2 1115 1 17 1 1 1 21 3 21 22 7 1 1 1 1 4 1 1 1 1 1 3 1 (1) THE REI 8 (1) (2) (1) (2) F RI 8 6 1 RAT 4 1 1 EHD 3 1 11 HER 1 1 11 0 F MI H E F 1 11 1 1 1 11 6 D RAM A TIC 11 5 1 1 1 1 1 1 0 UHD 2 1 1 11 ALL Y 1 1 1 11 THI 8 1 1 2 11 THE 0 6 1 11 1 0 UT 1 1 1 1 THE R DAY 1 1 1 11 1 1 1 1 (9) Out of 102 symbols the subject guessed right on the first guess 79 times, on the second guess 8 times, on the third guess 3 times, the fourth and fifth guesses 2 each and only eight times required more than five guesses. Results of this order are typical of prediction by a good subject with ordinary literary English. Newspaper writing, scientific work and poetry generally lead to somewhat poorer scores. The reduced text in this case also contains the same infonnation as the original. Again utilizing the identical twin we ask him at each stage to guess as many times as the number given in the reduced text and recover in this way the original. To eliminate the human element here we must ask our subject, for each possible l\T_gram of text, to guess the most probable next letter, the second most probable next letter/letc. This set of data can then serve both for prediction and recovery. ". Just as before, the reduced text can be considered an encoded version of the original. The original language, with an alphabet of 27 symbols, A, B, ,Z, space, has been translated into a new language with the alphabet 1, 2, , 27. The translating has been such that the symbol 1 now has an extremely high frequency. The symbols 2, 3, 4 have successively smaller frequencies and the final symbols 20, 21, ... , 27 occur very rarely. Thus the translating has simplified to a considerable extent the nature of the statistical structure involved. The redundancy which originally appeared in complicated constraints among groups of letters, has, by the translating process, been made explicit to a large extent in the very unequal probabilities of the new symbols. It is this, as will appear later, which enables one to estimate the entropy from these experiments. In order to determine how predictability depends on the number 11' of preceding letters known to the subject, a- more involved experiment was carried out. One hundred samples of English text were selected at random from a book, each fifteen letters in length. The subject was required to guess the text, letter by letter, for each sample as in the preceding experiment. Thus one hundred samples were obtained in which the subject had available 0, 1, 2, 3, ... , 14 preceding letters. To aid in prediction the subject made such use as he wish~d oi various statistical tables, letter, digram and trigram ....- . ~'f the I~ tables, a table of the frequenc queneies of common words am were from IIJefferson the Virg gether with a similar test in wI summarized in Table 1. The cc letters known to the subject I The entry in column N at fO\V the right letter at the 5th gues, L I - ~ I I 2 3 4 - - - - -- -47 1 18.2 29.2 36 2 10.7 14.8 20 8.6 10.0 12 3 7 8.6 6.7 4 1 7.1 6.5 5 4 5.5 5.8 6 3 4.5 5.6 7 2 3.6 5.2 8 4 3.0 5.0 9 2 2.6 4.3 10 2 2.2 3.1 11 4 1.9 12 2.8 1 .5 J 13 2.4 1.2 14 2.3 1 1.0 15 2.1 .9 16 2.0 .7 1 17 1.6 .5 18 1.6 19 .4 1.6 .3 20 1.3 .2 21 1.2 .1 22 .8 .1 23 .3 24 .0 .1 25 .1 26 .1 27 .1 18 14 3 1 5 3 2 1 2 1 1 1 I 5 ~ll 8 I jl 2 2 1 5 3 2 2 1 , , 2 1 1 the entry 19 in column 6, row rect letter was obtained on th dred. The first two columns ( mental procedure outlined a known letter and digram freg probable symbol is the spae< wrong, should be E (probal frequencies with which the ri~ trials with best prediction, Si table gives the entries in coIl \L JOURNAL, JANUARY 1951 PREDICTION umbers in the second line indic3.te the !htained. lE ONA MOTORCYCLE A l21321227111l41111131 ND THIS OUT 111112111111 LY THE OTHER OAY 1 11 6 1 11 1 1 1 1 11 1 1 1 1 (9) essed right on the first guess 79 times, lird guess 3 times, the fourth and fifth ;quired more than five guesses. Resul~s )y a good subject with ordinary literary ic work and poetry generally lead to tability depends on the number N of iect, a more involved experiment was English text were selected at random tgth. The subject was required to guess tmple as in the preceding experiment. ined in which the subject had available To aid in prediction the subject made 3tical tables) letter) digram and trigram 57 ENTROPY OF PRINTED ENGLISH tables) a table of the frequencies of initial letters in words, a list of the frequencies of common words and a dictionary. The samples in this experiment were from "Jefferson the Virgillian" by Dumas Malone. These results, together with a similar test in which 100 letters were known to the subject, are summarized in Table I. The column corresponds to the number of preceding letters known to the subject plus one; the row is the number of the guess. The entry in column N at row S is the number of times the subject guessed tlie right letter at the Sth guess when (N-I) letters were known. For example, TABLE - I 2 3 4 contains the same information as the twin we ask him at each stage to guess n the reduced text and recover in this lUman elem~nt here we must ask our text, to guess the most probable next t letter, etc. This set of data can then 'ery. n be considered an encoded version of with an alphabet of 27 symbols, A, into a new language with the alphabet en such that the symbol I now has an ;)ols 2, 3) 4 have successively smaller 21) ... , 27 occur very rarely. Thus the erable extent the nature of the statist ilCy which originally appeared in comletters, has) by the translating process, n the very unequal probabilities of the lr later) which enables one to estimate A~"D 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 I 1 2 3 4 5 6 1 8 9 10 18.2 10.7 8.6 6.7 6.5 5.8 5.6 5.2 5.0 4.3 3.1 2.8 2.4 2.3 2.1 2.0 1.6 1.6 1.6 1.3 1.2 .8 .3 .1 .1 .1 .1 29.2 14.8 10.0 8.6 7.1 5.5 4.5 3.6 3.0 2.6 2.2 1.9 1.5 1.2 1.0 .9 .7 .5 .4 .3 .2 .1 .1 .0 36 20 12 7 1 4 3 2 4 2 2 47 18 14 3 1 5 3 2 51 13 8 58 19 5 1 48 17 3 4 3 2 8 2 66 15 5 4 6 66 13 9 4 1 67 10 4 4 6 1 1 1 1 1 2 4 1 1 4 3 2 2 1 5 3 2 2 1 4 3 2 1 1 1 1 1 ,-;'1 1 1 4 3 1 1 1 1 1 1 1 1 2 1 1 62 9 7 5 5 4 1 1 1 3 2 1 1 1 1 ~l~ 1 2 1 1 1 1 1 1 1 66 9 4 4 3 3 1 2 2 1 1 1 1 1 2 1 1 1 H I 151100 _I_- 72 60 80 6 1 18 7 9 5 3 5 3 j 4 4 1 1 2 1 4 3 2 1 1 1 1 1 1 1 1 1 1 58 14 7 6 2 2 4 13 2 2 1 1 1 the entry 19 in column 6, row 2, means that with five letters known th~ cor rect letter was obtailTM. on the second guess nineteen times ou t of the hun dred. The first two columns of this table were not obtained by the experimental procedure outlined above but were calculated directly from the known letter and digram frequencies. Thus with no known letters the most probable symbol is the space (probability .182); the next guess, if this is wrong, should be E (probability .107), etc. These probabilities are the frequencies with which the right guess would occur at the first) second, etc., trials with best prediction. Similarly, a simple calculation from the digram table gives the entries in column I when the subject uses the table to best 58 THE BELL SYSTEM TECIThi1:CAL JOURNAL, JANuARY PREDICTION AJ\'D 1951 r edl1ced text, qlN+l ,q2N+l , pred'Lcti:on is on the basis of a g .._.';cc ......ee the probabilities of low n the following inequalities advantage. Since the frequency tables are determined from long samples of English, these two columns are subject to less sampling error than the others. It will be seen that the _prediction gradually improves, apart from some statistical fluctuation, with increasing knowledge of the past as indicated by the larger numbers of correct first guesses and the smaller numbers of high rank guesses. One experiment was carried out with "reverse" prediction) in which the subject guessed the letter preceding those already known. Although the task is subjectively much more difficult, the scores were only slightly poorer. Thus, with two 101 letter samples from the same source) the subject ob~ tained the following resnlts: No. of guess Forward ... Reverse. 1 2 3 4 70 10 7 7 4 2 4 66 2 6 6 7 8 >8 3 2 3· 1 0 2 9 S i=l IDEAL N -GRAM 4 PREDICTION The data of Table I can be nsed to obtain upper and lower bounds to the N-gram entropies F N : In order t~ do this, it is n:eces~ary first to develop some general results concerning the best possible prediction of it language when the preceding N letters are known. There will be for the language a set of conditionalprob~bilities Pil , i! , •.• , iN_l (j). This is the probability when the (N-i) gram iI, i 2 , ••• , i N _ 1 occurs that the next letter will be j. Th~' best guess for the next letter, when this (N-1) gram is known to have occurred, will be that letter having the highest conditional probability. The second guess should be that with the second highest probability, etc. A machine or person guessing in the best way would guess letters in the order of decreasing q,mditional probability. Thus process of reducing a text with such an ideal predictor conSists of a mapping of the letters -into the numbers from 1 to 27 in such a way that the most probable next letter [conditional on the known preceding (N-1) gram] is mapped into 1, etc. The frequency of 1's in the reduced text will then be given by p(h, i a, ••• ,iN)) the q'/. = 1;p(i, , i" ... , i N _ 1 , j) (10) where the sum is taken overall (N-1) grams i 1 ) i 2 ) ••• ,i N _ 1 thej being the one which maximizes p for that particular (N-1) gram. Similarly, the fre· . quency of 2's, qf, is given by the same formula with j chosen to be that letter having the second highest value of p, etc. On the basis of LV-grams, a different set of probabilities for the symbols L iE ?c This means that the probabilit the preceding N letters are 1m only (N-1) are known, for all , p(i1 , i 2 , ••• , iN) j) arranged ir the N-grams vertically. The tal tenn on the left of (1] row, SlltnIDed over all the rows. ':: of entries from this table in whic necessarily the S largest. This member would be calculated Ire tha.n N-gramslisted vertically. : of 27 rows of the N-gram table, Incidentally, the N -gram entropy F N for. a reversed language is equal to that for the forward language as may be seen from the second form in equation (1). Both terms have the same value in the forward and reversed cases. 4. , L li+1 ) The sum of the S largest entri the sum of the 27S selected ent N-gram table only if the latter j to hold for a particular S, this J table. In this case, the first lette] S most probable choices for th, the set may be affeeted. Howev follows that the ordering as wel N-gram. The reduced text obtau identical with that obtained frOJ Since the partial sums B Q:=L i=l are monotonic increasing functi preach limits as iV ---7 00. The: limits as jl{ ---7 00, i.e., the l: apl as the relative frequency of cor edge of the entire (infinite) pas JOUR:,AL, JA><"ARY 1951 PREnICTro~ .-\...-"\;n E)o"TROPY OF PRINTED ENGLISH . t he red t ql.v+l , q?N+l , ... , q27 .v+l ,wo uld norma IIy resu lt. S·mce tho15 m uced tex, prediction is on the basis of a greater knowledge of the past, one would expect the probabilities of low numbers to be greater, and in fact one can prove the following inequalities: re determined from long samples of lless sampling error than the others. adually improves, apart from some mow1edge of the past as indicated ;uesses and the smaller numbers of S = 1,2, .... "reverse" prediction, in which the lOse already known. Although the the SCores were only slightly poorer. n the same SQurce, the subject ob- , , , 7 4 2 4 2 6 6 7 3 2 3 1 8 4 2 9 for. a reversed language is equal to seen from the second fann in equae in the forward and reversed cases. ( PREDICTION !tain upper and lower bounds to the it is necessary first to develop ;t possible prediction of language TheJ;"e will be for the language a set ;N_, (j). This is the probability when ; that the next letter will be j. The 5 (iV-1) gram is known to have oclighest conditional probability. The second highest probability, etc. A ,yay would guess letters in the order [,hus the process of reducing a text ·f a mapping of the letters into the that the most probable next letter N-1) gram] is mapped into 1, etc. t will then be given by ~hisJ ... ,iN _ 1 ,j) 27 p(;",i" .;- ,iN,j) = a ,- , 1. L: p(i"i" ··-,iN,j). (12) "1=1 The sum of the S largest entries in a row of the N-l gram table will equal the sum of the 275 selected entries from the corresponding 27 rowS of the iV-gram table only if the latter fall into S columns. For the equality in (11) to hold for a particular 5, this must be true of every row of the lV-l gram table. In this case, the first letter of the iV-gram does not affect the set of the S most probable choices for the next letter, although the ordering within the set may be affected. However, if the equality in (11) holds for all S, it follows that the ordering as well will be unaffected by the first letter of the· N-gram. The reduced text obtained from an ideal IV-l gram predictor is then identical with that obtained from an ideal iV-gram predictor. Since the partial sums S = 1,2, ... (10) ams i l , i 2 , ••• , i N _ 1 the j being the "lar (N-l) gram. Similarly, the fre,e fonnula with j chosen to be that of p, etc_ set of probabilities for the symbols (11) This means that the probability of being right in the first S guesses when the preceding N letters are knmvn is greater than or equal to that when only (N-l) are known, for all S. To prove this, imagine the probabilities p(ir, i:!., ... , iN ,j) arranged in a table with j running horizontally and all the ~V-grams vertically. The table will therefore have 27 columns and 27 N rows. The tenn on the left of (11) is the sum of the S largest entries in eacR row, summed over all the rows. The right-hand member of (11) is also a sum of entries from this table in which S entries are taken from each row but not necessarily the S largest. This follows from the fact that the right-hand. member would be calculated from a similar table with (iV-i) grams rather than N-grams-1i$ted vertically. Each row in the lV-l gram table is the sum of 27 rows of the N-gram table, since: >8 o 59 (13) are monotonic increasing functions of N, < 1 for all iV, they must all approach limits as IV --+ 00. Their :first differences must therefore approach approach limits, gO; . These may be interpreted limits as N -)- 00, i.e., the as the relatiye frequency of correct first, second, ... , guesses with knowledge of the entire (infinite) past history of the text. l! 60 THE BELL SYSTEM TECHNICAL JOURNAL, JANUARY Pl 1951 27 The ideal N-gram predictor can be considered, as has been pointed out, to be a transducer which operates on the language translating it into a sequence of numbers running from 1 to 27. As such it has the following two properties: 1. The output symbol is a function of the present input (the predicted next letter when we think of it as a predicting device) and the preceding (N-l) letters. 2. It is instantaneously reversible. The original input can be recovered by a suitable operation on the reduced text without loss of time. In fact, the inverse operation also operates on only the (N-l) preceding symbols of the reduced text together with the present output. The above proof that the frequencies of output symbols with an N-l gram predictor satisfy the inequalities: S = 1,2, '., ,27 s L r i , will be the sum of the probabilitiesforS entriesin each row, summed 1 over the rows, and consequently is certainly not greater than the sum of the S largest entries in each row. Thus we wiU have s = .1, 2, ... , 27 (15) In other words ideal prediction as defined above enjoys a preferred positi?n among all translating operations that may be applied to a language and which sat.isfy the bvo properties above.. Roughly speaking, ideal predictioJ;1 . collapses the probabilities of various symbols to a small group mQre than any other translating operation involving the same number of letters which is instantaneously reversible. Sets of numbers satisfying the inequalities (15) have been studied by Muirhead in cOllilection with the theory of algebraic inequalities' If (15) holds when the and ri are arranged in decreasing order of magnitude, and gr 5 . ,1 2' .L q~ 1 = :L1 case is 1), then known that the ., 'properties: 1. Ther, can .flow is und one, as he;: direction. 2. The ri cal 2£era~~i~. ~ ~ \} i -" (14) can be applied to any transducer having the two properties listed above. In fact we can imagine again an array with the various (N-l) grams listed vertically' and the present input letter horizontally. Since the present output is a function of only these quantities there will be a definite output symbol which may be entered at the corresponding intersection of row and column. Furthermore, the instantaneous reversibility requires that no two entries in the same row be the same. Otherwise, there would beambiguity between the two or more possible present inpu t letters when reversing the translation. The total probability of the S most probable symbols in the output, say also Hardy, Littlewood and Polya, "Inequalities," Cambridge University Press, 1934. The upper bou entropy j the entrap) iV-gram ent, Ia!1guage, .as may sums involved wi . different order. T diction is ideal. The lower boun ,with any selectior " " N ~ ~'(qi -1_1 - N gi+l The left-hand In< I ' the qiN an magme The actual qf can tions as shown~ Th tions. Thus, the PREnICTIOX ;\,....'1(1) onsidered, as has been pointed out, to nguage translating it into a sequence ch it has the following two properties: of the present input (the predicted a predicting device) and the precede original input can be recovered by ed text without loss of time. In fact, es on only the (N-l) preceding symwith the present output. ies of output symbols with an N-l also ., 21 27 1 1 Lq'Y = Lr., = 1,2, ... ,27 ilities forS entries in each row, summed tainly not greater than the sum of the will have S = 1,2, ... , 27 (15) ed above enjoys a preferred positi~m may be applied to a language and e. Roughly speaking, ideal prediction symbols to a small group more than . g the same number of letters which qualities (15) have been studied by ory of algebraic inequalities.' If (15) in decreasing order of magnitude, and ·ties," Cambridge University Press, 1934. 61 (this is true here since the total probability in ea.ch L i r; (14) ing the two properties listed above. with the various (N-I) grams listed horizontally. Since the present output here will be a definite output symbol ding intersection of row and colunin. sibility requires 'that no two entries e, there would be ambiguity between t letters when reversing the translaost probable symbols in the output, OF PRTh"TED EXGLISH ca~e is 1), then the first set, q7 , is sa.id to majoriz3 the second set, r i. It is known that the majorizing property is equivalent to either of the following properties: 1. The r i can be obtained from the q~ by a finite series of ({flows." By a flow is understood a transfer of probability from a larger q to a smaller one, as heat flows from hotter to cooler bodies but not in the reverse direction. 2. The r i can be obtained from the qlf by a generalized "averaging" eE,eration. T~ere exists a set of non-negative real numbers, aij, with ~ aij = aij = land such that j S E~TROPY 5. = L,a;j(q7). j (16) ENTROPY BOU:N'DS FROM PREDICTION FREQUENCIES If we know the frequencies of symbols in the reduced text with the ideal LV-gram predictor, qf , it is possible to set both upper and lower bounds to th~' N-gram entropy, F N, of the original language. These bounds are as follows: 27 L ,. i(qf - qf+l) log i :S F N :$ - (17) i=l The upper bound follows immediately from the fact that the maximum possible entropy in a language with letter frequencies q~ is - L qf log q~. Thus the entropy per symbol of the reduced text is not greatt'f than this. The lV-gram entropy of the reduced text is equal to that for the original language, as may be seen by an inspection of the definition (1) Df F N. The sums involved will contain precisely the same terms although, perhaps, in a different order. This upper bound is clearly valid, whether or not the prediction is ideal. The lower bound is more difficult to establish. It is necessary to show that wit:le-any selection of iV-gram probabilities p(i1 , i'l, ... , iN), we will have 27 L, i(q~ - q~+l) log i :s; i=l (18) The left-hand member of the inequality can be interpreted as follows: Imagine the q~ arranged.as a sequence of lines of decreasing height (Fig. 3). The actual qf can be considered as the sum of a set of rectangular distributions as shown: The left member of (18) is the entropy of this set of distributions. Thus, the ilh rectangular distribution has a total probability of 62 THE BELL SYSTEM TECHNICAL JOURNAL, JA?>!uARY 1951 PREDICTION AND i(q': - q':+l)' The entropy of the distribution is log i. The total entropy is then 27 L i_I i(q': - q':H) log i. The problem, then, is to show that any system of probabilities P(il , ... iN), with best prediction frequencies q.. has an entropy F N greater than or equal to that of this rectangular ~ystem, derived from the same set of qi. 0.60 ORIGINAL I 1'. DISTRIBUTION , • \\ :'\. \ 0.20 10.05 q, q. of the general theorem that By' The equality holds only if the Now we may add the c1iffereJ changing the entropy (since in The result is that we have arri q.. , by a series of processes wh starting with the original N-gr, of the original systeln F N is gre, decomposition of the q,. This It will be noted that the lowe) row of the table has a rectangu q3 10.05 I 0.025 q-, q. 10.025 g. I 0.025 q, I 0.025 \ q. • 0040 (ql-q2) ""< "'" -UPPER BOUND '- \--. ~ LOWER BOUNO- RECTANGULAR ...... ;-j" ....... 1 DECOMPOSITION o /""'5 10.025 10.025 r" I I 0 .234567 NUO Fig. 4--Upper and lower experimen Cq2-q3) I I I I 0.025 Cq.-q,l I I I , 0 .025 qe Fig. 3--Rectangular decomposition of a monotonic distribution. The qi as we have said are obtained from the p(i1 1 • • • 1 iN) by arranging each row of the table in decreasing order of magnitude and adding vertically. Thus the qi are the sum of a set of monotonic decreasing distributions. Re- . place each of these distributions by its rectangular decomposition. Each one is replaced then (in general) by 27 rectangular distributions; the q, are the sum of 27 x 27N rectangular distributions, of froni 1 to 27 elements, and all starting at the left column. The entropy for this set is less than or equal to that of the.original set of distributions since a termwiseaddition·of two or> . more distributions always increases entropy. This is actually an application. possible eN-I) gram there is a s, probability, while all other next It will now be showJ>-that the (17) are monotonic decftasing fun sm'ce th e q..N+l maJonze " . t h e qiN an d Increases the entropy. To prove t '.-:-.reasmg we will show that the ( U=L is increased by an equalizing flow q, to q'+1, the first decreased by amount. Then three terms in the S' LlU = [~(i - 1) log(i - 1)- ______ ' )r PREDICTION-AND-- ENTROPY OF PRINTED _ENGLISH .~__ -63 of the general theorem that Hy(X) ~ H(x) for any chance variables x and y. The equality holds only if the distributions being added are proportional. Now we may add the different components of the same width without changing the entropy (since in this case the distributions are proportional). The result is that we have arrived at the rectangular decomposition of the q i, by a series of processes which decrease or leave constant the entropy, starting with the original N-gram probabilities. Consequently the entropy of the original system F N is greater than or equal to that of the rectangular decomposition of the qi. This proves the des~red result. It will be noted that the lower bound is definitely less than F N unless each row of the table has a rectangular distribution. This requires that for each 5 ,,\ 4 . \ I"\. i'< 3 "\. 2 'I ~ - UPPER BOUND --...... ...... - ...... :-- " ........ ,A'" LOWER BOUNO- r-- 1 o 0 '-- - , 2 3 4 5 6 7 8 9 10 11 12 13 14 15 "L,~ 100 NUMBER OF LETTERS Fig. 4-Upper and lowerexpeiimental bounds for the entropy of 27-letter English. possi?le (N-I) gram there is a set of possible next letters each with equal probability, while all other next letters have zero probability. It will now be shown that the upper and lower bounds for F N given by (17) are monotonic decreasing functions of lV. This is true of the upper bound since the qf+l majorize the qf and any equalizing flow in a set of probabilities increases the entropy. To prove that the lower bound is also monotonic decreasing we will show that the quantity ij = L:, i(q, - q1+1) log i (20) is increased by an equalizing flow among the qi. Suppose a flow occurs from qi to qi+l, the first decreased by !1q and the latter increased by the same . amount. Then three terms in the sum change and the change in U is given by r tlU ~ [- (i - 1) log (i - 1) + 2i log i - (i + 1) log (i + 1)]Ll.q I .1 - (21) r_ II:1'1 I'ii' il! !J! 1'"I,1; Iii! 'I' !'I .,1/ I I' II ',j I ) ~ III 1\1 III ii' ,,' ':1 lij , 64 THE BELL SYSTEM TECHNICAL JOURNAL, JA1"UARY 1951 + + The term in brackets has the form -f(x - 1) 2f(x) - f(x 1) where f(x) = x log x. Now f(x) is a function which is concave upward for positive x, since!" (x) = l/x > O. The bracketed term is twice the,difference between the ordinate of the curve at x = i and the ordinate of the midpoint of the chord joining i - 1 and i + 1, and consequently is negative. Since t1q also is negative, the change in U brought about by the flow is posit.ive. An even simpler calculation shows that this is also true for a flow from ql to q2 or from qZ6 to q27 (where only two terms of the sum are affected). It follows that the lower bound based on the N-gram prediction frequencies q1[ is gre3.ter than or equal to that calculated from the N + 1 gram frequencies q1[+l . 6. A Submarine Telephon€ Bj (Manuscri.i \ The paper describes-the recen telephone system in which repea cable structure and are laid as IN APRIL of last year then and Havana, Cuba, a subm, cal departure from the convent any. This departure consisted I marine cable of electron tube r, t~~ cable laying machinery ani cable, and which, over an exter not require servicing for the pUl circuit elements. The repeater about three inches in diameter cable diameter of a little over an the taper at each end is about: it can conform to the curvature ( in the laying gear on the cable, in Fig. 1. EXPERIMENTAL BOUNDS FOR ENGLISH Working from the data of Table I, the upper and lower bounds were calculated from relations (17). The data were first smoothed somewhat to overcome the worst sampling fluctuations. The low numbers in this table are the least reliable and these were averaged together in groups. Thus, in column 4, the 47, 18 and 14 were not changed but the remaining group totaling 21 was divided uniformly over the rows from 4 to 20. The upper and lower bounds given by (17) were then calculated" for each column giving the following results: Column Upper. Lower. 1 2 3 4 S 6 7 8 9 10 It 12 13 14 15 100 4.033.423.0 2.6 2.72.22.81.8 1.92.1 2.2 2.3 2.1 1.72.1 1.3 3.192.50 2.1 1.71.71.31.81.01.0 1.0 1.3 1,3 1,2 .91.2 .6' It is evident that there is ~till considerable sampling error in these figures due to identifying the observed sample frequencies with the prediction probabilities. It must also be remembered that the lower bound was proved only for the ideal predictor, while the frequencies used here are from human prediction. Some rough calculations, however, indicate that the discrepangr between the actual F N and the lower bound with ideal prediction (due to the failure to have rectangular distributions of _conditional probability) more than compensates for the failure of human subjects to predict in the ideal manner. Thus we feel reasonably confident of both bounds apart from sampling errors. The values given above are plotted against LV in Fig. 4. ACKNOWLEDGMENT The writer is indebted to Mrs. Mary E. Shannon and to Dr. B. M. Oliver for help with the experimental work and for a number of suggestions and. criticisms concerning the theoretical aspects of 'this paper. ...,. Th~ new cable system, com!" . Amencan Telephone and Teleg __ .~~,,:,the development of telephonic ( '.,.an~ Cuba, which has presented dlllons make it difficult, if not ,:':nethods of communication. One ..'lI} Florida that would permit thE " the stretch of water between Flo . as 6,000 feet in depth and whi, .1.lsed. The practical solution has !?'C Bell System toll lines at Mi line (With Some water crossings). -.,' abo t 100. ';'-""h ,',' ~ n.m., by submarine c .~"avmg a single coaxial circuit, ins .' '~:-;

© Copyright 2017