Distance Analysis of English Texts. I. Shakespeare's Sonnets.
Milan Kunz, Jurkovi
čova 13, 63800 Brno, The Czech Republic, (kunzmilan@seznam.cz)Summary
Distances between identical symbols in information strings (biological, language, computer programs (*.exe files) are described with a different precision with four distributions: Exponential, Weibull, lognormal and negative binomial. The correlations are sometimes highly significant. Here are analyzed distances between signs in the Shakespeare's Sonnets. Some distance tests revealed specific formal features of Sonnets.
INTRODUCTION
Statistical properties of information distributions, especially their extreme skewness, raised the notion of their specificity (Haitun 1982a, b, c). Determining frequencies of symbols or words was a time consuming task suitable for shortening unbearably long time periods (Yule 1944).
These linguistic studies had some pragmatical value, too: Learning of languages starting with the most frequent words and phrases and attribution of texts to authors.
The inverse function to frequencies are distances between identical counted objects.
Distances between identical symbols exist in all information strings with any number of symbols or their k-tuples (words). Their manual counting was even more troublesome than counting words. Therefore such studies were made only for neighbor symbols where the local transitivity (frequencies of 2-tuples, e. g. ab) was studied for example by Harary and Paper (1957).
Time intervals between consecutive patent applications of patentees (Kunz 1987), and time intervals between consecutive publications (Kunz 1993) were determined for some small samples.
A stochastical generation of a string of m repeatings of an alphabet of n symbols is conventionally modeled by tossing a dice with n-sides.
A coin is the first nontrivial model of the dice with two sides. When a coin is tossed, there appear differently long sequences, when one result prevails. The distribution of sequences between successive events (head or tail) in all possible runs is known as the negative binomial distribution.
The negative binomial distribution is the inverse to the binomial distribution. It evaluates frequencies of distances between consecutive binary symbols in their strings.
This distribution was a statistical curiosity till some decades ago since its evaluation was a rather difficult task (Irwing 1963), because its distribution function does not exist in a closed form. Now it is included in standard statistical software program packages (STATGRAPHICS, Statistical Graphics Corporation).
The distances between symbols in a codone and English and Czech text were analyzed by Kunz and Rádl (1998). I then analyzed the distances between numerals in the first 10000 digits of the number e and the distances in the artificial codone based on the number e (Kunz 2000).
The purpose of this study is at first, to determine statistical properties of Shakespeare's sonnets, and to gain some knowledge, how the poet used the laguage, and then, to find, if distance analysis can reveal some differences between prosaic and poetic texts.
Results
The Shakespeare's sonnets were obtained as an ASCII file on Internet (Project Gutenberg). Their numbering and dividing rows were stripped, as well as doubled or tripled spacebars, using MS-Word. After these formal corrections, the file contains 93772 signs including spaces, 76092 signs without spaces in 2155 lines and 17582 words.
It means that the mean length of a word is 4.327 signs (including apostrophes and punctuation marks), the mean length of a verse is 43.51 signs (with spaces), and 35.31 signs (without spaces), and/or 8.159 words.
After this, the distances were determined by a program elaborated by Rádl. The string is at first indexed with the position index i (i going from 1 to m) of each individual symbol in the string, and then the differences of these position indexes are determined. The differences are the topological distances between the same symbols. The sets of these values were evaluated by different statistical tests. The program counting distances counts all signs, including spacebar, return, and punctuation marks.
Replacing the numbering of sonnets by an sign, the length of sonnets can be determined as the distances between these signs. The results are tabulated as follows:
Table 1
Length of sonnets.
Chisquare test.The normal distribution. Mean: 649.47, standard deviation 22.1.
Lower | Upper | Observed | Expected | |
Limit | Limit | Frequency | Frequency | Chisquare |
546 | 611.818 | 4 | 6.8 | 1.1575 |
611.818 | 620.909 | 8 | 8.3 | .0107 |
620.909 | 630.000 | 11 | 14.0 | .6496 |
630.000 | 639.091 | 23 | 20.0 | .4373 |
639.091 | 648.182 | 26 | 24.2 | .1268 |
648.182 | 657.273 | 32 | 24.8 | 2.0709 |
657.273 | 666.364 | 16 | 21.5 | 1.4148 |
666.364 | 675.455 | 20 | 15.8 | 1.1271 |
675.455 | 684.545 | 8 | 9.8 | .3296 |
684.545 | 649 | 6 | 8.7 | .8193 |
Chisquare = 8.14362 with 7 degree of freedom. Significance level = 0.320101.
The length of sonnets is slightly bimodal, between the central narrow peak and the second one a walley exists. The differences of the length in this area including about one half of all sonnets are about 9 signs in 14 verses, it means about 2 words.
Then the distances between the individual letters were determined, at first separately the lower case and the upper case (when enough occurrences available), than taken together.
From all available implemented distributions, only four distributions gave significant results, the exponential distribution, the Weibull distribution, the lognormal distribution, and the negative binomial distribution, as before.
The actual values (mean, standard deviation, skewness, kurtosis, distribution parameters etc.) were determined only in some instances.
The spacebar
The distances between the consecutive spacebars greater than 1 determine the number of words of the length corresponding to the distance minus one. There exists 17680 spacebars after corrections. This is a somewhat different number compared with the direct counting of words. The results are tabulated as follows. Cumulating frequencies of shorter distances, improved in some cases the fit, since bellow it the counts are scattered, and the differences can balance themselves.
Table 2 The number of words with the different length
Length | Number | Type of distribution, chisquare value |
1 | 547 | LN, 0.253 |
2 | 2870 | NB, 0, over 8 = 0.521 |
3 | 3212 | NB, 0, over 16 = 0.208 |
4 | 4012 | NB, 0.091 + 0.873 |
5 | 2714 | NB, 0, over 11 = 0.208 |
6 | 1744 | EX, 0.069 |
7 | 1073 | WE, 0.208 |
8 | 692 | NB, 0.415 |
9 | 394 | WE, 0.305 |
10 | 190 | NB, 0.540 |
11 | 69 | WE, 0.670 |
12 | 31 | EX, 0.591 |
13 | 15 | few data |
14 | 13 | |
15 | 2 | |
16 | 1 | |
17 | 1 | |
18 | 1 |
The distribution of length of words seems to have the lognormal shape, but this guess was not tested.
Notes to some results:
The distribution of one letter words is poorly correlated by the lognormal distribution. There are two peeks and one walley. The main peak between distances 51-60 is slightly above length of one verse (21 occurrences against 12 expected). It alone makes 45.9 % of the chi-square test value.
The distribution of two letter words
is fairly correlated by the negative binomial distribution, similarly as the next 3 groups. The chisquare value is low due to their repeating within distance one (292 occurrences against 466.2 expected). This alone makes 52.8 % of the chi-square test value. There is a long peak within distances 5-8 (1179 occurrences against 991.2 expected). This makes another 29.9 % of the low chi-square test value.The distribution length of three letter words
is fair with the forced lower limit 11.There are too many words of length of four, the program failed to perform the test. It was necessary to split the file into two equal parts and to perform the test of both parts separately. The results are shown in the following tables, as the example of the other results.
Table 3
The distribution of distances between words of length 4. The first half.
Lower | Upper | Observed | Expected | |
Limit | Limit | Frequency | Frequency | Chisquare |
at or below | 1.500 | 476 | 464.1 | .3045 |
1.500 | 2.500 | 349 | 356.9 | .1770 |
2.500 | 3.500 | 284 | 274.5 | .3268 |
3.500 | 4.500 | 207 | 211.1 | .0811 |
4.500 | 5.500 | 172 | 162.4 | .5691 |
5.500 | 6.500 | 131 | 124.9 | .2988 |
6.500 | 7.500 | 90 | 96.1 | .3815 |
7.500 | 8.500 | 68 | 73.9 | .4672 |
8.500 | 9.500 | 54 | 56.8 | .1397 |
9.500 | 10.500 | 32 | 43.7 | 3.1314 |
10.500 | 11.500 | 17 | 33.6 | 8.2070 |
11.500 | 12.500 | 22 | 25.8 | .5728 |
12.500 | 13.500 | 24 | 19.9 | .8541 |
13.500 | 14.500 | 25 | 15.3 | 6.1677 |
14.500 | 15.500 | 13 | 11.8 | .1310 |
15.500 | 16.500 | 11 | 9.0 | .4232 |
16.500 | 17.500 | 4 | 7.0 | 1.2559 |
17.500 | 18.500 | 8 | 5.3 | 1.3132 |
18.500 | 20.500 | 10 | 7.3 | 1.0175 |
20.500 | 38 | 13 | 10.5 | .5743 |
Chisquare = 26.3937 with 18 degree of freedom. Significance level = 0.09109
The chisquare value is rather low. But inspecting its constituents, we see that there are only 49 distances 10 and 11 words against 77.3 expected and 25 distance 14 words against 15.8 expected. These two differences make only one percent of all occurrences of words but 66.3 % of the chisquare value.
Table 4
The distribution of distances between words of length 4. The second half.
Lower | Upper | Observed | Expected | |
Limit | Limit | Frequency | Frequency | Chisquare |
at or below | 1.500 | 445 | 446.8 | .00743 |
1.500 | 2.500 | 350 | 347.1 | .02428 |
2.500 | 3.500 | 266 | 269.6 | .04885 |
3.500 | 4.500 | 224 | 209.5 | 1.01059 |
4.500 | 5.500 | 161 | 162.7 | .01785 |
5.500 | 6.500 | 127 | 126.4 | .00294 |
6.500 | 7.500 | 88 | 98.2 | 1.05587 |
7.500 | 8.500 | 74 | 76.3 | .06749 |
8.500 | 9.500 | 65 | 59.2 | .55874 |
9.500 | 10.500 | 47 | 46.0 | .02073 |
10.500 | 11.500 | 25 | 35.8 | 3.23328 |
11.500 | 12.500 | 31 | 27.8 | .37515 |
12.500 | 13.500 | 23 | 21.6 | .09429 |
13.500 | 14.500 | 16 | 16.8 | .03435 |
14.500 | 15.500 | 14 | 13.0 | .07401 |
15.500 | 16.500 | 16 | 10.1 | 3.42717 |
16.500 | 17.500 | 5 | 7.9 | 1.03815 |
17.500 | 18.500 | 4 | 6.1 | .72436 |
18.500 | 20.500 | 8 | 8.4 | .02124 |
20.500 | 22.500 | 4 | 5.1 | .23064 |
22.500 | 44 | 9 | 7.7 | .20717 |
Chisquare = 12.2746 with 19 degrees of freedom. Sig. level = 0.873556.
Chisquare is almost excellent. Inspecting its constituents, we see again that there is only 25 distances 11 against 35.8 expected and 16 distances 16 against 10.1 expected. This difference makes less than one percent of all occurrences.
When consistency of both parts is tested by the two sample analysis, the zero hypothesis shall not be rejected.
The distribution of five letter words is poor due to their low repeating within distance one (273 occurrences against 368.3 expected). This makes 36.8 % of the chi-square test value. There is a peak within the distance 2 (377 occurrences against 318.3 expected). This makes 16.1 % of the poor chi-square test value. Other deviations are minor.
The exponential distribution of six letter words is poor due to their low repeating within distance one (140 occurrences against 163.8 expected). This makes 17.4 % of the chi-square test value. There is a peak within distances 6-13 (639 occurrences against 588.6 expected). This makes 22.2 % of the poor chi-square test value.
The distribution of seven letter words, similarly as next odd words is described by the Weilbull distribution. The correlation is poor. There exist two shortages, within distance one 27 (19 occurrences against 27.2 expected), and then within distances 80-92 (6 occurrences against 11.9 expected). They together make 65.4 % of the chi-square test value.
The distribution of eight letter words, as well as ten letter words is again the negative binomial distribution. The shortage of these words within distances 133-146 (1 occurrence against 5.8 expected) contributes 34 % of the chi-square test value. The tail is longer (14 occurrences against 8.8 expected). This makes 26.1 % of the poor chi-square test value.
The distribution of nine letter words is fairly correlated. The shortage of these words within distances 133-146 (1 occurrence against 5.8 expected) contributes 34 % of the chi-square test value.
The distribution of ten letter words is fairly correlated, too. The shortage of these words within distances 51-75 (16 occurrence against 26.2 expected) contributes one half of the chi-square test value.
The distributions of longer words are good correlated, or the tests failed due to few data.
Distances between points and commas
Distances between punctuation marks show the length of sentences or clauses.
Here is given the interesting result with the distribution of the points:
Table 5
The negative binomial distribution of distances between points.
Chisquare TestLower | Upper | Observed | Expected | |
Limit | Limit | Frequency | Frequency | Chisquare |
at or below | 35.250 | 32 | 98.2 | 44.5925 |
35.250 | 69.500 | 56 | 78.2 | 6.2818 |
69.500 | 103.750 | 126 | 64.2 | 9.3820 |
103.750 | 138.000 | 29 | 52.8 | 0.7259 |
138.000 | 172.250 | 80 | 43.4 | 30.8804 |
172.250 | 206.500 | 75 | 35.7 | 43.3816 |
206.500 | 240.750 | 13 | 29.3 | 9.0786 |
240.750 | 275.000 | 30 | 24.1 | 1.4485 |
275.000 | 309.250 | 11 | 19.8 | 3.9122 |
309.250 | 343.500 | 21 | 16.3 | 1.3718 |
343.500 | 377.750 | 27 | 13.4 | 13.8755 |
377.750 | 412.000 | 4 | 11.0 | 4.4493 |
412.000 | 446.250 | 4 | 9.0 | 2.8067 |
446.250 | 480.500 | 7 | 7.4 | .0245 |
480.500 | 514.750 | 3 | 6.1 | 1.5784 |
514.750 | 549.000 | 8 | 5.0 | 1.7739 |
549.000 | 617.500 | 8 | 7.5 | .0317 |
617.500 | 686.000 | 2 | 5.1 | 1.8629 |
686 | 734 | 1 | 10.6 | 8.6593 |
Chisquare = 246.117 with 17 d.f. Sig. level = 0
The mean distance is 174.62. This makes exactly four verses. The oscillations correspond to the number of verses. The comma is the most frequently used punctuation mark for dividing the
verses:Table 6The negative binomial distribution of distances between commas.
Chisquare TestLower | Upper | Observed | Expected | |
Limit | Limit | Frequency | Frequency | Chisquare |
2 | 12.485 | 139 | 177.8 | 8.4545 |
12.485 | 23.970 | 364 | 328.5 | 3.8466 |
23.970 | 35.455 | 273 | 368.4 | 24.7261 |
35.455 | 46.939 | 500 | 289.7 | 152.7396 |
46.939 | 58.424 | 167 | 247.9 | 26.3870 |
58.424 | 69.909 | 123 | 169.1 | 12.5859 |
69.909 | 81.394 | 125 | 132.8 | .4609 |
81.394 | 92.879 | 134 | 85.4 | 27.6319 |
92.879 | 104.364 | 50 | 64.3 | 3.1786 |
104.364 | 115.848 | 22 | 40.0 | 8.1144 |
115.848 | 127.333 | 30 | 29.4 | .0134 |
127.333 | 138.818 | 30 | 17.9 | 8.1625 |
138.818 | 150.303 | 8 | 12.9 | 1.8772 |
150.303 | 161.788 | 5 | 7.8 | .9872 |
161.788 | 173.273 | 7 | 5.5 | .3881 |
173.273 | 268 | 10 | 9.6 | .0179 |
Chisquare = 279.572 with 14 d.f. Sig. level = 0
Distances between individual letters
The results for all letters are presented in the form of the table, where the frequencies of all symbols are given and the significance of the performed chi-square tests. Then the commentaries to all symbols of the alphabet are given.
Table 7 Survey of results
Notes:
EX = exponential distribution
WE = Weibull distribution
L N = lognormal distribution
NB = negative binomial distribution
* = the test was not made, since not enough of data
Statistic = XX, the chi-square test
Symbol | Small | Capital | Both |
a | 4571, EX, 0 | 367, EX, 0.664 | 4938, EX, 0 |
b | 1085, EX,0.036 | 144, EX, 0.809 | 1229, WE, 0.087 |
c | 1311, NB, 0.358 | 31, EX, 0.041 | 1342, EX, 0.522 |
d | 2724, EX, 0 | 38, EX, 0.190 | 2762, NB, 0 |
e | 9219, NB, 0 | 23, EX, 0.186 | 9242, NB, 0 |
f | 1556, NB, 0.263 | 107, EX, 0.316 | 1663, NB, 0.993 |
g | 1342, EX, 0.038 | 16* | 1358, NB, 0.091 |
h | 5002, EX, 0 | 65, EX, 0.867 | 5067, EX, 0 |
i | 4232, EX, 0 | 443, LN, 0.883 | 4675, EX, 0 |
j | 66, LN, 0.604 | 2* | 68, LN, 0.604 |
k | 547, EX, 0.011 | 6* | 552, EX, 0.011 |
l | 3033, EX, 0 | 58, EX, 0.237 | 3091, EX, 0 |
m | 2004, WE, 0.671 | 90, WE, 0.098 | 2094, WE, 0.670 |
n | 4445, NB, 0 | 73, EX, 0.826 | 4518, NB, 0 |
o | 5579, NB, 0 | 127, LN, 0.685 | 5706, NB, 0 |
p | 986, NB, 0 | 24* | 1010, NB, 0 |
q | 51, EX, 0.739 | 0 | 51, EX, 0.739 |
r | 4165, NB, 0 | 17, EX, 0.573 | 4182, NB, 0 |
s | 4846, NB, 0 | 141, LN, 0.672 | 4987, NB, 0 |
t | 6754, NB, 0 | 459, EX, 0.197 | 7213, NB, 0 |
u | 2299, EX, 0 | 21, EX, 0.785 | 2320, EX, 0,008 |
v | 924, EX, 0.008 | 1* | 925, EX, 0.008 |
w | 1645, EX, 0 | 252, EX, 0.630 | 1897, EX, 0 |
x | 60, EX, 0.926 | 0 | 60, EX, 0.926 |
y | 1951, LN, 0 | 34, EX, 0.470 | 1985, EX, 0 |
z | 20, EX, 0.931 | 0 | 20, EX, 0.931 |
The Weibull distribution is the best one only in the case of the letter m. The lognormal distribution correlates 4 cases of capital letters I, N, and S, and both case of the letter j (there are only two capital J). The exponential distribution is the best in the most performed tests, and the negative binomial distribution in 9 cases.
The fit varied from the excellent, for example f with the chi-square value 0.994, to practically zero values, as at the most frequent vowels.
The differences between experimental and calculated values were usually great at the shortest distances (1 till 10). Adjusting the lowest possible value to greater distances by pooling these distances increased the significance of the chi-square tests in some cases. The significance improved dramatically sometimes, see below.
Now, the commentaries to the individual letters follow.
A
The capital case A frequency allowed the separate test. The result with the exponential distribution is good, even if there is to high repeating within one verse distances (90 occurrences against 75.8 expected). This makes 39.5 % of the chi-square test value.
The distribution of distances between lower case a and both case (a + A) seems to be the exponential, at least their tails fit. The distribution cannot be satisfactorily described by a simple function due to fluctuations of frequencies between odd and even distances. This feature can be documented by pooling the lower distances between both case (a + A):
Table 8
Chisquare values of pooled lower distancesover | 1. part | 2. part | 3. part |
26 | 0.0229 | ||
27 | 0.1371 | 0.1187 | |
28 | 0.3054 | 0.0976 | |
29 | 0.0027 | 0.2432 | 0.0158 |
30 | 0.1481 | 0.1872 | |
31 | 0.0835 |
The fluctuations between odd and even distances are not constant.
Correlating observed frequencies of the same classes of distances of one part against the same of the other part gives fairly linear plot (due to the span of values in the logarithmic scale). The two way sample analysis shows that the parts are from one whole.
1. part : 2. part | 1. part : 3. part | 2. part : 3. part | |
a | 0.3790 | 0.5597 | 0.7611 |
a+ A | 0.9379 | 0.6070 | 0.6530 |
B
The distribution of distances between upper case B is exponential. There is a peak within distances 438-655 which corresponds to 10-12 verses (90 occurrences against 70 expected). This makes nearly two thirds of the chi-square test value.
The exponential distribution of distances between lower case b is worsened by a peak within distances 277-337 (9 occurrences against 22.4 expected). This makes more than one third of the chi-square test value. There are too few doubled bb (5 occurrences against 12.5 expected), which contributes 19.1 % of the chi-square test value, and to many occurrences ( 256 against 224.6 expected) within distances 32-62. This again makes 18.7 % of the chi-square test value. The combination of both cases changed the form of the distribution to Weilbull. There are too few doubled bb (5 occurrences against 11.9 expected), which contributes 24.1 % of the chi-square test value. Other deviations have a minor weight.
C
The distribution of this letter varies between exponential (b chi-square value 0.3726, b+ B chi-square value 0.5223), and negative binomial (b chi-square value 0.3580, b+ B chi-square value 0.4790). In both cases, there is a shortage in the range 143-169 (34 occurrences against 48.8 expected, or 26 occurrences against 38.8 expected ). This makes 29.3 %, and 24.8 % of the chi-square test value, respectively.
D
Here also the negative binomial or exponential distribution were applicable, with many deviations. There are too few doubled dd (18 occurrences against 79.2 [d] or 81.4 [d + D] expected), which contributes 62.4 % or 63.6 % of the chi-square test value, respectively. The exponential distribution is fair over the limit 30 (the chi-square test value 0.168).
E
There are relatively few E comparing with the number of e.
The distribution of distances between lower case e and both case (e + E) seems to be the negative binomial, at least their tails fit. The distribution cannot be satisfactorily described by a simple function due to fluctuations of frequencies between odd and even distances. This feature can be documented by pooling the lower distances between e:
Table 9
Chisquare values of pooled lower distancesover | 1. part | 2. part | 3. part | 4. part |
10 | 0.1558 | |||
11 | 0.2949 | 0.2949 | 0.1672 | |
12 | 0.4356 | 0.4356 | ||
13 | 0.5101 | 0.5101 | 0.2719 | |
14 | 0.3687 | 0.3687 | ||
18 | 0.2918 | |||
20 | 0.5060 | 0.5060 | ||
22 | 0.2500 | 0.2500 | ||
23 | 0.2442 | 0.2442 | ||
24 | 0.5547 | 0.5543 | ||
25 | 0.5541 | 0.5541 | ||
26 | 0.5541 | 0.5730 | ||
30 | 0.5134 |
both case (a + A):
Table 10
Chisquare values of pooled lower distancesover | 1. part | 2. part | 3. part | 4. part | 5. part |
9 | 0.1517 | 0.2414 | |||
10 | 0.2299 | 0.3077 | |||
11 | 0.2926 | 0.1859 | 0.1621 | ||
12 | 0.4118 | 0.2559 | 0.2267 | ||
13 | 0.5131 | 0.2456 | |||
15 | 0.5712 | ||||
17 | 0.4662 | ||||
19 | 0.6346 | ||||
20 | 0.5014 | ||||
25 | 0.5485 | ||||
28 | 0.3799 | ||||
29 | 0.6237 | ||||
30 | 0.5134 | 0.5175 |
The fluctuations between odd and even distances are not constant.
Correlating observed frequencies of the same classes of distances of one part against the same of the other part gives fairly linear plot (due to the span of values in the logarithmic scale). The two way sample analysis shows that the parts can not always be considered as parts from one whole.
Table 11
The two way sample analysis of e distance testsThe differences between values in brackets are significant, the zero hypothesis should be rejected.
2. part | 3. part | 4. part | |
1. part | [0.0007] | 0.9100 | 0.0559 |
2. part | [0.0006] | 0.1533 | |
3. part | [0.0460] |
The first part corresponds to the third part, and correlates badly with the second, and fourth part, too.
Table 12
The two sample analysis of e + EThe differences between values in brackets are significant, the zero hypothesis should be rejected.
2. part | 3. part | 4. part | 5. part | |
1. part | 0.7108 | [0.0009] | 0.7964 | 0.0625 |
2. part | [0.0028] | 0.5228 | 0.1304 | |
3. part | [0.0004] | 0.1511 | ||
4. part | [0.0371] |
The first part corresponds to the second and fourth parts, and correlates badly with the second, and fifth part. As an example, the output of the test between part 1 and 3 is given:
Table 13 The two sample analysis of e + E
WEE1.var1 | WEE3.var1 | Pooled | |
Sample Statistics: Number of Obs. | 1855 | 1840 | 3695 |
Average | 9.83881 | 10.8168 | 10.3258 |
Variance | 69.1978 | 90.626 | 79.8684 |
Std. Deviation | 8.31852 | 9.51977 | 8.93691 |
Median | 7 | 8 | 8 |
Conf. Interval For Diff. in Means: 95 Percent
Equal Vars.) Sample 1 - Sample 2 -1.55467 -0.4014 3693 D.F.
Unequal Vars.) Sample 1 - Sample 2 -1.55499 -0.401083 3619.9 D.F.
Ratio of Variances = 0.763554
Conf. Interval for Ratio of Variances: 0 Percent
Sample 1 Sample 2
Hypothesis Test for H0: Diff = 0 Computed t statistic = -3.32614
vs Alt: NE Sig. Level = 8.89197E-4
at Alpha = 0.05 so reject H0.
Inspecting both tables, it seems that it were possible to find the parts, where the distribution of e differs, with a greater precision.
F
It is not necessary to add some notes to the excellent fit of this letter. But it is rather interesting, how the scattered F improved the distribution of the lover case f.
G
The distribution of the both case g + G is again more regular than the distribution of the lower case g. Over distances 25, the chi-square test value for g is 0.9004, for g + G 0.4215, only. Few occurrences of gg (8 occurrences against 19.7 expected) make about one third of the chi-square test value.
H
There is a peak within distances 40-49 (83 occurrences against 49.8 expected). This makes two fifth of the chi-square test value. The second smaller peak lies within distances 21-30 (224 occurrences against 182.6 expected). This makes about one fifth of the chi-square test value. The lognormal distribution of this letter is shorter than expected (7 occurrences over 86 against 12.7 expected). This makes more than one fourth of the chi-square test value.
I
The lognormal distribution of the capital I is excellent, the significance of the chi-square test is 0.883. There is a peak within distances 20-26 verses (9 occurrences against 6 expected). This makes one third of the chi-square test value.
The lognormal distribution of the both case i + I is poor. There is a shortage of distances 14-24 (287 occurrences against 326.1 expected). This makes about one fifth of the chi-square test value. There is a peak within distances 47-58 (52 occurrences against 40.7 expected). This makes one sixth of the chi-square test value. The lognormal distribution of this letter is shorter than expected (9 occurrences over 96 against 20.5 expected). This makes about one third of the chi-square test value.
J
The lognormal distribution of this letter is without any greater deviations.
K
The exponential distribution of this letter is poorly correlated due to many repeatings in the second verses (distances 51-100, 122 occurrences against 104.3 expected). This makes 19.1 % of the chi-square test value. There is a peak within distances 12-13 verses (14 occurrences against 5.5 expected). This makes about one half of the chi-square test value.
L
The exponential distribution of this letter (lower case) is distorted by many double ll (248 and 280 occurrences against 48.7 or 47.9 expected, respectively in two parts). This makes more than 90 % of the total very high chi-square test value. The correlation of the first half is good, when the lower limit is set over 13:
Lower limit | 13 | 14 | 15 | 16 | 17 |
Chi-square | 0.4105 | 0.8272 | 0.6526 | 0.6787 | 0.6220 |
The correlation of the second half is good, only when the lower limit is set over 29:
Lower limit | 29 | 30 | 31 |
Chi-square | 0.1805 | 0.8129 | 0.6622 |
The exponential distribution of both case l + L is distorted by many double ll, which again makes more than 90 % of the total very high chi-square test value. The correlation of the first half is good, when the lower limit is set over 12:
Limit | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
Chi | 0.2553 | 0.7107 | 0.5649 | 0.66849 | 0.6532 | 0.3830 | 0.1721 | 0.1721 |
The correlation of the second half is good, only when the lower limit is set over 29:
Lower limit | 29 | 30 | 31 | 32 |
Chi-square | 0.2235 | 0.8180 | 0.6332 | 0.3332 |
The consistency of both parts is good (l1/l2) = 0.5984, (l + L1)/(l + L2) = 0.6922. Even the comparisons of (l1/l + L1) = 0.6785, (l2/l + L2) = 0.5816 are acceptable.
M
This letter is correlated at best with the Weilbull distribution. Even the negative binomial distribution is acceptable (the chi-square test value 0.582). The doubled mm fit excellently with the negative binomial distribution but they form a peak in the Weilbull distribution (42 occurrences against 30.7 expected). This makes more than one third of the chi-square test value. There is a shortage of distances 100-113 (both models, 50 occurrences against 63.9 (WE) or 61.5 (NB) expected). This makes one quarter of the chi-square test value (WE). The combination with the upper case M decreased weights of these fluctuations somewhat.
N
The distribution of n and (n + N) was divided into two parts, which were slightly different. The negative binomial distribution of this letter is distorted by the shortage of doubled nn (only 11 % of expected, which makes two thirds of the chi-square test value). The chi-square test values improved by pooling lower distances differently: 1. part of n over 20 = 0.917, 2. part of n over 24 = 0.706, 1. part of (n + N) over 25 = 0.925, 2. part (n + N) over 24 = 0.938.
O
The lognormal distribution of capital O has a peak between 29-41 verses (20 occurrences against 14.3 expected). This makes 73.3 % of the chi-square test value. The distribution of o and (o + O) was again divided into two parts, which were slightly different. The negative binomial distribution of this letter is distorted by the shortage of doubled oo (only 51.8 % of expected, which makes one half of the chi-square test value). The chi-square test values improved by pooling lower distances differently: 1. part of o over 21 = 0.216, 2. part of n over 12 = 0.356, 1. part of (o + O) over 29 = 0.101, 2. part (o + O) over 11 = 0.472.
P
The negative binomial distribution of this letter is distorted by the surplus of doubled pp (44 occurrences against 10.4 expected), which makes 92.7 % of the chi-square test value. The chi-square test values improved by forced lower limit 10 to 0.296. The tail over 161 is almost perfect.
Q
The exponential distribution gives no opportunity for some comments.
R
The distribution of r and (r + R) was divided into two parts, which were slightly different. The negative binomial distribution of this letter is distorted by the shortage of doubled rr (only 24.8 % of expected, which makes about one half of the chi-square test value). The chi-square test values improved by pooling lower distances differently: 1. part of r over 30 = 0.241, 2. part of r over 42 = 0.671, 1. part of (r + R) over 30 = 0.186, 2. part (r + R) over 40 = 0.541.
S
The lognormal distribution of the capital S is good, the significance of the chi-square test is distorted by a peak within distances of 10-14 verses (24 occurrences against 19.4 expected). This contributes one quarter to the chi-square test value. The distribution of the lower case s an d (s + S) was divided into two parts, which were slightly different. The negative binomial distribution of this letter is distorted significantly by the shortage of doubled ss only in the 1. part (91 occurrences against 125.7 expected). This makes 15.2 % of the chi-square test value. The sign s appears less than expected within distances 2-6 (801 occurrences against 874 expected), and more than expected within distances 6-12 (933 occurrences against 835.9 expected). The chi-square test value improved by pooling lower distances: 1. part of s over 30 = 0.167, 2. part of s over 30 = 0.660. Here appeared a shortage of the distances 49-53 (52 occurrences against 78.6 expected). This makes 75.9 % of the chi-square test value. The distribution of the both case s +S is similarly the negative binomial one, fair with pooled lower distances: 1. part of s over 20 = 0.521, 2. part of s over 30 = 0.525.
T
The exponential distribution of the capital T has a peak within distances 157-260 (103 occurrences against 84.9 expected). This contributes 23.7 % of the chi-square test value. The distribution is then shorter than expected (9 occurrences over 523 against 21.6 expected). This makes 46 % of the chi-square test value. The distribution of the lower case t is the negative binomial one. When divided into three parts, all parts show the shortage of doubled tt (11-17 % of expected), 66-71 % of the chi-square test value. The first part fitted excellently over 13, the significance chi-square test value is 0.919, the tail over 25 of the second part gives a fair chi-square test value 0.509, whereas the same tail the third part has the chi-square test value only 0.0019.
The distribution of both (t + T) is the negative binomial one. All three parts show the shortage of doublets Tt + tt (9.8-15 % of expected), 61.8-71.9 % of the chi-square test value. The first part fitted good over 12, the significance chi-square test value is 0.798, the tail over 25 of the second part gives a fair chi-square test value 0.529, whereas the same tail the third part has the chi-square test value only 0.003, and it is better correlated as the negative binomial distribution.
U
There are no doubled uu (0 occurrence against 55.7 expected). This makes 73.3 % of the chi-square test value of the exponential distribution. When the lower limit is set to 20, the chi-square test value is improved to 0.263. The exponential distribution of both (u + U), divided into three parts, correlates differently, again. The first part fitted poorly over 20, the significance chi-square test value is 0.105, the tail over 20 of the second part gives a good chi-square test value 0.747, whereas the tail over 13 the third part has the chi-square test value 0.512.
V
The exponential distribution has a shortage of distances till 32 (223 occurrences against 256.8 expected). This contributes 39 % of the chi-square test value. Then there follows a peak within distances 33-76 (368 occurrences against 312.5 expected). This contributes 33.8 % of the chi-square test value. The tail over 50 fits good with the chi-square test value 0.402.
W
The exponential distribution of the upper case W gives a good fit. There is a peak of the distances 113-224 (69 occurrences against 55.6 expected). This alone makes 45.4 % of the chi-square test value. The exponential distribution of the lower case gives an acceptable fit over 10 (the chi-square test value 0.350). There are no doubled ww (59 % of the chi-square test value). Combined (w + W) improved somewhat the fit, the absence ww makes 61.9 % of the chi-square test value, since the sample is greater. There is a shortage of the distances 118-131 (about 3 verses, 28 occurrences against 45.1 expected). This makes 10.6 % of the chi-square test value. Over 15 the chi-square test value is 0.462.
X
The exponential distribution is almost perfect.
Y
The exponential distribution of the upper case Y gives a good fit. It somewhat improves the very poor lognormal distribution of the lower There is a long peak within distances 74-117 (272 occurrences against 200.2 expected). This alone makes 48.2 %of the chi-square test value. The lognormal distribution of this letter is shorter than expected (25 occurrences over 205 against 52.6 expected). This makes 33.5 % of the chi-square test value.
Z
The exponential distribution is almost perfect.
I tried to find also the distribution of distances between words or groups of signs. As an example, the frequency of All (10), all (121) and *all (as call, shall etc., 209 occurrences). The distribution of distances between the determiner all is the Weilbull one, the chi-square test value is 0.448 with 121 occurrences.
Discussion
The corrections (removing off superfluous spacebars) in some cases worsened the fits, when compared with preliminary tests made with the raw text, as if the writer's errors were a part of the scheme leading to some distribution of distances between symbols.
In verses, repeating of some letters in some intervals is intentional, since they form rhymes. But in statistics, this feature is blurred by their occurrences within verses. The verse structure of the text revealed itself in the use of points.
To high repeating of the capital A within one verse distances (90 occurrences against 75.8 expected) is due mostly to the sonnet number 66, where 11 verses start with "And". This starting "And" repeats in other sonnets, too, and in combination with other starting A makes the peak. This distortion must be considered as intentional.
Some distributions of distances between consonants are highly regular, especially their tails, if the low distances inside words are pooled. They are described with a different precision with four distributions: exponential, Weibull, lognormal and negative binomial. Sometimes it is rather difficult to decide which distribution is the better one for fitting.
The splitting of statistics of some frequent letters, which was a necessity due to the insufficient memory of the used software, showed new possibilities of the distance analysis.
Since there are statistically significant differences between the parts, it seems, that Sonnets are not a single work, but a collection of sonnets including different parts. No attempt was made to synchronize a statistical analysis with a subject and stylistical analysis.
If the results are compared with published example (Kunz @ Rádl, 1998) of a scientific paper, than there can be observed some differences. In both cases, the vowels, except u, are poorly fitted. In both cases, letter f gave nearly ideal fit.
Consonants with the worser fit in the Sonnets are: b, c, d, , g, h, k, l, v, and w. Consonants with the better fit in the Sonnets are: m, x, and z. Since there are only few data for study, it can be only speculated, if it is the caused by the different use of these consonants in rhymes, which could produce observed peaks and fluctuations.
It can be concluded, that the analysis of distances between lexical units in text could become an useful method of text analysis.
REFERENCES
Haitun, S. D. (1982a) Stationary Scientometric Distributions I: Different Approximations. Scientometrics, , 4, 525.
Haitun, S. D. (1982b) Stationary Scientometric Distributions II: Non Gaussian Nature of Scientific Activities. Scientometrics, 4, 89 - 101.
Haitun, S. D. (1982c) Stationary Scientometric Distributions III: The Role of the Zipf Distribution. Scientometrics, 5, 375 - 395.
Harary, F.; Paper, H. H. (1957) Toward a General Calculus of Phonemic Distribution, Language, 33, 143 -- 169.
Irwing, J. O. (1963) The Place of Mathematics in Medical and Biological Statistics, J. Royal. Statistical Soc. A, 126, 1 - 45.
Kunz, M. (1987) Time Spectra of Patent Information, Scientometrics, 11, 163 - 173.
Kunz, M. (1993) About metrics of bibliometrics, J. Chem. Inform. Comput.
Sci., 33, 193 – 196.
Kunz, M. ; Rádl, Z. (1998) Distribution of Distances in Information Strings, J. Chem. Inform. Comput. Sci., 38, 374-378.
Kunz, M. (2000) Number e as a model gene (
atlas.cz.mujweb\veda\kunzmilan)Yule, G. U. (1944) The Statistical Study of Literary Vocabulary, Cambridge University Press, Cambridge.