Distance Analysis of German Texts. III. Use of Words and Punctuation Marks by Johann Wolfgang Goethe.
Milan Kunz (
kunzmilan@seznam.cz) March, 2003Abstract
Distances between words of same length and using of punctuation marks in four Goethe’s works is studied. Beside four statistical distributions: Exponential, Weibull, lognormal and negative binomial, some distances are correlated with Erlang distribution.
INTRODUCTION
This is a continuation of study of statistical properties of distances between identical symbols in information strings in Goethe’s works. The same technique was used as before. Surprisingly, some sets correlated as best with the Erlang distribution, the fifth one to four found before to be suitable for English and German texts: the exponential distribution, the Weibull distribution, the lognormal distribution, and the negative binomial distribution. The Erlang distribution coincides with the exponential distribution, if its parameter alpha is 1. Then it has less degrees of freedom and therefore a lower the chi-square test value.
The actual values (mean, standard deviation, skewness, kurtosis, distribution parameters etc.) are of little interest, since they differ considerably.
Results
The distances between points determine the length of sentences. The points are used as follows:
Work |
Frequency |
Distribution |
Chisquare (significance) |
Werther 1 |
662 |
EX |
0, 0.1318 over 250 |
Werther 2 |
886 |
WE |
0.0371 |
Faust 1 |
1349 |
EX, ER |
0 |
Faust 2 |
2169 |
ER |
0, 0.2597 over 300 |
Werther 1: There are too few short sentences till 61 letters (109 occurrences against 184.6 expected). This contributes 31.9 % of the chi-square test value. The peak within distances 123-243 (243 occurrences against 154.6 expected) represents 53.1 % of the chi-square test value.
Werther 2: Compared with other three works, all differences against expected values are rather small.
Faust 1: There are too few short sentences till 41 letters (131 occurrences against 334.9 expected). This contributes 44.6 % of the chi-square test value. Then follows a peak within distances 41-181 (898 occurrences against 627.7 expected) making another 44.5 % of the chi-square test value.
Faust 2 There are too few short sentences till 50 letters (245 occurrences against 402.7 expected). This contributes 49.7 % of the chi-square test value. Then follows a peak within distances 61-150 (1268 occurrences against 1060 expected) making another 32.9 % of the chi-square test value.
The commas are used as follows. In both tails of Faust, they are too frequent and it was necessary to divide the sets into two parts:
Work |
Frequency |
Distribution |
Chisquare (significance) |
Werther 1 |
2017 |
ER |
0, 0.5314 over 100 |
Werther 2 |
2461 |
EX, ER |
0, EX 0.1023 over 100 |
Faust 1 |
1414 |
EX, ER |
0, 0.5097 over 80 |
|
1385 |
EX, ER |
0, 0.5006 over 70 |
Faust 2 |
2572 |
EX, ER |
0 |
|
2595 |
EX, ER |
0 |
Werther 1: There are too few commas within short distances till 12 letters (114 occurrences against 158.7 expected). This contributes 17.4 % of the chi-square test value. The peak within distances 13-35 (739 occurrences against 600.7 expected) represents 47.8 % of the chi-square test value.
Werther 2: As a difference against the first part, there are too many commas within short distances till 15 letters (277 occurrences against 233.4 expected). This contributes 16.1 % of the chi-square test value. Then there are too many lesser fluctuations.
Faust 1: In the first part, there are too few short distances till 12 letters (109 occurrences against 282.1 expected). This contributes 55.8 % of the chi-square test value. Then a peak follows within distances 13-48 (696 occurrences against 520.3 expected) making another 32.8 % of the chi-square test value.
In the second part, there are too few short distances till 49 letters (316 occurrences against 399.3 expected). This contributes 45.6 % of the chi-square test value. Then follows a peak within distances 26-49 (582 occurrences against 472.7 expected) making another 39,6 % of the chi-square test value. Both parts are different, t-test gave practically zero significance.
Faust 2: In the first part, there are too few short distances till 22 letters (715 occurrences against 859.4 expected). This contributes 41.2 % of the chi-square test value. Then follows a peak within distances 23-88 (1393 occurrences against 1179.4 expected) making another 39.6 % of the chi-square test value.
In the second part, there are too few short distances till 14 letters (310 occurrences against 627 expected). This contributes 54.8 % of the chi-square test value. Then follows a long peak within distances 15-96 (1932 occurrences against 1543.7 expected) making another 35.5 % of the chi-square test value. Both parts are different, t-test gave only 0.006 value.
The other punctuation mark, the colon is used by Goethe as follows:
Work |
Frequency |
Distribution |
Chisquare (significance) |
Werther 1 |
70 |
LN |
0.1424 |
Werther 2 |
69 |
LN |
0.6645 |
Faust 1 |
1001 |
LN |
0.0137 |
Faust 2 |
1287 |
LN |
0.0299 |
Werther 1: There are too few colons within distances 1475-2211 (3 occurrences against 7.7 expected). This slight difference contributes 73.9 % of the chi-square test value.
Werther 2: The correlation is very well:
Chisquare Test |
||||
Lower Limit |
Upper Limit |
Observed Frequency |
Expected Frequency |
Chisquare |
63 |
579.895 |
24 |
24.9 |
0.03235 |
579.895 |
1158.789 |
12 |
14.7 |
0.48108 |
1158.789 |
1737.684 |
8 |
8.2 |
0.00421 |
1737.684 |
2316.579 |
7 |
5.1 |
0.69371 |
2316.579 |
3474.368 |
7 |
5.9 |
0.20867 |
3474.368 |
5789.947 |
7 |
5.1 |
0.71024 |
5789.947 |
9405 |
4 |
5.2 |
0.25962 |
Chisquare = 2.38988 with 4 degrees of freedom. Significance level = 0.664458
Faust 1: There are too few colons within distances 247-368 letters (71 occurrences against 95.2 expected). This contributes 35.8 % of the chi-square test value. The tail over distances 1104 (28 occurrences against 16.8 expected) making another 56.8 % of the chi-square test value.
Faust 2: The peak within distances 792-934 (26 occurrences against 14.4 expected) makes 39,3 % of the chi-square test value. The mean distances in both parts are different.
The other punctuation mark, the semicolon, is used as follows:
Work |
Frequency |
Distribution |
Chisquare (significance) |
Werther 1 |
118 |
WE |
0.4679 |
Werther 2 |
138 |
WE |
0.5370 |
Faust 1 |
292 |
LN |
0.4603 |
Faust 2 |
746 |
EX |
0.0080, 0.2284 over 200 |
Werther 1: The correlation is rather well regarding the number of all semicolons.
Werther 2: There are too many semicolons within distances 1051-1400 letters (20 occurrences against 14.7 expected). This contributes 47.1 % of the chi-square test value. The shortage of semicolons within distances 1751-2100 letters (4 occurrences against 6.9 expected) makes another 29.9 % of the chi-square test value.
Faust 1: There are too few distances 229-456 (63 occurrences against 77.3 expected). This contributes 34.4 % of the chi-square test value. The peak within distances 685-912 (32 occurrences against 25.7 expected) makes another 19.9 % of the chi-square test value.
Faust 2: There are too few semicolons within short distances till 132 letters (184 occurrences against 220 expected). This contributes 23.2 % of the chi-square test value. Then follows a peak within distances 132-394 (322 occurrences against 263.1 expected) making another 52.6 % of the chi-square test value.
The exclamation mark is used as follows:
Work |
Frequency |
Distribution |
Chisquare (significance) |
Werther 1 |
213 |
WE |
0.0203 |
Werther 2 |
407 |
LN |
0.5082 |
Faust 1 |
1150 |
LN |
0.0036 |
Faust 2 |
1022 |
WE |
0, 0.2433 over 200 |
Werther 1: There are too few exclamation marks within distances 634-950 (7 occurrences against 18.3 expected). This contributes 59.7 % of the chi-square test value.
Werther 2: There are too many exclamation marks within distances 1700-2366 letters (10 occurrences against 5.6 expected). This contributes 67.2 % of the chi-square test value.
Faust 1: There are too many distances 272-452 letters (122 occurrences against 90.2 expected). This contributes 35.3 % of the chi-square test value. There are another 5 greater disturbances.
Faust 2: There are too many distances 2-123 letters (501 occurrences against 435.3 expected). This contributes 25.7 % of the chi-square test value. Then follows a shortage within distances 124-368 (236 occurrences against 323 expected) making another 30.7 % of the chi-square test value.
The quotation marks are used as follows:
Work |
Frequency |
Distribution |
Chisquare (significance) |
Werther 1 |
296 |
LN |
0.0043 |
Werther 2 |
314 |
LN |
0.0031 |
Faust 1 |
49 |
WE |
0.0001 |
Faust 2 |
20 |
EX |
0.0001 |
Werther 1: There are too many quotation marks with distances over 1560 (17 occurrences against 8.9 expected). This contributes 90.4 % of the chi-square test value.
Werther 2: Again, there are too many exclamation marks with distances over 1680 (20 occurrences against 11 expected). This contributes 56.7 % of the chi-square test value.
Faust 1 and Faust 2: The quotation marks are used exceptionally, the results of tests are poor.
The spacebar
The distances between consecutive spacebars greater than 1 determine the number of words of the length corresponding to this distance minus one. There exists 40931 spacebars without corrections. Some of them are used as formatting tools. The results of tests are tabulated as follows. Cumulating frequencies of shorter distances, improved in some cases the fit, since bellow it the counts are scattered, and differences can balance themselves.
Table The number of words of different length in Werther 1
Length |
Number |
Type of distribution, chisquare value |
1 |
472 |
EX, 0.9525 over 60 |
2 |
10 |
no test |
3 |
1089 |
ER, 0.1692 over 26, EX 0.1901 over 30 |
4 |
2138 |
EX 0 |
|
2134 |
EX, 0 |
5 |
2058 |
EX, 0.1960 over 35 |
6 |
1887 |
ER, 0, 0.5319 over 9 |
7 |
1590 |
EX, 0.2301 |
8 |
1076 |
EX, 0, (0.4948 over 10) |
9 |
812 |
ER 0.8186 |
10 |
691 |
EX, 0.2325, (0.6036 over 20) |
11 |
599 |
EX, 0.4286, (0.8193 over 30 |
12 |
419 |
EX, 0.7772, |
13 |
345 |
WE, 0.1418 |
14 |
263 |
EX, 0.4985 |
15 |
188 |
EX, 0.9746 |
16 |
151 |
EX, 0.9876 |
17 |
93 |
EX, 0.0953 |
18 |
82 |
EX, 0.2608 |
19 |
52 |
EX, 0.4301 |
20 |
46 |
EX, 0.6338 |
21 |
32 |
EX, 0.4337 |
22 |
22 |
EX, 0.5763 |
23 |
10 |
no test |
24 |
17 |
|
25 |
11 |
|
26 |
4 |
|
27 |
3 |
|
28 |
1 |
|
29 |
0 |
|
30 |
2 |
|
31 |
0 |
|
32 |
0 |
|
33 |
4 |
|
35 |
1 |
|
The words classified according to their length are distributed mostly according to the exponential distribution, the Erlang distribution performed better only at three distances but with the alpha equal to 1 At other, it has less degrees of freedom and thus a poorer chi-square test value. The Weibull distribution was applicable only at the distance 13.
The distribution of length of words seem to have the lognormal shape, but this guess was not tested.
Notes to some results:
There are no three consecutive backspaces. This makes 39.9 % of the chi-square test value. Double backspaces repeat most often within distances 18-26 (90 occurrences against 61.2 expected). This makes also 39.9 % of the chi-square test value. The tail fits almost perfectly.
1. letter words are too few.
2. letter words do not follow immediately as often as expected (37 occurrences against 70.4 expected) . This makes 36.4 % of the chi-square test value. The second greatest disturbance is the peak within distances 10-14 (180 occurrences against 144.8 expected). This makes also 19.6 % of the chi-square test value.
3. letter words. It was necessary to divide 3. letter words into two parts. In the first one, 3. letter words do not repeat as often as the exponential distribution requires (503 occurrences against 700.2 expected). This makes 35.6 % of the chi-square test value. 3. letter words repeat more often then expected within distances 2-6 (1188 occurrences against 938.7 expected). This makes 42.9 % of the chi-square test value.
Similarly in the second part, 3. letter words do not repeat as often as the exponential distribution requires (499 occurrences against 688.4 expected). This makes 32.6 % of the chi-square test value. 3. letter words repeat more often then expected within distances 2-6 (1308 occurrences against 1050.0 expected). This makes 35.7 % of the chi-square test value. Both parts are similar.
4. letter words. The peak within distances 10-12 (202 occurrences against 144.2 expected) makes 47.3 % of the chi-square test value.
5. letter words. The Erlang distribution shape is disturbed by surplus of distances 6-8 (313 occurrences against 231.8 expected). This makes 45.8 % of the chi-square test value. The second peak within distances 13-15 (137 occurrences against 103 expected) makes 18.1 % of the chi-square test value.
6. letter words. The Erlang distribution is acceptable, too, the chi-square test value is 0.1788. There are many smaller disturbances.
7. letter words. The Erlang distribution is acceptable, too. The disturbances are similar as at 5. letter words, there is a surplus of distances 5-8 (208 occurrences against 165.5 expected). This makes 22.1 % of the chi-square test value. The second peak within distances 12-15 (141 occurrences against 104 expected) makes 26.6 % of the chi-square test value.
8. letter words. The correlation is good either with the exponential distribution or the Erlang distribution. The greatest disturbance is a shortage of distances 30-35 (28 occurrences against 40.2 expected). This makes 36.8 % of the chi-square test value.
9. letter words are too many within distances 13-18 (117 occurrences against 88.5 expected). This makes 46.6 % of the chi-square test value.
10. letter words. The correlation is good, no greater individual disturbance.
11. letter words. The correlation is good either with the exponential distribution or the Erlang distribution. The greatest disturbance is a surplus of distances 30-43 (72 occurrences against 60.2 expected). This makes 36.2 % of the chi-square test value.
12. letter words. There exists a cumulation of distances 38-49 (52 occurrences against 39.1 expected) between two shortages. This makes 26.8 % of the chi-square test value.
13. letter words. The peak within distances 39-57 (46 occurrences against 36.9 expected) makes 26.7 % of the chi-square test value.
14. letter words. Almost perfect fit is disturbed by four surplus distances 125-149 (15 occurrences against 11.1 expected) making 64.4 % of the chi-square test value.
15. letter words. Almost perfect fit.
16. letter words are too many within distances 211-260 (13 occurrences against 6.9 expected). This makes 43.8 % of the chi-square test value. Another 27.5 % contributes the shortage of distances 311-410 (2 occurrences against 6.8 expected).
17. letter words distribution is worsened by the shortage of distances 276-350 (1 occurrence against 6.4 expected). This makes 70.5 % of the chi-square test value.
Longer words need not special comments.
Table The number of words of different length in Werther 2
Length |
Number |
Type of distribution, chisquare value |
1 |
826 |
LN, 0.010 |
2 |
36 |
EX, 0.0016 |
3 |
1519 |
EX, 0.0906, |
4 |
2473 |
NB, 0, 0.0789 over 6 |
|
2442 |
NB, 0.1218 |
5 |
2406 |
EX, 0, 0.2977 over 19 |
6 |
2227 |
EX, 0, 0.7886 over 23 |
7 |
2115 |
EX, 0.0163, 0.8818 over 28 |
8 |
1318 |
EX, 0, 0.4475 over 21 |
9 |
1078 |
EX, 0.0134, 0.2713 over 10 |
10 |
803 |
EX, 0.1163, 0.6598 over 10 |
11 |
663 |
EX, 0.1504, 0.4166 over 35 |
12 |
508 |
EX, 0.5634 |
13 |
374 |
EX, 0.6435 |
14 |
307 |
EX, 0.3717 |
15 |
224 |
EX, 0.3327 |
16 |
181 |
EX, 0.3950 |
17 |
97 |
EX, 0.6397 |
18 |
94 |
EX, 0.6265 |
19 |
70 |
EX, 0.9034 |
20 |
54 |
EX, 08483 |
21 |
37 |
EX, 0.6048 |
22 |
19 |
EX, 0.5360 |
23 |
27 |
WE, 0.4955 |
24 |
20 |
EX, 0.7692 |
25 |
12 |
|
26 |
5 |
|
27 |
3 |
|
28 |
3 |
|
29 |
3 |
|
30 |
1 |
|
31 |
3 |
|
32 |
0 |
|
33 |
0 |
|
34 |
1 |
|
35 |
2 |
|
The words classified according to their length are distributed mostly according to the exponential distribution. Each from the Weibull distribution, the negative binomial distribution and the lognormal distribution performed better only in one case.
The distribution of length of words seem to have the lognormal shape, but this guess was not tested.
Notes to some results:
Double backspaces repeat more often than expected within distances 30-49 (141 occurrences against 110.5 expected). This makes 34.0 % of the chi-square test value.
1. letter words are too few for commenting.
2. letter words follow too often at distances 9-13 (271 occurrences against 234.2 expected) . This makes 27.8 % of the chi-square test value. The second greatest disturbance is the shortage of distances 19-27 (165 occurrences against 195 expected). This makes another 23 % % of the chi-square test value.
3. letter words. It was necessary to divide 3. letter words into two parts. In the first part, 3. letter words do not repeat as often as the exponential distribution requires (579 occurrences against 648.4 expected). This makes 15.4 % of the chi-square test value. On contrary, they repeat more often then expected within distance 4 (314 occurrences against 260.4 expected). This makes 22.4 % of the chi-square test value.
Similarly in the second part, 3. letter words do not repeat as often as the exponential distribution requires (496 occurrences against 566.4 expected). This makes 33.3 % of the chi-square test value. They again repeat more often then expected within distance 3 (372 occurrences against 334.1 expected). This makes 16.3 % of the chi-square test value. Both parts are different according to the t-test.
4. letter words. There are too few distances 2-3 (481 occurrences against 615.4 expected). This makes 35,7 % of the chi-square test value. Immediately the peak within distances 4-6 follows (525 occurrences against 437.8 expected). It makes 21.1 % of the chi-square test value.
5. letter words. The exponential distribution shape is disturbed by a shortage of distances 2-3 (391 occurrences against 534.4 expected). This makes 34.2 % of the chi-square test value. The peak within distances 4-12 (1064 occurrences against 886.4 expected) makes 35.6 % of the chi-square test value.
6. letter words. There is a shortage of distances 18-20 (79 occurrences against 100.2 expected). This makes 15.5 % of the chi-square test value. The greatest disturbance is a peak of distances 30-33 (40 occurrences against 25.8 expected). This makes 27.2 % of the chi-square test value. This peak is cumulated from distances 27-29 (23 occurrences against 36.2 expected, 16.6 % of the chi-square test value) as well as from distances 34-36 (12 occurrences against 18.3 expected, 7.5 % of the chi-square test value)
7. letter words. The exponential distribution has a step, there is a surplus of distances 10-14 (227 occurrences against 173.4 expected) which makes 30.1 % of the chi-square test value. Immediately, there are too few distances 15-18 (95 occurrences against 130.1 expected). This makes 17.2 % of the chi-square test value.
8. letter words. There is a surplus of distances 10-14 (172 occurrences against 135.3 expected) which makes 29.4 % of the chi-square test value. Immediately, there are too few distances 15-18 (89 occurrences against 106.2 expected). This makes 8.3 % of the chi-square test value.
9. letter words. They are too few within distances 25-35 (70 occurrences against 94.9 expected). This makes 29.8 % of the chi-square test value.
10. letter words. The greatest disturbance is a shortage of distances 56-68 (27 occurrences against 36.4 expected). This makes 25 % of the chi-square test value.
11. letter words. They repeat less often than expected (3 occurrences against 7 expected). This makes 23.4 % of the chi-square test value. There are too few distances 104-117 (7 occurrences against 12.9 expected). This makes 27.6 % of the chi-square test value. The third disturbance is a surplus of distances 162-190 (12 occurrences against 7.6 expected). This makes 26.7 % of the chi-square test value.
12. letter words. The surplus of distances 136-154 (17 occurrences against 9.7 expected) makes 49.9 % of the chi-square test value.
13. letter words. The shortage of distance distances 41-59 (16 occurrences against 28.2 expected) makes 42.3 % of the chi-square test value.
14. letter words. The greatest disturbance is a shortage of distances 168-201 (4 occurrences against 10.2 expected). This makes 44.7 % of the chi-square test value. The surplus of distances 236-335 (19 occurrences against 12.5 expected) makes 40.4 % of the chi-square test value.
15. letter words. The shortage of distances 381-475 (3 occurrences against 5.6 expected) makes 47.7 % of the chi-square test value.
16. letter words. The shortage of distances 166-220 (5 occurrences against 9.8 expected) makes 44.8 % of the chi-square test value.
17. letter words. Almost perfect fit.
18. letter words distribution is worsened by the surplus of distances 190-284 (12 occurrences against 7.3 expected). This makes 70.7 % of the chi-square test value.
19-24. letter words. Without commentary.
Table The number of words of different length in Faust 1
Length |
Number |
Type of distribution, chisquare value |
1 |
517 |
WE, 0.081 |
2 |
9 |
no test |
3 |
1651 |
EX, 0.6693 over 5 |
4 |
1975 |
LN, 0, 0.1409 over 14, ER, 0 |
|
1977 |
EX, 0, 0.0790 over 15, ER, 0 |
|
1970 |
EX, 0, 0.0486 over 6, ER, 0 |
5 |
1715 |
EX, 0.4765 over 8 |
|
1713 |
EX, 0.3503 over 12 |
6 |
1516 |
EX, 0, 0.2243 over 22 |
|
1521 |
EX, 0, 0.4584 over 25 |
7 |
1258 |
EX, 0 |
|
1243 |
EX, 0.0611 over 12 |
8 |
1493 |
EX, 0, 0.0916 over 36 |
9 |
1327 |
EX, 0.8304 |
10 |
975 |
EX, 0.4978 |
11 |
886 |
WE, 0.3004 |
12 |
687 |
WE, 0.4225 |
13 |
543 |
WE, 0.7522 |
14 |
459 |
EX, 0.6260 |
15 |
352 |
EX, 0.6119 |
16 |
311 |
EX, 0.1383 |
17 |
215 |
EX, 0.3247 |
18 |
195 |
WE, 0.2284 |
19 |
136 |
WE, 0.3370 |
20 |
106 |
EX, 0.2361 |
21 |
89 |
WE, 0.5363 |
22 |
75 |
WE, 0.3443 |
23 |
68 |
WE, 0.8442 |
24 |
64 |
LN, 0.2896 |
25 |
61 |
EX, 0.6823 |
26 |
55 |
EX, 0.9420 |
27 |
32 |
EX, 0.6312 |
28 |
24 |
no test |
29 |
28 |
LN, 0.3247 |
30 |
23 |
EX, 0.4197 |
31 |
11 |
|
32 |
11 |
|
33 |
7 |
|
The words classified according to their length are distributed mostly according to the exponential distribution (16 distances), according to the Weibull distribution at 9 distances. The lognormal distribution was suitable in three cases. The Erlang distribution with the parameter alpha = 2 was sometimes applicable but performed worker than exponential distribution.
The distribution of length of words seem to have the lognormal shape, but this guess was not tested. There are 19 words longer than 35 letters.
Notes to some results:
Double backspaces repeat more often than expected within 222-276 (12 occurrences against 6.6 expected). This makes 32.2 % of the chi-square test value.
1. letter words. No test.
2. letter words. They repeat less often than expected (56 occurrences against 104.2 expected). This makes 53.1 % of the chi-square test value. The second greatest disturbance is the peak of distances 39-43 (45 occurrences against 31.4 expected). This makes 14.1 % of the chi-square test value.
3. letter words. It was necessary to divide three letter words into tree parts. In the first part, 3. letter words repeat more often than expected (379 occurrences against 128.3 expected). This makes 75 % of the chi-square test value. On contrary, distances 2-4 occur less often than expected (874 occurrences against 1176.3 expected). This makes 12.6 % of the chi-square test value. In the second part, 3. letter words do not repeat as often as the exponential distribution requires (441 occurrences against 596.4 expected). This makes 71 % of the chi-square test value. 3. letter words repeat more often then expected within distances 2-5 (1063 occurrences against 850.6 expected). This makes 20.1 % of the chi-square test value. Similarly in the third part, 3. letter words do not repeat as often as the exponential distribution requires (414 occurrences against 577.4 expected). This makes 38.5 % of the chi-square test value. 3. letter words repeat more often then expected in the distance 3 (306 occurrences against 228.1 expected). This makes 22.1 % of the chi-square test value. The first and third parts are different according to the t-test.
4. letter words. It was necessary to divide four letter words into two parts. In the first part, four letter words repeat less often than expected (223 occurrences against 311.7 expected). This makes 38.1 % of the chi-square test value. Other deviations have weight less than 5 % of the chi-square test value. In the second part, four letter words repeat as often as the exponential distribution requires. They repeat more often then expected within distances 6-8 (274 occurrences against 215.7 expected). This makes 32.9 % of the chi-square test value. The second peak within distances 13-15 (112 occurrences against 82.6 expected) makes 21.2 % of the chi-square test value. The parts are similar according to the t-test.
5. letter words. It was necessary to divide five letter words into two parts. In the first part, five letter words form a peak within distances 6-8 (247 occurrences against 188.7 expected) which makes 29.2 % of the chi-square test value. They repeat less often than expected within distances 21-22 (11 occurrences against 33.9 expected). This makes 25.1 % of the chi-square test value. In the second part, five letter words repeat more often then expected within distances 6-8 (277 occurrences against 189 expected). This makes 54.1 % of the chi-square test value. Other deviations have weight at most 10 % of the chi-square test value. Both parts are similar.
6. letter words. It was necessary to divide six letter words into two parts. In the first part, six letter words form a peak within distances 5-8 (316 occurrences against 237.9 expected) which makes 48.6 % of the chi-square test value. They repeat more often than expected within distances 44-50 (17 occurrences against 8.3 expected). This makes 17.3 % of the chi-square test value. In the second part, six letter words repeat less often then expected within distances 8-11 (133 occurrences against 164.8 expected). This makes 21.4 % of the chi-square test value. The tail over 54 is longer (13 occurrences against 6.8 expected). This makes 20.1 % of the chi-square test value. Both parts are similar.
7. letter words. Seven letter words repeat more often than expected (107 occurrences against 85.5 expected). This makes 14.5 % of the chi-square test value. The tail over 96 is longer (11 occurrences against 5 expected). This makes 19 % of the chi-square test value.
8. letter words. There is a shortage of distances 30-35 (52 occurrences against 63.4 expected) which makes 17.9 % of the chi-square test value.
9. letter words. They repeat more often than expected (49 occurrences against 36.8 expected). This makes 23.2 % of the chi-square test value. Then, they repeat more often than expected within distances 102-114 (12 occurrences against 7.4 expected). This makes 16.1 % of the chi-square test value. The tail over 133 is shorter (2 occurrences against 5.8 expected). This makes 14.1 % of the chi-square test value.
10. letter words. The greatest disturbance is a shortage of distances 40-62 (103 occurrences against 132.5 expected). This makes 47.5 % of the chi-square test value.
11. letter words. There are five deviations from the Weilbull shape with the weight greater than 10 %.
12. letter words. The shortage of distances 29-41 (60 occurrences against 74.3 expected) makes 32.5 % of the chi-square test value.
13. letter words. The slight shortage of distance 1 (5 occurrences against 8.3 expected) makes 15.9 % of the chi-square test value. Another shortage of distances 23-43 (84 occurrences against 97.7 expected) makes 23.9 % of the chi-square test value.
14. letter words. There are five deviations with the weight greater than 10 %.
15. letter words. The surplus of distances 43-66 (61 occurrences against 47.5 expected) makes 25.9 % of the chi-square test value. The shortage of distances 67-138 (59 occurrences against 81.1 expected) makes 41.3 % of the chi-square test value.
16. letter words. The peak within distances 284-324 (12 occurrences against 5.8 expected) makes 40.1 % of the chi-square test value.
17. letter words. Their distribution is worsened by shortage of their repeating in distances to 3 (3 occurrences against 8.5 expected). This makes 34 % of the chi-square test value.
18. letter words distribution is worsened by the shortage of distances 89-131 (10 occurrences against 16.9 expected). This makes 31.2 % of the chi-square test value.
19. letter words. Their distribution is worsened by their repeating in distances to 67 (34 occurrences against 26.4 expected). This makes 27.3 % of the chi-square test value. The shortage of distances 68-134 (11 occurrences against 19.6 expected) makes 46.7 % of the chi-square test value.
20. letter words. The greatest disturbance is a shortage of distances 276-350 (2 occurrences against 6.4 expected). This makes 74.5 % of the chi-square test value.
Longer words are without commentary.
Table The number of words of different length in Faust 2
Length |
Number |
Type of distribution, chisquare value |
1 |
564 |
WE, 0.0560 |
2 |
21 |
EX, 0.5460 |
3 |
2440 |
EX, 0, 0.8358 over 9 |
4 |
1937 |
EX, ER, 0 |
|
1936 |
EX, ER, 0 |
|
1942 |
EX, ER, 0 |
|
1947 |
EX, ER, 0 |
5 |
2193 |
EX, 0, 0.5716 over 8 |
|
2154 |
EX, 0.3318 |
6 |
1994 |
WE, 0, 0.5323 over 36 |
|
1983 |
EX, 0, 0.2209 over 25 |
7 |
1747 |
EX, 0.0022, 0.7718 over 9 |
|
1749 |
EX, 0, 0.2069 over 20 |
8 |
2450 |
EX, 0.0001, 0.2328 over 30 |
9 |
1986 |
EX, 0, 0.1125 over 20 |
10 |
1697 |
EX, 0.0855 |
11 |
1474 |
EX, 0.0967 |
12 |
1153 |
EX, 0.1290 |
13 |
944 |
EX, 0.2676 |
14 |
830 |
EX, 0.2952 |
15 |
649 |
WE, 0.6559 |
16 |
513 |
WE, 0.6186 |
17 |
460 |
WE, 0.2818 |
18 |
355 |
WE, 0.3906 |
19 |
294 |
WE, 0.9199 |
20 |
190 |
WE, 0.4318 |
21 |
153 |
WE, 0.9954 |
22 |
120 |
EX, 0.2064 |
23 |
126 |
EX, 0.2746 |
24 |
102 |
EX, 0.9203 |
25 |
86 |
EX, 0.2004 |
26 |
80 |
LN, 0.5909 |
27 |
48 |
EX, 0.8261 |
28 |
42 |
EX, 0.8938 |
29 |
32 |
EX, 0.5481 |
30 |
33 |
LN, 0.6155 |
31 |
23 |
EX, 0.6014 |
32 |
22 |
EX, 0.0515 |
33 -> |
32 |
LN, 0.1380 |
The words classified according to their length are distributed mostly according to the exponential distribution (22) distances, according to the Weibull distribution at 8 distances. The lognormal distribution was suitable in three cases. The Erlang distribution with the parameter alpha = 2 was sometimes applicable but performed worker than exponential distribution.
The distribution of length of words seem to have the lognormal shape, but this guess was not tested. There are 32 words longer than 32 letters.
Notes to some results:
Double backspaces repeat less often than expected within 132-160 (14 occurrences against 22.3 expected). This makes 18.5 % of the chi-square test value. The tail over 332 is longer than expected (16 occurrences against 9.4 expected). This makes 28.4 % of the chi-square test value.
1. letter words. No commentary.
2. letter words. They repeat less often than expected (66 occurrences against 157.8 expected). This makes 59.6 % of the chi-square test value. Then a peak of distances 2-12 (1313 occurrences against 11.96.7 expected). This makes 18.6 % of the chi-square test value.
3. letter words. It was necessary to divide three letter words into four parts. In the first part, 3. letter words repeat less often than expected (322 occurrences against 529.4 expected). This makes 43.9 % of the chi-square test value. On contrary, distances 2-7 occur more often than expected (1279 occurrences against 1015.1 expected). This makes 39.2 % of the chi-square test value. In the second part, 3. letter words repeat less often as the exponential distribution requires (337 occurrences against 508.9 expected). This makes 35.7 % of the chi-square test value. Then immediately 3. letter words repeat more often then expected, especially within distances 4-6 (602 occurrences against 433.9 expected). This makes 41.2 % of the chi-square test value. Similarly in the third part, 3. letter words do not repeat as often as the exponential distribution requires (324 occurrences against 536.6 expected). This makes 40 % of the chi-square test value. 3. letter words repeat more often then expected within distances 2-6 (1088 occurrences against 812.1 expected). This makes 49 % of the chi-square test value. In the fourth part, 3. letter words repeat less often than expected (350 occurrences against 545.4 expected). This makes 39.1 % of the chi-square test value. On contrary, distances 2-7 occur more often than expected (1192 occurrences against 932.9 expected). This makes 47.7 % of the chi-square test value. The second part differs from other ones, significantly from the third and fourth ones according to the t-test.
4. letter words. It was necessary to divide four letter words into two parts. In the first part, four letter words repeat less often than expected (262 occurrences against 384.5 expected). This makes 70.5 % of the chi-square test value. 4. letter words repeat more often then expected within distances 2-7 (1138 occurrences against 972 expected). This makes 16 % of the chi-square test value. In the second part, four letter words repeat nearly as often as the exponential distribution requires. They repeat more often then expected within distances 2-5 (769 occurrences against 703.3 expected). This makes 45.4 % of the chi-square test value. The parts are different according to the t-test.
5. letter words. It was necessary to divide five letter words into two parts. In the first part, five letter words repeat less often than expected till the distance 3 (573 occurrences against 716 expected). This makes 28.5 % of the chi-square test value. The peak within distances 4-12 (985 occurrences against 803.4 expected) makes 46 % of the chi-square test value. In the second part, five letter words repeat less often than expected till the distance 3 (563 occurrences against 660.8 expected). This makes 24 % of the chi-square test value. The peak within distances 4-9 (725 occurrences against 598.9 expected) makes 51.3 % of the chi-square test value. Both parts are different according to the t-test.
6. letter words. It was necessary to divide six letter words into two parts. In the first part, six letter words form a peak within distances 11-14 (209 occurrences against 166.6 expected) which makes 29.3 % of the chi-square test value. In the second part, six letter words repeat more often then expected within distances 8-10 (239 occurrences against 162.3 expected). This makes 49.3 % of the chi-square test value. Both parts are different according to the t-test.
7. letter words. Seven letter words repeat less often then expected within distances 2-6 (550 occurrences against 649.7 expected). This makes 34.2 % of the chi-square test value.
8. letter words. There is a peak of distances 18-22 (211 occurrences against 154 expected) which makes 38.7 % of the chi-square test value.
9. letter words. They repeat less often than expected till distance 7 (471 occurrences against 523.7 expected). This makes 23.1 % of the chi-square test value. The peak within distances 8-21 (609 occurrences against 557.8 expected) makes 20.8 % of the chi-square test value.
10. letter words. The greatest disturbance is a shortage of distances 102-107 (0 occurrences against 5.2 expected). This makes 18.1 % of the chi-square test value.
11. letter words. The peak of distances 124-135 (14 occurrences against 7.5 expected) which makes 31.9 % of the chi-square test value.
12. letter words. The surplus of distances 124-135 (14 occurrences against 7.5 expected) makes 31.9 % of the chi-square test value.
13. letter words. The surplus of distances 11-28 (313 occurrences against 276.2 expected) makes 24.1 % of the chi-square test value. The shortage of distances 39-47 (58 occurrences against 75.1 expected) makes 18.4 % of the chi-square test value.
14. letter words. There are six deviations with the weight greater than 10 %.
15. letter words. The surplus of distances 198-216 (9 occurrences against 5.8 expected) makes 20.2 % of the chi-square test value.
16. letter words. There are four deviations with the weight greater than 10 %.
17. letter words. Their distribution is worsened by shortage of their repeating (2 occurrences against 6.4 expected). This makes 25.4 % of the chi-square test value.
18. letter words distribution is worsened by the shortage of distances 257-329 (7 occurrences against 14.9 expected). This makes 52.2 % of the chi-square test value.
19. letter words. Almost perfect fit.
20. letter words. The greatest disturbance is a shortage of distances 357-421 (3 occurrences against 7.8 expected). This makes 42.1 % of the chi-square test value.
21. letter words. Almost perfect fit.
Longer words are without commentary since deviations from the expected values are few occurrences.
Discussion
The insufficient capacity of the used software for long lists forced splitting of too frequent signs. The splitting was made before determining distances. Surprisingly, the obtained parts are not always comparable, since there are in the split parts different number of signs. This leads to the different mean distances between them.
Some distributions of distances between punctuation marks are highly regular, especially their tails, if the low distances inside words are pooled. They are described with a different precision with five distributions: exponential, Erlang, Weibull, lognormal and negative binomial.
As it was noted in the introduction, the Erlang distribution coincides with the exponential distribution, if its parameter alpha is 1. Then it has less degrees of freedom and therefore a lower the chi-square test value. Sometimes, it was found that the parameter alpha was 2, but the chi-square test value was poorer that at the exponential distribution due too long tail.
Some significant deviations from the expected values are made by few occurrences. This can be sometimes caused by repeated phrases, or by lower care of the author. This conclusion should be confirmed by stylistic analysis.
REFERENCES
1. Kunz M., See papers of this series on the page.