Distance Analysis of German Texts. II. Capital Letters in Johann Wolfgang Goethe: "Die Leiden des jungen Werther"
Milan Kunz (
kunzmilan@seznam.cz) January 1, 2003Abstract
Distances between identical symbols in information strings (biological, language, computer programs (*.exe files) are described with a different precision with four distributions: Exponential, Weibull, lognormal and negative binomial. The correlations are sometimes highly significant. Here are analyzed distances between capital letters in the both parts of Die Leiden des jungen Werther. Most part of distances is correlated rather well with the exponential and Weibull distributions.
INTRODUCTION
This is a continuation study of statistical properties of distances between identical symbols in information strings in the German language (1).
The book was obtained from the project Gutenberg (promo.net) as the files 7ljw110.zip and 7ljw210.zip. The files were unzipped and stripped from the introductory English notes.
Then the first part has 112779 bytes, the second part has 136477 bytes. The first part contains 108953 signs including spaces, 92653 signs without spaces in 1912 lines and 30651 words. It means that the mean length of a word is 5.219 signs. The second part contains 131613 signs including spaces, 111654 signs without spaces in 2431 lines. It means that the mean length of a word is 5.199 signs.
The analysis is limited to the capital letters which are sufficiently frequent since by them begin in German all substantives. There are 4190 capital letters in the first part (4.52 %), and 5167 capital letters in the second part (4.63 %).
After these formal corrections, the distances were determined by a program elaborated by Rádl. The string is at first indexed with the position index i (i going from 1 to m) of each individual symbol in the string, and then the differences of these position indexes are determined. The differences are considered to be the topological distances between the same symbols. The sets of these values were evaluated by different statistical tests. The program counting distances counts all signs, including spacebar, return, and punctuation marks.
From all available implemented distributions, only tree distributions gave significant results, the exponential distribution, the Weibull distribution, and the lognormal distribution.
The actual values (mean, standard deviation, skewness, kurtosis, distribution parameters etc.) are of little interest, since they differ considerably.
Results
Distances between individual letters
The results for all letters are presented in the form of the table, where the frequencies of all symbols are given and the significance of the performed chi-square tests. Then the commentaries to all symbols of the alphabet are given. The values in the square brackets show the corresponding values of the combined lower and upper cases.
Table 1 Survey of results
Notes:
EX = exponential distribution
WE = Weibull distribution
LN = lognormal distribution
* = the test was not made, since not enough of data was available
Statistic = XX, the chi-square test
t- statistics similarity of both parts, significance = the parts significantly are different if the significance of the test is below 0.05*
Symbol |
1. part |
2. part |
t-statistics |
significance |
A |
328, EX, 0.3156 |
420, WE, 0.6483 |
0.7142 |
0.4753 |
B |
186, EX, 0.4774 |
241, EX, 0.1599 |
0.7481 |
0.4554 |
C |
9, no test |
20, no test |
||
D |
168, WE, 0.4179 |
256, EX, 0.1092 |
2.2706 |
0.0225* |
E |
224, WE, 0.0800 over 150 |
292, EX, 0.4785 |
0.8486 |
0.3965 |
F |
159, WE, 0.9724 |
201, EX, 0.5742 |
0.3295 |
0.7419 |
G |
282, LN, 0.6541 |
391, WE, 0.1663 |
1.7961 |
0.0729 |
H |
213, WE, 0.5098 |
309, EX, 0.4749 |
1.9116 |
0.0565 |
I |
150, EX, 0.7268 |
197, LN, 0.8459 |
0.7329 |
0.4641 |
J |
48, EX, 0.0158 |
65, WE, 0.3805 |
0.4081 |
0.6839 |
K |
183, LN, 0.9746 |
154, EX, 0.9769 |
-0.27345 |
0.0066* |
L |
232, EX, 0.2997 |
289, EX, 0.5847 |
0.3299 |
0.7416 |
M |
252, EX, 0.7693 |
271, EX, 0.3892 |
-0.8587 |
0.3909 |
N |
86, WE, 0.8971 |
123, WE, 0.6795 |
0.9166 |
0.3604 |
O |
52, LN, 0.3473 |
65, WE, 0.7167 |
-0.2199 |
0.8263 |
P |
75, EX, 0.7079 |
70, LN, 0.0740 |
1.3896 |
0.1668 |
Q |
7, too few data |
6, too few data |
||
R |
43, EX, 0.7437 |
69, EX, 0.8206 |
-1.0846 |
0.2805 |
S |
381, EX, 05862 |
674, WE, 0.0990 |
-5.5684 |
0* |
T |
164, EX, 0.8283 |
210, EX, 0.2624 |
0.5809 |
0.5616 |
U |
114, WE, 0.2882 |
158, WE, 0.6131 |
-0.8232 |
0.3999 |
V |
441, WE, 0.5323 |
130, WE, 0.6963 |
12.9539 |
0* |
W |
318, WE, 0.7819 |
455, EX, 0.0797 |
-2.0085 |
0.0449* |
X |
0 |
0 |
||
Y |
0 |
0 |
||
Z |
75, WE, 0.8856 |
101, WE, 0.8155 |
-0.6671 |
0.5056 |
At the upper case, the Weibull distribution is the best one in the case of 9 letters in the 1. part and in the case of 9 letters in the 2. part, too. The lognormal distribution correlates 3 and 2 cases, respectively, the exponential distribution is the best in the 10 (11) performed tests, and the negative binomial distribution in no case.
Sometimes, the distinction between the fit is small and more than one distribution is applicable. The chi-square values sometimes are practically zero, and only adjusting the lowest possible value to greater distances by pooling these shorter distances increases the significance of the chi-square tests.
Now, the commentaries to the individual letters follow.
A
There are too few A in the first part within distances 539-646 (7 occurrences against 17.9 expected which makes 60.1 % of the chi-square test value. In the second part such a shortage is within distances 520-726, there are 26 such distances against 36.8 expected, which contributes 52.5 % of the chi-square test value. These distances correspond to 8 till 11 lines of usual length and can be connected with the length of Faust's replicas.
B
The distribution of distances between upper case B in both parts is exponential. In the first part a surplus exists within distances 993 till 1157 (13 compared with 8.4 expected) which makes one third of the chi-square test value. In the second part, there is a valley, 1 occurrence against 6 expected within distances 1268-1742 which makes 31.5 % of the chi-square test value and immediately, there are too many B within distances 1426-1742 (15 occurrences against 7.8 expected, 42 occurrences against 54 expected, respectively), which contributes 51.1 % of the chi-square test value.
C
This letter could not be correlated due to few occurrences.
D
The shape of the distribution of distances between the upper case D in both parts is exponential. In the first part, there are too few distances 453-669 (14 occurrences against 23.8 expected) which contributes 48.8 % of the chi-square test value. In the second part, such a shortage lies within distances 600-920 (20 occurrences against 36-9 expected) which contributes 58.9 % of the chi-square test value. The result of the t-statistics shows the significant difference of the use of words starting with the upper case D. The ration of occurrences is only 65.62 %.
E
The distribution of distances between the upper case E in the first part is Weibull one, the exponential distribution gives a worse result (the chi-square test value is 0.2540). The second part is correlated better with the exponential distribution. There are four significant differences from the shape of the fit in the first part, and three ones in the second part. The greatest one in the first part is a surplus within distances 1300-1633 (12 compared with 7.7 expected) which makes 33.9 % of the chi-square test value. A surplus of distances 760 till 920 (23 compared with 16.2 expected) makes 33.9 % of the chi-square test value, too.
F
The distribution of distances between the upper case F in the first part is Weibull one. The fit is almost perfect. The second part is correlated better with the exponential distribution.
Here, the distribution has a shortage of the distances 951-1188 (10 occurrences against 14.3 expected). This difference makes 22.3 % of the chi-square test value. Then the peak within distances 1188 till 1663 (24 compared with 16.8 expected) makes 54.5 % of the chi-square test value.
G
The distribution of the capital G is correlated in the first part with the lognormal distribution, the Weibull distribution gives a poorer result (the chi-square test value is 0.5261). The second part is correlated with the Weibull distribution or with the exponential distribution (the chi-square test value is 0.1517). The greatest difference in the first part is a shortage of distances 305 till 456 (34 observed compared with 41.7 expected) which makes 23.6 % of the chi-square test value. Then the peak within distances 457 till 608 follows (35 compared with 28.1 expected) which makes another 28.7 % of the chi-square test value. The distribution in the second part has a lower chi-square test value but the greatest difference between observed (52) and expected values (73.6) of the distances 267-443 makes 23.3 % of the chi-square test value. The second peak within distances 886 till 1150 (24 compared with 15.3 expected) makes another 31.7 % of the chi-square test value.
H
The distribution of the capital H in the first part is correlated with the Weibull distribution but the tail is longer (13 observed occurrences against 8.2 expected. This difference makes 27.9 % of the corresponding chi-square test value. The second part is correlated with the exponential distribution. There are too many H within distances 900-1140 (21 occurrences against 16.1 expected) which contributes 27.3 % of the chi-square test value. The shortage which follows within distances 1141-1380 (5 distances against 9.1 expected) contributes 33.7 % of the chi-square test value.
I
The distribution of the capital I in the first part is correlated with the exponential distribution rather well. The second part is correlated with the lognormal distribution. There are too many I within distances 745-992 (17 occurrences against 12.6 expected) which contributes 58.3 % of the chi-square test value.
J
There are rather few J in the first part. It was not possible to do all tests. The second part has the Weibull distribution. The greatest difference between observed/expected distances 3620-4977 (5 distances against 8.5 expected) makes 47.2 % of the chi-square test value.
K
The distribution of distances between consecutive letters in the first part has almost perfect lognormal shape, and in the second part exponential shape. The statistics of both parts differ significantly.
L
The occurrences of capital L in both parts are correlated with the exponential distribution. In the first part, there are too many distances 300-466 (46 occurrences against 36.6 expected). This makes 25.4 % of the chi-square test value. Then a shortage within distances 467 till 633 follows (18 compared with 25.7 expected) which makes another 24.0 % of the chi-square test value. In the second part, a shortage within distances 897 till 1008 follows (4 compared with 8.8 expected) makes another 25.2 % of the chi-square test value and a peak of distances 1121-1232 (9 occurrences against 5.4 expected) another 23.6 % of the chi-square test value.
M
The upper case M is correlated using the exponential distribution in both parts. In the first part, there are too few distances 1050-1258 (5 occurrences against 8.7 expected). This makes 246.7 % of the chi-square test value. Two peaks are in the second part, the first one between distances 160-280 (52 occurrences against 42.7 expected) makes 15.8 % of the chi-square test value, the second one between distances 1120-1240 (11 occurrences against 5.9 expected) makes 34.6 % of the chi-square test value.
N
The upper case N in the first part is correlated using the exponential distribution, but the Weibull distribution gives an acceptable fit (the chi-square test value is 0.7378). The second part is correlated with the Weibull distribution. There is a shortage within distances 929 till 1404 (12 occurrences compared with 16.5 expected). This makes 53.7 % of the chi-square test value.
O
The distribution of O in the first part is correlated with the lognormal distribution, in the second part with the Weibull distribution. The greatest distortion is in the first part within distances 667-1500 (8 occurrences against 12.6 expected). This makes 51.2 % of the chi-square test value.
P
The upper case P distribution in the first part is correlated with the exponential distribution, in the second part with the lognormal distribution. In the first part, there are too few distances 812-1216 (7 occurrences against 10.4 expected). This makes 38.5 % of the chi-square test value. The tail is longer (14 occurrences against 10.6 expected over 2838) . This makes another 36.5 % of the chi-square test value. In the second part, there are too few distances 1369-2157 (3 occurrences against 8.2 expected). This makes 63.7 % of the chi-square test value. The tail is longer (14 occurrences against 9.9 expected over 3736) . This makes another 32.1 % of the chi-square test value.
Q
This letter could not be correlated due to few occurrences.
R
The upper case R correlated well with the exponential distribution therefore no comments are necessary.
S
The upper case S in the first part is correlated using the exponential distribution, but the Weibull distribution gives practically the same chi-square test value (0.5681). The second part is correlated with the Weibull distribution only poorly. In the first part, the greatest difference is too many repeating within distances 108 till 261 (125 occurrences against 108.9 expected) which makes 51.1 % of the chi-square test value. The second part fluctuates, the greatest difference is the longer tail (11 occurrences against 6.1 expected over 952) . This makes 28.9 % of the chi-square test value. The t-statistics shows that both parts are different.
T
The distribution of the capital T has the exponential shape. In the first part, there are too few distances 662-992 (15 occurrences against 23.7 expected). This makes 64.3 % of the chi-square test value. In the second part, the shortage of distances 467-800 (28 occurrences against 41.1 expected) makes 34.0 % of the chi-square test value. Then there are too many repeating within distances 1467 till 2300 (24 occurrences against 14.7 expected). This makes 30.1 % of the chi-square test value.
U
The distribution of the capital U has the Weibull shape in both parts. In the first part, there are too many distances 273-543 (30 occurrences against 21.1 expected). This makes 50.3 % of the chi-square test value. In the second part, the shortage of distances 656-873 (11 occurrences against 15.6 expected) makes 21.4 % of the chi-square test value, another shortage of distances 1092-1309 (6 occurrences against 9.2 expected) makes 17.4 % of the chi-square test value.
V
The distribution of the capital V has the Weibull shape in both parts but the first part is correlated worthier. There are too many distances 379-441 (30 occurrences against 22.6 expected). This makes 24.0 % of the chi-square test value. Then there are too few repeating within distances 442 till 504 (11 occurrences against 17.4 expected). This makes 23.5 % of the chi-square test value. The third fluctuation is a peak of distances 757-881 (12 occurrences against 7.8 expected). This makes 22.3 % of the chi-square test value. In the second part, there is only one shortage of distances 1556-1814 (1 occurrence against 5.3 expected) which makes 75.0 % of the chi-square test value. The t-statistics shows that both parts are different.
W
The distribution of the upper case W in the first part is correlated well with the Weibull distribution, in the second part with the exponential distribution. In the first part, there is a shortage of the distances 440-731 (38 occurrences against 47.6 expected). This difference alone makes 49.0 % of the chi-square test value. In the second part, the shortage of distances 424-704 (50 occurrences against 65.6 expected) makes 31.1 % of the chi-square test value. The shortage of distances 986-1126 (1 occurrence against 5.8 expected) makes 31.3 % of the chi-square test value. The t-statistics shows that both parts are different.
X
This letter could not be correlated due to few occurrences.
Y
This letter could not be correlated due to few occurrences.
Z
The Weibull distribution of this letter gives no opportunity for commenting.
Discussion
The t-statistics shows that five capital letters (D, K, S, V, W) in both parts are used differently, it means that their distribution has not only a different shape, but their means and standard deviations are different.
Without analyzing the frequency vocabularies, it can be conjectured at the capital D, that this is connected with the length of sentences in both parts. This can be tested.
The exponential distribution is most frequently applicable in both parts, whereas the Weibull distribution is less frequent. This can be compared with the results of both parts of Faust where the Weibull distribution is used as the most frequent one.
This is in accord with my conjecture how the shape of distribution evolves.
REFERENCES
1. Kunz M., See following papers of this series on the page.