Distance Analysis of German Texts. I. Capital Letters in "Johann Wolfgang Goethe: Faust"

Milan Kunz (kunzmilan@seznam.cz) December, 2002

Abstract

Distances between identical symbols in information strings (biological, language, computer programs (*.exe files) are described with a different precision with four distributions: Exponential, Weibull, lognormal and negative binomial. The correlations are sometimes highly significant. Here are analyzed distances between capital letters in the both parts of Faust. Most part of distances is correlated rather well with the Weibull distribution, better in the second part.

INTRODUCTION

This is a continuation study of statistical properties of distances between identical symbols in information strings (1, 2, 3, 4), now in the German language.

The tragedy was obtained from the project Gutenberg (promo.net) as the files fau110.zip and fau210.zip. The files were unzipped and stripped from the introductory English notes. Then the first part has 197759 bytes, the second part has 301506 bytes. The first part contains 183741 signs including spaces, 158436 signs without spaces in 7003 lines and 30651 words. It means that the mean length of a word is 5.169 signs. The second part contains 281916 signs including spaces, 245401 signs without spaces in 9794 lines and 44653 words. It means that the mean length of a word is 5.495 signs.

The analysis is limited to the capital letters which are sufficiently frequent since by them begin all verses and in German all substantives. Moreover, the dramatis personae are wholly written in capital letters.

After these formal corrections, the distances were determined by a program elaborated by Rádl. The string is at first indexed with the position index i (i going from 1 to m) of each individual symbol in the string, and then the differences of these position indexes are determined. The differences are considered to be the topological distances between the same symbols. The sets of these values were evaluated by different statistical tests. The program counting distances counts all signs, including spacebar, return, and punctuation marks.

From all available implemented distributions, only tree distributions gave significant results, the exponential distribution, the Weibull distribution, and the lognormal distribution.

The actual values (mean, standard deviation, skewness, kurtosis, distribution parameters etc.) are of little interest, since they differ considerably.

Results

 

Distances between individual letters

The results for all letters are presented in the form of the table, where the frequencies of all symbols are given and the significance of the performed chi-square tests. Then the commentaries to all symbols of the alphabet are given. The values in the square brackets show the corresponding values of the combined lower and upper cases.

Table 1 Survey of results

Notes:

EX = exponential distribution

WE = Weibull distribution

LN = lognormal distribution

* = the test was not made, since not enough of data

Statistic = XX, the chi-square test

 

Symbol

1. part

2. part

A

1004, WE, 0.0716 over 30

1252, WE, 0.2004 over 30

B

532, EX, 0.0378

865, WE, 0.1156

C

140, LN, 0.5634

228, WE, 0.5551

D

1234, EX, 0.1093 (0.7025 over 42)

2128, EX, 0.2062 over 115

E

2107, WE, 0.0800 over 150

2309, WE, 0

F

717,WE, 0.0052

849, WE, 0.0488

G

838, WE, 0.8727

1263, WE, 0.4506

H

1281, WE, 0.7650 over 100

1676, WE, 0.9065 over 100

I

999, WE, 0.0029

1112, WE, 0.0011

J

111, WE, 0.5597

149, WE, 0.8104

K

419, WE, 0.6968

758, WE, 0.6541

L

772, WE, 0.0397

1056, WE, 0.1407

M

1070, EX, 0.4112 over 50

1117, EX, 0.1641

N

468, WE, 0.1156

1024, WE, 0.5042 over 30

O

535, LN, 0.4837

723, WE, 0.7251

P

700, WE, 0

891, WE, 0.1676

Q

27, too few

20, too few

R

673, WE, 0.0282

1046, WE, 0.1164

S

1996, WE, 0

1393, WE, 0.1533;1377, WE, 0.0632

T

1121, LN, 0.0143

1055, WE, 0.6975

U

778, LN, 0.0406

1043, WE, 0.8235 over 250

V

267, WE, 0.3976

505, WE, 0.0926

W

920, EX, 0.6532

1427, WE, 0

X

20, too few

20, no test

Y

21, too few

78, LN, 0.0288

Z

236, WE, 0.4787

423, WE, 0.2730

 

At the upper case, the Weibull distribution is the best one in the case of 15 letters in the 1. part and in the case of 20 letters in the 2. part. The lognormal distribution correlates 4 and 2 cases, respectively, the exponential distribution is the best in the 4 (2) performed tests, and the negative binomial distribution in no case.

Sometimes, the distinction between the fit is small and more than one distribution is applicable. The chi-square values sometimes are practically zero, and only adjusting the lowest possible value to greater distances by pooling these shorter distances increases the significance of the chi-square tests.

Now, the commentaries to the individual letters follow.

A

The capital case A frequency statistics is shown in the following table

Chisquare Test

 Lower

Upper

Observed

Expected

 

Limit

Limit

Frequency

Frequency

Chisquare

at or below

1.000

0

16.1

16.07790

1.000

91.290

438

450.7

.36056

91.290

181.581

213

200.2

.81908

181.581

271.871

129

116.1

1.44080

271.871

362.161

92

72.2

5.41398

362.161

452.452

46

46.7

.01177

452.452

542.742

24

31.1

1.60181

542.742

633.032

15

21.0

1.72948

633.032

723.323

11

14.5

.82759

723.323

813.613

13

10.1

.85715

813.613

903.903

6

7.1

.16343

903.903

994.194

3

5.0

.81168

994.194

1174.774

6

6.2

.00449

above 1174.774

 

8

7.1

.12023

 

Chisquare = 30.24 with 11 d.f. Sig. level = 1.45274E-3

There are no repeated AA in both parts . This makes 53.11 % (40.5 %, respectively) of the chi-square test value of the Weibull distribution. Then there are too many A within distances 272-362 (92 occurrences against 72.2 expected, in the second part the tail is within distances 272-542, there are 210 such distances against 184.2 expected), which contributes 17.9 % (14.8 %) of the chi-square test value. This distance corresponds to 7 till 9 (12.5) verses of usual length and can be connected with the length of Faust's replicas.

 

B

The distribution of distances between upper case B in the first part is exponential, but the is Weibull distribution gives only somewhat worse the chi-square test value 0.0256. In both parts, there are too few B within distances 501-600 (507-634, respectively) (16 occurrences against 31.3 expected, 42 occurrences against 54 expected, respectively) which contributes 34 % (17.3 %) of the chi-square test value. The tails of both parts differ significantly. The first part contains a valley, 37 occurrences against 22.9 expected within distances 801-900 which makes 40.8 % of the chi-square test value. In the second part is a peak, 26 occurrences against 14.5 expected within distances 1268-1647 which makes 58.3 % of the chi-square test value.

C

The shape of the distribution of distances between upper case C differs in both parts. The first part is described better by the exponential distribution, the second part by the Weibull one. The irregularities are rather small.

D

Here the exponential distribution is applicable. The chi-square test values are poor but the shortage of D in the second volume within distances till 28 makes 40,6 % of the chi-square test value, and their surplus within distances 29 till 56 makes 49.6 % of the chi-square test value. These repeatings can be identified with the definite articles. The distribution of distances in the first part is more regular.

 

E

The poor Weibull correlation shows a surplus within distances 112 till 168 (223 compared with 169 expected) which makes 8.8 % of the chi-square test value in the first part, and a surplus within distances 203 till 339 (253 compared with 197.1 expected) which makes 9.4 % of the chi-square test value in the second part.

 

F

The distribution of the capital F is correlated with the Weibull distribution only poorly. The distribution in the first part has a longer tail, 18 occurrences against 8.4 expected. This difference makes 54.6 % of the total high chi-square test value. The distribution in the second part has a peak within distances 128 till 254 (215 compared with 180 expected) which makes 39.6 % of the chi-square test value.

 

G

The distribution of the capital G is correlated with the Weibull distribution rather well. The greatest difference in the first part is a shortage of distances 624 till 680 (5 observed compared with 11 expected) which makes 37.1 % of the chi-square test value. The distribution in the second part has a lower chi-square test value but the greatest difference between observed (31) and expected values (22.2) of distances 588 till 650 makes 23.3 % of the chi-square test value.

Here the test of the first part is given but with the another forced lower limit:

Chisquare Test

 

Lower

Upper

Observed

Expected

 

Limit

Limit

Frequency

Frequency

Chisquare

1.000

20.000

71

69.9

.016478

20.000

80.000

183

181.3

.015043

80.000

140.000

154

140.2

1.360459

140.000

200.000

109

107.4

.024999

200.000

260.000

71

81.9

1.450514

260.000

320.000

58

62.3

.300457

320.000

380.000

49

47.4

.057248

380.000

440.000

33

35.9

.239124

440.000

500.000

28

27.2

.021419

500.000

560.000

18

20.6

.334739

560.000

620.000

16

15.6

.009678

620.000

680.000

5

11.8

3.924649

680.000

740.000

9

8.9

.000627

740.000

800.000

7

6.7

.009778

800.000

860.000

6

5.1

.161780

860.000

980.000

8

6.7

.233659

above 980

 

13

8.9

1.918322

 

Chisquare = 10.079 with 14 d.f. Sig. level = 0.756387

 

 

H

The distribution of the capital H is correlated with the Weibull distribution only poorly, but the shortage of repeated HH makes 77.6 (69.2) % of the chi-square test value. But the tails are correlated rather well:

Chisquare value

1. part

2. part

over 50

0.0851

0.7499

over 100

0.7650

0.9065

 

I

The distribution of the capital I is correlated poorly with the Weibull distribution. The both parts have longer tails, (24 observed occurrences against 12.8 expected in the first part, and 13 occurrences against 5.5 expected in the second part). These differences make 37.4 and 26.5 % of the corresponding chi-square test values.

 

J

The distribution of the letter J is the Weibull one. The greatest difference of the distribution of the first part between observed/expected distances 3620-4977 (5 distances against 6.2 expected) makes 48.9 % of the chi-square test value.

 

K

The Weibull distribution of this letter is worsened in the first part by a shortage of distances 1127-1267 (3 occurrences against 8.7 expected). This makes 40.0 % of the chi-square test value. In the second part a shortage of distances 662-835 exists (32 occurrences against 44.3 expected). This makes 49.9 % of the chi-square test value.

 

L

The occurrences of capital L is correlated by the Weibull distribution. In the first part, there are too few distances 394-656 (68 occurrences against 95.8 expected). This makes 62 % of the chi-square test value. The tail of the distribution is longer 17 occurrences against 11 expected). This makes another 24.5 % of the chi-square test value. In the second part a surplus of distances 76-149 exists (245 occurrences against 211.2 expected). This makes 28.3 % of the chi-square test value. The shortage within distances 446 till 594 (65 occurrences against 84.7 expected) makes another 24.9 % of the chi-square test value.

 

M

The upper case M is correlated using the exponential distribution. In this case, there are too many doubled MM (12 occurrences against 6.2 expected). Therefore the chi-square test value improves to 0.4116 over 50. Then in the first part, there are too many distances 166-220 (139 occurrences against 111.6 expected). This makes 22.4 % of the chi-square test value. Contrary to it, there are too few distances 384-494 (33 occurrences against 53.7 expected). This makes 27.5 % of the chi-square test value.

Here the test of the first part is given:

Chisquare Test

 

Lower

Upper

Observed

Expected

 

Limit

Limit

Frequency

Frequency

Chisquare

at or below

1.000

12

6.2

5.3734

1.000

55.806

286

290.9

.0831

55.806

110.613

214

211.4

.0330

110.613

165.419

151

153.6

.0426

165.419

220.226

139

111.6

6.7475

220.226

275.032

78

81.1

.1150

275.032

329.839

48

58.9

2.0130

329.839

384.645

38

42.8

.5348

384.645

439.452

17

31.1

6.3808

439.452

494.258

16

22.6

1.9188

494.258

549.065

22

16.4

1.9067

549.065

603.871

8

11.9

1.2891

603.871

658.677

9

8.7

.0133

658.677

713.484

9

6.3

1.1656

 

Chisquare = 30.1571 with 14 d.f. Sig. level = 7.26057E-3

 

In the second part a surplus of distances 76-149 exists (245 occurrences against 211.2 expected). This makes 28.3 % of the chi-square test value. The shortage within distances 446 till 594 (65 occurrences against 84.7 expected) makes 24.9 % of the chi-square test value.

 

N

The upper case N is correlated using the Weibull distribution. In the first part, there are too many distances 1126-1267 (14 occurrences against 8 expected). This makes 28.4 % of the chi-square test value. In the second part, a peak of distances 156-310 exists (223 occurrences against 189.1 expected). This makes 31 % of the chi-square test value.

 

O

The distribution of O in the first part is correlated with the lognormal distribution, in the second part with the Weibull distribution. The greatest distortion is in the second part within distances 198-394 (171 occurrences against 152.5 expected). This makes 42.3 % of the chi-square test value.

P

The Weibull distribution of the upper case P is much better in the second part but it fluctuates slightly in seven groups from sixteen. In the first part, there are too many distances 198-394 (131 occurrences against 94.2 expected). This makes 22.4 % of the chi-square test value.

 

Q

This letter could not be correlated due to few occurrences.

 

R

The upper case R correlated with the Weibull distribution. There are too few doubled RR (11 occurrences against 28.3 expected) in the first part. This makes 52.6 % of the chi-square test value. In the second part, the shortage is not great (10 occurrences against 13.6 expected). Here, the greatest difference is too many repeating within distances 246 till 368 (146 occurrences against 120.4 expected) which makes 32.5 % of the chi-square test value.

 

S

The Weibull distribution of the capital S is distorted mostly by too few doubled SS (1 [3] occurrences against 28.5 [16.9] expected) in the first part [second part, respectively]. This makes 45.1 [45.2] % of the chi-square test value. In the first part, the greatest difference is too many repeating within distances 448 till 503 (17 occurrences against 8.6 expected) which makes 14.1 % of the chi-square test value.

 

T

The distribution of the capital T has the Weibull shape only in the second part, the first part is correlated with the lognormal distribution. There are too many distances 521-594 (24 occurrences against 15.2 expected). This makes 19.2 % of the chi-square test value. In the second part, the shortage of distances 1227-1594 (5 occurrences against 11.5 expected) makes 47.3 % of the chi-square test value.

 

U

The distribution of the capital U has the Weibull shape only in the second part, the first part is correlated better with the lognormal distribution, again. There are too many distances 160-318 (236 occurrences against 213.9 expected). This makes 42 % of the chi-square test value. In the second part, the peak of distances 124-246 (277 occurrences against 223 expected) makes 41.2 % of the chi-square test value.

 

 

V

The distribution of the capital V has the Weibull shape in both parts but the first part is correlated better. There are too few distances 457-684 (26 occurrences against 34.5 expected). This makes 25 % of the chi-square test value. In the second part, the shortage of distances 336-693 (106 occurrences against 129.1 expected) makes 27.9 % of the chi-square test value. Another 24,8 % of the chi-square test value contributes the peak within distances 1050-1228 (30 occurrences against 20.8 expected).

 

W

The distribution of the upper case W in the first part is correlated well with the exponential distribution, in the second part with the Weibull distribution gives approximate fit, only. In the first part, there is a shortage of the distances 300-383 (58 occurrences against 69.7 expected). This difference alone makes 25.4 % of the chi-square test value. In the second part, the shortage of distances 214-372 (218 occurrences against 271.4 expected) makes 24.1 % of the chi-square test value. The tail of the distribution is longer (27 occurrences against 18.1 expected).

 

X

This letter could not be correlated due to few occurrences.

 

Y

The first part could not be correlated due to few occurrences. The second part is approximated by poorly lognormal distribution.

 

Z

The Weibull distribution of this letter is distorted in the first part by too many occurrences of distances 1034-1366 (30 occurrences against 21.1 expected). This makes 57.7 % of the chi-square test value. The second part is less regular, the shortage of distances 1057-1478 (24 occurrences against 37.5 expected) makes 41.3 % of the chi-square test value.

Discussion

The old German custom to start all substantives with capital letters gives an excellent opportunity to study them alone. Moreover, the dramatis personae are written in capital letters, too. The capital letters are frequent enough to give significant results without technical difficulties when insufficient capacity of the used software for long lists makes necessary to split too long lists.

The most surprising result is the form of the distribution of distances. The Weibull distribution appears as the main distribution, especially in the second tail.

If we remember the long time which Goethe devoted to its polishing, then only rather few faults interfere with a perfect form.

REFERENCES

1. Kunz M., See folloving papers of this series on the page.