Distance Analysis of German Texts. III. Use of Words and Punctuation Marks by Johann Wolfgang Goethe.

Milan Kunz (kunzmilan@seznam.cz) March, 2003

Abstract

Distances between words of same length and using of punctuation marks in four Goethe’s works is studied. Beside four statistical distributions: Exponential, Weibull, lognormal and negative binomial, some distances are correlated with Erlang distribution.

INTRODUCTION

This is a continuation of study of statistical properties of distances between identical symbols in information strings in Goethe’s works. The same technique was used as before. Surprisingly, some sets correlated as best with the Erlang distribution, the fifth one to four found before to be suitable for English and German texts: the exponential distribution, the Weibull distribution, the lognormal distribution, and the negative binomial distribution. The Erlang distribution coincides with the exponential distribution, if its parameter alpha is 1. Then it has less degrees of freedom and therefore a lower the chi-square test value.

The actual values (mean, standard deviation, skewness, kurtosis, distribution parameters etc.) are of little interest, since they differ considerably.

Results

The distances between points determine the length of sentences. The points are used as follows:

Work

Frequency

Distribution

Chisquare (significance)

Werther 1

662

EX

0, 0.1318 over 250

Werther 2

886

WE

0.0371

Faust 1

1349

EX, ER

0

Faust 2

2169

ER

0, 0.2597 over 300

Werther 1: There are too few short sentences till 61 letters (109 occurrences against 184.6 expected). This contributes 31.9 % of the chi-square test value. The peak within distances 123-243 (243 occurrences against 154.6 expected) represents 53.1 % of the chi-square test value.

Werther 2: Compared with other three works, all differences against expected values are rather small.

Faust 1: There are too few short sentences till 41 letters (131 occurrences against 334.9 expected). This contributes 44.6 % of the chi-square test value. Then follows a peak within distances 41-181 (898 occurrences against 627.7 expected) making another 44.5 % of the chi-square test value.

Faust 2 There are too few short sentences till 50 letters (245 occurrences against 402.7 expected). This contributes 49.7 % of the chi-square test value. Then follows a peak within distances 61-150 (1268 occurrences against 1060 expected) making another 32.9 % of the chi-square test value.

The commas are used as follows. In both tails of Faust, they are too frequent and it was necessary to divide the sets into two parts:

Work

Frequency

Distribution

Chisquare (significance)

Werther 1

2017

ER

0, 0.5314 over 100

Werther 2

2461

EX, ER

0, EX 0.1023 over 100

Faust 1

1414

EX, ER

0, 0.5097 over 80

 

1385

EX, ER

0, 0.5006 over 70

Faust 2

2572

EX, ER

0

 

2595

EX, ER

0

Werther 1: There are too few commas within short distances till 12 letters (114 occurrences against 158.7 expected). This contributes 17.4 % of the chi-square test value. The peak within distances 13-35 (739 occurrences against 600.7 expected) represents 47.8 % of the chi-square test value.

Werther 2: As a difference against the first part, there are too many commas within short distances till 15 letters (277 occurrences against 233.4 expected). This contributes 16.1 % of the chi-square test value. Then there are too many lesser fluctuations.

Faust 1: In the first part, there are too few short distances till 12 letters (109 occurrences against 282.1 expected). This contributes 55.8 % of the chi-square test value. Then a peak follows within distances 13-48 (696 occurrences against 520.3 expected) making another 32.8 % of the chi-square test value.

In the second part, there are too few short distances till 49 letters (316 occurrences against 399.3 expected). This contributes 45.6 % of the chi-square test value. Then follows a peak within distances 26-49 (582 occurrences against 472.7 expected) making another 39,6 % of the chi-square test value. Both parts are different, t-test gave practically zero significance.

Faust 2: In the first part, there are too few short distances till 22 letters (715 occurrences against 859.4 expected). This contributes 41.2 % of the chi-square test value. Then follows a peak within distances 23-88 (1393 occurrences against 1179.4 expected) making another 39.6 % of the chi-square test value.

In the second part, there are too few short distances till 14 letters (310 occurrences against 627 expected). This contributes 54.8 % of the chi-square test value. Then follows a long peak within distances 15-96 (1932 occurrences against 1543.7 expected) making another 35.5 % of the chi-square test value. Both parts are different, t-test gave only 0.006 value.

The other punctuation mark, the colon is used by Goethe as follows:

Work

Frequency

Distribution

Chisquare (significance)

Werther 1

70

LN

0.1424

Werther 2

69

LN

0.6645

Faust 1

1001

LN

0.0137

Faust 2

1287

LN

0.0299

Werther 1: There are too few colons within distances 1475-2211 (3 occurrences against 7.7 expected). This slight difference contributes 73.9 % of the chi-square test value.

Werther 2: The correlation is very well:

Chisquare Test

Lower

Limit

Upper

Limit

Observed

Frequency

Expected

Frequency

Chisquare

63

579.895

24

24.9

0.03235

579.895

1158.789

12

14.7

0.48108

1158.789

1737.684

8

8.2

0.00421

1737.684

2316.579

7

5.1

0.69371

2316.579

3474.368

7

5.9

0.20867

3474.368

5789.947

7

5.1

0.71024

5789.947

9405

4

5.2

0.25962

Chisquare = 2.38988 with 4 degrees of freedom. Significance level = 0.664458

Faust 1: There are too few colons within distances 247-368 letters (71 occurrences against 95.2 expected). This contributes 35.8 % of the chi-square test value. The tail over distances 1104 (28 occurrences against 16.8 expected) making another 56.8 % of the chi-square test value.

Faust 2: The peak within distances 792-934 (26 occurrences against 14.4 expected) makes 39,3 % of the chi-square test value. The mean distances in both parts are different.

The other punctuation mark, the semicolon, is used as follows:

Work

Frequency

Distribution

Chisquare (significance)

Werther 1

118

WE

0.4679

Werther 2

138

WE

0.5370

Faust 1

292

LN

0.4603

Faust 2

746

EX

0.0080, 0.2284 over 200

Werther 1: The correlation is rather well regarding the number of all semicolons.

Werther 2: There are too many semicolons within distances 1051-1400 letters (20 occurrences against 14.7 expected). This contributes 47.1 % of the chi-square test value. The shortage of semicolons within distances 1751-2100 letters (4 occurrences against 6.9 expected) makes another 29.9 % of the chi-square test value.

Faust 1: There are too few distances 229-456 (63 occurrences against 77.3 expected). This contributes 34.4 % of the chi-square test value. The peak within distances 685-912 (32 occurrences against 25.7 expected) makes another 19.9 % of the chi-square test value.

Faust 2: There are too few semicolons within short distances till 132 letters (184 occurrences against 220 expected). This contributes 23.2 % of the chi-square test value. Then follows a peak within distances 132-394 (322 occurrences against 263.1 expected) making another 52.6 % of the chi-square test value.

The exclamation mark is used as follows:

Work

Frequency

Distribution

Chisquare (significance)

Werther 1

213

WE

0.0203

Werther 2

407

LN

0.5082

Faust 1

1150

LN

0.0036

Faust 2

1022

WE

0, 0.2433 over 200

Werther 1: There are too few exclamation marks within distances 634-950 (7 occurrences against 18.3 expected). This contributes 59.7 % of the chi-square test value.

Werther 2: There are too many exclamation marks within distances 1700-2366 letters (10 occurrences against 5.6 expected). This contributes 67.2 % of the chi-square test value.

Faust 1: There are too many distances 272-452 letters (122 occurrences against 90.2 expected). This contributes 35.3 % of the chi-square test value. There are another 5 greater disturbances.

Faust 2: There are too many distances 2-123 letters (501 occurrences against 435.3 expected). This contributes 25.7 % of the chi-square test value. Then follows a shortage within distances 124-368 (236 occurrences against 323 expected) making another 30.7 % of the chi-square test value.

The quotation marks are used as follows:

Work

Frequency

Distribution

Chisquare (significance)

Werther 1

296

LN

0.0043

Werther 2

314

LN

0.0031

Faust 1

49

WE

0.0001

Faust 2

20

EX

0.0001

Werther 1: There are too many quotation marks with distances over 1560 (17 occurrences against 8.9 expected). This contributes 90.4 % of the chi-square test value.

Werther 2: Again, there are too many exclamation marks with distances over 1680 (20 occurrences against 11 expected). This contributes 56.7 % of the chi-square test value.

Faust 1 and Faust 2: The quotation marks are used exceptionally, the results of tests are poor.

The spacebar

The distances between consecutive spacebars greater than 1 determine the number of words of the length corresponding to this distance minus one. There exists 40931 spacebars without corrections. Some of them are used as formatting tools. The results of tests are tabulated as follows. Cumulating frequencies of shorter distances, improved in some cases the fit, since bellow it the counts are scattered, and differences can balance themselves.

Table The number of words of different length in Werther 1

Length

Number

Type of distribution, chisquare value

1

472

EX, 0.9525 over 60

2

10

no test

3

1089

ER, 0.1692 over 26, EX 0.1901 over 30

4

2138

EX 0

 

2134

EX, 0

5

2058

EX, 0.1960 over 35

6

1887

ER, 0, 0.5319 over 9

7

1590

EX, 0.2301

8

1076

EX, 0, (0.4948 over 10)

9

812

ER 0.8186

10

691

EX, 0.2325, (0.6036 over 20)

11

599

EX, 0.4286, (0.8193 over 30

12

419

EX, 0.7772,

13

345

WE, 0.1418

14

263

EX, 0.4985

15

188

EX, 0.9746

16

151

EX, 0.9876

17

93

EX, 0.0953

18

82

EX, 0.2608

19

52

EX, 0.4301

20

46

EX, 0.6338

21

32

EX, 0.4337

22

22

EX, 0.5763

23

10

no test

24

17

 

25

11

 

26

4

 

27

3

 

28

1

 

29

0

 

30

2

 

31

0

 

32

0

 

33

4

 

35

1

 

 

The words classified according to their length are distributed mostly according to the exponential distribution, the Erlang distribution performed better only at three distances but with the alpha equal to 1 At other, it has less degrees of freedom and thus a poorer chi-square test value. The Weibull distribution was applicable only at the distance 13.

The distribution of length of words seem to have the lognormal shape, but this guess was not tested.

Notes to some results:

There are no three consecutive backspaces. This makes 39.9 % of the chi-square test value. Double backspaces repeat most often within distances 18-26 (90 occurrences against 61.2 expected). This makes also 39.9 % of the chi-square test value. The tail fits almost perfectly.

1. letter words are too few.

2. letter words do not follow immediately as often as expected (37 occurrences against 70.4 expected) . This makes 36.4 % of the chi-square test value. The second greatest disturbance is the peak within distances 10-14 (180 occurrences against 144.8 expected). This makes also 19.6 % of the chi-square test value.

3. letter words. It was necessary to divide 3. letter words into two parts. In the first one, 3. letter words do not repeat as often as the exponential distribution requires (503 occurrences against 700.2 expected). This makes 35.6 % of the chi-square test value. 3. letter words repeat more often then expected within distances 2-6 (1188 occurrences against 938.7 expected). This makes 42.9 % of the chi-square test value.

Similarly in the second part, 3. letter words do not repeat as often as the exponential distribution requires (499 occurrences against 688.4 expected). This makes 32.6 % of the chi-square test value. 3. letter words repeat more often then expected within distances 2-6 (1308 occurrences against 1050.0 expected). This makes 35.7 % of the chi-square test value. Both parts are similar.

4. letter words. The peak within distances 10-12 (202 occurrences against 144.2 expected) makes 47.3 % of the chi-square test value.

5. letter words. The Erlang distribution shape is disturbed by surplus of distances 6-8 (313 occurrences against 231.8 expected). This makes 45.8 % of the chi-square test value. The second peak within distances 13-15 (137 occurrences against 103 expected) makes 18.1 % of the chi-square test value.

6. letter words. The Erlang distribution is acceptable, too, the chi-square test value is 0.1788. There are many smaller disturbances.

7. letter words. The Erlang distribution is acceptable, too. The disturbances are similar as at 5. letter words, there is a surplus of distances 5-8 (208 occurrences against 165.5 expected). This makes 22.1 % of the chi-square test value. The second peak within distances 12-15 (141 occurrences against 104 expected) makes 26.6 % of the chi-square test value.

8. letter words. The correlation is good either with the exponential distribution or the Erlang distribution. The greatest disturbance is a shortage of distances 30-35 (28 occurrences against 40.2 expected). This makes 36.8 % of the chi-square test value.

9. letter words are too many within distances 13-18 (117 occurrences against 88.5 expected). This makes 46.6 % of the chi-square test value.

10. letter words. The correlation is good, no greater individual disturbance.

11. letter words. The correlation is good either with the exponential distribution or the Erlang distribution. The greatest disturbance is a surplus of distances 30-43 (72 occurrences against 60.2 expected). This makes 36.2 % of the chi-square test value.

12. letter words. There exists a cumulation of distances 38-49 (52 occurrences against 39.1 expected) between two shortages. This makes 26.8 % of the chi-square test value.

13. letter words. The peak within distances 39-57 (46 occurrences against 36.9 expected) makes 26.7 % of the chi-square test value.

14. letter words. Almost perfect fit is disturbed by four surplus distances 125-149 (15 occurrences against 11.1 expected) making 64.4 % of the chi-square test value.

15. letter words. Almost perfect fit.

16. letter words are too many within distances 211-260 (13 occurrences against 6.9 expected). This makes 43.8 % of the chi-square test value. Another 27.5 % contributes the shortage of distances 311-410 (2 occurrences against 6.8 expected).

17. letter words distribution is worsened by the shortage of distances 276-350 (1 occurrence against 6.4 expected). This makes 70.5 % of the chi-square test value.

Longer words need not special comments.

Table The number of words of different length in Werther 2

Length

Number

Type of distribution, chisquare value

1

826

LN, 0.010

2

36

EX, 0.0016

3

1519

EX, 0.0906,

4

2473

NB, 0, 0.0789 over 6

 

2442

NB, 0.1218

5

2406

EX, 0, 0.2977 over 19

6

2227

EX, 0, 0.7886 over 23

7

2115

EX, 0.0163, 0.8818 over 28

8

1318

EX, 0, 0.4475 over 21

9

1078

EX, 0.0134, 0.2713 over 10

10

803

EX, 0.1163, 0.6598 over 10

11

663

EX, 0.1504, 0.4166 over 35

12

508

EX, 0.5634

13

374

EX, 0.6435

14

307

EX, 0.3717

15

224

EX, 0.3327

16

181

EX, 0.3950

17

97

EX, 0.6397

18

94

EX, 0.6265

19

70

EX, 0.9034

20

54

EX, 08483

21

37

EX, 0.6048

22

19

EX, 0.5360

23

27

WE, 0.4955

24

20

EX, 0.7692

25

12

 

26

5

 

27

3

 

28

3

 

29

3

 

30

1

 

31

3

 

32

0

 

33

0

 

34

1

 

35

2

 

 

The words classified according to their length are distributed mostly according to the exponential distribution. Each from the Weibull distribution, the negative binomial distribution and the lognormal distribution performed better only in one case.

The distribution of length of words seem to have the lognormal shape, but this guess was not tested.

Notes to some results:

Double backspaces repeat more often than expected within distances 30-49 (141 occurrences against 110.5 expected). This makes 34.0 % of the chi-square test value.

1. letter words are too few for commenting.

2. letter words follow too often at distances 9-13 (271 occurrences against 234.2 expected) . This makes 27.8 % of the chi-square test value. The second greatest disturbance is the shortage of distances 19-27 (165 occurrences against 195 expected). This makes another 23 % % of the chi-square test value.

3. letter words. It was necessary to divide 3. letter words into two parts. In the first part, 3. letter words do not repeat as often as the exponential distribution requires (579 occurrences against 648.4 expected). This makes 15.4 % of the chi-square test value. On contrary, they repeat more often then expected within distance 4 (314 occurrences against 260.4 expected). This makes 22.4 % of the chi-square test value.

Similarly in the second part, 3. letter words do not repeat as often as the exponential distribution requires (496 occurrences against 566.4 expected). This makes 33.3 % of the chi-square test value. They again repeat more often then expected within distance 3 (372 occurrences against 334.1 expected). This makes 16.3 % of the chi-square test value. Both parts are different according to the t-test.

4. letter words. There are too few distances 2-3 (481 occurrences against 615.4 expected). This makes 35,7 % of the chi-square test value. Immediately the peak within distances 4-6 follows (525 occurrences against 437.8 expected). It makes 21.1 % of the chi-square test value.

5. letter words. The exponential distribution shape is disturbed by a shortage of distances 2-3 (391 occurrences against 534.4 expected). This makes 34.2 % of the chi-square test value. The peak within distances 4-12 (1064 occurrences against 886.4 expected) makes 35.6 % of the chi-square test value.

6. letter words. There is a shortage of distances 18-20 (79 occurrences against 100.2 expected). This makes 15.5 % of the chi-square test value. The greatest disturbance is a peak of distances 30-33 (40 occurrences against 25.8 expected). This makes 27.2 % of the chi-square test value. This peak is cumulated from distances 27-29 (23 occurrences against 36.2 expected, 16.6 % of the chi-square test value) as well as from distances 34-36 (12 occurrences against 18.3 expected, 7.5 % of the chi-square test value)

7. letter words. The exponential distribution has a step, there is a surplus of distances 10-14 (227 occurrences against 173.4 expected) which makes 30.1 % of the chi-square test value. Immediately, there are too few distances 15-18 (95 occurrences against 130.1 expected). This makes 17.2 % of the chi-square test value.

8. letter words. There is a surplus of distances 10-14 (172 occurrences against 135.3 expected) which makes 29.4 % of the chi-square test value. Immediately, there are too few distances 15-18 (89 occurrences against 106.2 expected). This makes 8.3 % of the chi-square test value.

9. letter words. They are too few within distances 25-35 (70 occurrences against 94.9 expected). This makes 29.8 % of the chi-square test value.

10. letter words. The greatest disturbance is a shortage of distances 56-68 (27 occurrences against 36.4 expected). This makes 25 % of the chi-square test value.

11. letter words. They repeat less often than expected (3 occurrences against 7 expected). This makes 23.4 % of the chi-square test value. There are too few distances 104-117 (7 occurrences against 12.9 expected). This makes 27.6 % of the chi-square test value. The third disturbance is a surplus of distances 162-190 (12 occurrences against 7.6 expected). This makes 26.7 % of the chi-square test value.

12. letter words. The surplus of distances 136-154 (17 occurrences against 9.7 expected) makes 49.9 % of the chi-square test value.

13. letter words. The shortage of distance distances 41-59 (16 occurrences against 28.2 expected) makes 42.3 % of the chi-square test value.

14. letter words. The greatest disturbance is a shortage of distances 168-201 (4 occurrences against 10.2 expected). This makes 44.7 % of the chi-square test value. The surplus of distances 236-335 (19 occurrences against 12.5 expected) makes 40.4 % of the chi-square test value.

15. letter words. The shortage of distances 381-475 (3 occurrences against 5.6 expected) makes 47.7 % of the chi-square test value.

16. letter words. The shortage of distances 166-220 (5 occurrences against 9.8 expected) makes 44.8 % of the chi-square test value.

17. letter words. Almost perfect fit.

18. letter words distribution is worsened by the surplus of distances 190-284 (12 occurrences against 7.3 expected). This makes 70.7 % of the chi-square test value.

19-24. letter words. Without commentary.

Table The number of words of different length in Faust 1

Length

Number

Type of distribution, chisquare value

1

517

WE, 0.081

2

9

no test

3

1651

EX, 0.6693 over 5

4

1975

LN, 0, 0.1409 over 14, ER, 0

 

1977

EX, 0, 0.0790 over 15, ER, 0

 

1970

EX, 0, 0.0486 over 6, ER, 0

5

1715

EX, 0.4765 over 8

 

1713

EX, 0.3503 over 12

6

1516

EX, 0, 0.2243 over 22

 

1521

EX, 0, 0.4584 over 25

7

1258

EX, 0

 

1243

EX, 0.0611 over 12

8

1493

EX, 0, 0.0916 over 36

9

1327

EX, 0.8304

10

975

EX, 0.4978

11

886

WE, 0.3004

12

687

WE, 0.4225

13

543

WE, 0.7522

14

459

EX, 0.6260

15

352

EX, 0.6119

16

311

EX, 0.1383

17

215

EX, 0.3247

18

195

WE, 0.2284

19

136

WE, 0.3370

20

106

EX, 0.2361

21

89

WE, 0.5363

22

75

WE, 0.3443

23

68

WE, 0.8442

24

64

LN, 0.2896

25

61

EX, 0.6823

26

55

EX, 0.9420

27

32

EX, 0.6312

28

24

no test

29

28

LN, 0.3247

30

23

EX, 0.4197

31

11

 

32

11

 

33

7

 

 

The words classified according to their length are distributed mostly according to the exponential distribution (16 distances), according to the Weibull distribution at 9 distances. The lognormal distribution was suitable in three cases. The Erlang distribution with the parameter alpha = 2 was sometimes applicable but performed worker than exponential distribution.

The distribution of length of words seem to have the lognormal shape, but this guess was not tested. There are 19 words longer than 35 letters.

Notes to some results:

Double backspaces repeat more often than expected within 222-276 (12 occurrences against 6.6 expected). This makes 32.2 % of the chi-square test value.

1. letter words. No test.

2. letter words. They repeat less often than expected (56 occurrences against 104.2 expected). This makes 53.1 % of the chi-square test value. The second greatest disturbance is the peak of distances 39-43 (45 occurrences against 31.4 expected). This makes 14.1 % of the chi-square test value.

3. letter words. It was necessary to divide three letter words into tree parts. In the first part, 3. letter words repeat more often than expected (379 occurrences against 128.3 expected). This makes 75 % of the chi-square test value. On contrary, distances 2-4 occur less often than expected (874 occurrences against 1176.3 expected). This makes 12.6 % of the chi-square test value. In the second part, 3. letter words do not repeat as often as the exponential distribution requires (441 occurrences against 596.4 expected). This makes 71 % of the chi-square test value. 3. letter words repeat more often then expected within distances 2-5 (1063 occurrences against 850.6 expected). This makes 20.1 % of the chi-square test value. Similarly in the third part, 3. letter words do not repeat as often as the exponential distribution requires (414 occurrences against 577.4 expected). This makes 38.5 % of the chi-square test value. 3. letter words repeat more often then expected in the distance 3 (306 occurrences against 228.1 expected). This makes 22.1 % of the chi-square test value. The first and third parts are different according to the t-test.

4. letter words. It was necessary to divide four letter words into two parts. In the first part, four letter words repeat less often than expected (223 occurrences against 311.7 expected). This makes 38.1 % of the chi-square test value. Other deviations have weight less than 5 % of the chi-square test value. In the second part, four letter words repeat as often as the exponential distribution requires. They repeat more often then expected within distances 6-8 (274 occurrences against 215.7 expected). This makes 32.9 % of the chi-square test value. The second peak within distances 13-15 (112 occurrences against 82.6 expected) makes 21.2 % of the chi-square test value. The parts are similar according to the t-test.

5. letter words. It was necessary to divide five letter words into two parts. In the first part, five letter words form a peak within distances 6-8 (247 occurrences against 188.7 expected) which makes 29.2 % of the chi-square test value. They repeat less often than expected within distances 21-22 (11 occurrences against 33.9 expected). This makes 25.1 % of the chi-square test value. In the second part, five letter words repeat more often then expected within distances 6-8 (277 occurrences against 189 expected). This makes 54.1 % of the chi-square test value. Other deviations have weight at most 10 % of the chi-square test value. Both parts are similar.

6. letter words. It was necessary to divide six letter words into two parts. In the first part, six letter words form a peak within distances 5-8 (316 occurrences against 237.9 expected) which makes 48.6 % of the chi-square test value. They repeat more often than expected within distances 44-50 (17 occurrences against 8.3 expected). This makes 17.3 % of the chi-square test value. In the second part, six letter words repeat less often then expected within distances 8-11 (133 occurrences against 164.8 expected). This makes 21.4 % of the chi-square test value. The tail over 54 is longer (13 occurrences against 6.8 expected). This makes 20.1 % of the chi-square test value. Both parts are similar.

7. letter words. Seven letter words repeat more often than expected (107 occurrences against 85.5 expected). This makes 14.5 % of the chi-square test value. The tail over 96 is longer (11 occurrences against 5 expected). This makes 19 % of the chi-square test value.

8. letter words. There is a shortage of distances 30-35 (52 occurrences against 63.4 expected) which makes 17.9 % of the chi-square test value.

9. letter words. They repeat more often than expected (49 occurrences against 36.8 expected). This makes 23.2 % of the chi-square test value. Then, they repeat more often than expected within distances 102-114 (12 occurrences against 7.4 expected). This makes 16.1 % of the chi-square test value. The tail over 133 is shorter (2 occurrences against 5.8 expected). This makes 14.1 % of the chi-square test value.

10. letter words. The greatest disturbance is a shortage of distances 40-62 (103 occurrences against 132.5 expected). This makes 47.5 % of the chi-square test value.

11. letter words. There are five deviations from the Weilbull shape with the weight greater than 10 %.

12. letter words. The shortage of distances 29-41 (60 occurrences against 74.3 expected) makes 32.5 % of the chi-square test value.

13. letter words. The slight shortage of distance 1 (5 occurrences against 8.3 expected) makes 15.9 % of the chi-square test value. Another shortage of distances 23-43 (84 occurrences against 97.7 expected) makes 23.9 % of the chi-square test value.

14. letter words. There are five deviations with the weight greater than 10 %.

15. letter words. The surplus of distances 43-66 (61 occurrences against 47.5 expected) makes 25.9 % of the chi-square test value. The shortage of distances 67-138 (59 occurrences against 81.1 expected) makes 41.3 % of the chi-square test value.

16. letter words. The peak within distances 284-324 (12 occurrences against 5.8 expected) makes 40.1 % of the chi-square test value.

17. letter words. Their distribution is worsened by shortage of their repeating in distances to 3 (3 occurrences against 8.5 expected). This makes 34 % of the chi-square test value.

18. letter words distribution is worsened by the shortage of distances 89-131 (10 occurrences against 16.9 expected). This makes 31.2 % of the chi-square test value.

19. letter words. Their distribution is worsened by their repeating in distances to 67 (34 occurrences against 26.4 expected). This makes 27.3 % of the chi-square test value. The shortage of distances 68-134 (11 occurrences against 19.6 expected) makes 46.7 % of the chi-square test value.

20. letter words. The greatest disturbance is a shortage of distances 276-350 (2 occurrences against 6.4 expected). This makes 74.5 % of the chi-square test value.

Longer words are without commentary.

Table The number of words of different length in Faust 2

Length

Number

Type of distribution, chisquare value

1

564

WE, 0.0560

2

21

EX, 0.5460

3

2440

EX, 0, 0.8358 over 9

4

1937

EX, ER, 0

 

1936

EX, ER, 0

 

1942

EX, ER, 0

 

1947

EX, ER, 0

5

2193

EX, 0, 0.5716 over 8

 

2154

EX, 0.3318

6

1994

WE, 0, 0.5323 over 36

 

1983

EX, 0, 0.2209 over 25

7

1747

EX, 0.0022, 0.7718 over 9

 

1749

EX, 0, 0.2069 over 20

8

2450

EX, 0.0001, 0.2328 over 30

9

1986

EX, 0, 0.1125 over 20

10

1697

EX, 0.0855

11

1474

EX, 0.0967

12

1153

EX, 0.1290

13

944

EX, 0.2676

14

830

EX, 0.2952

15

649

WE, 0.6559

16

513

WE, 0.6186

17

460

WE, 0.2818

18

355

WE, 0.3906

19

294

WE, 0.9199

20

190

WE, 0.4318

21

153

WE, 0.9954

22

120

EX, 0.2064

23

126

EX, 0.2746

24

102

EX, 0.9203

25

86

EX, 0.2004

26

80

LN, 0.5909

27

48

EX, 0.8261

28

42

EX, 0.8938

29

32

EX, 0.5481

30

33

LN, 0.6155

31

23

EX, 0.6014

32

22

EX, 0.0515

33 ->

32

LN, 0.1380

 

The words classified according to their length are distributed mostly according to the exponential distribution (22) distances, according to the Weibull distribution at 8 distances. The lognormal distribution was suitable in three cases. The Erlang distribution with the parameter alpha = 2 was sometimes applicable but performed worker than exponential distribution.

The distribution of length of words seem to have the lognormal shape, but this guess was not tested. There are 32 words longer than 32 letters.

Notes to some results:

Double backspaces repeat less often than expected within 132-160 (14 occurrences against 22.3 expected). This makes 18.5 % of the chi-square test value. The tail over 332 is longer than expected (16 occurrences against 9.4 expected). This makes 28.4 % of the chi-square test value.

1. letter words. No commentary.

2. letter words. They repeat less often than expected (66 occurrences against 157.8 expected). This makes 59.6 % of the chi-square test value. Then a peak of distances 2-12 (1313 occurrences against 11.96.7 expected). This makes 18.6 % of the chi-square test value.

3. letter words. It was necessary to divide three letter words into four parts. In the first part, 3. letter words repeat less often than expected (322 occurrences against 529.4 expected). This makes 43.9 % of the chi-square test value. On contrary, distances 2-7 occur more often than expected (1279 occurrences against 1015.1 expected). This makes 39.2 % of the chi-square test value. In the second part, 3. letter words repeat less often as the exponential distribution requires (337 occurrences against 508.9 expected). This makes 35.7 % of the chi-square test value. Then immediately 3. letter words repeat more often then expected, especially within distances 4-6 (602 occurrences against 433.9 expected). This makes 41.2 % of the chi-square test value. Similarly in the third part, 3. letter words do not repeat as often as the exponential distribution requires (324 occurrences against 536.6 expected). This makes 40 % of the chi-square test value. 3. letter words repeat more often then expected within distances 2-6 (1088 occurrences against 812.1 expected). This makes 49 % of the chi-square test value. In the fourth part, 3. letter words repeat less often than expected (350 occurrences against 545.4 expected). This makes 39.1 % of the chi-square test value. On contrary, distances 2-7 occur more often than expected (1192 occurrences against 932.9 expected). This makes 47.7 % of the chi-square test value. The second part differs from other ones, significantly from the third and fourth ones according to the t-test.

4. letter words. It was necessary to divide four letter words into two parts. In the first part, four letter words repeat less often than expected (262 occurrences against 384.5 expected). This makes 70.5 % of the chi-square test value. 4. letter words repeat more often then expected within distances 2-7 (1138 occurrences against 972 expected). This makes 16 % of the chi-square test value. In the second part, four letter words repeat nearly as often as the exponential distribution requires. They repeat more often then expected within distances 2-5 (769 occurrences against 703.3 expected). This makes 45.4 % of the chi-square test value. The parts are different according to the t-test.

5. letter words. It was necessary to divide five letter words into two parts. In the first part, five letter words repeat less often than expected till the distance 3 (573 occurrences against 716 expected). This makes 28.5 % of the chi-square test value. The peak within distances 4-12 (985 occurrences against 803.4 expected) makes 46 % of the chi-square test value. In the second part, five letter words repeat less often than expected till the distance 3 (563 occurrences against 660.8 expected). This makes 24 % of the chi-square test value. The peak within distances 4-9 (725 occurrences against 598.9 expected) makes 51.3 % of the chi-square test value. Both parts are different according to the t-test.

6. letter words. It was necessary to divide six letter words into two parts. In the first part, six letter words form a peak within distances 11-14 (209 occurrences against 166.6 expected) which makes 29.3 % of the chi-square test value. In the second part, six letter words repeat more often then expected within distances 8-10 (239 occurrences against 162.3 expected). This makes 49.3 % of the chi-square test value. Both parts are different according to the t-test.

7. letter words. Seven letter words repeat less often then expected within distances 2-6 (550 occurrences against 649.7 expected). This makes 34.2 % of the chi-square test value.

8. letter words. There is a peak of distances 18-22 (211 occurrences against 154 expected) which makes 38.7 % of the chi-square test value.

9. letter words. They repeat less often than expected till distance 7 (471 occurrences against 523.7 expected). This makes 23.1 % of the chi-square test value. The peak within distances 8-21 (609 occurrences against 557.8 expected) makes 20.8 % of the chi-square test value.

10. letter words. The greatest disturbance is a shortage of distances 102-107 (0 occurrences against 5.2 expected). This makes 18.1 % of the chi-square test value.

11. letter words. The peak of distances 124-135 (14 occurrences against 7.5 expected) which makes 31.9 % of the chi-square test value.

12. letter words. The surplus of distances 124-135 (14 occurrences against 7.5 expected) makes 31.9 % of the chi-square test value.

13. letter words. The surplus of distances 11-28 (313 occurrences against 276.2 expected) makes 24.1 % of the chi-square test value. The shortage of distances 39-47 (58 occurrences against 75.1 expected) makes 18.4 % of the chi-square test value.

14. letter words. There are six deviations with the weight greater than 10 %.

15. letter words. The surplus of distances 198-216 (9 occurrences against 5.8 expected) makes 20.2 % of the chi-square test value.

16. letter words. There are four deviations with the weight greater than 10 %.

17. letter words. Their distribution is worsened by shortage of their repeating (2 occurrences against 6.4 expected). This makes 25.4 % of the chi-square test value.

18. letter words distribution is worsened by the shortage of distances 257-329 (7 occurrences against 14.9 expected). This makes 52.2 % of the chi-square test value.

19. letter words. Almost perfect fit.

20. letter words. The greatest disturbance is a shortage of distances 357-421 (3 occurrences against 7.8 expected). This makes 42.1 % of the chi-square test value.

21. letter words. Almost perfect fit.

Longer words are without commentary since deviations from the expected values are few occurrences.

Discussion

The insufficient capacity of the used software for long lists forced splitting of too frequent signs. The splitting was made before determining distances. Surprisingly, the obtained parts are not always comparable, since there are in the split parts different number of signs. This leads to the different mean distances between them.

Some distributions of distances between punctuation marks are highly regular, especially their tails, if the low distances inside words are pooled. They are described with a different precision with five distributions: exponential, Erlang, Weibull, lognormal and negative binomial.

As it was noted in the introduction, the Erlang distribution coincides with the exponential distribution, if its parameter alpha is 1. Then it has less degrees of freedom and therefore a lower the chi-square test value. Sometimes, it was found that the parameter alpha was 2, but the chi-square test value was poorer that at the exponential distribution due too long tail.

Some significant deviations from the expected values are made by few occurrences. This can be sometimes caused by repeated phrases, or by lower care of the author. This conclusion should be confirmed by stylistic analysis.

REFERENCES

1. Kunz M., See papers of this series on the page.