Distance Analysis of English Texts. III. ARTHUR CONAN DOYLE: A STUDY IN SCARLET.

Milan Kunz (kunzmilan@atlas.cz) August, 2002

Abstract

Distances between identical symbols in information strings (biological, language, computer programs (*.exe files) are described with a different precision with four distributions: Exponential, Weibull, lognormal and negative binomial. The correlations are sometimes highly significant. Here are analyzed distances between signs in the novel of A. C. Doyle. Some distance tests revealed specific formal features of the text.

INTRODUCTION

This is a continuation study of statistical properties of distances between identical symbols in information strings (1, 2, 3). The Doyle's novel was obtained in the form of RTF. Using MS Word, the text was transformed into the plain *.txt. Some formatting, as headlines, remained unchanged. Then the file has 238430 bytes. It contains 230112 signs including spaces, 189183 signs without spaces in 4159 lines and 42549 words. It means that the mean length of a word is 4.446 signs (including apostrophes and punctuation marks). At some letters, the list of distances were split into more equal (approximately) parts, since the used statistical software Statgraphics (in version, I work with) does not work with too long lists.

After these formal corrections, the distances were determined by a program elaborated by Rádl. The string is at first indexed with the position index i (i going from 1 to m) of each individual symbol in the string, and then the differences of these position indexes are determined. The differences are considered to be the topological distances between the same symbols. The sets of these values were evaluated by different statistical tests. The program counting distances counts all signs, including spacebar, return, and punctuation marks.

From all available implemented distributions, only four distributions gave significant results, the exponential distribution, the Weibull distribution, the lognormal distribution, and the negative binomial distribution, as before.

The actual values (mean, standard deviation, skewness, kurtosis, distribution parameters etc.) are of little interest, since they differ considerably.

Results

The distances between points determine the length of sentences. There are 2393 points, mostly used as the punctuation mark, except some abbreviations (e.g. M.D.)

The distribution is of Weibull type, a = 1.75916, b = 108.163.

Chisquare = 10.4418 over 52, significance level was 0.72929 with 14 degree of freedom. There exists a shortage of points between distances 68 - 82 (237 occurrences against 265 expected). This alone makes 28.4 % of the chi-square test value.

The other punctuation mark, the semicolon (108 occurences), is used as follows:

Chisquare Test

Lower

Upper

Observed

Expected

Limit

Limit

Frequency

Frequency

Chisquare

at or below

667.619

39

38.3

0.01274

667.619

1334.238

23

23.4

0.00610

1334.238

2000.857

12

13.1

0.08969

2000.857

2667.476

10

8.2

0.41749

2667.476

3334.095

4

5.5

0.39519

3334.095

4667.333

7

6.7

0.01289

4667.333

6667.190

5

5.1

0.00387

above

6667.190

8

7.8

0.00706

Chisquare = 0.945022 with 5 d.f. Sig. level = 0.966877

This is an example of the almost perfect lognormal distribution.

There are many commas. Therefore, the file was split into 6 parts. The distribution of distances was exponential with different fitting, as follows in tabulated form

Part

Number

Chisquare

Note

1

475

0.6716

Peak 85-120, walley 155-190

2

476

0.2991

Peak 107-127

3

483

0.5055

Peak 86-106

4

477

0.0802

Peak 115-143

5

476

0.6593

Over 100, lower tail

6

478

0.0287

Peak 71-125

There are no immediately repeated commas, as ",," which contributes 21.2 - 50.78 % of the chi-square test value.

The two way sample analysis shows how the parts of the lower case differ:

Part

2

3

4

5

6

1

0.355

0.091

0.122

0.374

0.177

2

 

0.397

0.480

0.977

0.656

3

  

0.447

0.364

0.654

4

   

0.621

0.761

5

    

0.621

Note: The asterisk shows the statistically significant difference between tested parts.

The commas are used without too great differences.

The spacebar

The distances between consecutive spacebars greater than 1 determine the number of words of the length corresponding to this distance minus one. There exists 40931 spacebars without corrections. Some of them are used as formatting tools. The results are tabulated as follows. Cumulating frequencies of shorter distances, improved in some cases the fit, since bellow it the counts are scattered, and differences can balance themselves.

Table 2 The number of words with the different length

Length

Number

Type of distribution, chisquare value

1

1593

LN, 0.305

2

6427

LN (divided)

3

8109

LN (divided)

4

6309

EX (divided)

5

4364

EX (divided)

6

3237

NB, 0.024, over 8 = 0.810

7

2692

NB, 0.053, over 4 = 0.230

8

2002

NB, 0.137, over 8 = 0.455

9

1540

NB, 0.126, over 6 = 0.871

10

1043

NB, 0.072, over 25 = 0.605

11

698

EX, 0.358

12

455

EX, 0.251

13

289

WE, 0.676

14

193

WE, 0.686

15

122

WE, 0.097

16

81

WE, 0.531

17

45

WE, 0.207

18

30

too few data

19

13

too few data

20

4

too few data

The distribution of length of words seem to have the lognormal shape, but this guess was not tested.

Notes to some results:

The distribution of one letter words is correlated by the lognormal distribution. There are two peaks between distances 70-81 and 128-138 (36 (9) occurrences against 28.4 (5.6) expected). Each makes about 14.7 % of the chi-square test value. There exists a shortage of distances 13-23 (375 occurrences against 406.9 expected). Each contributes about 17.9 % to the chi-square test value. The distribution is shorter than expected (4 occurrences against 9.1 expected. This makes about 20.7 % of the chi-square test value.

The distribution of two letter wordsis correlated poorly by the lognormal distribution. The set was divided into 4 parts. From them the second part gives the best fit, the chi-square test value is 0.3346 over 11, 0.8075 over 12, and 0.4011 over 13. These words follow each other more often than corresponds to the shape. This makes 52.7-92.7 % of the chi-square test value.

The two way sample analysis shows that only the first and second parts are similar:

Part

2

3

4

1

0.537

*0

*0.010

2

 

*0

*0.037

3

  

*0.032

The distribution of three letter words was divided into four parts, too. The parts are correlated by different distributions:

Part

Type

The chi-square test value (over)

Note

1

LN

0.348 (12), 0.553 (13), 0.273 (14)

repeatings make 91.3 %

2

EX

0.130 (16-17)

peak 3-4 makes 45 %

3

NB

0.148 (11)

shortage of repeatings makes 43.8 %, peak 3-4 makes 39.1 %

4

LN

0.036 (12), 0.495 (13), 0.147 (14)

repeatings make 66.5 %

The two way sample analysis shows that only the second and fourth parts are similar:

Part

2

3

4

1

*0

*0

*0

2

 

*0

0.688

3

  

*0

The distribution of four letter words was divided into four parts. These words are following each other more often than corresponds to the shape of the exponential distribution but not too much, at most 28.2 % of the chi-square test value in the first part. The parts are correlated as follows:

Part

The chi-square test value (over)

Note

1

0.005 (15), 0.332 (16), 0.141 (17)

shortage of distances 15-18 makes 25.1 %,

2

0.258 (9), 0.617 (10), 0.276 (11)

peak 6-8 makes 28.5 %

3

0.257 (12-13)

peak 6-8 makes 31.7 %

4

0

peak 2-6 makes 62.3 %

The two way sample analysis shows that only the second and third parts and the third and fourth ones are similar:

Part

2

3

4

1

*0.003

*0

*0

2

 

0.302

*0.041

3

  

0.283

The distribution of five letter words was divided into three parts. These words are following each other slightly often than corresponds to the shape of the exponential distribution. The parts are correlated as follows:

Part

The chi-square test value (over)

Note

1

0.071 (2), 0.389 (3), 0.148 (4)

shortage of distances 17-32 makes 51.3 %,

2

0.002

peak 6-8 makes 35.8 %

3

0

peak 8-11 makes 41.7 %

The two way sample analysis shows that the parts are similar:

Part

2

3

1

0.723

0.250

2

 

0.723

The negative binomial distribution of six letter words is fair, worsened by short distances (the chi-square test value is 0.681 over 7, 0.810 over 8, 0.420 over 9).

The distribution of seven letter words is described by the negative binomial distribution (the chi-square test value = 0.0529. It is somewhat improved over 4 to 0.2304). The tail is longer than expected (18 occurrences against 10.1 expected above 82). This makes 24.9 % of the chi-square test value.

The distribution of eight letter words is also the negative binomial one (the chi-square test value = 0.137). It is somewhat improved over 4 to 0.455). The tail is again longer (16 occurrences against 9.8 expected over 1.7). This makes 15.6 % of the chi-square test value.

The distribution of nine letter words is also the negative binomial one (the chi-square test value = 0.126 is improved over 6 to 0.871). The shortage of these words within distances 88-96 (7 occurrences against 13.9 expected) contributes 16.0 % of the chi-square test value.

The distribution of ten letter words is fairly correlated by the negative binomial distribution (the chi-square test value = 0.072 is improved over 25 to 0.605). The shortage of these words within distances 19-28 (107 occurrences against 132.4 expected) contributes 17.0 % of the chi-square test value. There are more distances 145-172 than expected (23 occurrences against 10.1 expected). This makes 31.9 % of the chi-square test value.

The distribution of eleven letter words is fairly correlated by the exponential distribution (the chi-square test value = 0.358). The shortage of these words within distances 160-186 (11 occurrences against 16.5 expected) contributes 18.3 % of the chi-square test value. There are more distances 28-54 than expected (183 occurrences against 159.0 expected). This makes 36.7 % of the chi-square test value.

The distributions of longer words are well correlated with the Weibull distribution. As an example, the results with the 16 letter words are given:

Chisquare Test

Lower

Upper

Observed

Expected

Limit

Limit

Frequency

Frequency

Chisquare

at or below

200.000

29

30.0

0.03323

200.000

400.000

17

16.7

0.00504

400.000

600.000

13

10.7

0.47842

600.000

800.000

7

7.2

0.00345

800.000

1200.000

5

8.2

1.27208

above 1200.000

3374

10

8.2

0.41292

Chisquare = 2.20513 with 3 d.f. Sig. level = 0.530938

Distances between individual letters

The results for all letters are presented in the form of the table, where the frequencies of all symbols are given and the significance of the performed chi-square tests. Then the commentaries to all symbols of the alphabet are given. The values in the square brackets show the corresponding values of the combined lower and upper cases.

Table 7 Survey of results

Notes:

EX = exponential distribution

WE = Weibull distribution

L N = lognormal distribution

NB = negative binomial distribution

* = the test was not made, since not enough of data

Statistic = XX, the chi-square test

Symbol

Small

Capital

Both

a

14387,EX, 0

251, WE, 0.913

14640, EX, NB, 0

b

2429, EX, 0

113,WE, 0.574

2542, WE, 0

c

4403, NB, 0.300

126, WE, 0.932

4524, NB

d

8210, NB, EX

146, LN, 0.048

8.356, NB

e

22812, NB, 0

84, WE, 0.465

22895, NB, 0

f

3773,WE, 0

269, WE, 0.008

4042, WE, 0

g

3494, WE, 0.296

99, WE, 0.924

3593, WE,

h

11954, NB, 0

445, WE, 0.371

12399, WE, NB

i

1152, LN, 0.137

1180, LN, 0.128

12332, EX, 0

j

127, WE, 0.576

108, WE, 0.031

235, WE, 0.346

k

1296, WE, 0.033

10, no test

1306, WE, 0.041

l

6797, WE

173, EX, 0.194

4970, WE

m

4569, EX

164, WE, 0.458

4733, EX

n

12201, EX

304, WE, 0.072

12505, EX

o

13843, EX

101, WE, 0.077

13944,EX

p

2867, WE

69, WE, 0.356

2936, WE

q

136, EX, 0.441

2, no test

138, EX, 0.433

r

10793, EX

204, WE, 0.109

10997, EX

s

12680, WE, EX

262, WE, 0.878

12942, WE, EX

t

15486, EX

525, WE, 0.4522

16011, EX

u

5047, EX

193, WE, 0.455

5076, EX

v

1735, WE

11, no test

1747, WE, 0.020

w

4335, EX, WE

260, WE, 0.709

4595, EX

x

278, WE, 0.130

no test

-

y

3349, EX,

323, WE, 0.267

3672, EX

z

EX

no test

133, EX, 0161

At the upper case, the Weibull distribution is the best one in the case of 16 letters. The lognormal distribution correlates 2 cases, only, the exponential distribution is the best in the 3 performed tests, and the negative binomial distribution in no case.

At the lower case, the Weibull distribution is the best one in the case of 8 letters. The lognormal distribution correlates 1 case, only, the exponential distribution is the best in the 13 performed tests, and the negative binomial distribution in 4 cases. At combined cases, the Weibull distribution is the best one in the case of 10 letters. The lognormal distribution correlates no case, the exponential distribution is the best in the 12 performed tests, and the negative binomial distribution in no case. Sometimes, the distinction between the fit is small and more than one distribution is applicable. The chi-square values sometimes are practically zero, and only adjusting the lowest possible value to greater distances by pooling these shorter distances increases the significance of the chi-square tests. Now, the commentaries to the individual letters follow.

A

The capital case A frequency allowed the separate test. The fair result was obtained with the exponential distribution (the chi-square test value 0.378). The excellent fit with the Weibulll distribution (the chi-square test value 0.913) is worsened by too many repeating within distances 2376 till 2850 (12 occurrences against 7.4 expected) which makes 26.0 % of the chi-square test value.

The distribution of distances between the lower case a is exponential, except that there are practically no repeating aa. This fact contributes 58.4 -77.3 to the chi-square value. The lower case a repeats too often within distances 6 - 14 (1.185 - 1.336 of expected values).

The two way sample analysis shows how the parts of the lower case differ:

Part

2

3

4

5

6

1

0.368

*0.034

*0.026

*0.000

0.390

2

 

0.217

0.178

*0.008

0.956

3

  

0.895

0.160

0.191

4

   

0.209

0.155

5

    

*0.006

Note: The asterisk shows the significant difference between tested parts.

The first sixth differs significantly from the second till fifth ones. The fifth and sixth are different, too.

The most important disturbances from the shape of the distribution in all parts are tabulated:

Part

Range

Observed

Expected

% of chisquare

1

6-20

1285

1015.5

29.7

2

6-14

997

728.4

18.5

3

6-9

470

390.1

8.5

4

6-9

483

390.9

11.7

5

7-12

600

453.6

20.5

6

14-17

293

183.9

25.5

The lower case a repeats too often within one till three words.

The distances between both case (a + A) are fitted poorly by different distributions. Again, there are practically no repeating Aa. This fact contributes 58 - 77.6 to the chi-square value.

The first sixth of a fits well with the negative binomial distribution with pooled distances to 16 (the chi-square value = 0.592. Other parts give much worse fits, and other distributions (the exponential distribution and the negative binomial distribution) give a better fit.

The two way sample analysis of both cases (a + A) gives worser results as the lower case a:

Part

2

3

4

5

6

1

0.457

*0.025

*0.023

*0.000

0.245

2

 

0.131

0.120

*0.004

0.679

3

  

0.943

0.162

0.263

4

   

0.193

0.242

5

    

*0.011

The first sixth differs significantly from the three parts but its consistency with other parts is low, too. The most important disturbances in all parts are tabulated, again:

Part

Type

Range

Observed

Expected

% of chisquare

1

NB

25-27

74

87.9

12.2

  

49-54

12

24.7

36.3

2

NB

6-18

1210

1017.8

17.1

3

EX

6-21

1337

1135.1

19.6

4

EX

6-13

844

713.2

14.0

5

LN

29-39

221

172.1

29.3

6

EX

14-17

300

187.1

25.9

B

The distribution of distances between upper case B is Weibull. The distribution of distances over 20 between lower case b is exponential, the chi-square test value is then 0.614. There are too few b within distances 129-150 (106 occurrences against 128 expected), which contributes 19.3 % of the chi-square test value. Contrary, there are too many b within distances 282-324 (63 occurrences against 46 expected), which contributes 34.3 % of the chi-square test value. The distribution of distances over 20 between (b +B) is exponential, the chi-square test value is then 0.921. But here the Weibull distribution gives even a better chi-square test value 0.927. The fit

is worsened by too many (b + B) within distances 295-316 (31 occurrences against 20.7 expected), which contributes 40.9 % of the chi-square test value. There are too few (b + B) within distances 422-442 (1 occurrence against 5.4 expected), which contributes 28.5 % of the chi-square test value.

Including B improved the fit, the disturbances lessened and shifted to longer distances.

C

The distribution of distances between upper case C is the Weibull one (the chi-square test value is wery good, 0.932).

The distribution of distances of the lower case of this letter (and c + C) is described well by three distributions, exponential, negative binomial and Weibull.

The distances between lower case c were split into 3 parts. The results are tabulated:

Part

Type

Chisquare

Range

Observed

Expected

% of chisquare

1

NB

0.298, 0.817 over 5

76-87

57

68.8

14.9

2

EX, NB

0.365, 0.349

191-238

15

23.2

25.3

3

NB, EX

0.954, 0.954

146-169

45

36.9

39.9

The parts are rather different, as two way sample analysis shows:

 

2. part

3. part

1. part

*0.043

*0.000

2. part

 

0.137

The distances between (c = C) were split into 3 parts, too. The results are tabulated as follows:

Part

Type

Chisquare

Range

Observed

Expected

% of chisquare

1

NB

0.488

72-83

53

68.4

18.7

   

203-226

3

8.2

17.7

2

EX, NB

0.284, 0.230

49-72

245

217.6

26.3

3

EX, NB

0.761, 0.769

146-169

45

36.6

25.9

   

265-598

6

11.1

31.1

The parts are rather different, as two way sample analysis shows:

 

2. part

3. part

1. part

*0.045

*0.000

2. part

 

0.118

Combining both cases worsened the fit. It is difficult to choose between the exponential distribution and the negative binomial distribution, both give practically the identical results.

D

Here the exponential distribution and the negative binomial are applicable. The chi-square test values are as follows:

Part

Exponential

Negative binomial

d1

0

 

d2

over 20 = 0.247

 

d3

 

over 33 = 0.354

d4

over 30 = 0.329

 

[d + D]1

0

 

[d + D]2

over 19 = 0.763

 

[d + D]3

over 31 = 0.395

 

[d + D]4

over 22 = 0.683

 

The capital case D frequency allowed the separate test. The lognormal distribution correlates poorly, the chi-square value is only 0.048 since there are too many repeating within distances 1274 till 1909 (20 occurrences against 12.6 expected) which makes 34.3 % of the chi-square test value. The tail is shorter than expected, only 1 occurrence against 5.1 expected, which contributes another 25.9 % of the chi-square test value.

There are too few repeating dd (Dd). This fact contributes 42.4 - 70.5 % (32.3-63.8 %) to the high chi-square values given in the table above.

The two way sample analysis shows that the parts of the lower case d are different:

Part

2

3

4

1

0.587

*0.018

0.050

2

 

0.072

0.163

3

  

0.686

The third sixth differs significantly from the first part. Only the third and the fourth parts are similar.

There are always less doubled dd then corresponding to the exponential form which makes 42-70.5 % of the chi-square test value.

The combined [d + D] gives somewhat different results. The two way sample analysis shows that the parts of [d + D] are different, too:

Part

2

3

4

1

0.316

*0.022

*0.016

2

 

0.208

0.163

3

  

0.873

The third and fourth parts differ significantly from the first part. Only the third and fourth parts are close.

There are always less doubled Dd then corresponding to the exponential form (0-10 occurrences against 23.5-37 expected) which makes 32.3 - 63.8 % of the chi-square test value.

E

There are relatively few E comparing with the great number of e. The distribution of distances between lower case e and both case (e + E) is mostly the negative binomial, some parts fit better the lognormal or exponential distributions:

Part

Negative binomial

e1

over 15 = 0.538

e2

over 12 = 0.137

e3

over 12 = 0.054

e4

EX

e5

0

e6

over 12 = 0.063

e7

over 14 = 0.066

e8

over 20 = 0.093

[e + E]1

over 15 = 0.529

[e + E]2

over 15 = 0.135

[e + E]3

over 14 = 0.052

[e + E]4

0

[e + E]5

over 17 = 0.112

[e + E]6

0

[e + E]7

over 13 = 0.102

[e + E]8

over 17 = 0.131

The two way sample analysis failed due to too large samples.

F

The distribution of the capital F, of the lover case f, and of [f + F], is correlated with the Weilbull distribution. The set of the lover case f, and of [f + F], were divided into two parts, which both are rather different (the two way sample analysis results 0.0002 and 0.0048, respectively.

The distribution of this letter is distorted by too few double ff [Ff] (e. g. 96 occurrences against 28.7 expected). This makes 84.9 % of the total very high chi-square test value.

G

The distribution of the capital G is correlated with the Weibull distribution rather well. It effects the distribution of the lover case g, divided into two parts, in both parts differently:

Part

The chi-square test value

g

0.320

0.709

g + G

0.296

0.024

The most important distortions:

Part

Range

Observed

Expected

% of chisquare

g1

19-36

325

295.5

15.4

 

277-294

11

6.1

20.9

g2

71-117

278

319.5

40.3

[g+G]1

277-294

11

5.9

32.9

[g+G]2

88-104

80

114.1

31.5

H

The distribution of the capital H is correlated with the Weibull distribution rather well:

Chisquare Test

Lower

Upper

Observed

Expected

Limit

Limit

Frequency

Frequency

Chisquare

at or below

212.074

169

160.5

0.454082

212.074

423.148

96

95.3

0.005133

423.148

634.222

58

61.7

0.226729

634.222

845.296

36

40.9

0.585048

845.296

1056.370

26

27.4

0.073882

1056.370

1267.444

26

18.5

2.992872

1267.444

1478.519

8

12.6

1.695590

1478.519

1689.593

5

8.6

1.533205

1689.593

1900.667

5

5.9

0.147582

1900.667

2322.815

7

6.9

0.000840

above 2322.815

9

6.5

0.957956

Chisquare = 8.67292 with 8 d.f. Sig. level = 0.370635

The surplus of distances 1057-1267 is followed by the shortage of longer distances.

The frequency of h made necessary to split the set for the evaluation into four parts which correlated badly with the negative binomial distribution (1. part has the chisquare value 0.315 over 30) but they were still too long for the two way sample analysis. [g + G] was split for the evaluation into six parts which correlated badly with the negative binomial distribution (e. g. 3. part has the chisquare value 0.115 over 27)

The two way sample analysis shows how the parts are different:

Part

2

3

4

5

6

1

*0.001

*0

*0

*0

*0

2

 

0.780

*0

*0

0.370

3

  

*0

*0

0.533

4

   

*0.005

*0

5

    

*0

I

The distribution of the capital I is correlated poorly with the lognormal distribution. The greatest disturbance is a shortage of counts within distances 305-607 (102 occurrences against 125 expected) which contributes 39,1 % of the chi-square test value. The tail is longer over distances 1516 (17 occurrences against 10,2 expected) which contributes another 49,5 % of the chi-square test value.

The frequency of the lower case i made necessary the splitting. The parts are poorly correlated with the exponential distribution, as the best the 5. part (the chi-square test value 0.701 over 5), and they pass the two way sample analysis, as follows:

Part

2

3

4

5

6

1

*0.014

*0.026

0.068

*0.015

*0.002

2

 

0.815

0.635

0.947

0.441

3

  

0.631

0.864

0.316

4

   

0.506

0.132

5

    

0.397

Only the first part differs significantly from the others, since the result with the fourth part is only slightly above the limit of rejection. There are no repeating ii. This makes 55.6-78.2 % of the chi-square test value. The distribution is more skewed, there exists always a surplus of intermediate distances:

Part

Range

Observed

Expected

% of chisquare

i1

6-21

860

744.5

15.1

i2

7-18

709

567.2

23.3

i3

7-28

954

803.1

23.1

i4

7-26

950

871.8

6.2

i5

7-31

1077

979.5

11.8

i6

8-28

930

795.7

23.1

The including of I changed the results of the two way sample analysis as follows:

Part

2

3

4

5

6

1

*0.001

*0.002

*0

*0

*0.022

2

 

0.756

0.176

0.169

0.230

3

  

0.091

0.084

0.453

4

   

0.984

*0.017

5

    

*0.014

In most cases, the similarity is worse. Only the fourth and fifth parts are less different. There are no repeating Ii. This makes 60.5-70.1 % of the chi-square test value. The distribution is more skewed, there exists always a surplus of intermediate distances:

Part

Range

Observed

Expected

% of chisquare

[i+I]1

5-14

735

604.5

21.6

[i+I]2

7-12

495

371.2

24.0

[i+I]3

6-15

728

616.0

13.8

[i+I]4

8-21

831

687.9

21.3

[i+I]5

6-30

1256

1084.7

18.2

[i+I]6

7-23

1028

861.2

24.0

J

The distribution of the letter is the Weibull one. The Weilbull distribution of the lower case j is better correlated than both cases [j + J]. There are too many distances 874-1310 (22 occurrences against 16.7 expected). This makes 25.0 % of the chi-square test value. Contrary, there are too few distances 2619-3055 (10 occurrences against 6.4 expected). This makes 31.2 % of the chi-square test value. ombining both cases worsened the fit. There are too many distances 963-1284 (52 occurrences against 40.4 expected). This makes 44.2 % of the chi-square test value.

K

The Weilbull distribution of this letter is bad. There are no repeating kk [Kk]. This makes 18.5 [19.3] % of the chi-square test value.

L

The occurrences of capital L is correlated by the exponential distribution. There are too many distances 2653-3173 (14 occurrences against 11.3 expected). This makes 54.1 % of the chi-square test value.

The frequency of l and [l + L] made necessary the splitting.

The parts are correlated with the Weilbull distribution. It is distorted by many double ll [Ll]. This makes 74.8-79.9 % [74.8-81.2 %] of the total chi-square test value. The parts fit over different distances rather well, see table:

Part

Cut

Chisquare

l 1

11

0.903

2

35

0.967

3

24

0.967

4

35

0.208

5

20

0.097

l+L 1

11

0.987

2

 

0

3

30

0.925

4

36

0.092

5

10

0.579

The parts of l pass the two way sample analysis, as follows:

Part

2

3

4

5

1

0.850

0.518

*0.013

*0.041

2

 

0.647

*0.023

0.065

3

  

0.074

0.173

4

   

0.670

The fourth part differs significantly from the first and second ones, the first from the fifth one.

The including of I changed the results only slightly, see table:

Part

2

3

4

5

1

0.831

0.430

*0.012

*0.039

2

 

0.567

*0.022

0.067

3

  

0.095

0.219

4

   

0.661

M

The upper case M is correlated well using the Weilbull distribution (the chi-square test value is 0.458).

The lower case m, divided into 3 parts, is correlated as best with the exponential distribution (1. part the chi-square test value over 44 is 0.798, 2. part the chi-square test value = 0.443, 3. part the chi-square test value = 0.137). The doubled mm fit excellently only in the second part, in other parts, the repeating mm is more scarce than expected.

The parts of the distribution of m are different. The two way sample analysis shows following results:

Part

2

3

1

*0.001

*0.004

2

 

0.712

The upper case m, divided into 3 parts, is correlated as best with the exponential distribution (1. part the chi-square test value over 14 is 0.849, 2. part the chi-square test value = 0.096, 3. part the chi-square test value = 0.082). The doubled Mm fit only in the second part, in other parts, they are more scarce than expected.

The parts of the distribution of [m + M] are different, too. The two way sample analysis shows following results:

Part

2

3

1

*0.006

*0.004

2

 

0.835

N

The upper case N is correlated using the Weibull distribution (the chi-square test value is 0.072). There are too few distances 1600-2000 (7 occurrences against 13.3 expected). This makes 25.6 % of the chi-square test value.

The distribution of n and (n + N) was divided into seven parts.

The distribution of this letter is distorted by too few double nn [Nn] (e. g. 10 occurrences against 93 expected). This makes 44.0-67.9 % of the total very high chi-square test value. In some parts are rather great disturbances:

Part

Range

Observed

Expected

% of chisquare

Chisquare

1

6-10

419

283.0

38.8

0.152 over 10

3

18-22

202

134.8

28.2

0.242 over 20

4

6-9

330

265.5

13.8

0.073 over 16

5

6-16

644

736.6

15.2

0.114 over 10

6

6-22

944

784.7

27.0

0.117 over 37

The two way sample analysis shows following results:

Part

2

3

4

5

6

7

1

*0.012

0.097

0.355

0.882

0.780

0.146

2

 

0.388

0.107

*0.008

0.107

0.274

3

  

0.457

0.071

0.457

0.823

4

   

0.284

0.508

0.598

5

    

0.667

0.110

6

     

0.230

[n + N]:

In some parts are rather great disturbances:

Part

Range

Observed

Expected

% of chisquare

Chisquare

1

6-10

425

288.5

32.8

0.094 over 9

2

6-10

359

273.8

18.2

0.074 over 31

3

18-22

198

136.3

24.4

0.736 over 25

4

6-9

332

269.0

12.6

0.171 over 35

5

7-16

657

545.4

21.3

0.114 over 10

6

6-22

958

795.9

26.5

0.138 over 37

The disturbances have slightly less weight than at the lower case n.

The two way sample analysis shows for [n + N] the following results:

Part

2

3

4

5

6

7

1

*0.014

0.169

0.209

0.987

0.595

0.078

2

 

0.278

0.234

*0.015

0.051

0.480

3

  

0.911

0.170

0.384

0.701

4

   

0.209

0.457

0.621

5

    

0.589

0.080

6

     

0.211

Both sets are alike, the including of N did not changed the results of the two way sample analysis dramatically. The second part differs signficantly from the firts and fifth ones.

O

The distribution of O can be correlated also with the Weibull distribution (the chi-square test value 0.077). There are too many distances 3572-4714 (14 occurrences against 7.3 expected). This makes 35.1 % of the chi-square test value.

The distribution of o and (o + O) was divided into seven parts, which correlated poorly with the exponential distribution.

The distribution of this letter is distorted by too few double oo [Oo] only slightly,

(at most the first part of o, 72 occurrences against 119 expected). This makes 40.6% of the total very high chi-square test value. Here are the greatest disturbances:

Part

Range

Observed

Expected

% of chisquare

Chisquare

1

7-11

427

369.2

19.8

0.122 over 25

2

2-5

336

420.8

19.0

0.089 over 17

 

6-9

395

325.7

16.4

 
 

14-22

421

346.1

18.8

 

3

2-4

233

334.1

25.9

0.117 over 10

 

11-14

262

184.6

27.5

 

4

7-18

778

653.0

36.5

0.652 over 26

5

14-22

522

346.3

31.8

0.084 over 20

6

2-5

310

403.2

22.7

 
 

14-22

432

344.4

27.3

 

7

14-22

408

345.8

18.2

0.155 over 30

In four parts, an excess of distances 14-22 occurs.

The two way sample analysis shows following results:

Part

2

3

4

5

6

7

1

0.608

0.276

0.149

0.390

*0.042

0.656

2

 

0.565

0.342

0.726

0.124

0.941

3

  

0.689

0.824

0.323

0.511

4

   

0.541

0.575

0.302

5

    

0.231

0.669

6

     

0.103

The first part differs significantly from the fifth one.

[o + O]:

In some parts are rather great disturbances:

Part

Range

Observed

Expected

% of chisquare

Chisquare

1

6-9

404

328.7

22.5

0.162 over 31

 

18-22

205

150.6

18.4

 

2

39-43

69

42.1

18.2

0.074 over 31

3

6-9

378

318.6

17.2

0.248 over 45

 

27-30

125

91.7

18.7

 

4

7-12

480

378.7

31.2

 

5

14-22

426

344.6

31.0

0.128 over 25

6

2-5

308

402.7

21.0

 
 

14-22

432

343.3

24.3

 

7

14-22

411

344.1

18.4

 

The excess of distances 14-22 occurs again. The disturbances have slightly less weight than at the lower case o.

The two way sample analysis shows for o + O the following results:

Part

2

3

4

5

6

7

1

0.443

0.128

0.114

0.229

0.050

0.5378

2

 

0.449

0.456

0.770

0.229

0.873

3

  

0.930

0.640

0.653

0.354

4

   

0.584

0.722

0.319

5

    

0.359

0.648

6

     

0.169

There is no significant difference between parts.

P

The upper case P correlated also using the Weilbull distribution (the chi-square test value = 0.356). The Weilbull distribution of p and [p +P] is distorted by too many pp [Pp] (69.8-80.8 [72.2-73.6] % of the chi-square test value). The other disturbances give only a minor opportunities for commenting. The both sets were divided into 3 parts. The test did not revealed their dissimilarity.

Q

This letter correlated also using the Weilbull distribution (the chi-square test value = 0.367 at q, 0.389 at [q + Q]) but better fit gives the exponential distribution (the chi-square test value = 0.441 at q, 0.433 at [q + Q]).

R

The upper case R correlated with the Weilbull distribution (the chi-square test value = 0.109). The fit distribution is worsened by too many repeating within distances 1000 till 2000 (45 occurrences against 33.3 expected) which makes 40.4 % of the chi-square test value.

The distribution of r and [r + R] was divided into five parts. They fit with the exponential distribution.

The most important disturbances from the shape of the distribution in all parts of r are tabulated:

Part

Range

Observed

Expected

% of chisquare

Chisquare

1

1

31

91.9

25.1

0.103 over 24

 

2-5

263

401.5

29.8

 
 

6-20

973

791.6

26.9

 

2

1

36

94.5

30.2

0.097 over 20

 

8-21

831

689.8

26.5

 

3

1

45

105.6

23.8

0.195 over 34

 

10-18

628

471.9

41.4

 

4

2-5

305

464.5

48.0

 

5

10-18

570

454.1

32.2

 

The two way sample analysis shows that the parts of the lower case r are rather different:

Part

2

3

4

5

1

0.351

*0

*0

0178

2

 

*0

*0

0.671

3

  

0.334

*0

4

   

*0

The third and fourth part differ significantly from all other parts.

The most important disturbances from the shape of the distribution in all parts of [r + R] are tabulated:

Part

Range

Observed

Expected

% of chisquare

Chisquare

1

1

31

99.8

24.4

0.099 over 19

 

2-5

254

393.3

31.2

 
 

6-20

957

779.8

25.9

 

2

1

36

93.2

31.3

 
 

8-21

831

682.7

27.1

 

3

1

45

104.0

27.6

0.239 over 17

 

10-18

617

467.0

41.5

 

4

2-5

297

456.5

48.7

 

5

10-18

562

448.8

31.6

 

The two way sample analysis shows that the parts of [r + R] are rather different:

Part

2

3

4

5

1

0.220

*0

*0

0.115

2

 

*0

*0

0.716

3

  

0.400

*0

4

   

*0

The third and fourth part differ significantly from all other parts. Combining r with R changed the distribution insignificantly.

S

The Weibull distribution of the capital S is distorted mostly by a walley within distances 1600-2000 (10 occurrences against 14.2 expected). This makes 50.8 % of the chi-square test value.

The distribution of the lower case s an d [s + S] was divided into five parts.

The most important disturbances from the shape of the distribution in all parts of s are tabulated:

Part

Range

Observed

Expected

% of chisquare

Chisquare

1

1

115

70.1 (WE)

44.1

WE 0.669 over 33, EX 0.569 over 30

 

26-33

171

226.8

21.1

 

2

2-5

285

419.6

44.5

EX 0.921 over 22

 

16-20

284

210.3

26.6

 

3

1

96

63.7

25.1

WE 0.105 over 18

 

2-6

294

374.6

26.5

 

4

2-6

338

455.2

36.2

 
 

7-12

453

353.1

38.6

 

5

2-7

452

529.0

26.2

EX 0.155 over 10

The two way sample analysis shows that the parts of the lower case s are rather similar, except the fifth part which differs from the first and third parts:

Part

2

3

4

5

1

0.300

0.845

0.130

*0.019

2

 

0.216

0.620

0.175

3

  

0.087

*0.011

4

   

0.389

The most important disturbances from the shape of the distribution in all parts of [s + S] are tabulated:

Part

Range

Observed

Expected

% of chisquare

Chisquare

1

1

115

72.1

40.1

WE, 0.234 over 32

 

26-33

170

231.0

25.5

 

2

2-5

303

376.7

22.9

WE, 0.273 over 32

3

2-5

310

450.2

48.7

EX, 0.091 over 28

 

11-15

348

276.8

20.4

 

4

2-6

355

474.7

32.8

EX, 0.531 over 31

 

7-12

474

366.3

34.4

 

5

    

no test

The two way sample analysis shows that the parts of [s + S] are rather similar:

Part

2

3

4

5

1

0.215

0.833

0.59

0.188

2

 

0.306

0.502

0.241

3

  

0.0.94

0.197

4

   

0.275

The third and fourth part differ significantly from all other parts. Combining r with R changed the distribution insignificantly.

T

The distribution of the capital T has the Weilbull shape. There is more distances 544-679 (54 occurrences against 43.3 expected). This makes 26.2 % of the chi-square test value.

The distribution of the lower case t as well as the both [t + T] is divided into six parts.

The most important disturbances from the shape of the exponential distribution in all parts of s are tabulated:

Part

Range

Observed

Expected

% of chisquare

Chisquare

1

1

82

193.2

46.1

0.429 over 31

 

5-16

1410

1204.3

34.7

 

2

1

72

203.3

44.2

0.461 over 24

 

5-16

1513

1226.8

35.2

 

3

1

82

204.4

50.2

0.119 over 47

 

23-29

279

225.9

10.4

 

4

1

64

199.8

38.0

0.490 over 36

 

5-20

1808

1448.8

37.3

 

5

1

61

208.1

58.6

0.176 over 33

 

5-16

1462

1238.1

23.8

 

There are too few repeated s in all parts. Moreover, the shape of the distribution is rather sharp in the range 5-20, except the third part.

The most important disturbances from the shape of the distribution in all parts of [s + S] are tabulated:

Part

Range

Observed

Expected

% of chisquare

Chisquare

1

1

82

205.6

44.4

0.117 over 25

 

5-16

1481

1243.5

29.8

 

2

1

73

219.1

42.6

0.383 over 25

 

5-16

1604

1272.4

38.4

 

3

1

82

218.2

54.2

0.039 over 21

 

5-16

1590

1384.1

22.1

 

4

1

64

213.1

36.9

0.205 over 45

 

5-20

1896

1498.6

36.4

 

5

1

60

221.0

54.1

0.075 over 37

 

5-16

1533

1276.5

26.0

 

The results are not changed to much.

U

The distribution of the capital U has the Weilbull shape.

The set of u, and [u + U] was divided into two parts, which are similar (the two sample analysis 0.326 [0.318]. There are no doubled uu or Uu (0 [0] occurrence against 54.0-55.6 [54.6, 56.2] expected). This makes 55.4, 44.4 [56.4, 44.1] % of the chi-square test value of the exponential distribution. When the lower limit is set to 30 at the 1. u set [28 at the 1. U set], the chi-square test value of the first part of u [u + U] is improved to 0.701 [0.865. The second parts give poorer fits.

V

The Weilbull distribution of [v + V] is poor. There are no doubled vv (0 occurrence against 9.8 expected). This makes 46.2 % of the chi-square test value. The tail is longer, (18 occurrences over 634 against 10.7 expected). This makes another 23.9 % of the chi-square test value.

W

The Weilbull distribution of the upper case W gives an good fit. There is a shortage of the distances 1840-2160 (5 occurrences against 9.4 expected). This small difference alone makes 44.5 % of the chi-square test value.

The sets w and [w + W] were divided into three parts. The exponential distribution of w gives a fair fit, see the following table:

Part

Range

Observed

Expected

% of chisquare

Chisquare

1

1

2

25.6

57.3

0.758 over 20

 

108-125

37

57.2

18.7

 

2

1

1

26.3

58.6

0.242 over 10

3

1

0

29.3

52.3

0.711 over 45

 

31-45

249

203.2

18.4

 

The two way sample analysis shows that the third part of w differs from the first two thirds:

Part

2

3

1

0.440

*0

2

 

*0.007

The third and fourth part differ significantly from all other parts. Combining r with R changed the distribution insignificantly.

The exponential distribution is best in the first two thirds, the last one is better correlated by the Weilbull distribution, see the following table:

Part

Range

Observed

Expected

% of chisquare

Chisquare

1

1

2

28.9

59.4

0.581 over 10

 

16-30

309

276.7

8.9

 

2

1

1

29.7

64.0

0.250 over 10

3

1

0

16.7

44.8

0.228 over19

 

166-180

16

9.2

13.6

 

The two way sample analysis shows that the third part of w differs from the first two thirds:

Part

2

3

1

0.471

*0.001

2

 

*0.015

X

The Weilbull distribution of [x + X] is poor. There is a shortage of the distances 1141-1368 (8 occurrences against 16.6 expected). This makes 32.5 % of the chi-square test value.

doubled vv (0 occurrence against 9.8 expected). This makes 46.2 % of the chi-square test value.

Y

The exponential distribution of the upper case Y gives an acceptable fit. There is a peak of the distances 2216-2585 (10 occurrences against 6.2 expected). This minor difference makes 30.8 % of the chi-square test value.

The sets w and [w + W] were divided into two parts. The exponential distribution of y gives a poor fit, see the following table:

Part

Range

Observed

Expected

% of chisquare

Chisquare

1

1

1

25.6

60.0

0.436 over 70

2

1

1

22.9

46.0

0.161 over 40

 

136-185

84

104.5

20.0

 

The two way sample analysis shows that they differ significantly (test value 0.003).

The exponential distribution of [y + Y] gives a poor fit, too, see the following table:

Part

Range

Observed

Expected

% of chisquare

Chisquare

1

1

1

28.2

71.7

0.162 over 30

2

1

1

25.1

46.7

0.200 over 40

 

139-185

82

103.5

18.9

 

The two way sample analysis shows that they differ significantly (test value 0.001).

Z

The exponential distribution of this letter is distorted by few occurrences within distances 638*1273 (17 occurrences against 28.4 expected). This makes 49.9 % of the chi-square test value.

Discussion

The insufficient capacity of the used software for long lists forced splitting of too frequent signs. The splitting was made before determining distances. Surprisingly, the obtained parts are not always comparable, since there are in the split parts different number of signs. This leads to the different mean distances between them.

Some distributions of distances between consonants are highly regular, especially their tails, if the low distances inside words are pooled. They are described with a different precision with four distributions: exponential, Weibull, lognormal and negative binomial. Sometimes it is rather difficult to decide which distribution is the better one for fitting.

If the results are compared with published analyses of Shakespeare's Sonnets and the Mathew's Ghospel, then there can be observed many differences. Doyle used words differently than older authors. Especially Weibull distribution appears more often.

Some peaks are obviously results of repeated phrases. This conclusion should be confirmed by stylistic analysis.

REFERENCES

1. Kunz M., See papers of this series on the page.