Shakespeare's sonets

	Sign In Sign-Up

Distance Analysis of English Texts. III. ARTHUR CONAN DOYLE: A STUDY IN SCARLET.

Milan Kunz (kunzmilan@atlas.cz) August, 2002

Abstract

Distances between identical symbols in information strings (biological, language, computer programs (*.exe files) are described with a different precision with four distributions: Exponential, Weibull, lognormal and negative binomial. The correlations are sometimes highly significant. Here are analyzed distances between signs in the novel of A. C. Doyle. Some distance tests revealed specific formal features of the text.

INTRODUCTION

This is a continuation study of statistical properties of distances between identical symbols in information strings (1, 2, 3). The Doyle's novel was obtained in the form of RTF. Using MS Word, the text was transformed into the plain *.txt. Some formatting, as headlines, remained unchanged. Then the file has 238430 bytes. It contains 230112 signs including spaces, 189183 signs without spaces in 4159 lines and 42549 words. It means that the mean length of a word is 4.446 signs (including apostrophes and punctuation marks). At some letters, the list of distances were split into more equal (approximately) parts, since the used statistical software Statgraphics (in version, I work with) does not work with too long lists.

After these formal corrections, the distances were determined by a program elaborated by Rádl. The string is at first indexed with the position index i (i going from 1 to m) of each individual symbol in the string, and then the differences of these position indexes are determined. The differences are considered to be the topological distances between the same symbols. The sets of these values were evaluated by different statistical tests. The program counting distances counts all signs, including spacebar, return, and punctuation marks.

From all available implemented distributions, only four distributions gave significant results, the exponential distribution, the Weibull distribution, the lognormal distribution, and the negative binomial distribution, as before.

The actual values (mean, standard deviation, skewness, kurtosis, distribution parameters etc.) are of little interest, since they differ considerably.

Results

The distances between points determine the length of sentences. There are 2393 points, mostly used as the punctuation mark, except some abbreviations (e.g. M.D.)

The distribution is of Weibull type, a = 1.75916, b = 108.163.

Chisquare = 10.4418 over 52, significance level was 0.72929 with 14 degree of freedom. There exists a shortage of points between distances 68 - 82 (237 occurrences against 265 expected). This alone makes 28.4 % of the chi-square test value.

The other punctuation mark, the semicolon (108 occurences), is used as follows:

Chisquare Test

Lower	Upper	Observed	Expected
Limit	Limit	Frequency	Frequency	Chisquare
at or below	667.619	39	38.3	0.01274
667.619	1334.238	23	23.4	0.00610
1334.238	2000.857	12	13.1	0.08969
2000.857	2667.476	10	8.2	0.41749
2667.476	3334.095	4	5.5	0.39519
3334.095	4667.333	7	6.7	0.01289
4667.333	6667.190	5	5.1	0.00387
above	6667.190	8	7.8	0.00706

Chisquare = 0.945022 with 5 d.f. Sig. level = 0.966877

This is an example of the almost perfect lognormal distribution.

There are many commas. Therefore, the file was split into 6 parts. The distribution of distances was exponential with different fitting, as follows in tabulated form

Part	Number	Chisquare	Note
1	475	0.6716	Peak 85-120, walley 155-190
2	476	0.2991	Peak 107-127
3	483	0.5055	Peak 86-106
4	477	0.0802	Peak 115-143
5	476	0.6593	Over 100, lower tail
6	478	0.0287	Peak 71-125

There are no immediately repeated commas, as ",," which contributes 21.2 - 50.78 % of the chi-square test value.

The two way sample analysis shows how the parts of the lower case differ:

Part	2	3	4	5	6
1	0.355	0.091	0.122	0.374	0.177
2		0.397	0.480	0.977	0.656
3			0.447	0.364	0.654
4				0.621	0.761
5					0.621

Note: The asterisk shows the statistically significant difference between tested parts.

The commas are used without too great differences.

The spacebar

The distances between consecutive spacebars greater than 1 determine the number of words of the length corresponding to this distance minus one. There exists 40931 spacebars without corrections. Some of them are used as formatting tools. The results are tabulated as follows. Cumulating frequencies of shorter distances, improved in some cases the fit, since bellow it the counts are scattered, and differences can balance themselves.

Table 2 The number of words with the different length

Length	Number	Type of distribution, chisquare value
1	1593	LN, 0.305
2	6427	LN (divided)
3	8109	LN (divided)
4	6309	EX (divided)
5	4364	EX (divided)
6	3237	NB, 0.024, over 8 = 0.810
7	2692	NB, 0.053, over 4 = 0.230
8	2002	NB, 0.137, over 8 = 0.455
9	1540	NB, 0.126, over 6 = 0.871
10	1043	NB, 0.072, over 25 = 0.605
11	698	EX, 0.358
12	455	EX, 0.251
13	289	WE, 0.676
14	193	WE, 0.686
15	122	WE, 0.097
16	81	WE, 0.531
17	45	WE, 0.207
18	30	too few data
19	13	too few data
20	4	too few data

The distribution of length of words seem to have the lognormal shape, but this guess was not tested.

Notes to some results:

The distribution of one letter words is correlated by the lognormal distribution. There are two peaks between distances 70-81 and 128-138 (36 (9) occurrences against 28.4 (5.6) expected). Each makes about 14.7 % of the chi-square test value. There exists a shortage of distances 13-23 (375 occurrences against 406.9 expected). Each contributes about 17.9 % to the chi-square test value. The distribution is shorter than expected (4 occurrences against 9.1 expected. This makes about 20.7 % of the chi-square test value.

The distribution of two letter wordsis correlated poorly by the lognormal distribution. The set was divided into 4 parts. From them the second part gives the best fit, the chi-square test value is 0.3346 over 11, 0.8075 over 12, and 0.4011 over 13. These words follow each other more often than corresponds to the shape. This makes 52.7-92.7 % of the chi-square test value.

The two way sample analysis shows that only the first and second parts are similar:

Part	2	3	4
1	0.537	*0	*0.010
2		*0	*0.037
3			*0.032

The distribution of three letter words was divided into four parts, too. The parts are correlated by different distributions:

Part	Type	The chi-square test value (over)	Note
1	LN	0.348 (12), 0.553 (13), 0.273 (14)	repeatings make 91.3 %
2	EX	0.130 (16-17)	peak 3-4 makes 45 %
3	NB	0.148 (11)	shortage of repeatings makes 43.8 %, peak 3-4 makes 39.1 %
4	LN	0.036 (12), 0.495 (13), 0.147 (14)	repeatings make 66.5 %

The two way sample analysis shows that only the second and fourth parts are similar:

Part	2	3	4
1	*0	*0	*0
2		*0	0.688
3			*0

The distribution of four letter words was divided into four parts. These words are following each other more often than corresponds to the shape of the exponential distribution but not too much, at most 28.2 % of the chi-square test value in the first part. The parts are correlated as follows:

Part	The chi-square test value (over)	Note
1	0.005 (15), 0.332 (16), 0.141 (17)	shortage of distances 15-18 makes 25.1 %,
2	0.258 (9), 0.617 (10), 0.276 (11)	peak 6-8 makes 28.5 %
3	0.257 (12-13)	peak 6-8 makes 31.7 %
4	0	peak 2-6 makes 62.3 %

The two way sample analysis shows that only the second and third parts and the third and fourth ones are similar:

Part	2	3	4
1	*0.003	*0	*0
2		0.302	*0.041
3			0.283

The distribution of five letter words was divided into three parts. These words are following each other slightly often than corresponds to the shape of the exponential distribution. The parts are correlated as follows:

Part	The chi-square test value (over)	Note
1	0.071 (2), 0.389 (3), 0.148 (4)	shortage of distances 17-32 makes 51.3 %,
2	0.002	peak 6-8 makes 35.8 %
3	0	peak 8-11 makes 41.7 %

The two way sample analysis shows that the parts are similar:

Part	2	3
1	0.723	0.250
2		0.723

The negative binomial distribution of six letter words is fair, worsened by short distances (the chi-square test value is 0.681 over 7, 0.810 over 8, 0.420 over 9).

The distribution of seven letter words is described by the negative binomial distribution (the chi-square test value = 0.0529. It is somewhat improved over 4 to 0.2304). The tail is longer than expected (18 occurrences against 10.1 expected above 82). This makes 24.9 % of the chi-square test value.

The distribution of eight letter words is also the negative binomial one (the chi-square test value = 0.137). It is somewhat improved over 4 to 0.455). The tail is again longer (16 occurrences against 9.8 expected over 1.7). This makes 15.6 % of the chi-square test value.

The distribution of nine letter words is also the negative binomial one (the chi-square test value = 0.126 is improved over 6 to 0.871). The shortage of these words within distances 88-96 (7 occurrences against 13.9 expected) contributes 16.0 % of the chi-square test value.

The distribution of ten letter words is fairly correlated by the negative binomial distribution (the chi-square test value = 0.072 is improved over 25 to 0.605). The shortage of these words within distances 19-28 (107 occurrences against 132.4 expected) contributes 17.0 % of the chi-square test value. There are more distances 145-172 than expected (23 occurrences against 10.1 expected). This makes 31.9 % of the chi-square test value.

The distribution of eleven letter words is fairly correlated by the exponential distribution (the chi-square test value = 0.358). The shortage of these words within distances 160-186 (11 occurrences against 16.5 expected) contributes 18.3 % of the chi-square test value. There are more distances 28-54 than expected (183 occurrences against 159.0 expected). This makes 36.7 % of the chi-square test value.

The distributions of longer words are well correlated with the Weibull distribution. As an example, the results with the 16 letter words are given:

Chisquare Test

Lower	Upper	Observed	Expected
Limit	Limit	Frequency	Frequency	Chisquare
at or below	200.000	29	30.0	0.03323
200.000	400.000	17	16.7	0.00504
400.000	600.000	13	10.7	0.47842
600.000	800.000	7	7.2	0.00345
800.000	1200.000	5	8.2	1.27208
above 1200.000	3374	10	8.2	0.41292

Chisquare = 2.20513 with 3 d.f. Sig. level = 0.530938

Distances between individual letters

The results for all letters are presented in the form of the table, where the frequencies of all symbols are given and the significance of the performed chi-square tests. Then the commentaries to all symbols of the alphabet are given. The values in the square brackets show the corresponding values of the combined lower and upper cases.

Table 7 Survey of results

Notes:

EX = exponential distribution

WE = Weibull distribution

L N = lognormal distribution

NB = negative binomial distribution

* = the test was not made, since not enough of data

Statistic = XX, the chi-square test

Symbol	Small	Capital	Both
a	14387,EX, 0	251, WE, 0.913	14640, EX, NB, 0
b	2429, EX, 0	113,WE, 0.574	2542, WE, 0
c	4403, NB, 0.300	126, WE, 0.932	4524, NB
d	8210, NB, EX	146, LN, 0.048	8.356, NB
e	22812, NB, 0	84, WE, 0.465	22895, NB, 0
f	3773,WE, 0	269, WE, 0.008	4042, WE, 0
g	3494, WE, 0.296	99, WE, 0.924	3593, WE,
h	11954, NB, 0	445, WE, 0.371	12399, WE, NB
i	1152, LN, 0.137	1180, LN, 0.128	12332, EX, 0
j	127, WE, 0.576	108, WE, 0.031	235, WE, 0.346
k	1296, WE, 0.033	10, no test	1306, WE, 0.041
l	6797, WE	173, EX, 0.194	4970, WE
m	4569, EX	164, WE, 0.458	4733, EX
n	12201, EX	304, WE, 0.072	12505, EX
o	13843, EX	101, WE, 0.077	13944,EX
p	2867, WE	69, WE, 0.356	2936, WE
q	136, EX, 0.441	2, no test	138, EX, 0.433
r	10793, EX	204, WE, 0.109	10997, EX
s	12680, WE, EX	262, WE, 0.878	12942, WE, EX
t	15486, EX	525, WE, 0.4522	16011, EX
u	5047, EX	193, WE, 0.455	5076, EX
v	1735, WE	11, no test	1747, WE, 0.020
w	4335, EX, WE	260, WE, 0.709	4595, EX
x	278, WE, 0.130	no test	-
y	3349, EX,	323, WE, 0.267	3672, EX
z	EX	no test	133, EX, 0161

At the upper case, the Weibull distribution is the best one in the case of 16 letters. The lognormal distribution correlates 2 cases, only, the exponential distribution is the best in the 3 performed tests, and the negative binomial distribution in no case.

At the lower case, the Weibull distribution is the best one in the case of 8 letters. The lognormal distribution correlates 1 case, only, the exponential distribution is the best in the 13 performed tests, and the negative binomial distribution in 4 cases. At combined cases, the Weibull distribution is the best one in the case of 10 letters. The lognormal distribution correlates no case, the exponential distribution is the best in the 12 performed tests, and the negative binomial distribution in no case. Sometimes, the distinction between the fit is small and more than one distribution is applicable. The chi-square values sometimes are practically zero, and only adjusting the lowest possible value to greater distances by pooling these shorter distances increases the significance of the chi-square tests. Now, the commentaries to the individual letters follow.

The capital case A frequency allowed the separate test. The fair result was obtained with the exponential distribution (the chi-square test value 0.378). The excellent fit with the Weibulll distribution (the chi-square test value 0.913) is worsened by too many repeating within distances 2376 till 2850 (12 occurrences against 7.4 expected) which makes 26.0 % of the chi-square test value.

The distribution of distances between the lower case a is exponential, except that there are practically no repeating aa. This fact contributes 58.4 -77.3 to the chi-square value. The lower case a repeats too often within distances 6 - 14 (1.185 - 1.336 of expected values).

The two way sample analysis shows how the parts of the lower case differ:

Part	2	3	4	5	6
1	0.368	*0.034	*0.026	*0.000	0.390
2		0.217	0.178	*0.008	0.956
3			0.895	0.160	0.191
4				0.209	0.155
5					*0.006

Note: The asterisk shows the significant difference between tested parts.

The first sixth differs significantly from the second till fifth ones. The fifth and sixth are different, too.

The most important disturbances from the shape of the distribution in all parts are tabulated:

Part	Range	Observed	Expected	% of chisquare
1	6-20	1285	1015.5	29.7
2	6-14	997	728.4	18.5
3	6-9	470	390.1	8.5
4	6-9	483	390.9	11.7
5	7-12	600	453.6	20.5
6	14-17	293	183.9	25.5

The lower case a repeats too often within one till three words.

The distances between both case (a + A) are fitted poorly by different distributions. Again, there are practically no repeating Aa. This fact contributes 58 - 77.6 to the chi-square value.

The first sixth of a fits well with the negative binomial distribution with pooled distances to 16 (the chi-square value = 0.592. Other parts give much worse fits, and other distributions (the exponential distribution and the negative binomial distribution) give a better fit.

The two way sample analysis of both cases (a + A) gives worser results as the lower case a:

Part	2	3	4	5	6
1	0.457	*0.025	*0.023	*0.000	0.245
2		0.131	0.120	*0.004	0.679
3			0.943	0.162	0.263
4				0.193	0.242
5					*0.011

The first sixth differs significantly from the three parts but its consistency with other parts is low, too. The most important disturbances in all parts are tabulated, again:

Part	Type	Range	Observed	Expected	% of chisquare
1	NB	25-27	74	87.9	12.2
		49-54	12	24.7	36.3
2	NB	6-18	1210	1017.8	17.1
3	EX	6-21	1337	1135.1	19.6
4	EX	6-13	844	713.2	14.0
5	LN	29-39	221	172.1	29.3
6	EX	14-17	300	187.1	25.9

The distribution of distances between upper case B is Weibull. The distribution of distances over 20 between lower case b is exponential, the chi-square test value is then 0.614. There are too few b within distances 129-150 (106 occurrences against 128 expected), which contributes 19.3 % of the chi-square test value. Contrary, there are too many b within distances 282-324 (63 occurrences against 46 expected), which contributes 34.3 % of the chi-square test value. The distribution of distances over 20 between (b +B) is exponential, the chi-square test value is then 0.921. But here the Weibull distribution gives even a better chi-square test value 0.927. The fit

is worsened by too many (b + B) within distances 295-316 (31 occurrences against 20.7 expected), which contributes 40.9 % of the chi-square test value. There are too few (b + B) within distances 422-442 (1 occurrence against 5.4 expected), which contributes 28.5 % of the chi-square test value.

Including B improved the fit, the disturbances lessened and shifted to longer distances.

The distribution of distances between upper case C is the Weibull one (the chi-square test value is wery good, 0.932).

The distribution of distances of the lower case of this letter (and c + C) is described well by three distributions, exponential, negative binomial and Weibull.

The distances between lower case c were split into 3 parts. The results are tabulated:

Part	Type	Chisquare	Range	Observed	Expected	% of chisquare
1	NB	0.298, 0.817 over 5	76-87	57	68.8	14.9
2	EX, NB	0.365, 0.349	191-238	15	23.2	25.3
3	NB, EX	0.954, 0.954	146-169	45	36.9	39.9

The parts are rather different, as two way sample analysis shows:

	2. part	3. part
1. part	*0.043	*0.000
2. part		0.137

The distances between (c = C) were split into 3 parts, too. The results are tabulated as follows:

Part	Type	Chisquare	Range	Observed	Expected	% of chisquare
1	NB	0.488	72-83	53	68.4	18.7
			203-226	3	8.2	17.7
2	EX, NB	0.284, 0.230	49-72	245	217.6	26.3
3	EX, NB	0.761, 0.769	146-169	45	36.6	25.9
			265-598	6	11.1	31.1

The parts are rather different, as two way sample analysis shows:

	2. part	3. part
1. part	*0.045	*0.000
2. part		0.118

Combining both cases worsened the fit. It is difficult to choose between the exponential distribution and the negative binomial distribution, both give practically the identical results.

Here the exponential distribution and the negative binomial are applicable. The chi-square test values are as follows:

Part	Exponential	Negative binomial
d1	0
d2	over 20 = 0.247
d3		over 33 = 0.354
d4	over 30 = 0.329
[d + D]1	0
[d + D]2	over 19 = 0.763
[d + D]3	over 31 = 0.395
[d + D]4	over 22 = 0.683

The capital case D frequency allowed the separate test. The lognormal distribution correlates poorly, the chi-square value is only 0.048 since there are too many repeating within distances 1274 till 1909 (20 occurrences against 12.6 expected) which makes 34.3 % of the chi-square test value. The tail is shorter than expected, only 1 occurrence against 5.1 expected, which contributes another 25.9 % of the chi-square test value.

There are too few repeating dd (Dd). This fact contributes 42.4 - 70.5 % (32.3-63.8 %) to the high chi-square values given in the table above.

The two way sample analysis shows that the parts of the lower case d are different:

Part	2	3	4
1	0.587	*0.018	0.050
2		0.072	0.163
3			0.686

The third sixth differs significantly from the first part. Only the third and the fourth parts are similar.

There are always less doubled dd then corresponding to the exponential form which makes 42-70.5 % of the chi-square test value.

The combined [d + D] gives somewhat different results. The two way sample analysis shows that the parts of [d + D] are different, too:

Part	2	3	4
1	0.316	*0.022	*0.016
2		0.208	0.163
3			0.873

The third and fourth parts differ significantly from the first part. Only the third and fourth parts are close.

There are always less doubled Dd then corresponding to the exponential form (0-10 occurrences against 23.5-37 expected) which makes 32.3 - 63.8 % of the chi-square test value.

There are relatively few E comparing with the great number of e. The distribution of distances between lower case e and both case (e + E) is mostly the negative binomial, some parts fit better the lognormal or exponential distributions:

Part	Negative binomial
e1	over 15 = 0.538
e2	over 12 = 0.137
e3	over 12 = 0.054
e4	EX
e5	0
e6	over 12 = 0.063
e7	over 14 = 0.066
e8	over 20 = 0.093
[e + E]1	over 15 = 0.529
[e + E]2	over 15 = 0.135
[e + E]3	over 14 = 0.052
[e + E]4	0
[e + E]5	over 17 = 0.112
[e + E]6	0
[e + E]7	over 13 = 0.102
[e + E]8	over 17 = 0.131

The two way sample analysis failed due to too large samples.

The distribution of the capital F, of the lover case f, and of [f + F], is correlated with the Weilbull distribution. The set of the lover case f, and of [f + F], were divided into two parts, which both are rather different (the two way sample analysis results 0.0002 and 0.0048, respectively.

The distribution of this letter is distorted by too few double ff [Ff] (e. g. 96 occurrences against 28.7 expected). This makes 84.9 % of the total very high chi-square test value.

The distribution of the capital G is correlated with the Weibull distribution rather well. It effects the distribution of the lover case g, divided into two parts, in both parts differently:

Part	The chi-square test value
g	0.320	0.709
g + G	0.296	0.024

The most important distortions:

Part	Range	Observed	Expected	% of chisquare
g1	19-36	325	295.5	15.4
	277-294	11	6.1	20.9
g2	71-117	278	319.5	40.3
[g+G]1	277-294	11	5.9	32.9
[g+G]2	88-104	80	114.1	31.5

The distribution of the capital H is correlated with the Weibull distribution rather well:

Chisquare Test

Lower	Upper	Observed	Expected
Limit	Limit	Frequency	Frequency	Chisquare
at or below	212.074	169	160.5	0.454082
212.074	423.148	96	95.3	0.005133
423.148	634.222	58	61.7	0.226729
634.222	845.296	36	40.9	0.585048
845.296	1056.370	26	27.4	0.073882
1056.370	1267.444	26	18.5	2.992872
1267.444	1478.519	8	12.6	1.695590
1478.519	1689.593	5	8.6	1.533205
1689.593	1900.667	5	5.9	0.147582
1900.667	2322.815	7	6.9	0.000840
above 2322.815		9	6.5	0.957956

Chisquare = 8.67292 with 8 d.f. Sig. level = 0.370635

The surplus of distances 1057-1267 is followed by the shortage of longer distances.

The frequency of h made necessary to split the set for the evaluation into four parts which correlated badly with the negative binomial distribution (1. part has the chisquare value 0.315 over 30) but they were still too long for the two way sample analysis. [g + G] was split for the evaluation into six parts which correlated badly with the negative binomial distribution (e. g. 3. part has the chisquare value 0.115 over 27)

The two way sample analysis shows how the parts are different:

Part	2	3	4	5	6
1	*0.001	*0	*0	*0	*0
2		0.780	*0	*0	0.370
3			*0	*0	0.533
4				*0.005	*0
5					*0

The distribution of the capital I is correlated poorly with the lognormal distribution. The greatest disturbance is a shortage of counts within distances 305-607 (102 occurrences against 125 expected) which contributes 39,1 % of the chi-square test value. The tail is longer over distances 1516 (17 occurrences against 10,2 expected) which contributes another 49,5 % of the chi-square test value.

The frequency of the lower case i made necessary the splitting. The parts are poorly correlated with the exponential distribution, as the best the 5. part (the chi-square test value 0.701 over 5), and they pass the two way sample analysis, as follows:

Part	2	3	4	5	6
1	*0.014	*0.026	0.068	*0.015	*0.002
2		0.815	0.635	0.947	0.441
3			0.631	0.864	0.316
4				0.506	0.132
5					0.397

Only the first part differs significantly from the others, since the result with the fourth part is only slightly above the limit of rejection. There are no repeating ii. This makes 55.6-78.2 % of the chi-square test value. The distribution is more skewed, there exists always a surplus of intermediate distances:

Part	Range	Observed	Expected	% of chisquare
i1	6-21	860	744.5	15.1
i2	7-18	709	567.2	23.3
i3	7-28	954	803.1	23.1
i4	7-26	950	871.8	6.2
i5	7-31	1077	979.5	11.8
i6	8-28	930	795.7	23.1

The including of I changed the results of the two way sample analysis as follows:

Part	2	3	4	5	6
1	*0.001	*0.002	*0	*0	*0.022
2		0.756	0.176	0.169	0.230
3			0.091	0.084	0.453
4				0.984	*0.017
5					*0.014

In most cases, the similarity is worse. Only the fourth and fifth parts are less different. There are no repeating Ii. This makes 60.5-70.1 % of the chi-square test value. The distribution is more skewed, there exists always a surplus of intermediate distances:

Part	Range	Observed	Expected	% of chisquare
[i+I]1	5-14	735	604.5	21.6
[i+I]2	7-12	495	371.2	24.0
[i+I]3	6-15	728	616.0	13.8
[i+I]4	8-21	831	687.9	21.3
[i+I]5	6-30	1256	1084.7	18.2
[i+I]6	7-23	1028	861.2	24.0

The distribution of the letter is the Weibull one. The Weilbull distribution of the lower case j is better correlated than both cases [j + J]. There are too many distances 874-1310 (22 occurrences against 16.7 expected). This makes 25.0 % of the chi-square test value. Contrary, there are too few distances 2619-3055 (10 occurrences against 6.4 expected). This makes 31.2 % of the chi-square test value. ombining both cases worsened the fit. There are too many distances 963-1284 (52 occurrences against 40.4 expected). This makes 44.2 % of the chi-square test value.

The Weilbull distribution of this letter is bad. There are no repeating kk [Kk]. This makes 18.5 [19.3] % of the chi-square test value.

The occurrences of capital L is correlated by the exponential distribution. There are too many distances 2653-3173 (14 occurrences against 11.3 expected). This makes 54.1 % of the chi-square test value.

The frequency of l and [l + L] made necessary the splitting.

The parts are correlated with the Weilbull distribution. It is distorted by many double ll [Ll]. This makes 74.8-79.9 % [74.8-81.2 %] of the total chi-square test value. The parts fit over different distances rather well, see table:

Part	Cut	Chisquare
l 1	11	0.903
2	35	0.967
3	24	0.967
4	35	0.208
5	20	0.097
l+L 1	11	0.987
2		0
3	30	0.925
4	36	0.092
5	10	0.579

The parts of l pass the two way sample analysis, as follows:

Part	2	3	4	5
1	0.850	0.518	*0.013	*0.041
2		0.647	*0.023	0.065
3			0.074	0.173
4				0.670

The fourth part differs significantly from the first and second ones, the first from the fifth one.

The including of I changed the results only slightly, see table:

Part	2	3	4	5
1	0.831	0.430	*0.012	*0.039
2		0.567	*0.022	0.067
3			0.095	0.219
4				0.661

The upper case M is correlated well using the Weilbull distribution (the chi-square test value is 0.458).

The lower case m, divided into 3 parts, is correlated as best with the exponential distribution (1. part the chi-square test value over 44 is 0.798, 2. part the chi-square test value = 0.443, 3. part the chi-square test value = 0.137). The doubled mm fit excellently only in the second part, in other parts, the repeating mm is more scarce than expected.

The parts of the distribution of m are different. The two way sample analysis shows following results:

Part	2	3
1	*0.001	*0.004
2		0.712

The upper case m, divided into 3 parts, is correlated as best with the exponential distribution (1. part the chi-square test value over 14 is 0.849, 2. part the chi-square test value = 0.096, 3. part the chi-square test value = 0.082). The doubled Mm fit only in the second part, in other parts, they are more scarce than expected.

The parts of the distribution of [m + M] are different, too. The two way sample analysis shows following results:

Part	2	3
1	*0.006	*0.004
2		0.835

The upper case N is correlated using the Weibull distribution (the chi-square test value is 0.072). There are too few distances 1600-2000 (7 occurrences against 13.3 expected). This makes 25.6 % of the chi-square test value.

The distribution of n and (n + N) was divided into seven parts.

The distribution of this letter is distorted by too few double nn [Nn] (e. g. 10 occurrences against 93 expected). This makes 44.0-67.9 % of the total very high chi-square test value. In some parts are rather great disturbances:

Part	Range	Observed	Expected	% of chisquare	Chisquare
1	6-10	419	283.0	38.8	0.152 over 10
3	18-22	202	134.8	28.2	0.242 over 20
4	6-9	330	265.5	13.8	0.073 over 16
5	6-16	644	736.6	15.2	0.114 over 10
6	6-22	944	784.7	27.0	0.117 over 37

The two way sample analysis shows following results:

Part	2	3	4	5	6	7
1	*0.012	0.097	0.355	0.882	0.780	0.146
2		0.388	0.107	*0.008	0.107	0.274
3			0.457	0.071	0.457	0.823
4				0.284	0.508	0.598
5					0.667	0.110
6						0.230

[n + N]:

In some parts are rather great disturbances:

Part	Range	Observed	Expected	% of chisquare	Chisquare
1	6-10	425	288.5	32.8	0.094 over 9
2	6-10	359	273.8	18.2	0.074 over 31
3	18-22	198	136.3	24.4	0.736 over 25
4	6-9	332	269.0	12.6	0.171 over 35
5	7-16	657	545.4	21.3	0.114 over 10
6	6-22	958	795.9	26.5	0.138 over 37

The disturbances have slightly less weight than at the lower case n.

The two way sample analysis shows for [n + N] the following results:

Part	2	3	4	5	6	7
1	*0.014	0.169	0.209	0.987	0.595	0.078
2		0.278	0.234	*0.015	0.051	0.480
3			0.911	0.170	0.384	0.701
4				0.209	0.457	0.621
5					0.589	0.080
6						0.211

Both sets are alike, the including of N did not changed the results of the two way sample analysis dramatically. The second part differs signficantly from the firts and fifth ones.

The distribution of O can be correlated also with the Weibull distribution (the chi-square test value 0.077). There are too many distances 3572-4714 (14 occurrences against 7.3 expected). This makes 35.1 % of the chi-square test value.

The distribution of o and (o + O) was divided into seven parts, which correlated poorly with the exponential distribution.

The distribution of this letter is distorted by too few double oo [Oo] only slightly,

(at most the first part of o, 72 occurrences against 119 expected). This makes 40.6% of the total very high chi-square test value. Here are the greatest disturbances:

Part	Range	Observed	Expected	% of chisquare	Chisquare
1	7-11	427	369.2	19.8	0.122 over 25
2	2-5	336	420.8	19.0	0.089 over 17
	6-9	395	325.7	16.4
	14-22	421	346.1	18.8
3	2-4	233	334.1	25.9	0.117 over 10
	11-14	262	184.6	27.5
4	7-18	778	653.0	36.5	0.652 over 26
5	14-22	522	346.3	31.8	0.084 over 20
6	2-5	310	403.2	22.7
	14-22	432	344.4	27.3
7	14-22	408	345.8	18.2	0.155 over 30

In four parts, an excess of distances 14-22 occurs.

The two way sample analysis shows following results:

Part	2	3	4	5	6	7
1	0.608	0.276	0.149	0.390	*0.042	0.656
2		0.565	0.342	0.726	0.124	0.941
3			0.689	0.824	0.323	0.511
4				0.541	0.575	0.302
5					0.231	0.669
6						0.103

The first part differs significantly from the fifth one.

[o + O]:

In some parts are rather great disturbances:

Part	Range	Observed	Expected	% of chisquare	Chisquare
1	6-9	404	328.7	22.5	0.162 over 31
	18-22	205	150.6	18.4
2	39-43	69	42.1	18.2	0.074 over 31
3	6-9	378	318.6	17.2	0.248 over 45
	27-30	125	91.7	18.7
4	7-12	480	378.7	31.2
5	14-22	426	344.6	31.0	0.128 over 25
6	2-5	308	402.7	21.0
	14-22	432	343.3	24.3
7	14-22	411	344.1	18.4

The excess of distances 14-22 occurs again. The disturbances have slightly less weight than at the lower case o.

The two way sample analysis shows for o + O the following results:

Part	2	3	4	5	6	7
1	0.443	0.128	0.114	0.229	0.050	0.5378
2		0.449	0.456	0.770	0.229	0.873
3			0.930	0.640	0.653	0.354
4				0.584	0.722	0.319
5					0.359	0.648
6						0.169

There is no significant difference between parts.

The upper case P correlated also using the Weilbull distribution (the chi-square test value = 0.356). The Weilbull distribution of p and [p +P] is distorted by too many pp [Pp] (69.8-80.8 [72.2-73.6] % of the chi-square test value). The other disturbances give only a minor opportunities for commenting. The both sets were divided into 3 parts. The test did not revealed their dissimilarity.

This letter correlated also using the Weilbull distribution (the chi-square test value = 0.367 at q, 0.389 at [q + Q]) but better fit gives the exponential distribution (the chi-square test value = 0.441 at q, 0.433 at [q + Q]).

The upper case R correlated with the Weilbull distribution (the chi-square test value = 0.109). The fit distribution is worsened by too many repeating within distances 1000 till 2000 (45 occurrences against 33.3 expected) which makes 40.4 % of the chi-square test value.

The distribution of r and [r + R] was divided into five parts. They fit with the exponential distribution.

The most important disturbances from the shape of the distribution in all parts of r are tabulated:

Part	Range	Observed	Expected	% of chisquare	Chisquare
1	1	31	91.9	25.1	0.103 over 24
	2-5	263	401.5	29.8
	6-20	973	791.6	26.9
2	1	36	94.5	30.2	0.097 over 20
	8-21	831	689.8	26.5
3	1	45	105.6	23.8	0.195 over 34
	10-18	628	471.9	41.4
4	2-5	305	464.5	48.0
5	10-18	570	454.1	32.2

The two way sample analysis shows that the parts of the lower case r are rather different:

Part	2	3	4	5
1	0.351	*0	*0	0178
2		*0	*0	0.671
3			0.334	*0
4				*0

The third and fourth part differ significantly from all other parts.

The most important disturbances from the shape of the distribution in all parts of [r + R] are tabulated:

Part	Range	Observed	Expected	% of chisquare	Chisquare
1	1	31	99.8	24.4	0.099 over 19
	2-5	254	393.3	31.2
	6-20	957	779.8	25.9
2	1	36	93.2	31.3
	8-21	831	682.7	27.1
3	1	45	104.0	27.6	0.239 over 17
	10-18	617	467.0	41.5
4	2-5	297	456.5	48.7
5	10-18	562	448.8	31.6

The two way sample analysis shows that the parts of [r + R] are rather different:

Part	2	3	4	5
1	0.220	*0	*0	0.115
2		*0	*0	0.716
3			0.400	*0
4				*0

The third and fourth part differ significantly from all other parts. Combining r with R changed the distribution insignificantly.

The Weibull distribution of the capital S is distorted mostly by a walley within distances 1600-2000 (10 occurrences against 14.2 expected). This makes 50.8 % of the chi-square test value.

The distribution of the lower case s an d [s + S] was divided into five parts.

The most important disturbances from the shape of the distribution in all parts of s are tabulated:

Part	Range	Observed	Expected	% of chisquare	Chisquare
1	1	115	70.1 (WE)	44.1	WE 0.669 over 33, EX 0.569 over 30
	26-33	171	226.8	21.1
2	2-5	285	419.6	44.5	EX 0.921 over 22
	16-20	284	210.3	26.6
3	1	96	63.7	25.1	WE 0.105 over 18
	2-6	294	374.6	26.5
4	2-6	338	455.2	36.2
	7-12	453	353.1	38.6
5	2-7	452	529.0	26.2	EX 0.155 over 10

The two way sample analysis shows that the parts of the lower case s are rather similar, except the fifth part which differs from the first and third parts:

Part	2	3	4	5
1	0.300	0.845	0.130	*0.019
2		0.216	0.620	0.175
3			0.087	*0.011
4				0.389

The most important disturbances from the shape of the distribution in all parts of [s + S] are tabulated:

Part	Range	Observed	Expected	% of chisquare	Chisquare
1	1	115	72.1	40.1	WE, 0.234 over 32
	26-33	170	231.0	25.5
2	2-5	303	376.7	22.9	WE, 0.273 over 32
3	2-5	310	450.2	48.7	EX, 0.091 over 28
	11-15	348	276.8	20.4
4	2-6	355	474.7	32.8	EX, 0.531 over 31
	7-12	474	366.3	34.4
5					no test

The two way sample analysis shows that the parts of [s + S] are rather similar:

Part	2	3	4	5
1	0.215	0.833	0.59	0.188
2		0.306	0.502	0.241
3			0.0.94	0.197
4				0.275

The third and fourth part differ significantly from all other parts. Combining r with R changed the distribution insignificantly.

The distribution of the capital T has the Weilbull shape. There is more distances 544-679 (54 occurrences against 43.3 expected). This makes 26.2 % of the chi-square test value.

The distribution of the lower case t as well as the both [t + T] is divided into six parts.

The most important disturbances from the shape of the exponential distribution in all parts of s are tabulated:

Part	Range	Observed	Expected	% of chisquare	Chisquare
1	1	82	193.2	46.1	0.429 over 31
	5-16	1410	1204.3	34.7
2	1	72	203.3	44.2	0.461 over 24
	5-16	1513	1226.8	35.2
3	1	82	204.4	50.2	0.119 over 47
	23-29	279	225.9	10.4
4	1	64	199.8	38.0	0.490 over 36
	5-20	1808	1448.8	37.3
5	1	61	208.1	58.6	0.176 over 33
	5-16	1462	1238.1	23.8

There are too few repeated s in all parts. Moreover, the shape of the distribution is rather sharp in the range 5-20, except the third part.

The most important disturbances from the shape of the distribution in all parts of [s + S] are tabulated:

Part	Range	Observed	Expected	% of chisquare	Chisquare
1	1	82	205.6	44.4	0.117 over 25
	5-16	1481	1243.5	29.8
2	1	73	219.1	42.6	0.383 over 25
	5-16	1604	1272.4	38.4
3	1	82	218.2	54.2	0.039 over 21
	5-16	1590	1384.1	22.1
4	1	64	213.1	36.9	0.205 over 45
	5-20	1896	1498.6	36.4
5	1	60	221.0	54.1	0.075 over 37
	5-16	1533	1276.5	26.0

The results are not changed to much.

The distribution of the capital U has the Weilbull shape.

The set of u, and [u + U] was divided into two parts, which are similar (the two sample analysis 0.326 [0.318]. There are no doubled uu or Uu (0 [0] occurrence against 54.0-55.6 [54.6, 56.2] expected). This makes 55.4, 44.4 [56.4, 44.1] % of the chi-square test value of the exponential distribution. When the lower limit is set to 30 at the 1. u set [28 at the 1. U set], the chi-square test value of the first part of u [u + U] is improved to 0.701 [0.865. The second parts give poorer fits.

The Weilbull distribution of [v + V] is poor. There are no doubled vv (0 occurrence against 9.8 expected). This makes 46.2 % of the chi-square test value. The tail is longer, (18 occurrences over 634 against 10.7 expected). This makes another 23.9 % of the chi-square test value.

The Weilbull distribution of the upper case W gives an good fit. There is a shortage of the distances 1840-2160 (5 occurrences against 9.4 expected). This small difference alone makes 44.5 % of the chi-square test value.

The sets w and [w + W] were divided into three parts. The exponential distribution of w gives a fair fit, see the following table:

Part	Range	Observed	Expected	% of chisquare	Chisquare
1	1	2	25.6	57.3	0.758 over 20
	108-125	37	57.2	18.7
2	1	1	26.3	58.6	0.242 over 10
3	1	0	29.3	52.3	0.711 over 45
	31-45	249	203.2	18.4

The two way sample analysis shows that the third part of w differs from the first two thirds:

Part	2	3
1	0.440	*0
2		*0.007

The third and fourth part differ significantly from all other parts. Combining r with R changed the distribution insignificantly.

The exponential distribution is best in the first two thirds, the last one is better correlated by the Weilbull distribution, see the following table:

Part	Range	Observed	Expected	% of chisquare	Chisquare
1	1	2	28.9	59.4	0.581 over 10
	16-30	309	276.7	8.9
2	1	1	29.7	64.0	0.250 over 10
3	1	0	16.7	44.8	0.228 over19
	166-180	16	9.2	13.6

The two way sample analysis shows that the third part of w differs from the first two thirds:

Part	2	3
1	0.471	*0.001
2		*0.015

The Weilbull distribution of [x + X] is poor. There is a shortage of the distances 1141-1368 (8 occurrences against 16.6 expected). This makes 32.5 % of the chi-square test value.

doubled vv (0 occurrence against 9.8 expected). This makes 46.2 % of the chi-square test value.

The exponential distribution of the upper case Y gives an acceptable fit. There is a peak of the distances 2216-2585 (10 occurrences against 6.2 expected). This minor difference makes 30.8 % of the chi-square test value.

The sets w and [w + W] were divided into two parts. The exponential distribution of y gives a poor fit, see the following table:

Part	Range	Observed	Expected	% of chisquare	Chisquare
1	1	1	25.6	60.0	0.436 over 70
2	1	1	22.9	46.0	0.161 over 40
	136-185	84	104.5	20.0

The two way sample analysis shows that they differ significantly (test value 0.003).

The exponential distribution of [y + Y] gives a poor fit, too, see the following table:

Part	Range	Observed	Expected	% of chisquare	Chisquare
1	1	1	28.2	71.7	0.162 over 30
2	1	1	25.1	46.7	0.200 over 40
	139-185	82	103.5	18.9

The two way sample analysis shows that they differ significantly (test value 0.001).

The exponential distribution of this letter is distorted by few occurrences within distances 638*1273 (17 occurrences against 28.4 expected). This makes 49.9 % of the chi-square test value.

Discussion

The insufficient capacity of the used software for long lists forced splitting of too frequent signs. The splitting was made before determining distances. Surprisingly, the obtained parts are not always comparable, since there are in the split parts different number of signs. This leads to the different mean distances between them.

Some distributions of distances between consonants are highly regular, especially their tails, if the low distances inside words are pooled. They are described with a different precision with four distributions: exponential, Weibull, lognormal and negative binomial. Sometimes it is rather difficult to decide which distribution is the better one for fitting.

If the results are compared with published analyses of Shakespeare's Sonnets and the Mathew's Ghospel, then there can be observed many differences. Doyle used words differently than older authors. Especially Weibull distribution appears more often.

Some peaks are obviously results of repeated phrases. This conclusion should be confirmed by stylistic analysis.

REFERENCES

1. Kunz M., See papers of this series on the page.