Sign In Sign-Up

Distance Analysis of English Texts. I. Shakespeare's Sonnets.

Milan Kunz, Jurkovičova 13, 63800 Brno, The Czech Republic, (kunzmilan@seznam.cz)

Summary

Distances between identical symbols in information strings (biological, language, computer programs (*.exe files) are described with a different precision with four distributions: Exponential, Weibull, lognormal and negative binomial. The correlations are sometimes highly significant. Here are analyzed distances between signs in the Shakespeare's Sonnets. Some distance tests revealed specific formal features of Sonnets.

INTRODUCTION

Statistical properties of information distributions, especially their extreme skewness, raised the notion of their specificity (Haitun 1982a, b, c). Determining frequencies of symbols or words was a time consuming task suitable for shortening unbearably long time periods (Yule 1944).

These linguistic studies had some pragmatical value, too: Learning of languages starting with the most frequent words and phrases and attribution of texts to authors.

The inverse function to frequencies are distances between identical counted objects.

Distances between identical symbols exist in all information strings with any number of symbols or their k-tuples (words). Their manual counting was even more troublesome than counting words. Therefore such studies were made only for neighbor symbols where the local transitivity (frequencies of 2-tuples, e. g. ab) was studied for example by Harary and Paper (1957).

Time intervals between consecutive patent applications of patentees (Kunz 1987), and time intervals between consecutive publications (Kunz 1993) were determined for some small samples.

A stochastical generation of a string of m repeatings of an alphabet of n symbols is conventionally modeled by tossing a dice with n-sides.

A coin is the first nontrivial model of the dice with two sides. When a coin is tossed, there appear differently long sequences, when one result prevails. The distribution of sequences between successive events (head or tail) in all possible runs is known as the negative binomial distribution.

The negative binomial distribution is the inverse to the binomial distribution. It evaluates frequencies of distances between consecutive binary symbols in their strings.

This distribution was a statistical curiosity till some decades ago since its evaluation was a rather difficult task (Irwing 1963), because its distribution function does not exist in a closed form. Now it is included in standard statistical software program packages (STATGRAPHICS, Statistical Graphics Corporation).

The distances between symbols in a codone and English and Czech text were analyzed by Kunz and Rádl (1998). I then analyzed the distances between numerals in the first 10000 digits of the number e and the distances in the artificial codone based on the number e (Kunz 2000).

The purpose of this study is at first, to determine statistical properties of Shakespeare's sonnets, and to gain some knowledge, how the poet used the laguage, and then, to find, if distance analysis can reveal some differences between prosaic and poetic texts.

Results

The Shakespeare's sonnets were obtained as an ASCII file on Internet (Project Gutenberg). Their numbering and dividing rows were stripped, as well as doubled or tripled spacebars, using MS-Word. After these formal corrections, the file contains 93772 signs including spaces, 76092 signs without spaces in 2155 lines and 17582 words.

It means that the mean length of a word is 4.327 signs (including apostrophes and punctuation marks), the mean length of a verse is 43.51 signs (with spaces), and 35.31 signs (without spaces), and/or 8.159 words.

After this, the distances were determined by a program elaborated by Rádl. The string is at first indexed with the position index i (i going from 1 to m) of each individual symbol in the string, and then the differences of these position indexes are determined. The differences are the topological distances between the same symbols. The sets of these values were evaluated by different statistical tests. The program counting distances counts all signs, including spacebar, return, and punctuation marks.

Replacing the numbering of sonnets by an sign, the length of sonnets can be determined as the distances between these signs. The results are tabulated as follows:

Table 1

Length of sonnets. Chisquare test.

The normal distribution. Mean: 649.47, standard deviation 22.1.

Lower	Upper	Observed	Expected
Limit	Limit	Frequency	Frequency	Chisquare
546	611.818	4	6.8	1.1575
611.818	620.909	8	8.3	.0107
620.909	630.000	11	14.0	.6496
630.000	639.091	23	20.0	.4373
639.091	648.182	26	24.2	.1268
648.182	657.273	32	24.8	2.0709
657.273	666.364	16	21.5	1.4148
666.364	675.455	20	15.8	1.1271
675.455	684.545	8	9.8	.3296
684.545	649	6	8.7	.8193

Chisquare = 8.14362 with 7 degree of freedom. Significance level = 0.320101.

The length of sonnets is slightly bimodal, between the central narrow peak and the second one a walley exists. The differences of the length in this area including about one half of all sonnets are about 9 signs in 14 verses, it means about 2 words.

Then the distances between the individual letters were determined, at first separately the lower case and the upper case (when enough occurrences available), than taken together.

From all available implemented distributions, only four distributions gave significant results, the exponential distribution, the Weibull distribution, the lognormal distribution, and the negative binomial distribution, as before.

The actual values (mean, standard deviation, skewness, kurtosis, distribution parameters etc.) were determined only in some instances.

The spacebar

The distances between the consecutive spacebars greater than 1 determine the number of words of the length corresponding to the distance minus one. There exists 17680 spacebars after corrections. This is a somewhat different number compared with the direct counting of words. The results are tabulated as follows. Cumulating frequencies of shorter distances, improved in some cases the fit, since bellow it the counts are scattered, and the differences can balance themselves.

Table 2 The number of words with the different length

Length	Number	Type of distribution, chisquare value
1	547	LN, 0.253
2	2870	NB, 0, over 8 = 0.521
3	3212	NB, 0, over 16 = 0.208
4	4012	NB, 0.091 + 0.873
5	2714	NB, 0, over 11 = 0.208
6	1744	EX, 0.069
7	1073	WE, 0.208
8	692	NB, 0.415
9	394	WE, 0.305
10	190	NB, 0.540
11	69	WE, 0.670
12	31	EX, 0.591
13	15	few data
14	13
15	2
16	1
17	1
18	1

The distribution of length of words seems to have the lognormal shape, but this guess was not tested.

Notes to some results:

The distribution of one letter words is poorly correlated by the lognormal distribution. There are two peeks and one walley. The main peak between distances 51-60 is slightly above length of one verse (21 occurrences against 12 expected). It alone makes 45.9 % of the chi-square test value.

The distribution of two letter wordsis fairly correlated by the negative binomial distribution, similarly as the next 3 groups. The chisquare value is low due to their repeating within distance one (292 occurrences against 466.2 expected). This alone makes 52.8 % of the chi-square test value. There is a long peak within distances 5-8 (1179 occurrences against 991.2 expected). This makes another 29.9 % of the low chi-square test value.

The distribution length of three letter wordsis fair with the forced lower limit 11.

There are too many words of length of four, the program failed to perform the test. It was necessary to split the file into two equal parts and to perform the test of both parts separately. The results are shown in the following tables, as the example of the other results.

Table 3

The distribution of distances between words of length 4. The first half.

Lower	Upper	Observed	Expected
Limit	Limit	Frequency	Frequency	Chisquare
at or below	1.500	476	464.1	.3045
1.500	2.500	349	356.9	.1770
2.500	3.500	284	274.5	.3268
3.500	4.500	207	211.1	.0811
4.500	5.500	172	162.4	.5691
5.500	6.500	131	124.9	.2988
6.500	7.500	90	96.1	.3815
7.500	8.500	68	73.9	.4672
8.500	9.500	54	56.8	.1397
9.500	10.500	32	43.7	3.1314
10.500	11.500	17	33.6	8.2070
11.500	12.500	22	25.8	.5728
12.500	13.500	24	19.9	.8541
13.500	14.500	25	15.3	6.1677
14.500	15.500	13	11.8	.1310
15.500	16.500	11	9.0	.4232
16.500	17.500	4	7.0	1.2559
17.500	18.500	8	5.3	1.3132
18.500	20.500	10	7.3	1.0175
20.500	38	13	10.5	.5743

Chisquare = 26.3937 with 18 degree of freedom. Significance level = 0.09109

The chisquare value is rather low. But inspecting its constituents, we see that there are only 49 distances 10 and 11 words against 77.3 expected and 25 distance 14 words against 15.8 expected. These two differences make only one percent of all occurrences of words but 66.3 % of the chisquare value.

Table 4

The distribution of distances between words of length 4. The second half.

Lower	Upper	Observed	Expected
Limit	Limit	Frequency	Frequency	Chisquare
at or below	1.500	445	446.8	.00743
1.500	2.500	350	347.1	.02428
2.500	3.500	266	269.6	.04885
3.500	4.500	224	209.5	1.01059
4.500	5.500	161	162.7	.01785
5.500	6.500	127	126.4	.00294
6.500	7.500	88	98.2	1.05587
7.500	8.500	74	76.3	.06749
8.500	9.500	65	59.2	.55874
9.500	10.500	47	46.0	.02073
10.500	11.500	25	35.8	3.23328
11.500	12.500	31	27.8	.37515
12.500	13.500	23	21.6	.09429
13.500	14.500	16	16.8	.03435
14.500	15.500	14	13.0	.07401
15.500	16.500	16	10.1	3.42717
16.500	17.500	5	7.9	1.03815
17.500	18.500	4	6.1	.72436
18.500	20.500	8	8.4	.02124
20.500	22.500	4	5.1	.23064
22.500	44	9	7.7	.20717

Chisquare = 12.2746 with 19 degrees of freedom. Sig. level = 0.873556.

Chisquare is almost excellent. Inspecting its constituents, we see again that there is only 25 distances 11 against 35.8 expected and 16 distances 16 against 10.1 expected. This difference makes less than one percent of all occurrences.

When consistency of both parts is tested by the two sample analysis, the zero hypothesis shall not be rejected.

The distribution of five letter words is poor due to their low repeating within distance one (273 occurrences against 368.3 expected). This makes 36.8 % of the chi-square test value. There is a peak within the distance 2 (377 occurrences against 318.3 expected). This makes 16.1 % of the poor chi-square test value. Other deviations are minor.

The exponential distribution of six letter words is poor due to their low repeating within distance one (140 occurrences against 163.8 expected). This makes 17.4 % of the chi-square test value. There is a peak within distances 6-13 (639 occurrences against 588.6 expected). This makes 22.2 % of the poor chi-square test value.

The distribution of seven letter words, similarly as next odd words is described by the Weilbull distribution. The correlation is poor. There exist two shortages, within distance one 27 (19 occurrences against 27.2 expected), and then within distances 80-92 (6 occurrences against 11.9 expected). They together make 65.4 % of the chi-square test value.

The distribution of eight letter words, as well as ten letter words is again the negative binomial distribution. The shortage of these words within distances 133-146 (1 occurrence against 5.8 expected) contributes 34 % of the chi-square test value. The tail is longer (14 occurrences against 8.8 expected). This makes 26.1 % of the poor chi-square test value.

The distribution of nine letter words is fairly correlated. The shortage of these words within distances 133-146 (1 occurrence against 5.8 expected) contributes 34 % of the chi-square test value.

The distribution of ten letter words is fairly correlated, too. The shortage of these words within distances 51-75 (16 occurrence against 26.2 expected) contributes one half of the chi-square test value.

The distributions of longer words are good correlated, or the tests failed due to few data.

Distances between points and commas

Distances between punctuation marks show the length of sentences or clauses.

Here is given the interesting result with the distribution of the points:

Table 5

The negative binomial distribution of distances between points. Chisquare Test

Lower	Upper	Observed	Expected
Limit	Limit	Frequency	Frequency	Chisquare
at or below	35.250	32	98.2	44.5925
35.250	69.500	56	78.2	6.2818
69.500	103.750	126	64.2	9.3820
103.750	138.000	29	52.8	0.7259
138.000	172.250	80	43.4	30.8804
172.250	206.500	75	35.7	43.3816
206.500	240.750	13	29.3	9.0786
240.750	275.000	30	24.1	1.4485
275.000	309.250	11	19.8	3.9122
309.250	343.500	21	16.3	1.3718
343.500	377.750	27	13.4	13.8755
377.750	412.000	4	11.0	4.4493
412.000	446.250	4	9.0	2.8067
446.250	480.500	7	7.4	.0245
480.500	514.750	3	6.1	1.5784
514.750	549.000	8	5.0	1.7739
549.000	617.500	8	7.5	.0317
617.500	686.000	2	5.1	1.8629
686	734	1	10.6	8.6593

Chisquare = 246.117 with 17 d.f. Sig. level = 0

The mean distance is 174.62. This makes exactly four verses. The oscillations correspond to the number of verses. The comma is the most frequently used punctuation mark for dividing theverses:

Table 6

The negative binomial distribution of distances between commas. Chisquare Test

Lower	Upper	Observed	Expected
Limit	Limit	Frequency	Frequency	Chisquare
2	12.485	139	177.8	8.4545
12.485	23.970	364	328.5	3.8466
23.970	35.455	273	368.4	24.7261
35.455	46.939	500	289.7	152.7396
46.939	58.424	167	247.9	26.3870
58.424	69.909	123	169.1	12.5859
69.909	81.394	125	132.8	.4609
81.394	92.879	134	85.4	27.6319
92.879	104.364	50	64.3	3.1786
104.364	115.848	22	40.0	8.1144
115.848	127.333	30	29.4	.0134
127.333	138.818	30	17.9	8.1625
138.818	150.303	8	12.9	1.8772
150.303	161.788	5	7.8	.9872
161.788	173.273	7	5.5	.3881
173.273	268	10	9.6	.0179

Chisquare = 279.572 with 14 d.f. Sig. level = 0

Distances between individual letters

The results for all letters are presented in the form of the table, where the frequencies of all symbols are given and the significance of the performed chi-square tests. Then the commentaries to all symbols of the alphabet are given.

Table 7 Survey of results

Notes:

EX = exponential distribution

WE = Weibull distribution

L N = lognormal distribution

NB = negative binomial distribution

* = the test was not made, since not enough of data

Statistic = XX, the chi-square test

Symbol	Small	Capital	Both
a	4571, EX, 0	367, EX, 0.664	4938, EX, 0
b	1085, EX,0.036	144, EX, 0.809	1229, WE, 0.087
c	1311, NB, 0.358	31, EX, 0.041	1342, EX, 0.522
d	2724, EX, 0	38, EX, 0.190	2762, NB, 0
e	9219, NB, 0	23, EX, 0.186	9242, NB, 0
f	1556, NB, 0.263	107, EX, 0.316	1663, NB, 0.993
g	1342, EX, 0.038	16*	1358, NB, 0.091
h	5002, EX, 0	65, EX, 0.867	5067, EX, 0
i	4232, EX, 0	443, LN, 0.883	4675, EX, 0
j	66, LN, 0.604	2*	68, LN, 0.604
k	547, EX, 0.011	6*	552, EX, 0.011
l	3033, EX, 0	58, EX, 0.237	3091, EX, 0
m	2004, WE, 0.671	90, WE, 0.098	2094, WE, 0.670
n	4445, NB, 0	73, EX, 0.826	4518, NB, 0
o	5579, NB, 0	127, LN, 0.685	5706, NB, 0
p	986, NB, 0	24*	1010, NB, 0
q	51, EX, 0.739	0	51, EX, 0.739
r	4165, NB, 0	17, EX, 0.573	4182, NB, 0
s	4846, NB, 0	141, LN, 0.672	4987, NB, 0
t	6754, NB, 0	459, EX, 0.197	7213, NB, 0
u	2299, EX, 0	21, EX, 0.785	2320, EX, 0,008
v	924, EX, 0.008	1*	925, EX, 0.008
w	1645, EX, 0	252, EX, 0.630	1897, EX, 0
x	60, EX, 0.926	0	60, EX, 0.926
y	1951, LN, 0	34, EX, 0.470	1985, EX, 0
z	20, EX, 0.931	0	20, EX, 0.931

The Weibull distribution is the best one only in the case of the letter m. The lognormal distribution correlates 4 cases of capital letters I, N, and S, and both case of the letter j (there are only two capital J). The exponential distribution is the best in the most performed tests, and the negative binomial distribution in 9 cases.

The fit varied from the excellent, for example f with the chi-square value 0.994, to practically zero values, as at the most frequent vowels.

The differences between experimental and calculated values were usually great at the shortest distances (1 till 10). Adjusting the lowest possible value to greater distances by pooling these distances increased the significance of the chi-square tests in some cases. The significance improved dramatically sometimes, see below.

Now, the commentaries to the individual letters follow.

The capital case A frequency allowed the separate test. The result with the exponential distribution is good, even if there is to high repeating within one verse distances (90 occurrences against 75.8 expected). This makes 39.5 % of the chi-square test value.

The distribution of distances between lower case a and both case (a + A) seems to be the exponential, at least their tails fit. The distribution cannot be satisfactorily described by a simple function due to fluctuations of frequencies between odd and even distances. This feature can be documented by pooling the lower distances between both case (a + A):

Table 8 Chisquare values of pooled lower distances

over	1. part	2. part	3. part
26		0.0229
27	0.1371	0.1187
28	0.3054	0.0976
29	0.0027	0.2432	0.0158
30	0.1481		0.1872
31			0.0835

The fluctuations between odd and even distances are not constant.

Correlating observed frequencies of the same classes of distances of one part against the same of the other part gives fairly linear plot (due to the span of values in the logarithmic scale). The two way sample analysis shows that the parts are from one whole.

	1. part : 2. part	1. part : 3. part	2. part : 3. part
a	0.3790	0.5597	0.7611
a+ A	0.9379	0.6070	0.6530

The distribution of distances between upper case B is exponential. There is a peak within distances 438-655 which corresponds to 10-12 verses (90 occurrences against 70 expected). This makes nearly two thirds of the chi-square test value.

The exponential distribution of distances between lower case b is worsened by a peak within distances 277-337 (9 occurrences against 22.4 expected). This makes more than one third of the chi-square test value. There are too few doubled bb (5 occurrences against 12.5 expected), which contributes 19.1 % of the chi-square test value, and to many occurrences ( 256 against 224.6 expected) within distances 32-62. This again makes 18.7 % of the chi-square test value. The combination of both cases changed the form of the distribution to Weilbull. There are too few doubled bb (5 occurrences against 11.9 expected), which contributes 24.1 % of the chi-square test value. Other deviations have a minor weight.

The distribution of this letter varies between exponential (b chi-square value 0.3726, b+ B chi-square value 0.5223), and negative binomial (b chi-square value 0.3580, b+ B chi-square value 0.4790). In both cases, there is a shortage in the range 143-169 (34 occurrences against 48.8 expected, or 26 occurrences against 38.8 expected ). This makes 29.3 %, and 24.8 % of the chi-square test value, respectively.

Here also the negative binomial or exponential distribution were applicable, with many deviations. There are too few doubled dd (18 occurrences against 79.2 [d] or 81.4 [d + D] expected), which contributes 62.4 % or 63.6 % of the chi-square test value, respectively. The exponential distribution is fair over the limit 30 (the chi-square test value 0.168).

There are relatively few E comparing with the number of e.

The distribution of distances between lower case e and both case (e + E) seems to be the negative binomial, at least their tails fit. The distribution cannot be satisfactorily described by a simple function due to fluctuations of frequencies between odd and even distances. This feature can be documented by pooling the lower distances between e:

Table 9 Chisquare values of pooled lower distances

over	1. part	2. part	3. part	4. part
10			0.1558
11	0.2949	0.2949		0.1672
12	0.4356	0.4356
13	0.5101	0.5101		0.2719
14	0.3687	0.3687
18				0.2918
20	0.5060	0.5060
22	0.2500	0.2500
23	0.2442	0.2442
24	0.5547	0.5543
25	0.5541	0.5541
26	0.5541	0.5730
30				0.5134

both case (a + A):

Table 10 Chisquare values of pooled lower distances

over	1. part	2. part	3. part	4. part	5. part
9		0.1517	0.2414
10		0.2299			0.3077
11		0.2926		0.1859	0.1621
12	0.4118			0.2559	0.2267
13	0.5131				0.2456
15	0.5712
17			0.4662
19			0.6346
20	0.5014
25	0.5485
28					0.3799
29					0.6237
30				0.5134	0.5175

The fluctuations between odd and even distances are not constant.

Table 11 The two way sample analysis of e distance tests

The differences between values in brackets are significant, the zero hypothesis should be rejected.

	2. part	3. part	4. part
1. part	[0.0007]	0.9100	0.0559
2. part		[0.0006]	0.1533
3. part			[0.0460]

The first part corresponds to the third part, and correlates badly with the second, and fourth part, too.

Table 12 The two sample analysis of e + E

The differences between values in brackets are significant, the zero hypothesis should be rejected.

	2. part	3. part	4. part	5. part
1. part	0.7108	[0.0009]	0.7964	0.0625
2. part		[0.0028]	0.5228	0.1304
3. part			[0.0004]	0.1511
4. part				[0.0371]

The first part corresponds to the second and fourth parts, and correlates badly with the second, and fifth part. As an example, the output of the test between part 1 and 3 is given:

Table 13 The two sample analysis of e + E

	WEE1.var1	WEE3.var1	Pooled
Sample Statistics: Number of Obs.	1855	1840	3695
Average	9.83881	10.8168	10.3258
Variance	69.1978	90.626	79.8684
Std. Deviation	8.31852	9.51977	8.93691
Median	7	8	8

Difference between Means = -0.978034

Conf. Interval For Diff. in Means: 95 Percent

Equal Vars.) Sample 1 - Sample 2 -1.55467 -0.4014 3693 D.F.

Unequal Vars.) Sample 1 - Sample 2 -1.55499 -0.401083 3619.9 D.F.

Ratio of Variances = 0.763554

Conf. Interval for Ratio of Variances: 0 Percent

Sample 1 Sample 2

Hypothesis Test for H0: Diff = 0 Computed t statistic = -3.32614

vs Alt: NE Sig. Level = 8.89197E-4

at Alpha = 0.05 so reject H0.

Inspecting both tables, it seems that it were possible to find the parts, where the distribution of e differs, with a greater precision.

It is not necessary to add some notes to the excellent fit of this letter. But it is rather interesting, how the scattered F improved the distribution of the lover case f.

The distribution of the both case g + G is again more regular than the distribution of the lower case g. Over distances 25, the chi-square test value for g is 0.9004, for g + G 0.4215, only. Few occurrences of gg (8 occurrences against 19.7 expected) make about one third of the chi-square test value.

There is a peak within distances 40-49 (83 occurrences against 49.8 expected). This makes two fifth of the chi-square test value. The second smaller peak lies within distances 21-30 (224 occurrences against 182.6 expected). This makes about one fifth of the chi-square test value. The lognormal distribution of this letter is shorter than expected (7 occurrences over 86 against 12.7 expected). This makes more than one fourth of the chi-square test value.

The lognormal distribution of the capital I is excellent, the significance of the chi-square test is 0.883. There is a peak within distances 20-26 verses (9 occurrences against 6 expected). This makes one third of the chi-square test value.

The lognormal distribution of the both case i + I is poor. There is a shortage of distances 14-24 (287 occurrences against 326.1 expected). This makes about one fifth of the chi-square test value. There is a peak within distances 47-58 (52 occurrences against 40.7 expected). This makes one sixth of the chi-square test value. The lognormal distribution of this letter is shorter than expected (9 occurrences over 96 against 20.5 expected). This makes about one third of the chi-square test value.

The lognormal distribution of this letter is without any greater deviations.

The exponential distribution of this letter is poorly correlated due to many repeatings in the second verses (distances 51-100, 122 occurrences against 104.3 expected). This makes 19.1 % of the chi-square test value. There is a peak within distances 12-13 verses (14 occurrences against 5.5 expected). This makes about one half of the chi-square test value.

The exponential distribution of this letter (lower case) is distorted by many double ll (248 and 280 occurrences against 48.7 or 47.9 expected, respectively in two parts). This makes more than 90 % of the total very high chi-square test value. The correlation of the first half is good, when the lower limit is set over 13:

Lower limit	13	14	15	16	17
Chi-square	0.4105	0.8272	0.6526	0.6787	0.6220

The correlation of the second half is good, only when the lower limit is set over 29:

Lower limit	29	30	31
Chi-square	0.1805	0.8129	0.6622

The exponential distribution of both case l + L is distorted by many double ll, which again makes more than 90 % of the total very high chi-square test value. The correlation of the first half is good, when the lower limit is set over 12:

Limit	12	13	14	15	16	17	18	19
Chi	0.2553	0.7107	0.5649	0.66849	0.6532	0.3830	0.1721	0.1721

The correlation of the second half is good, only when the lower limit is set over 29:

Lower limit	29	30	31	32
Chi-square	0.2235	0.8180	0.6332	0.3332

The consistency of both parts is good (l1/l2) = 0.5984, (l + L1)/(l + L2) = 0.6922. Even the comparisons of (l1/l + L1) = 0.6785, (l2/l + L2) = 0.5816 are acceptable.

This letter is correlated at best with the Weilbull distribution. Even the negative binomial distribution is acceptable (the chi-square test value 0.582). The doubled mm fit excellently with the negative binomial distribution but they form a peak in the Weilbull distribution (42 occurrences against 30.7 expected). This makes more than one third of the chi-square test value. There is a shortage of distances 100-113 (both models, 50 occurrences against 63.9 (WE) or 61.5 (NB) expected). This makes one quarter of the chi-square test value (WE). The combination with the upper case M decreased weights of these fluctuations somewhat.

The distribution of n and (n + N) was divided into two parts, which were slightly different. The negative binomial distribution of this letter is distorted by the shortage of doubled nn (only 11 % of expected, which makes two thirds of the chi-square test value). The chi-square test values improved by pooling lower distances differently: 1. part of n over 20 = 0.917, 2. part of n over 24 = 0.706, 1. part of (n + N) over 25 = 0.925, 2. part (n + N) over 24 = 0.938.

The lognormal distribution of capital O has a peak between 29-41 verses (20 occurrences against 14.3 expected). This makes 73.3 % of the chi-square test value. The distribution of o and (o + O) was again divided into two parts, which were slightly different. The negative binomial distribution of this letter is distorted by the shortage of doubled oo (only 51.8 % of expected, which makes one half of the chi-square test value). The chi-square test values improved by pooling lower distances differently: 1. part of o over 21 = 0.216, 2. part of n over 12 = 0.356, 1. part of (o + O) over 29 = 0.101, 2. part (o + O) over 11 = 0.472.

The negative binomial distribution of this letter is distorted by the surplus of doubled pp (44 occurrences against 10.4 expected), which makes 92.7 % of the chi-square test value. The chi-square test values improved by forced lower limit 10 to 0.296. The tail over 161 is almost perfect.

The exponential distribution gives no opportunity for some comments.

The distribution of r and (r + R) was divided into two parts, which were slightly different. The negative binomial distribution of this letter is distorted by the shortage of doubled rr (only 24.8 % of expected, which makes about one half of the chi-square test value). The chi-square test values improved by pooling lower distances differently: 1. part of r over 30 = 0.241, 2. part of r over 42 = 0.671, 1. part of (r + R) over 30 = 0.186, 2. part (r + R) over 40 = 0.541.

The lognormal distribution of the capital S is good, the significance of the chi-square test is distorted by a peak within distances of 10-14 verses (24 occurrences against 19.4 expected). This contributes one quarter to the chi-square test value. The distribution of the lower case s an d (s + S) was divided into two parts, which were slightly different. The negative binomial distribution of this letter is distorted significantly by the shortage of doubled ss only in the 1. part (91 occurrences against 125.7 expected). This makes 15.2 % of the chi-square test value. The sign s appears less than expected within distances 2-6 (801 occurrences against 874 expected), and more than expected within distances 6-12 (933 occurrences against 835.9 expected). The chi-square test value improved by pooling lower distances: 1. part of s over 30 = 0.167, 2. part of s over 30 = 0.660. Here appeared a shortage of the distances 49-53 (52 occurrences against 78.6 expected). This makes 75.9 % of the chi-square test value. The distribution of the both case s +S is similarly the negative binomial one, fair with pooled lower distances: 1. part of s over 20 = 0.521, 2. part of s over 30 = 0.525.

The exponential distribution of the capital T has a peak within distances 157-260 (103 occurrences against 84.9 expected). This contributes 23.7 % of the chi-square test value. The distribution is then shorter than expected (9 occurrences over 523 against 21.6 expected). This makes 46 % of the chi-square test value. The distribution of the lower case t is the negative binomial one. When divided into three parts, all parts show the shortage of doubled tt (11-17 % of expected), 66-71 % of the chi-square test value. The first part fitted excellently over 13, the significance chi-square test value is 0.919, the tail over 25 of the second part gives a fair chi-square test value 0.509, whereas the same tail the third part has the chi-square test value only 0.0019.

The distribution of both (t + T) is the negative binomial one. All three parts show the shortage of doublets Tt + tt (9.8-15 % of expected), 61.8-71.9 % of the chi-square test value. The first part fitted good over 12, the significance chi-square test value is 0.798, the tail over 25 of the second part gives a fair chi-square test value 0.529, whereas the same tail the third part has the chi-square test value only 0.003, and it is better correlated as the negative binomial distribution.

There are no doubled uu (0 occurrence against 55.7 expected). This makes 73.3 % of the chi-square test value of the exponential distribution. When the lower limit is set to 20, the chi-square test value is improved to 0.263. The exponential distribution of both (u + U), divided into three parts, correlates differently, again. The first part fitted poorly over 20, the significance chi-square test value is 0.105, the tail over 20 of the second part gives a good chi-square test value 0.747, whereas the tail over 13 the third part has the chi-square test value 0.512.

The exponential distribution has a shortage of distances till 32 (223 occurrences against 256.8 expected). This contributes 39 % of the chi-square test value. Then there follows a peak within distances 33-76 (368 occurrences against 312.5 expected). This contributes 33.8 % of the chi-square test value. The tail over 50 fits good with the chi-square test value 0.402.

The exponential distribution of the upper case W gives a good fit. There is a peak of the distances 113-224 (69 occurrences against 55.6 expected). This alone makes 45.4 % of the chi-square test value. The exponential distribution of the lower case gives an acceptable fit over 10 (the chi-square test value 0.350). There are no doubled ww (59 % of the chi-square test value). Combined (w + W) improved somewhat the fit, the absence ww makes 61.9 % of the chi-square test value, since the sample is greater. There is a shortage of the distances 118-131 (about 3 verses, 28 occurrences against 45.1 expected). This makes 10.6 % of the chi-square test value. Over 15 the chi-square test value is 0.462.

The exponential distribution is almost perfect.

The exponential distribution of the upper case Y gives a good fit. It somewhat improves the very poor lognormal distribution of the lower There is a long peak within distances 74-117 (272 occurrences against 200.2 expected). This alone makes 48.2 %of the chi-square test value. The lognormal distribution of this letter is shorter than expected (25 occurrences over 205 against 52.6 expected). This makes 33.5 % of the chi-square test value.

The exponential distribution is almost perfect.

I tried to find also the distribution of distances between words or groups of signs. As an example, the frequency of All (10), all (121) and *all (as call, shall etc., 209 occurrences). The distribution of distances between the determiner all is the Weilbull one, the chi-square test value is 0.448 with 121 occurrences.

Discussion

The corrections (removing off superfluous spacebars) in some cases worsened the fits, when compared with preliminary tests made with the raw text, as if the writer's errors were a part of the scheme leading to some distribution of distances between symbols.

In verses, repeating of some letters in some intervals is intentional, since they form rhymes. But in statistics, this feature is blurred by their occurrences within verses. The verse structure of the text revealed itself in the use of points.

To high repeating of the capital A within one verse distances (90 occurrences against 75.8 expected) is due mostly to the sonnet number 66, where 11 verses start with "And". This starting "And" repeats in other sonnets, too, and in combination with other starting A makes the peak. This distortion must be considered as intentional.

Some distributions of distances between consonants are highly regular, especially their tails, if the low distances inside words are pooled. They are described with a different precision with four distributions: exponential, Weibull, lognormal and negative binomial. Sometimes it is rather difficult to decide which distribution is the better one for fitting.

The splitting of statistics of some frequent letters, which was a necessity due to the insufficient memory of the used software, showed new possibilities of the distance analysis.

Since there are statistically significant differences between the parts, it seems, that Sonnets are not a single work, but a collection of sonnets including different parts. No attempt was made to synchronize a statistical analysis with a subject and stylistical analysis.

If the results are compared with published example (Kunz @ Rádl, 1998) of a scientific paper, than there can be observed some differences. In both cases, the vowels, except u, are poorly fitted. In both cases, letter f gave nearly ideal fit.

Consonants with the worser fit in the Sonnets are: b, c, d, , g, h, k, l, v, and w. Consonants with the better fit in the Sonnets are: m, x, and z. Since there are only few data for study, it can be only speculated, if it is the caused by the different use of these consonants in rhymes, which could produce observed peaks and fluctuations.

It can be concluded, that the analysis of distances between lexical units in text could become an useful method of text analysis.

REFERENCES

Haitun, S. D. (1982a) Stationary Scientometric Distributions I: Different Approximations. Scientometrics, , 4, 525.

Haitun, S. D. (1982b) Stationary Scientometric Distributions II: Non Gaussian Nature of Scientific Activities. Scientometrics, 4, 89 - 101.

Haitun, S. D. (1982c) Stationary Scientometric Distributions III: The Role of the Zipf Distribution. Scientometrics, 5, 375 - 395.

Harary, F.; Paper, H. H. (1957) Toward a General Calculus of Phonemic Distribution, Language, 33, 143 -- 169.

Irwing, J. O. (1963) The Place of Mathematics in Medical and Biological Statistics, J. Royal. Statistical Soc. A, 126, 1 - 45.

Kunz, M. (1987) Time Spectra of Patent Information, Scientometrics, 11, 163 - 173.

Kunz, M. (1993) About metrics of bibliometrics, J. Chem. Inform. Comput.

Sci., 33, 193 – 196.

Kunz, M. ; Rádl, Z. (1998) Distribution of Distances in Information Strings, J. Chem. Inform. Comput. Sci., 38, 374-378.

Kunz, M. (2000) Number e as a model gene (atlas.cz.mujweb\veda\kunzmilan)

Yule, G. U. (1944) The Statistical Study of Literary Vocabulary, Cambridge University Press, Cambridge.