Distance Analysis of English Texts. I. Shakespeare's Sonnets.

Milan Kunz, Jurkovičova 13, 63800 Brno, The Czech Republic, (kunzmilan@seznam.cz)

Summary

Distances between identical symbols in information strings (biological, language, computer programs (*.exe files) are described with a different precision with four distributions: Exponential, Weibull, lognormal and negative binomial. The correlations are sometimes highly significant. Here are analyzed distances between signs in the Shakespeare's Sonnets. Some distance tests revealed specific formal features of Sonnets.

INTRODUCTION

Statistical properties of information distributions, especially their extreme skewness, raised the notion of their specificity (Haitun 1982a, b, c). Determining frequencies of symbols or words was a time consuming task suitable for shortening unbearably long time periods (Yule 1944).

These linguistic studies had some pragmatical value, too: Learning of languages starting with the most frequent words and phrases and attribution of texts to authors.

The inverse function to frequencies are distances between identical counted objects.

Distances between identical symbols exist in all information strings with any number of symbols or their k-tuples (words). Their manual counting was even more troublesome than counting words. Therefore such studies were made only for neighbor symbols where the local transitivity (frequencies of 2-tuples, e. g. ab) was studied for example by Harary and Paper (1957).

Time intervals between consecutive patent applications of patentees (Kunz 1987), and time intervals between consecutive publications (Kunz 1993) were determined for some small samples.

A stochastical generation of a string of m repeatings of an alphabet of n symbols is conventionally modeled by tossing a dice with n-sides.

A coin is the first nontrivial model of the dice with two sides. When a coin is tossed, there appear differently long sequences, when one result prevails. The distribution of sequences between successive events (head or tail) in all possible runs is known as the negative binomial distribution.

The negative binomial distribution is the inverse to the binomial distribution. It evaluates frequencies of distances between consecutive binary symbols in their strings.

This distribution was a statistical curiosity till some decades ago since its evaluation was a rather difficult task (Irwing 1963), because its distribution function does not exist in a closed form. Now it is included in standard statistical software program packages (STATGRAPHICS, Statistical Graphics Corporation).

The distances between symbols in a codone and English and Czech text were analyzed by Kunz and Rádl (1998). I then analyzed the distances between numerals in the first 10000 digits of the number e and the distances in the artificial codone based on the number e (Kunz 2000).

The purpose of this study is at first, to determine statistical properties of Shakespeare's sonnets, and to gain some knowledge, how the poet used the laguage, and then, to find, if distance analysis can reveal some differences between prosaic and poetic texts.

Results

The Shakespeare's sonnets were obtained as an ASCII file on Internet (Project Gutenberg). Their numbering and dividing rows were stripped, as well as doubled or tripled spacebars, using MS-Word. After these formal corrections, the file contains 93772 signs including spaces, 76092 signs without spaces in 2155 lines and 17582 words.

It means that the mean length of a word is 4.327 signs (including apostrophes and punctuation marks), the mean length of a verse is 43.51 signs (with spaces), and 35.31 signs (without spaces), and/or 8.159 words.

After this, the distances were determined by a program elaborated by Rádl. The string is at first indexed with the position index i (i going from 1 to m) of each individual symbol in the string, and then the differences of these position indexes are determined. The differences are the topological distances between the same symbols. The sets of these values were evaluated by different statistical tests. The program counting distances counts all signs, including spacebar, return, and punctuation marks.

Replacing the numbering of sonnets by an sign, the length of sonnets can be determined as the distances between these signs. The results are tabulated as follows:

Table 1

Length of sonnets. Chisquare test.

The normal distribution. Mean: 649.47, standard deviation 22.1.

Lower

Upper

Observed

Expected

Limit

Limit

Frequency

Frequency

Chisquare

546

611.818

4

6.8

1.1575

611.818

620.909

8

8.3

.0107

620.909

630.000

11

14.0

.6496

630.000

639.091

23

20.0

.4373

639.091

648.182

26

24.2

.1268

648.182

657.273

32

24.8

2.0709

657.273

666.364

16

21.5

1.4148

666.364

675.455

20

15.8

1.1271

675.455

684.545

8

9.8

.3296

684.545

649

6

8.7

.8193

Chisquare = 8.14362 with 7 degree of freedom. Significance level = 0.320101.

The length of sonnets is slightly bimodal, between the central narrow peak and the second one a walley exists. The differences of the length in this area including about one half of all sonnets are about 9 signs in 14 verses, it means about 2 words.

Then the distances between the individual letters were determined, at first separately the lower case and the upper case (when enough occurrences available), than taken together.

From all available implemented distributions, only four distributions gave significant results, the exponential distribution, the Weibull distribution, the lognormal distribution, and the negative binomial distribution, as before.

The actual values (mean, standard deviation, skewness, kurtosis, distribution parameters etc.) were determined only in some instances.

The spacebar

The distances between the consecutive spacebars greater than 1 determine the number of words of the length corresponding to the distance minus one. There exists 17680 spacebars after corrections. This is a somewhat different number compared with the direct counting of words. The results are tabulated as follows. Cumulating frequencies of shorter distances, improved in some cases the fit, since bellow it the counts are scattered, and the differences can balance themselves.

Table 2 The number of words with the different length

Length

Number

Type of distribution, chisquare value

1

547

LN, 0.253

2

2870

NB, 0, over 8 = 0.521

3

3212

NB, 0, over 16 = 0.208

4

4012

NB, 0.091 + 0.873

5

2714

NB, 0, over 11 = 0.208

6

1744

EX, 0.069

7

1073

WE, 0.208

8

692

NB, 0.415

9

394

WE, 0.305

10

190

NB, 0.540

11

69

WE, 0.670

12

31

EX, 0.591

13

15

few data

14

13

 

15

2

 

16

1

 

17

1

 

18

1

 

The distribution of length of words seems to have the lognormal shape, but this guess was not tested.

Notes to some results:

The distribution of one letter words is poorly correlated by the lognormal distribution. There are two peeks and one walley. The main peak between distances 51-60 is slightly above length of one verse (21 occurrences against 12 expected). It alone makes 45.9 % of the chi-square test value.

The distribution of two letter wordsis fairly correlated by the negative binomial distribution, similarly as the next 3 groups. The chisquare value is low due to their repeating within distance one (292 occurrences against 466.2 expected). This alone makes 52.8 % of the chi-square test value. There is a long peak within distances 5-8 (1179 occurrences against 991.2 expected). This makes another 29.9 % of the low chi-square test value.

The distribution length of three letter wordsis fair with the forced lower limit 11.

There are too many words of length of four, the program failed to perform the test. It was necessary to split the file into two equal parts and to perform the test of both parts separately. The results are shown in the following tables, as the example of the other results.

Table 3

The distribution of distances between words of length 4. The first half.

Lower

Upper

Observed

Expected

Limit

Limit

Frequency

Frequency

Chisquare

at or below

1.500

476

464.1

.3045

1.500

2.500

349

356.9

.1770

2.500

3.500

284

274.5

.3268

3.500

4.500

207

211.1

.0811

4.500

5.500

172

162.4

.5691

5.500

6.500

131

124.9

.2988

6.500

7.500

90

96.1

.3815

7.500

8.500

68

73.9

.4672

8.500

9.500

54

56.8

.1397

9.500

10.500

32

43.7

3.1314

10.500

11.500

17

33.6

8.2070

11.500

12.500

22

25.8

.5728

12.500

13.500

24

19.9

.8541

13.500

14.500

25

15.3

6.1677

14.500

15.500

13

11.8

.1310

15.500

16.500

11

9.0

.4232

16.500

17.500

4

7.0

1.2559

17.500

18.500

8

5.3

1.3132

18.500

20.500

10

7.3

1.0175

20.500

38

13

10.5

.5743

Chisquare = 26.3937 with 18 degree of freedom. Significance level = 0.09109

The chisquare value is rather low. But inspecting its constituents, we see that there are only 49 distances 10 and 11 words against 77.3 expected and 25 distance 14 words against 15.8 expected. These two differences make only one percent of all occurrences of words but 66.3 % of the chisquare value.

Table 4

The distribution of distances between words of length 4. The second half.

Lower

Upper

Observed

Expected

Limit

Limit

Frequency

Frequency

Chisquare

at or below

1.500

445

446.8

.00743

1.500

2.500

350

347.1

.02428

2.500

3.500

266

269.6

.04885

3.500

4.500

224

209.5

1.01059

4.500

5.500

161

162.7

.01785

5.500

6.500

127

126.4

.00294

6.500

7.500

88

98.2

1.05587

7.500

8.500

74

76.3

.06749

8.500

9.500

65

59.2

.55874

9.500

10.500

47

46.0

.02073

10.500

11.500

25

35.8

3.23328

11.500

12.500

31

27.8

.37515

12.500

13.500

23

21.6

.09429

13.500

14.500

16

16.8

.03435

14.500

15.500

14

13.0

.07401

15.500

16.500

16

10.1

3.42717

16.500

17.500

5

7.9

1.03815

17.500

18.500

4

6.1

.72436

18.500

20.500

8

8.4

.02124

20.500

22.500

4

5.1

.23064

22.500

44

9

7.7

.20717

Chisquare = 12.2746 with 19 degrees of freedom. Sig. level = 0.873556.

Chisquare is almost excellent. Inspecting its constituents, we see again that there is only 25 distances 11 against 35.8 expected and 16 distances 16 against 10.1 expected. This difference makes less than one percent of all occurrences.

When consistency of both parts is tested by the two sample analysis, the zero hypothesis shall not be rejected.

The distribution of five letter words is poor due to their low repeating within distance one (273 occurrences against 368.3 expected). This makes 36.8 % of the chi-square test value. There is a peak within the distance 2 (377 occurrences against 318.3 expected). This makes 16.1 % of the poor chi-square test value. Other deviations are minor.

The exponential distribution of six letter words is poor due to their low repeating within distance one (140 occurrences against 163.8 expected). This makes 17.4 % of the chi-square test value. There is a peak within distances 6-13 (639 occurrences against 588.6 expected). This makes 22.2 % of the poor chi-square test value.

The distribution of seven letter words, similarly as next odd words is described by the Weilbull distribution. The correlation is poor. There exist two shortages, within distance one 27 (19 occurrences against 27.2 expected), and then within distances 80-92 (6 occurrences against 11.9 expected). They together make 65.4 % of the chi-square test value.

The distribution of eight letter words, as well as ten letter words is again the negative binomial distribution. The shortage of these words within distances 133-146 (1 occurrence against 5.8 expected) contributes 34 % of the chi-square test value. The tail is longer (14 occurrences against 8.8 expected). This makes 26.1 % of the poor chi-square test value.

The distribution of nine letter words is fairly correlated. The shortage of these words within distances 133-146 (1 occurrence against 5.8 expected) contributes 34 % of the chi-square test value.

The distribution of ten letter words is fairly correlated, too. The shortage of these words within distances 51-75 (16 occurrence against 26.2 expected) contributes one half of the chi-square test value.

The distributions of longer words are good correlated, or the tests failed due to few data.

Distances between points and commas

Distances between punctuation marks show the length of sentences or clauses.

Here is given the interesting result with the distribution of the points:

Table 5

The negative binomial distribution of distances between points. Chisquare Test

Lower

Upper

Observed

Expected

Limit

Limit

Frequency

Frequency

Chisquare

at or below

35.250

32

98.2

44.5925

35.250

69.500

56

78.2

6.2818

69.500

103.750

126

64.2

9.3820

103.750

138.000

29

52.8

0.7259

138.000

172.250

80

43.4

30.8804

172.250

206.500

75

35.7

43.3816

206.500

240.750

13

29.3

9.0786

240.750

275.000

30

24.1

1.4485

275.000

309.250

11

19.8

3.9122

309.250

343.500

21

16.3

1.3718

343.500

377.750

27

13.4

13.8755

377.750

412.000

4

11.0

4.4493

412.000

446.250

4

9.0

2.8067

446.250

480.500

7

7.4

.0245

480.500

514.750

3

6.1

1.5784

514.750

549.000

8

5.0

1.7739

549.000

617.500

8

7.5

.0317

617.500

686.000

2

5.1

1.8629

686

734

1

10.6

8.6593

Chisquare = 246.117 with 17 d.f. Sig. level = 0

The mean distance is 174.62. This makes exactly four verses. The oscillations correspond to the number of verses. The comma is the most frequently used punctuation mark for dividing theverses:

Table 6

The negative binomial distribution of distances between commas. Chisquare Test

Lower

Upper

Observed

Expected

Limit

Limit

Frequency

Frequency

Chisquare

2

12.485

139

177.8

8.4545

12.485

23.970

364

328.5

3.8466

23.970

35.455

273

368.4

24.7261

35.455

46.939

500

289.7

152.7396

46.939

58.424

167

247.9

26.3870

58.424

69.909

123

169.1

12.5859

69.909

81.394

125

132.8

.4609

81.394

92.879

134

85.4

27.6319

92.879

104.364

50

64.3

3.1786

104.364

115.848

22

40.0

8.1144

115.848

127.333

30

29.4

.0134

127.333

138.818

30

17.9

8.1625

138.818

150.303

8

12.9

1.8772

150.303

161.788

5

7.8

.9872

161.788

173.273

7

5.5

.3881

173.273

268

10

9.6

.0179

Chisquare = 279.572 with 14 d.f. Sig. level = 0

Distances between individual letters

The results for all letters are presented in the form of the table, where the frequencies of all symbols are given and the significance of the performed chi-square tests. Then the commentaries to all symbols of the alphabet are given.

Table 7 Survey of results

Notes:

EX = exponential distribution

WE = Weibull distribution

L N = lognormal distribution

NB = negative binomial distribution

* = the test was not made, since not enough of data

Statistic = XX, the chi-square test

Symbol

Small

Capital

Both

a

4571, EX, 0

367, EX, 0.664

4938, EX, 0

b

1085, EX,0.036

144, EX, 0.809

1229, WE, 0.087

c

1311, NB, 0.358

31, EX, 0.041

1342, EX, 0.522

d

2724, EX, 0

38, EX, 0.190

2762, NB, 0

e

9219, NB, 0

23, EX, 0.186

9242, NB, 0

f

1556, NB, 0.263

107, EX, 0.316

1663, NB, 0.993

g

1342, EX, 0.038

16*

1358, NB, 0.091

h

5002, EX, 0

65, EX, 0.867

5067, EX, 0

i

4232, EX, 0

443, LN, 0.883

4675, EX, 0

j

66, LN, 0.604

2*

68, LN, 0.604

k

547, EX, 0.011

6*

552, EX, 0.011

l

3033, EX, 0

58, EX, 0.237

3091, EX, 0

m

2004, WE, 0.671

90, WE, 0.098

2094, WE, 0.670

n

4445, NB, 0

73, EX, 0.826

4518, NB, 0

o

5579, NB, 0

127, LN, 0.685

5706, NB, 0

p

986, NB, 0

24*

1010, NB, 0

q

51, EX, 0.739

0

51, EX, 0.739

r

4165, NB, 0

17, EX, 0.573

4182, NB, 0

s

4846, NB, 0

141, LN, 0.672

4987, NB, 0

t

6754, NB, 0

459, EX, 0.197

7213, NB, 0

u

2299, EX, 0

21, EX, 0.785

2320, EX, 0,008

v

924, EX, 0.008

1*

925, EX, 0.008

w

1645, EX, 0

252, EX, 0.630

1897, EX, 0

x

60, EX, 0.926

0

60, EX, 0.926

y

1951, LN, 0

34, EX, 0.470

1985, EX, 0

z

20, EX, 0.931

0

20, EX, 0.931

The Weibull distribution is the best one only in the case of the letter m. The lognormal distribution correlates 4 cases of capital letters I, N, and S, and both case of the letter j (there are only two capital J). The exponential distribution is the best in the most performed tests, and the negative binomial distribution in 9 cases.

The fit varied from the excellent, for example f with the chi-square value 0.994, to practically zero values, as at the most frequent vowels.

The differences between experimental and calculated values were usually great at the shortest distances (1 till 10). Adjusting the lowest possible value to greater distances by pooling these distances increased the significance of the chi-square tests in some cases. The significance improved dramatically sometimes, see below.

Now, the commentaries to the individual letters follow.

A

The capital case A frequency allowed the separate test. The result with the exponential distribution is good, even if there is to high repeating within one verse distances (90 occurrences against 75.8 expected). This makes 39.5 % of the chi-square test value.

The distribution of distances between lower case a and both case (a + A) seems to be the exponential, at least their tails fit. The distribution cannot be satisfactorily described by a simple function due to fluctuations of frequencies between odd and even distances. This feature can be documented by pooling the lower distances between both case (a + A):

Table 8 Chisquare values of pooled lower distances

over

1. part

2. part

3. part

26

 

0.0229

 

27

0.1371

0.1187

 

28

0.3054

0.0976

 

29

0.0027

0.2432

0.0158

30

0.1481

 

0.1872

31

  

0.0835

The fluctuations between odd and even distances are not constant.

Correlating observed frequencies of the same classes of distances of one part against the same of the other part gives fairly linear plot (due to the span of values in the logarithmic scale). The two way sample analysis shows that the parts are from one whole.

 

1. part : 2. part

1. part : 3. part

2. part : 3. part

a

0.3790

0.5597

0.7611

a+ A

0.9379

0.6070

0.6530

B

The distribution of distances between upper case B is exponential. There is a peak within distances 438-655 which corresponds to 10-12 verses (90 occurrences against 70 expected). This makes nearly two thirds of the chi-square test value.

The exponential distribution of distances between lower case b is worsened by a peak within distances 277-337 (9 occurrences against 22.4 expected). This makes more than one third of the chi-square test value. There are too few doubled bb (5 occurrences against 12.5 expected), which contributes 19.1 % of the chi-square test value, and to many occurrences ( 256 against 224.6 expected) within distances 32-62. This again makes 18.7 % of the chi-square test value. The combination of both cases changed the form of the distribution to Weilbull. There are too few doubled bb (5 occurrences against 11.9 expected), which contributes 24.1 % of the chi-square test value. Other deviations have a minor weight.

C

The distribution of this letter varies between exponential (b chi-square value 0.3726, b+ B chi-square value 0.5223), and negative binomial (b chi-square value 0.3580, b+ B chi-square value 0.4790). In both cases, there is a shortage in the range 143-169 (34 occurrences against 48.8 expected, or 26 occurrences against 38.8 expected ). This makes 29.3 %, and 24.8 % of the chi-square test value, respectively.

D

Here also the negative binomial or exponential distribution were applicable, with many deviations. There are too few doubled dd (18 occurrences against 79.2 [d] or 81.4 [d + D] expected), which contributes 62.4 % or 63.6 % of the chi-square test value, respectively. The exponential distribution is fair over the limit 30 (the chi-square test value 0.168).

E

There are relatively few E comparing with the number of e.

The distribution of distances between lower case e and both case (e + E) seems to be the negative binomial, at least their tails fit. The distribution cannot be satisfactorily described by a simple function due to fluctuations of frequencies between odd and even distances. This feature can be documented by pooling the lower distances between e:

Table 9 Chisquare values of pooled lower distances

over

1. part

2. part

3. part

4. part

10

  

0.1558

 

11

0.2949

0.2949

 

0.1672

12

0.4356

0.4356

  

13

0.5101

0.5101

 

0.2719

14

0.3687

0.3687

  

18

   

0.2918

20

0.5060

0.5060

  

22

0.2500

0.2500

  

23

0.2442

0.2442

  

24

0.5547

0.5543

  

25

0.5541

0.5541

  

26

0.5541

0.5730

  

30

   

0.5134

both case (a + A):

Table 10 Chisquare values of pooled lower distances

over

1. part

2. part

3. part

4. part

5. part

9

 

0.1517

0.2414

  

10

 

0.2299

  

0.3077

11

 

0.2926

 

0.1859

0.1621

12

0.4118

  

0.2559

0.2267

13

0.5131

   

0.2456

15

0.5712

    

17

  

0.4662

  

19

  

0.6346

  

20

0.5014

    

25

0.5485

    

28

    

0.3799

29

    

0.6237

30

   

0.5134

0.5175

The fluctuations between odd and even distances are not constant.

Correlating observed frequencies of the same classes of distances of one part against the same of the other part gives fairly linear plot (due to the span of values in the logarithmic scale). The two way sample analysis shows that the parts can not always be considered as parts from one whole.

 

Table 11 The two way sample analysis of e distance tests

The differences between values in brackets are significant, the zero hypothesis should be rejected.

 

2. part

3. part

4. part

1. part

[0.0007]

0.9100

0.0559

2. part

 

[0.0006]

0.1533

3. part

  

[0.0460]

The first part corresponds to the third part, and correlates badly with the second, and fourth part, too.

 

Table 12 The two sample analysis of e + E

The differences between values in brackets are significant, the zero hypothesis should be rejected.

 

2. part

3. part

4. part

5. part

1. part

0.7108

[0.0009]

0.7964

0.0625

2. part

 

[0.0028]

0.5228

0.1304

3. part

  

[0.0004]

0.1511

4. part

   

[0.0371]

The first part corresponds to the second and fourth parts, and correlates badly with the second, and fifth part. As an example, the output of the test between part 1 and 3 is given:

Table 13 The two sample analysis of e + E

 

WEE1.var1

WEE3.var1

Pooled

Sample Statistics: Number of Obs.

1855

1840

3695

Average

9.83881

10.8168

10.3258

Variance

69.1978

90.626

79.8684

Std. Deviation

8.31852

9.51977

8.93691

Median

7

8

8

Difference between Means = -0.978034

Conf. Interval For Diff. in Means: 95 Percent

Equal Vars.) Sample 1 - Sample 2 -1.55467 -0.4014 3693 D.F.

Unequal Vars.) Sample 1 - Sample 2 -1.55499 -0.401083 3619.9 D.F.

Ratio of Variances = 0.763554

Conf. Interval for Ratio of Variances: 0 Percent

Sample 1 Sample 2

Hypothesis Test for H0: Diff = 0 Computed t statistic = -3.32614

vs Alt: NE Sig. Level = 8.89197E-4

at Alpha = 0.05 so reject H0.

Inspecting both tables, it seems that it were possible to find the parts, where the distribution of e differs, with a greater precision.

F

It is not necessary to add some notes to the excellent fit of this letter. But it is rather interesting, how the scattered F improved the distribution of the lover case f.

G

The distribution of the both case g + G is again more regular than the distribution of the lower case g. Over distances 25, the chi-square test value for g is 0.9004, for g + G 0.4215, only. Few occurrences of gg (8 occurrences against 19.7 expected) make about one third of the chi-square test value.

H

There is a peak within distances 40-49 (83 occurrences against 49.8 expected). This makes two fifth of the chi-square test value. The second smaller peak lies within distances 21-30 (224 occurrences against 182.6 expected). This makes about one fifth of the chi-square test value. The lognormal distribution of this letter is shorter than expected (7 occurrences over 86 against 12.7 expected). This makes more than one fourth of the chi-square test value.

I

The lognormal distribution of the capital I is excellent, the significance of the chi-square test is 0.883. There is a peak within distances 20-26 verses (9 occurrences against 6 expected). This makes one third of the chi-square test value.

The lognormal distribution of the both case i + I is poor. There is a shortage of distances 14-24 (287 occurrences against 326.1 expected). This makes about one fifth of the chi-square test value. There is a peak within distances 47-58 (52 occurrences against 40.7 expected). This makes one sixth of the chi-square test value. The lognormal distribution of this letter is shorter than expected (9 occurrences over 96 against 20.5 expected). This makes about one third of the chi-square test value.

J

The lognormal distribution of this letter is without any greater deviations.

K

The exponential distribution of this letter is poorly correlated due to many repeatings in the second verses (distances 51-100, 122 occurrences against 104.3 expected). This makes 19.1 % of the chi-square test value. There is a peak within distances 12-13 verses (14 occurrences against 5.5 expected). This makes about one half of the chi-square test value.

L

The exponential distribution of this letter (lower case) is distorted by many double ll (248 and 280 occurrences against 48.7 or 47.9 expected, respectively in two parts). This makes more than 90 % of the total very high chi-square test value. The correlation of the first half is good, when the lower limit is set over 13:

Lower limit

13

14

15

16

17

Chi-square

0.4105

0.8272

0.6526

0.6787

0.6220

The correlation of the second half is good, only when the lower limit is set over 29:

Lower limit

29

30

31

Chi-square

0.1805

0.8129

0.6622

The exponential distribution of both case l + L is distorted by many double ll, which again makes more than 90 % of the total very high chi-square test value. The correlation of the first half is good, when the lower limit is set over 12:

Limit

12

13

14

15

16

17

18

19

Chi

0.2553

0.7107

0.5649

0.66849

0.6532

0.3830

0.1721

0.1721

The correlation of the second half is good, only when the lower limit is set over 29:

Lower limit

29

30

31

32

Chi-square

0.2235

0.8180

0.6332

0.3332

The consistency of both parts is good (l1/l2) = 0.5984, (l + L1)/(l + L2) = 0.6922. Even the comparisons of (l1/l + L1) = 0.6785, (l2/l + L2) = 0.5816 are acceptable.

M

This letter is correlated at best with the Weilbull distribution. Even the negative binomial distribution is acceptable (the chi-square test value 0.582). The doubled mm fit excellently with the negative binomial distribution but they form a peak in the Weilbull distribution (42 occurrences against 30.7 expected). This makes more than one third of the chi-square test value. There is a shortage of distances 100-113 (both models, 50 occurrences against 63.9 (WE) or 61.5 (NB) expected). This makes one quarter of the chi-square test value (WE). The combination with the upper case M decreased weights of these fluctuations somewhat.

N

The distribution of n and (n + N) was divided into two parts, which were slightly different. The negative binomial distribution of this letter is distorted by the shortage of doubled nn (only 11 % of expected, which makes two thirds of the chi-square test value). The chi-square test values improved by pooling lower distances differently: 1. part of n over 20 = 0.917, 2. part of n over 24 = 0.706, 1. part of (n + N) over 25 = 0.925, 2. part (n + N) over 24 = 0.938.

O

The lognormal distribution of capital O has a peak between 29-41 verses (20 occurrences against 14.3 expected). This makes 73.3 % of the chi-square test value. The distribution of o and (o + O) was again divided into two parts, which were slightly different. The negative binomial distribution of this letter is distorted by the shortage of doubled oo (only 51.8 % of expected, which makes one half of the chi-square test value). The chi-square test values improved by pooling lower distances differently: 1. part of o over 21 = 0.216, 2. part of n over 12 = 0.356, 1. part of (o + O) over 29 = 0.101, 2. part (o + O) over 11 = 0.472.

P

The negative binomial distribution of this letter is distorted by the surplus of doubled pp (44 occurrences against 10.4 expected), which makes 92.7 % of the chi-square test value. The chi-square test values improved by forced lower limit 10 to 0.296. The tail over 161 is almost perfect.

Q

The exponential distribution gives no opportunity for some comments.

R

The distribution of r and (r + R) was divided into two parts, which were slightly different. The negative binomial distribution of this letter is distorted by the shortage of doubled rr (only 24.8 % of expected, which makes about one half of the chi-square test value). The chi-square test values improved by pooling lower distances differently: 1. part of r over 30 = 0.241, 2. part of r over 42 = 0.671, 1. part of (r + R) over 30 = 0.186, 2. part (r + R) over 40 = 0.541.

S

The lognormal distribution of the capital S is good, the significance of the chi-square test is distorted by a peak within distances of 10-14 verses (24 occurrences against 19.4 expected). This contributes one quarter to the chi-square test value. The distribution of the lower case s an d (s + S) was divided into two parts, which were slightly different. The negative binomial distribution of this letter is distorted significantly by the shortage of doubled ss only in the 1. part (91 occurrences against 125.7 expected). This makes 15.2 % of the chi-square test value. The sign s appears less than expected within distances 2-6 (801 occurrences against 874 expected), and more than expected within distances 6-12 (933 occurrences against 835.9 expected). The chi-square test value improved by pooling lower distances: 1. part of s over 30 = 0.167, 2. part of s over 30 = 0.660. Here appeared a shortage of the distances 49-53 (52 occurrences against 78.6 expected). This makes 75.9 % of the chi-square test value. The distribution of the both case s +S is similarly the negative binomial one, fair with pooled lower distances: 1. part of s over 20 = 0.521, 2. part of s over 30 = 0.525.

T

The exponential distribution of the capital T has a peak within distances 157-260 (103 occurrences against 84.9 expected). This contributes 23.7 % of the chi-square test value. The distribution is then shorter than expected (9 occurrences over 523 against 21.6 expected). This makes 46 % of the chi-square test value. The distribution of the lower case t is the negative binomial one. When divided into three parts, all parts show the shortage of doubled tt (11-17 % of expected), 66-71 % of the chi-square test value. The first part fitted excellently over 13, the significance chi-square test value is 0.919, the tail over 25 of the second part gives a fair chi-square test value 0.509, whereas the same tail the third part has the chi-square test value only 0.0019.

The distribution of both (t + T) is the negative binomial one. All three parts show the shortage of doublets Tt + tt (9.8-15 % of expected), 61.8-71.9 % of the chi-square test value. The first part fitted good over 12, the significance chi-square test value is 0.798, the tail over 25 of the second part gives a fair chi-square test value 0.529, whereas the same tail the third part has the chi-square test value only 0.003, and it is better correlated as the negative binomial distribution.

U

There are no doubled uu (0 occurrence against 55.7 expected). This makes 73.3 % of the chi-square test value of the exponential distribution. When the lower limit is set to 20, the chi-square test value is improved to 0.263. The exponential distribution of both (u + U), divided into three parts, correlates differently, again. The first part fitted poorly over 20, the significance chi-square test value is 0.105, the tail over 20 of the second part gives a good chi-square test value 0.747, whereas the tail over 13 the third part has the chi-square test value 0.512.

V

The exponential distribution has a shortage of distances till 32 (223 occurrences against 256.8 expected). This contributes 39 % of the chi-square test value. Then there follows a peak within distances 33-76 (368 occurrences against 312.5 expected). This contributes 33.8 % of the chi-square test value. The tail over 50 fits good with the chi-square test value 0.402.

W

The exponential distribution of the upper case W gives a good fit. There is a peak of the distances 113-224 (69 occurrences against 55.6 expected). This alone makes 45.4 % of the chi-square test value. The exponential distribution of the lower case gives an acceptable fit over 10 (the chi-square test value 0.350). There are no doubled ww (59 % of the chi-square test value). Combined (w + W) improved somewhat the fit, the absence ww makes 61.9 % of the chi-square test value, since the sample is greater. There is a shortage of the distances 118-131 (about 3 verses, 28 occurrences against 45.1 expected). This makes 10.6 % of the chi-square test value. Over 15 the chi-square test value is 0.462.

X

The exponential distribution is almost perfect.

Y

The exponential distribution of the upper case Y gives a good fit. It somewhat improves the very poor lognormal distribution of the lower There is a long peak within distances 74-117 (272 occurrences against 200.2 expected). This alone makes 48.2 %of the chi-square test value. The lognormal distribution of this letter is shorter than expected (25 occurrences over 205 against 52.6 expected). This makes 33.5 % of the chi-square test value.

Z

The exponential distribution is almost perfect.

I tried to find also the distribution of distances between words or groups of signs. As an example, the frequency of All (10), all (121) and *all (as call, shall etc., 209 occurrences). The distribution of distances between the determiner all is the Weilbull one, the chi-square test value is 0.448 with 121 occurrences.

Discussion

The corrections (removing off superfluous spacebars) in some cases worsened the fits, when compared with preliminary tests made with the raw text, as if the writer's errors were a part of the scheme leading to some distribution of distances between symbols.

In verses, repeating of some letters in some intervals is intentional, since they form rhymes. But in statistics, this feature is blurred by their occurrences within verses. The verse structure of the text revealed itself in the use of points.

To high repeating of the capital A within one verse distances (90 occurrences against 75.8 expected) is due mostly to the sonnet number 66, where 11 verses start with "And". This starting "And" repeats in other sonnets, too, and in combination with other starting A makes the peak. This distortion must be considered as intentional.

Some distributions of distances between consonants are highly regular, especially their tails, if the low distances inside words are pooled. They are described with a different precision with four distributions: exponential, Weibull, lognormal and negative binomial. Sometimes it is rather difficult to decide which distribution is the better one for fitting.

The splitting of statistics of some frequent letters, which was a necessity due to the insufficient memory of the used software, showed new possibilities of the distance analysis.

Since there are statistically significant differences between the parts, it seems, that Sonnets are not a single work, but a collection of sonnets including different parts. No attempt was made to synchronize a statistical analysis with a subject and stylistical analysis.

If the results are compared with published example (Kunz @ Rádl, 1998) of a scientific paper, than there can be observed some differences. In both cases, the vowels, except u, are poorly fitted. In both cases, letter f gave nearly ideal fit.

Consonants with the worser fit in the Sonnets are: b, c, d, , g, h, k, l, v, and w. Consonants with the better fit in the Sonnets are: m, x, and z. Since there are only few data for study, it can be only speculated, if it is the caused by the different use of these consonants in rhymes, which could produce observed peaks and fluctuations.

It can be concluded, that the analysis of distances between lexical units in text could become an useful method of text analysis.

REFERENCES

Haitun, S. D. (1982a) Stationary Scientometric Distributions I: Different Approximations. Scientometrics, , 4, 525.

Haitun, S. D. (1982b) Stationary Scientometric Distributions II: Non Gaussian Nature of Scientific Activities. Scientometrics, 4, 89 - 101.

Haitun, S. D. (1982c) Stationary Scientometric Distributions III: The Role of the Zipf Distribution. Scientometrics, 5, 375 - 395.

Harary, F.; Paper, H. H. (1957) Toward a General Calculus of Phonemic Distribution, Language, 33, 143 -- 169.

Irwing, J. O. (1963) The Place of Mathematics in Medical and Biological Statistics, J. Royal. Statistical Soc. A, 126, 1 - 45.

Kunz, M. (1987) Time Spectra of Patent Information, Scientometrics, 11, 163 - 173.

Kunz, M. (1993) About metrics of bibliometrics, J. Chem. Inform. Comput.

Sci., 33, 193 – 196.

Kunz, M. ; Rádl, Z. (1998) Distribution of Distances in Information Strings, J. Chem. Inform. Comput. Sci., 38, 374-378.

Kunz, M. (2000) Number e as a model gene (atlas.cz.mujweb\veda\kunzmilan)

Yule, G. U. (1944) The Statistical Study of Literary Vocabulary, Cambridge University Press, Cambridge.