Number e as a model gene. Distribution of distances

Milan Kunz

December 23, 2000, corrected version March, 2003

Abstract

The number e, transformed to base four was studied as a model of a gene. There are some specific features of distribution of distances between individual codones and nucleic acids but these features can not be considered as completely different when compared with a natural gene.

Introduction

Enigma of RNA and DNA is now being solved by their reading. Ribonucleic acids appear from a chemical analysis as letters on a tape of a telecommunication device. We read them as an unknown language. We know that RNA contains instruction for the synthesis of proteins from amino acids in cells.

There emerged a new problem, how these chemical structures, RNA, DNA and proteins could appear. How could be RNA formed from its components, ribonucleic acids, what conditions were favorable for a synthesis. Many other questions must be answered before we understand life and its origin.

There were different possibilities, where life could started: in the interstellar space in its plasma, in the earth atmosphere in droplets, in water, maybe under high pressure in the deep sea, or on catalytic surfaces.

We may suppose that by polymerization of four ribonucleic acids were produced chains of their copolymers, forming the primitive RNA. Following mutations lead to more sophisticated forms capable of selfreproduction in microdroplets as primitive cells.

The primitive RNA should be a stochastic copolymer, since the ordering of its monomers were accidental. It is possible to form a model of such a RNA using some generator of random numbers (replacing the conventional symbols A, C, G, and T for acids by numbers 0, 1, 2, and 3) or to use a well defined sequence of four symbols, where the symbols are distributed randomly.

There are practically infinite many such sequences. It seems better to use some good defined infinite sequence, since it gives opportunity to check results.

For this purpose, it is suitable to use the number e, the base of natural logarithms. The number e is obtained as an infinite sum of the terms 1/n!, where n! is the factorial. It can be calculated for as many places as necessary.

Its study did not show any pattern of distribution of digits in the sequence (see this page).

Since it is possible to express the number e using any number base, starting from binary to hexadecimal, the number e is thus a suitable model of RNA.

Statistical analysis of RNA and DNA is usually based on determining frequencies of their components, ribonucleic acids, their triplets, and nucleic acids, sometimes pairs of ribonucleic acids, or some longer motives.

I proposed to study distances between consecutive items in strings [1].

When tossing a coin, a stochastic binary sequence is obtained. The length of occurrences either heads or tails is the model of the binomial distribution and of the normal distribution. The distribution of distances between consecutive 0 or 1 is known as the negative binomial distribution. Before PC, it was a mere mathematical curiosity, since its hand calculations are very tedious.

I found that the distribution of distances between consecutive numerals in the number e is more or less well described by the negative binomial distribution. Therefore, this number appears to be a suitable model of RNA, which could be formed by a stochastic process.

Results

The number e, further e-gene, calculated on 100000 decimal places was obtained from J. Ventluka [2]. It was transformed in the quarternary base, and studied using the programs elaborated by ing. Zdeněk Rádl CSc.

At first, the distances between symbols were determined and analyzed, then the distances between 64 triplets, and then between the codones coding the nucleic acids. For statistical analysis, the program STATGRAPHICS was used. Unfortunately, it was found that STATGRAPHICS does not reproduce calculations exactly, e.g., it was not possible to force it to use the same number of degrees of freedom in repeated calculations. Therefore, the obtained chi-square values are sometimes poorly comparable.

E-gene was compared with results obtained before with the gene FRAX 52 (n-gene) [1].

At first, frequencies of all 64 codones were determined.

The results are tabulated in the following four tables, indicated according to the first symbol in the triplet, the rows are arranged according to the second symbol in the triplet, and columns are arranged according to the third symbol in the triplet. The row sums give frequencies of all triplets starting with the combination of two symbols, and the column sums give frequencies of all triplets starting and ending with the combination of two symbols.

Table A

A

A

C

G

T

S

A

48

58

46

45

197

C

55

52

51

57

225

G

54

46

54

39

193

T

45

53

56

49

253

S

202

209

217

190

818

Table C

A

A

C

G

T

S

A

50

58

55

64

227

C

52

55

65

63

335

G

46

46

57

57

206

T

62

60

47

50

219

S

210

219

224

234

842

 

Table G

A

A

C

G

T

S

A

68

45

62

65

240

C

52

57

47

44

200

G

44

63

55

53

215

T

53

60

54

45

212

S

217

215

218

207

867

Table T

A

A

C

G

T

S

A

52

58

53

65

228

C

54

55

44

47

200

G

44

58

58

51

208

T

43

56

55

54

268

S

193

227

210

217

847

Generally, some column or row sums are more frequent than other ones. It seems that it is not significant, but I did not made the necessary tests.

The list of all triplets fittings in both genes is given in the following table. There are given at most two best fittings, sometimes more than these two exist but not all possibilities were tested. If the first try was satisfying, no further tests were made.

Table Correlations of Distances

Explanations:

NB - the negative binomial distribution,

EX - the exponential distribution,

ER - the Erlang distribution,

LN - the lognormal distribution,

WE - the Weilbull distribution,

Chisquare is given as its first 3 decimal places.

Codone

n-gene, type

n-gene,

chi-square

e-gene, type

e-gene,

chi-square

AAA

WE

001

ER

821

AAC

WE

001

WE

750

AAG

WE

168

NB, ER

928, 795

AAT

EX, NB

276

ER

750

ACA

EX, WE

885

WE

690

ACC

EX

774

ER

361

ACG

ER

603

WE

490

ACT

NB, EX

863

LN

664

AGA

EX

517

WE

892

AGC

EX, LN

273

LN

321

AGG

ER

545

LN

780

AGT

WE

021

EX, ER

721, 555

ATA

WE, LN

444

NB

060

ATC

EX, NB

238

NB, EX

726

ATG

ER

738

EX

902

ATT

ER

306

ER, NB

366, 281

CAA

ER

728

LN

312

CAC

WE

647

NB

920

CAG

ER, EX

231, 213

EX

192

CAT

WE, EX

585

WE, NB

516

CCA

EX

263

EX

752

CCC

EX

719

NB, EX

481

CCG

WE

602

WE

684

CCT

EX

643

LN

552

CGA

EX, ER

823, 630

WE

272

CGC

WE

184

WE

460

CGG

WE, LN

854

WE

624

CGT

EX

614

EX

849

CTA

WE

814

EX

580

CTC

WE

666

EX

191

CTG

WE, LN, EX

225

EX, ER

705, 656

CTT

EX, ER

342, 343

EX

380

GAA

WE

352

EX

354

GAC

ER

464

ER, EX

351, 270

GAG

WE

852

EX, ER

653, 620

GAT

EX, NB

750

LN

524

GCA

WE

585

NB

131

GCC

ER

828

NB

519

GCG

ER

767

NB

117

GCT

ER

675

EX

518

GGA

EX

021

LN

422

GGC

EX, ER

192, 192

EX

390

GGG

EX

085

LN

423

GGT

WE, EX

852

NB, EX

726

GTA

ER

131

LN

477

GTC

WE

306

ER, EX

640, 632

GTG

ER

717

EX, NB

555

GTT

EX

311

NB

512

TAA

EX

229

NB

758

TAC

WE

323

NB

302

TAG

ER

400

NB

696

TAT

EX

105

WE

073

TCA

NB

877

WE

903

TCC

WE, EX

622

ER

676

TCG

EX, ER

934, 615

EX

630

TCT

NB, EX

623

EX

101

TGA

EX

707

WE

219

TGC

LN

594

LN

664

TGG

WE, EX

991, 937

WE

755

TGT

EX, ER

978, 937

EX

741

TTA

WE

924

EX, ER

577, 359

TTC

WE

255

EX

967

TTG

ER, EX, NB

838, 813, 812

LN

418

TTT

WE

238

EX, NB

637

The best fits are tabulated as follows:

 

n-gene

e-gene

The negative binomial distribution

3 + 4

13 + 4

The exponential distribution

22 + 9

20 + 5

The Erlang distribution

14 +5

7 +5

The lognormal distribution

1 + 4

11

The Weilbull distribution

24 +1

13

The Weilbull distribution and the exponential distribution are most frequent at the Frax 52. The Erlang distribution is the best one at 14 codones and it is applicable at 5 other codones, mostly together with the exponential distribution. The negative binomial distribution and the lognormal distribution are rare. At the artificial e-gene, all distribution are applicable, the exponential distribution is the most frequent one.

The Erlang distribution was not tested in the first version of this communication.

Sometimes it is not possible to decide which distribution gives a better fit, when the results are evaluated by chi-square test. The goodness of fit was usually decreased by some local deviations between expected and observed values.

The following table gives a correlation between the best fits of 64 codones in both genes according to the chi-square test.

The first best fits, comparing n-gene against e-gene

n\e

EX

ER

LN

NB

WE

S

EX

6

2

5

4

5

22

ER

4

2

4

3

1

14

LN

0

0

1

0

0

1

NB

1

0

1

0

1

3

WE

9

3

0

6

6

24

S

20

7

11

13

13

64

The FRAX 52 and e-gene triplets distance distributions coincide mostly at the exponential distribution and the Weilbull distribution. But both sets are different.

Correlations of the best fit distributions in both sets are given in two following tables:

n-gene

1\2

EX

ER

LN

NB

WE

S

EX

x

5

1

4

1

11

ER

2

x

0

1

0

3

LN

1

0

x

0

0

1

NB

2

0

0

x

0

2

WE

5

0

3

0

x

8

S

10

5

4

5

1

25

e-gene

1\2

EX

ER

LN

NB

WE

S

EX

x

4

0

2

0

6

ER

2

x

0

1

0

3

LN

0

0

x

0

0

0

NB

3

1

0

x

0

4

WE

0

0

0

1

x

1

S

5

5

0

4

0

14

The artificial e-gene correlations are better defined, there are less triplets where two distributions give approximately the same fit.

It appears that there exists a great difference between both genes. Its statistical significance was not verified. But clearly, e-gene is mostly correlated at best by the exponential distribution, and by the negative binomial distribution, n-gene by the exponential distribution, and the Weilbull distribution. Because the symbols have the negative binomial distribution, mostly, the triplets forming the distribution of distances, at first to the exponential distribution, then to the Weilbull distribution, and at last to the lognormal distribution.

The range of chi-squares was

for n-gene 0.001 till 0.991

for e-gene 0.060 till 0.928.

It seems that there is no correlation of the results of both genes.

At the natural gene, the exponential distribution matches with the negative binomial distribution and the Weibull distribution, a the e-gene, the exponential distribution matches with the negative binomial distribution. This is only a raw estimate, since not all possible matches were evaluated.

Supposing that the starting stochastic distribution has the form of the negative binomial distribution, it is changing to the exponential one and then to the Weilbull distribution, and to the lognormal distribution, eventually.

This is blurred by deviations from the ideal forms of distributions as local surpluses or shortages of some distances.

Distribution of aminoacids in e-gene

Different triplets are lumped into aminoacids. The results are in the following table

Acid, code symbol

Codones, best distribution, chi-square 0.xxx

Correlation with the best distribution, chi-square 0.xxx

Alanine, A

GCA WE 586, GCC ER 828, GCG ER 767, GCT ER 676

EX 563, WE 450

Arginine, R

AGA EX 517, AGG ER 545, CGA EX 823, CGC WE 184, CGG WE 854, CGT EX 614

NB 708

Asparagine, N

GAC ER 621, GAT WE 852

WE 137

Asparagine acid, D

AAC WE 001, AAT EX 277

LN 479

Cysteine, C

TGC LN 664, TGT EX 978, ER 974

NB 265, ER 265

Glutamine acid, E

GAA ER 353, GAG WE 852

WE 965, NB 858, ER 949

Glutamine, Q

CAA ER 728, CAG ER 213

NB 891, EX 818, WE 780

Glycine, G

GGA EX 022, GGC ER 192, GGG EX 086, GGT WE 852

NB 475

Histidine, H

CAC WE 647, CAT WE 585

WE 278

Isoleucine, I

ATA WE 445, ATC EX 238, ATT ER 306

WE 560, EX 572

Leucine, L

CTA **EX 580, CTC EX 191, CTG EX 380, CTT EX 705, TTA WE 577, TTG LN 418

NB 045

Lysine, K

AAA NB 211, AAG NB 928

LN 469, NB 425

Methionine, M

ATG

EX 902

Phenylalanine, F

TTC NB 967, TTT NB 625

NB 472, WE 442

Proline, P

CCA EX 752, CCC NB 495, CCG WE 484, CCT LN 552

EX 911, NB 855

Serine, S

TCA WE 608, TCC EX 152, TCG EX 630, TCT EX 101

NB 363

Threonine, T

ACA WE 690, ACC NB 232, ACG EX 035, ACT WE 084

WE 162

Tryptophane, W

TGG

WE 755

Tyrosine, Y

TAC NB 302, TAT 073

NB 687, WE 618

Valine, V

GTA EX 345, GTC EX 632, GTG EX 555, GTT WE 362

EX 799, WE 709

Lumping of codones has not the same effect on results. Sometimes, the correlation is improved, as for example for glutamine, where its constituent codones have a worse distribution than their sum, but mostly the fit is worse than could be expected from the parts.

As typical examples, the following results for e-gene can be shown. As a very good fit:

Chisquare test of the codone CAC. The negative binomial distribution

Lower

Upper

Observed

Expected

Limit

Limit

Frequency

Frequency

Chisquare

at or below

22.056

18

18.3

.00478

22.056

43.111

11

12.1

.09185

43.111

64.167

10

8.4

.30737

64.167

85.222

6

5.8

.00406

85.222

127.333

8

6.9

.17300

above

127.333

5

6.5

.34851

Chisquare = 0.929577 with 4 d.f. Sig. level = 0.92028

This is an example of a were good fit. The deviations from expected values are rather small.

Chisquare Test of phenylalanine

Lower

Limit

Upper

Limit

Observed

Frequency

Expected

Frequency

Chisquare

at or below

7.619

18

22.5

.91090

.619

14.238

20

17.9

.24252

14.238

20.857

10

12.4

.46637

20.857

27.476

12

11.7

.00742

7.476

34.095

10

9.3

.05148

34.095

40.714

6

6.4

.03072

40.714

47.333

12

6.1

5.76046

47.333

60.571

8

8.2

.00414

60.571

73.810

5

5.3

.02254

above 73.810

9

10.1

.11538

Chisquare = 7.61193 with 8 d.f. Sig. level = 0.472265

A peak apeared here between distances 41 till 47, 12 such distances instead expected 6.

Chisquare Test of serine

Lower

Limit

Upper

Limit

Observed

Frequency

Expected

Frequency

Chisquare

at or below

1.000

22

26.3

.6904

1.000

5.360

87

83.8

.1187

5.360

9.720

56

58.0

.0673

9.720

14.080

54

48.0

.7493

14.080

18.440

29

25.3

.5487

18.440

22.800

9

17.5

4.1118

22.800

27.160

15

14.5

.0193

27.160

31.520

11

7.6

1.4996

31.520

35.880

4

5.3

.3055

35.880

44.600

4

6.7

1.0620

above 44.600

 

7

5.1

.6665

Chisquare = 9.83917 with 9 d.f. Sig. level = 0.363661

This example is an opposite to phenylalanine, a valley appeared here at distances 19 – 22, giving almost a half of the chi-square.

Conclusion

The number e as a model of a gene has some statistical properties observed at the natural gene.

Literature

1. M. Kunz, Z. Rádl: Distribution of Distances in Information Strings, J. Chem. Inform. Comput. Sci., 38, 374-378.