Number e as a model gene. Distribution of distances
Milan Kunz
December 23, 2000, corrected version March, 2003
Abstract
The number e, transformed to base four was studied as a model of a gene. There are some specific features of distribution of distances between individual codones and nucleic acids but these features can not be considered as completely different when compared with a natural gene.
Introduction
Enigma of RNA and DNA is now being solved by their reading. Ribonucleic acids appear from a chemical analysis as letters on a tape of a telecommunication device. We read them as an unknown language. We know that RNA contains instruction for the synthesis of proteins from amino acids in cells.
There emerged a new problem, how these chemical structures, RNA, DNA and proteins could appear. How could be RNA formed from its components, ribonucleic acids, what conditions were favorable for a synthesis. Many other questions must be answered before we understand life and its origin.
There were different possibilities, where life could started: in the interstellar space in its plasma, in the earth atmosphere in droplets, in water, maybe under high pressure in the deep sea, or on catalytic surfaces.
We may suppose that by polymerization of four ribonucleic acids were produced chains of their copolymers, forming the primitive RNA. Following mutations lead to more sophisticated forms capable of selfreproduction in microdroplets as primitive cells.
The primitive RNA should be a stochastic copolymer, since the ordering of its monomers were accidental. It is possible to form a model of such a RNA using some generator of random numbers (replacing the conventional symbols A, C, G, and T for acids by numbers 0, 1, 2, and 3) or to use a well defined sequence of four symbols, where the symbols are distributed randomly.
There are practically infinite many such sequences. It seems better to use some good defined infinite sequence, since it gives opportunity to check results.
For this purpose, it is suitable to use the number e, the base of natural logarithms. The number e is obtained as an infinite sum of the terms 1/n!, where n! is the factorial. It can be calculated for as many places as necessary.
Its study did not show any pattern of distribution of digits in the sequence (see this page).
Since it is possible to express the number e using any number base, starting from binary to hexadecimal, the number e is thus a suitable model of RNA.
Statistical analysis of RNA and DNA is usually based on determining frequencies of their components, ribonucleic acids, their triplets, and nucleic acids, sometimes pairs of ribonucleic acids, or some longer motives.
I proposed to study distances between consecutive items in strings [1].
When tossing a coin, a stochastic binary sequence is obtained. The length of occurrences either heads or tails is the model of the binomial distribution and of the normal distribution. The distribution of distances between consecutive 0 or 1 is known as the negative binomial distribution. Before PC, it was a mere mathematical curiosity, since its hand calculations are very tedious.
I found that the distribution of distances between consecutive numerals in the number e is more or less well described by the negative binomial distribution. Therefore, this number appears to be a suitable model of RNA, which could be formed by a stochastic process.
Results
The number e, further e-gene, calculated on 100000 decimal places was obtained from J. Ventluka [2]. It was transformed in the quarternary base, and studied
using the programs elaborated by ing. Zdeněk Rádl CSc.At first, the distances between symbols were determined and analyzed, then the distances between 64 triplets, and then between the codones coding the nucleic acids. For statistical analysis, the program STATGRAPHICS was used. Unfortunately, it was found that STATGRAPHICS does not reproduce calculations exactly, e.g., it was not possible to force it to use the same number of degrees of freedom in repeated calculations. Therefore, the obtained chi-square values are sometimes poorly comparable.
E-gene was compared with results obtained before with the gene FRAX 52 (n-gene) [1].
At first, frequencies of all 64 codones were determined.
The results are tabulated in the following four tables, indicated according to the first symbol in the triplet, the rows are arranged according to the second symbol in the triplet, and columns are arranged according to the third symbol in the triplet. The row sums give frequencies of all triplets starting with the combination of two symbols, and the column sums give frequencies of all triplets starting and ending with the combination of two symbols.
Table A
A |
A |
C |
G |
T |
S |
A |
48 |
58 |
46 |
45 |
197 |
C |
55 |
52 |
51 |
57 |
225 |
G |
54 |
46 |
54 |
39 |
193 |
T |
45 |
53 |
56 |
49 |
253 |
S |
202 |
209 |
217 |
190 |
818 |
Table C
A |
A |
C |
G |
T |
S |
A |
50 |
58 |
55 |
64 |
227 |
C |
52 |
55 |
65 |
63 |
335 |
G |
46 |
46 |
57 |
57 |
206 |
T |
62 |
60 |
47 |
50 |
219 |
S |
210 |
219 |
224 |
234 |
842 |
Table G
A |
A |
C |
G |
T |
S |
A |
68 |
45 |
62 |
65 |
240 |
C |
52 |
57 |
47 |
44 |
200 |
G |
44 |
63 |
55 |
53 |
215 |
T |
53 |
60 |
54 |
45 |
212 |
S |
217 |
215 |
218 |
207 |
867 |
Table T
A |
A |
C |
G |
T |
S |
A |
52 |
58 |
53 |
65 |
228 |
C |
54 |
55 |
44 |
47 |
200 |
G |
44 |
58 |
58 |
51 |
208 |
T |
43 |
56 |
55 |
54 |
268 |
S |
193 |
227 |
210 |
217 |
847 |
Generally, some column or row sums are more frequent than other ones. It seems that it is not significant, but I did not made the necessary tests.
The list of all triplets fittings in both genes is given in the following table. There are given at most two best fittings, sometimes more than these two exist but not all possibilities were tested. If the first try was satisfying, no further tests were made.
Table Correlations of Distances
Explanations:
NB - the negative binomial distribution,
EX - the exponential distribution,
ER - the Erlang distribution,
LN - the lognormal distribution,
WE - the Weilbull distribution,
Chisquare is given as its first 3 decimal places.
Codone |
n-gene, type |
n-gene, chi-square |
e-gene, type |
e-gene, chi-square |
AAA |
WE |
001 |
ER |
821 |
AAC |
WE |
001 |
WE |
750 |
AAG |
WE |
168 |
NB, ER |
928, 795 |
AAT |
EX, NB |
276 |
ER |
750 |
ACA |
EX, WE |
885 |
WE |
690 |
ACC |
EX |
774 |
ER |
361 |
ACG |
ER |
603 |
WE |
490 |
ACT |
NB, EX |
863 |
LN |
664 |
AGA |
EX |
517 |
WE |
892 |
AGC |
EX, LN |
273 |
LN |
321 |
AGG |
ER |
545 |
LN |
780 |
AGT |
WE |
021 |
EX, ER |
721, 555 |
ATA |
WE, LN |
444 |
NB |
060 |
ATC |
EX, NB |
238 |
NB, EX |
726 |
ATG |
ER |
738 |
EX |
902 |
ATT |
ER |
306 |
ER, NB |
366, 281 |
CAA |
ER |
728 |
LN |
312 |
CAC |
WE |
647 |
NB |
920 |
CAG |
ER, EX |
231, 213 |
EX |
192 |
CAT |
WE, EX |
585 |
WE, NB |
516 |
CCA |
EX |
263 |
EX |
752 |
CCC |
EX |
719 |
NB, EX |
481 |
CCG |
WE |
602 |
WE |
684 |
CCT |
EX |
643 |
LN |
552 |
CGA |
EX, ER |
823, 630 |
WE |
272 |
CGC |
WE |
184 |
WE |
460 |
CGG |
WE, LN |
854 |
WE |
624 |
CGT |
EX |
614 |
EX |
849 |
CTA |
WE |
814 |
EX |
580 |
CTC |
WE |
666 |
EX |
191 |
CTG |
WE, LN, EX |
225 |
EX, ER |
705, 656 |
CTT |
EX, ER |
342, 343 |
EX |
380 |
GAA |
WE |
352 |
EX |
354 |
GAC |
ER |
464 |
ER, EX |
351, 270 |
GAG |
WE |
852 |
EX, ER |
653, 620 |
GAT |
EX, NB |
750 |
LN |
524 |
GCA |
WE |
585 |
NB |
131 |
GCC |
ER |
828 |
NB |
519 |
GCG |
ER |
767 |
NB |
117 |
GCT |
ER |
675 |
EX |
518 |
GGA |
EX |
021 |
LN |
422 |
GGC |
EX, ER |
192, 192 |
EX |
390 |
GGG |
EX |
085 |
LN |
423 |
GGT |
WE, EX |
852 |
NB, EX |
726 |
GTA |
ER |
131 |
LN |
477 |
GTC |
WE |
306 |
ER, EX |
640, 632 |
GTG |
ER |
717 |
EX, NB |
555 |
GTT |
EX |
311 |
NB |
512 |
TAA |
EX |
229 |
NB |
758 |
TAC |
WE |
323 |
NB |
302 |
TAG |
ER |
400 |
NB |
696 |
TAT |
EX |
105 |
WE |
073 |
TCA |
NB |
877 |
WE |
903 |
TCC |
WE, EX |
622 |
ER |
676 |
TCG |
EX, ER |
934, 615 |
EX |
630 |
TCT |
NB, EX |
623 |
EX |
101 |
TGA |
EX |
707 |
WE |
219 |
TGC |
LN |
594 |
LN |
664 |
TGG |
WE, EX |
991, 937 |
WE |
755 |
TGT |
EX, ER |
978, 937 |
EX |
741 |
TTA |
WE |
924 |
EX, ER |
577, 359 |
TTC |
WE |
255 |
EX |
967 |
TTG |
ER, EX, NB |
838, 813, 812 |
LN |
418 |
TTT |
WE |
238 |
EX, NB |
637 |
The best fits are tabulated as follows:
n-gene |
e-gene |
|
The negative binomial distribution |
3 + 4 |
13 + 4 |
The exponential distribution |
22 + 9 |
20 + 5 |
The Erlang distribution |
14 +5 |
7 +5 |
The lognormal distribution |
1 + 4 |
11 |
The Weilbull distribution |
24 +1 |
13 |
The Weilbull distribution and the exponential distribution are most frequent at the Frax 52. The Erlang distribution is the best one at 14 codones and it is applicable at 5 other codones, mostly together with the exponential distribution. The negative binomial distribution and the lognormal distribution are rare. At the artificial e-gene, all distribution are applicable, the exponential distribution is the most frequent one.
The Erlang distribution was not tested in the first version of this communication.
Sometimes it is not possible to decide which distribution gives a better fit, when the results are evaluated by chi-square test. The goodness of fit was usually decreased by some local deviations between expected and observed values.
The following table gives a correlation between the best fits of 64 codones in both genes according to the chi-square test.
The first best fits, comparing n-gene against e-gene
n\e |
EX |
ER |
LN |
NB |
WE |
S |
EX |
6 |
2 |
5 |
4 |
5 |
22 |
ER |
4 |
2 |
4 |
3 |
1 |
14 |
LN |
0 |
0 |
1 |
0 |
0 |
1 |
NB |
1 |
0 |
1 |
0 |
1 |
3 |
WE |
9 |
3 |
0 |
6 |
6 |
24 |
S |
20 |
7 |
11 |
13 |
13 |
64 |
The FRAX 52 and e-gene triplets distance distributions coincide mostly at the exponential distribution and the Weilbull distribution. But both sets are different.
Correlations of the best fit distributions in both sets are given in two following tables:
n-gene
1\2 |
EX |
ER |
LN |
NB |
WE |
S |
EX |
x |
5 |
1 |
4 |
1 |
11 |
ER |
2 |
x |
0 |
1 |
0 |
3 |
LN |
1 |
0 |
x |
0 |
0 |
1 |
NB |
2 |
0 |
0 |
x |
0 |
2 |
WE |
5 |
0 |
3 |
0 |
x |
8 |
S |
10 |
5 |
4 |
5 |
1 |
25 |
e-gene
1\2 |
EX |
ER |
LN |
NB |
WE |
S |
EX |
x |
4 |
0 |
2 |
0 |
6 |
ER |
2 |
x |
0 |
1 |
0 |
3 |
LN |
0 |
0 |
x |
0 |
0 |
0 |
NB |
3 |
1 |
0 |
x |
0 |
4 |
WE |
0 |
0 |
0 |
1 |
x |
1 |
S |
5 |
5 |
0 |
4 |
0 |
14 |
The artificial e-gene correlations are better defined, there are less triplets where two distributions give approximately the same fit.
It appears that there exists a great difference between both genes. Its statistical significance was not verified. But clearly, e-gene is mostly correlated at best by the exponential distribution, and by the negative binomial distribution, n-gene by the exponential distribution, and the Weilbull distribution. Because the symbols have the negative binomial distribution, mostly, the triplets forming the distribution of distances, at first to the exponential distribution, then to the Weilbull distribution, and at last to the lognormal distribution.
The range of chi-squares was
for n-gene 0.001 till 0.991
for e-gene 0.060 till 0.928.
It seems that there is no correlation of the results of both genes.
At the natural gene, the exponential distribution matches with the negative binomial distribution and the Weibull distribution, a the e-gene, the exponential distribution matches with the negative binomial distribution. This is only a raw estimate, since not all possible matches were evaluated.
Supposing that the starting stochastic distribution has the form of the negative binomial distribution, it is changing to the exponential one and then to the Weilbull distribution, and to the lognormal distribution, eventually.
This is blurred by deviations from the ideal forms of distributions as local surpluses or shortages of some distances.
Distribution of aminoacids in e-gene
Different triplets are lumped into aminoacids. The results are in the following table
Acid, code symbol |
Codones, best distribution, chi-square 0.xxx |
Correlation with the best distribution, chi-square 0.xxx |
Alanine, A |
GCA WE 586, GCC ER 828, GCG ER 767, GCT ER 676 |
EX 563, WE 450 |
Arginine, R |
AGA EX 517, AGG ER 545, CGA EX 823, CGC WE 184, CGG WE 854, CGT EX 614 |
NB 708 |
Asparagine, N |
GAC ER 621, GAT WE 852 |
WE 137 |
Asparagine acid, D |
AAC WE 001, AAT EX 277 |
LN 479 |
Cysteine, C |
TGC LN 664, TGT EX 978, ER 974 |
NB 265, ER 265 |
Glutamine acid, E |
GAA ER 353, GAG WE 852 |
WE 965, NB 858, ER 949 |
Glutamine, Q |
CAA ER 728, CAG ER 213 |
NB 891, EX 818, WE 780 |
Glycine, G |
GGA EX 022, GGC ER 192, GGG EX 086, GGT WE 852 |
NB 475 |
Histidine, H |
CAC WE 647, CAT WE 585 |
WE 278 |
Isoleucine, I |
ATA WE 445, ATC EX 238, ATT ER 306 |
WE 560, EX 572 |
Leucine, L |
CTA **EX 580, CTC EX 191, CTG EX 380, CTT EX 705, TTA WE 577, TTG LN 418 |
NB 045 |
Lysine, K |
AAA NB 211, AAG NB 928 |
LN 469, NB 425 |
Methionine, M |
ATG |
EX 902 |
Phenylalanine, F |
TTC NB 967, TTT NB 625 |
NB 472, WE 442 |
Proline, P |
CCA EX 752, CCC NB 495, CCG WE 484, CCT LN 552 |
EX 911, NB 855 |
Serine, S |
TCA WE 608, TCC EX 152, TCG EX 630, TCT EX 101 |
NB 363 |
Threonine, T |
ACA WE 690, ACC NB 232, ACG EX 035, ACT WE 084 |
WE 162 |
Tryptophane, W |
TGG |
WE 755 |
Tyrosine, Y |
TAC NB 302, TAT 073 |
NB 687, WE 618 |
Valine, V |
GTA EX 345, GTC EX 632, GTG EX 555, GTT WE 362 |
EX 799, WE 709 |
Lumping of codones has not the same effect on results. Sometimes, the correlation is improved, as for example for glutamine, where its constituent codones have a worse distribution than their sum, but mostly the fit is worse than could be expected from the parts.
As typical examples, the following results for e-gene can be shown. As a very good fit:
Chisquare test of the codone CAC. The negative binomial distribution
Lower |
Upper |
Observed |
Expected |
|
Limit |
Limit |
Frequency |
Frequency |
Chisquare |
at or below |
22.056 |
18 |
18.3 |
.00478 |
22.056 |
43.111 |
11 |
12.1 |
.09185 |
43.111 |
64.167 |
10 |
8.4 |
.30737 |
64.167 |
85.222 |
6 |
5.8 |
.00406 |
85.222 |
127.333 |
8 |
6.9 |
.17300 |
above |
127.333 |
5 |
6.5 |
.34851 |
Chisquare = 0.929577 with 4 d.f. Sig. level = 0.92028
This is an example of a were good fit. The deviations from expected values are rather small.
Chisquare Test of phenylalanine
Lower Limit |
Upper Limit |
Observed Frequency |
Expected Frequency |
Chisquare |
at or below |
7.619 |
18 |
22.5 |
.91090 |
.619 |
14.238 |
20 |
17.9 |
.24252 |
14.238 |
20.857 |
10 |
12.4 |
.46637 |
20.857 |
27.476 |
12 |
11.7 |
.00742 |
7.476 |
34.095 |
10 |
9.3 |
.05148 |
34.095 |
40.714 |
6 |
6.4 |
.03072 |
40.714 |
47.333 |
12 |
6.1 |
5.76046 |
47.333 |
60.571 |
8 |
8.2 |
.00414 |
60.571 |
73.810 |
5 |
5.3 |
.02254 |
above 73.810 |
|
9 |
10.1 |
.11538 |
Chisquare = 7.61193 with 8 d.f. Sig. level = 0.472265
A peak apeared here between distances 41 till 47, 12 such distances instead expected 6.
Chisquare Test of serine
Lower Limit |
Upper Limit |
Observed Frequency |
Expected Frequency |
Chisquare |
at or below |
1.000 |
22 |
26.3 |
.6904 |
1.000 |
5.360 |
87 |
83.8 |
.1187 |
5.360 |
9.720 |
56 |
58.0 |
.0673 |
9.720 |
14.080 |
54 |
48.0 |
.7493 |
14.080 |
18.440 |
29 |
25.3 |
.5487 |
18.440 |
22.800 |
9 |
17.5 |
4.1118 |
22.800 |
27.160 |
15 |
14.5 |
.0193 |
27.160 |
31.520 |
11 |
7.6 |
1.4996 |
31.520 |
35.880 |
4 |
5.3 |
.3055 |
35.880 |
44.600 |
4 |
6.7 |
1.0620 |
above 44.600 |
7 |
5.1 |
.6665 |
Chisquare = 9.83917 with 9 d.f. Sig. level = 0.363661
This example is an opposite to phenylalanine, a valley appeared here at distances 19 – 22, giving almost a half of the chi-square.
Conclusion
The number e as a model of a gene has some statistical properties observed at the natural gene.
Literature
1. M. Kunz, Z. Rádl: Distribution of Distances in Information Strings, J. Chem. Inform. Comput. Sci., 38, 374-378.