A  note about DNA sequences

 

The complete reading of the human genome, rice genome, and other achievements, aroused interest in problems connected with interpretation of these results.

Randic proposed condensed representation of DNA primary sequences. These representations are in fact in form of 4x4 matrices M which entries mij are the number of transition of the base i to the base j.

In fact, Randic proposed to use the Markov matrices.

In his basic paper, Markov studied transitions of symbols in a poem of Poushkine Eugen Onegine, to simplify his task, as the binary case of vowels and consonants. From this study a whole branch of mathematics stemmed.

As a criterion for comparing different results, Randic proposed the Euclidean distance between observed matrices. In this paper, I try to analyze some aspects of this problem. It is explained for nucleic bases but the results are valid for consecutive triplets, and aminoacids as well.

Matrices of expected events

From an analysis, we obtain the number of four nucleic bases A, C, G, T.

The first trivial equation is

         A +  C + G + T = n

where  n is  the number of all nucleic bases in DNA.

We define the vector N as (A, C, G, T), and we use this vector to form the square matrix NTNexp. This matrix is symmetrical. Its values gives the expected numbers of neighbours,  AC, AG, AT, etc., The squared values A2, C2, etc. on the diagonal are somewhat greater than expected values X(X - 1) of transitions AA, etc.. This can be neglected for great X. The matrix NTNexphas just one nonzero eigenvalue, since it is generated by one vector.

The matrix NTNexpis a base to the matrix of the observed transitions  NTNobs.Both matrices can be normalized by dividing with the numbers   (A, C, G, T).

At  NTNexp,  the result is  the unit  matrix JJT, where J is the  the unit column.

At  NTNexp,  the resulting matrix elements are different numbers. Their difference from 1  characterizes the specifity of DNA, its difference from the most probable values.

For example, the Randic Table 2, the condensed matrix for exon 1-92

4

2

7

4

1

2

15

2

8

9

12

6

3

7

2

6

 

The table of expected values (some rounding errors)

3

4

7

3

4

4

8

4

8

8

13

7

3

4

7

3

 

the difference (expected - observed)

-1

2

0

-1

3

2

-7

2

-2

-1

1

1

0

-3

5

-3

 

The  normalized  matrices can be subtracted from the unit diagonal matrix I. This operation gives the actual Markov operator [INTNnorm]. 

The matrix NTNobscan be compared with the Randic matrix of observed transitions.

Unfortunately, the observed matrices have complex eigenvalues and their interpretation is difficult.