WIENER NUMBER DISINTEGRATED

Milan Kunz

From different decompositions of the Wiener number, the Altenburg polynomial gives the best results. The inverse matrix technique gives exact correlations inside training sets and acceptable predictions for test sets, provided that all subgraphs are used as topological descriptors together with distances. The standard deviation is at heptanes greater than at hexanes. Results show that physical interpretations of the original Wiener correlation can be fallacious.

1. Introduction

Fifty years ago when Wiener [1] published his correlation of paraffin boiling points with a structural feature known as distances, an era of topological indices began. They gained a strong position between techniques used for correlating and predicting physicochemical properties of molecules. Because of its great economical importance for the pharmaceutical industry, the topic appeared in several thousands of publications.

Before computers, because linear correlations with one variable are convincing and without technical difficulties, the aim was to characterize a molecule by only one number. But no universal index was found. Now it is customary to combine more indices for obtaining good fits [2,3]. Another extreme appeared: too many topological indices were introduced and they intercorrelate [4-8]. Therefore, orthogonal sets of topological indices are constructed [9-11].

Wiener index correlates successfully with boiling points of alkanes and their other properties. Distances of different length seem to have specific effect on physical properties [1,4,5,12-14]. Distances between atoms correspond to moments of electronic orbitals [14].

Trinajstic [15-21] with his coworkers calculated the geometric analogue of the Wiener index, that is polynomial coefficients and eigenvalues of distance matrices of alkanes, topological and geometrical ones. Some authors proposed the use of the eigenvalues [2,22-24] of graph matrices and/or coefficients of their polynomials [25-26] as topological indices.

Topological indices are objective invariants characterizing molecules and that should be their advantage against substituent constants, derived from empirical correlations. Their disadvantage is that they are usually not connected with properties by some physical theories which could be tested objectively and therefore a sole criterion for their acceptance are correlation coefficients between proposed indices and some properties of molecules.

Unique numbers characterizing molecules should overcome problems of multiple correlations. Technical ones are now obsolete due to computers, but there remained problems connected with the precision of such correlations.

2. Inverse matrix technique

A technique, how to surpass difficulties with the evaluation of the significance of multiple correlations coefficients, is the use of the training and test sets. The efficiency of a correlation is simply verified as the prediction capability of independent cases. It is mostly used in the case of neural networks but it can be applied to the inverse matrix technique, too. Since the inverting of singular matrices is a routine procedure, it is now usable quite easily.

If M is a matrix of some indicators (their row sums can give an index), x is a vector of weights of the indicators, and b is a vector of correlated properties

Mx = b

then

x = M-1b

provided that M is regular. In practice it must be well determined to give a reasonable inverse. It is essentially possible to obtain exact correlations with physical properties within the training set even with matrices of random numbers. But predictions will be erratic. If we weight row sums of random numbers, using an index known from linear correlations to be effective, e. g. by the Wiener index, then we get exact correlations within the training set and predicted values will have great random errors. Nevertheless they should be acceptable. When a decomposition of an index into the indicators has a physical meaning then a linear combination of indicator weights should give a good prediction for the test set.

As an example can be given the case of boiling points of hydrogen and C1-C3 alkanes, when all plain subgraphs of the molecular graphs are used as the indicators

Indicator

H2

C

C-C

C-C-C

Cycle

Hydrogen

1

0

0

0

0

Methane

2

1

0

0

0

Ethane

3

2

1

0

0

Propane

4

3

2

1

0

Cyklopropane

3

3

3

3

1

 

 

Inverse matrix

Vector b 0C

1

0

0

0

0

-259.2

-2

1

0

0

0

357.4

1

-2

1

0

0

-25.2

0

1

-2

1

0

-27.1

0

0

3

-3

1

-104.7

The last term represents the difference between the prediction 137.70C of the boiling point of cyklopropane and its experimental value -330C. The prediction is based on the fact, that cyklopropane contains 6 hydrogens, 3 carbons, 3 C-C bonds and 3 different paths of length 2. The effect of cyclization is too great, to give a useful prediction.

3. Indicator vectors

Between n atoms of a molecule there exist n(n -1)/2 distances. The topological distance dij is the number of bonds between atoms i and j. Geometric distances between centers of atoms depend on configurations of the molecule. They must be calculated for some typical configurations of molecules.

The Wiener index W can be formulated as Altenburg polynomial [27]

W =  S nkdk (1)

where nk is the number of paths of the length dk in a molecule. Altenburg polynomial gives the gyration tensor of an alkane, which roughly corresponds to its volume and is the sum of squared distances. Recently Gutman and Körtvélyesi [28] claimed that the Wiener index is a measure of the surface of molecules. Values nk were used by Kvasnička [29] as some of descriptors in a neural network.

Another objective decomposition of the Wiener index is its representation by eigenvalues of matrices which trace is twice the Wiener index. The sum of eigenvalues is identical with the sum of the diagonal elements. The distance matrix D of a molecule is defined [30] as a square matrix, the elements dij of which are distances between the atoms. The diagonal elements are zero. It is possible to formulate matrices (Q - D) having on the diagonal sums of distances of the given vertex to other vertices. At acyclic graphs two other kinds of matrices have traces related with Wiener index, namely the quadratic form of path and walk matrices WTW [31] and Eichinger matrices E [32-35]. Relations of different graph matrices are summarized in TABLE I.

TABLE I Graph matrices

I the diagonal unit matrix

JT the transposed unit vector column

S the incidence matrix of an oriented graph

sij = -1 the arc i is going off the vertex j

sij = 1 the arc i is going to the vertex j

STS the Laplace-Kirchhoff matrix STS = V - A

V the diagonal matrix of vertex degrees vj

A the adjacency matrix

aij = 1 if vertices i and j are adjacent, aij = 0 otherwise

W the path (walk) matrix, defined for trees [31] as incidences of paths with arcs or edges

WTW the inverse matrix of nSST

D the distance matrix

dij the number of arcs on the path between vertices i and j

Q the diagonal matrix of distance sums in D

E the Eichinger matrix, the generalized inverse [32]

ESTS = nI - JJT

Distance matrices have one positive eigenvalue and (n-1) negative eigenvalues which sum annihilates the positive eigenvalue. A sum of squared eigenvalues of the distance matrix is equal to the sum of its squared elements, to it gyration tensor [27,36,37], and these eigenvalues were included in the study. Together with eigenvalues of distance matrices their inverse values were tested. In the case of WTW matrices, they are eigenvalues of the matrix SST and simultaneously of the Laplace- Kirchhoff matrix STS. In the case of topological distance matrices, their inverses are the perturbed Laplace-Kirchhoff matrices [33].

Another tested vectors were the polynomial combinations of eigenvalues. They are the coefficients of the matrix polynomial. Their sum, known as Hosoya [23] index correlates with boiling points.

4. Correlations of alkane boiling points

At first we used heptanes as training and test sets. For training sets we used k compounds with k independent values of indicator vectors, which gave the matrix M. Multiplying the inverse matrix M-1 with the vector column of corresponding boiling points, we obtained the vector of weights x which, when multiplied with rows of values characterizing compounds of the whole set, gave predicted boiling points. If the matrix M was well determined and had a reasonable inverse then the boiling points of the training sets were reproduced always exactly till rounding errors much smaller than the best experimental errors. The quadratic form of the difference between predicted and experimental boiling points vectors, normalized to one compound (standard deviation), was used as the criterion of the fit.

The results obtained with heptanes are compiled in TABLE II. From the training sets of 6 or 7 compounds 2 or 3 other were predicted. The results for geometric and topologic distances are not strictly comparable, because the training sets contained different heptanes as they were used in TABLES of original papers [16-20].

TABLE II

Predictions of boiling points of heptane isomers

Used vectors

Standard deviation (0C)

Eigenvalues of WTW

22.11

Eigenvalues of WTW +c

21.51

Ordered diagonal values of WTW

75.95

Eigenvalues of (Q-D)

152.80

Eigenvalues of (Q-D) +c

*

Ordered diagonal values of Q

30.46

Inverse eigenvalues of(Q-D)

119.80

Inverse eigenvalues of (Q-D)+c

14.47

Eigenvalues of topol. D [21]

44.82

Squared eigenvalues of topol. D [21]

31.86

Eigenvalues of topol. D+c [21]

12.28

Inverse eigenvalues of topol. D+c [21]

4.44

Polynomial coefficients of topol. D [21]

252.04

Eigenvalues of geom. D [17]

9.61

Eigenvalues of geom. D+c [17]

6.95

Inverse eigenvalues of geom. D +c [17]

4.43

Polynomial coefficients of geom. D [17]

33.94

Polynomial coefficients of geom. D+c [17]

2.94

Altenburg polynomial

*

 

Notes: +c = a unit column c = 1 is added to the matrix * correlation matrix was singular or nearly singular.

Some erratic predicted values show that a causal relation of these indicators with the observed physical property does not exist at all. The decompositions of Wiener number (eigenvalues and ordered diagonal values of WTW, Q, etc.) gave deviations greater than 10, which can be considered as a random result. The geometric indicators behaved better. Nevertheless, no exact prediction of boiling points was obtained by any set of eigenvalues. Results could be improved, as it will be discussed later, by linear combinations, but this was not the aim of this study.

The Altenburg polynomial was remarkable. It failed in heptane tests because its matrix was singular. It was necessary to remove the constant column of six C-C distances. Then from the training set of 5 heptanes boiling points other 4 ones were calculated with the standard deviation of 3.14 0C. The Altenburg polynomial was therefore tested further.

For calculations of boiling points of all lower alkanes till heptanes distance matrices of hydrogen depleted alkanes were used, as customary. Here a constant column could be added but results could not be improved by replacing it in the matrix by the number of hydrogens, it means by the number of distances C-H. An analysis of the inverse matrices M-1 showed that their first columns have elements 1, -2, +1, and 0.5, -1.5, +1, respectively. It gives zero as the sum of the first 3 columns of a matrix M. This conclusion was confirmed by introducing a dummy compound into correlations characterized by a variable boiling point, instead of methane, which weighted the first column. Variations of weights from -100 till 10 had no effect on calculated boiling points of other alkanes.

When all 22 alkanes from methane till heptanes were calculated from a training set containing propane till n-heptane and two other heptane isomers, the predicted ethane boiling point was 87 0C and methane -141 0C.

Attempts to increase training sets by introducing the number of distances H-C-H or H-C-C-H as indicators characterizing additional compounds failed, the resulting matrices were always singular. The information about these distances is redundant. The numbers of such paths in a molecule are linear combinations of C-C distances which is not obvious at branched alkanes. Standard deviations in the whole set were very sensitive to the choice of a learning set and were rather high to be useful.

A linear combination of the Altenburg polynomial can not reproduce boiling points of alkanes exactly. Let us suppose that the set of intermolecular distances determines exactly boiling points of hexanes. The difference of boiling points of heptanes against their predecessors must be then a linear combination of distance vectors. The difference of distances between 2-methylpentane and 2,3-dimethylpentane is the same as between 2,3-dimethylbutane and 2,2,3-trimethylbutane, it is 1*1, 2*2, 3*3, but differences of boiling points are 29.51 and 22.86 0C, respectively. This is against the assumption of linearity.

Better results are obtained when alkanes are divided into sets according to the length of their chains. The boiling point of 2,2,3-trimethylbutane was predicted from the training set of only 4 lower substituted butanes with 0.61 0C error, 3 heptanes from the training set of only 5 substituted pentanes had standard deviation 1.58 0C, which is a half of the standard deviation inside heptane set itself.

The correlation vectors are:

butane set = -27.25 +94.54n1 -6.60n2 +1.62n3

pentane set = -21.47 +76.63n1 -5.19n2 +1.92n3 -1.06n4

Correlations, obtained with one small training set can be improved by a linear combination technique. Choosing another training sets boiling points predicted with errors are determined exactly (provided that their correlation matrix is not singular) and the mean of the correlation vectors from different training sets gives the middle of the predicted and true boiling points.

5. Correlations with all subgraphs

To the distances, or linear chains subgraphs, all subgraphs including the graph itself can be added. The value of its vector of weights is error of the prediction. It is decreasing with n, see TABLE III, where alkanes are divided as the training set and two test sets. At first branched hexanes values were predicted from lower alkanes matrix, then branched heptanes values were predicted.

TABLE III

Values of the additive scheme for lower alkanes

Alkane

Value b 0C

 

 

The training set:

 

The test set I:

 

Me

-161.00

2-MePe

0.92

Et

233.37

3-MePe

0.83

Pr

-26.04

2,2-diMeBu

0.36

n-Bu

-4.53

2,3-diMeBu

1.35

n-Pe

-5.20

Mean and standard deviation:

n-Hex

-3.75

0.865

+/- 0.352

n-Hep

-3.39

 

 

i-Bu

10.29

 

 

i-Pe

2.40

 

 

neo-Pe

-3.90

 

 

 

The test set II

2-MeHex

1.63

3-MeHex

-0.02

2,2-diMePe

0.49

3,3-diMePe

-0.56

2,3-diMePe

-1.66

2,4-diMePe

1.56

3-EtPe

-0.71

 

 

Mean and standard deviation:

0.104

+/- 1.123

 

The inverse matrix technique was found to be effective. The standard deviation of 7 branched heptanes boiling points prediction was 1.123 0C from the 13 parameter equation. The number of parameters it were possible to decrease to 10, using for all branched hexanes their mean value 0.865. Since other 13 boiling points were reproduced exactly, it means that the standard deviation was 0.375 0C for the whole set. Adjusting some vectors b, to decrease the greatest differences, it was possible to improve this result somewhat.

6. Geometric Wiener indices correlations

The inverse matrix technique presumes that all weights x are equal for all molecules, as at the topological Altenburg polynomial the distances dk. In contrast the distance weights at the geometric Altenburg polynomial, except the first two ones, depend on configurations of the molecules. The distance weights of consecutive bonds are changed by rotations through bonds. This effect can be compared with a neural network. Each isomer has its own network, which works slightly differently from other ones when a molecule pursues its optimal configuration. The interatomic distances are always in some ranges which are at greater distances wider because there can appear unclosed rings which shorten distances. The coefficients dk of the geometric Altenburg polynomial are arithmetic means of true distances and a calculation of the geometric Wiener index using the Altenburg polynomial investigates the neural network model of molecules. Or otherwise, the geometric Wiener index which correlates with physical properties can be used as their model. For example at n-pentane the polynomial has the coefficients in the range [17]

W = (1.534-1.537)n1 + (2.543-2.574)n2 + (3.138-3.941)n3 + (3.766-5.087)n4

When the lowest distance weights of unsubstituted pentanes which apply for (g+g+) conformer (it is energetically strained against the (aa) conformer) are used for calculation of their alkyl substituted derivatives W, then too high estimates are obtained. Distances between atoms in branched alkanes are smaller than in linear ones. Nevertheless even in the most strained molecules they can not be less than the distance between adjacent atoms.

When I tried to calculate the geometric Wiener indices using the inverse matrix technique with the Altenburg polynomial, I obtained, for example, at two different training sets of substituted hexanes weights

1.7597c - 3.0682n1 + 2.43n2 + 3.210n3 + 6.1963n4 + 5.1825n5

1.2453c + 0.0182n1 + 2.43n2 - 0.905n3 + 8.2538n4 + 5.1825n5

which exactly reproduced the training sets and satisfactorily the test sets, at best when both weight sets were pooled. Because the distance weights must be positive, as shown above, the result with negative weights is physically unacceptable. Identical weights at n2 and n5 in both sets are remarkable.

It is not a fault of the inverse matrix technique that it gives false weights but it is a result of the properties of the system. When the topologic Wiener index for pentanes was calculated, the computed weights were exact, provided that the input data were exact. An error of about 6 % in one value (error vector: 2, 0, 0, 0, 0) changed the weight vector dramatically:

correct 1c + 0.9999n1 + 2n2 + 3n3 + 4n4

error 8.5c - 24n1 + 4n2 + 5n3 + 6n4

A bundle of multidimensional weight vectors is replaced by a vector, which does not lie inside them but crosses them.

7. Discussion

Randic [11] challenged chemometricians, medicinal chemists and graph theorists to answer questions what are the best k-parameter regressions. Since in the literature till now the best result was 2.24 0C as the standard deviation, the inverse matrix technique gives good results.

Unfortunately, the standard deviation is increasing from branched hexanes to branched heptanes, despite the fact that the difference against subgraphs is relatively smaller at heptanes than at hexanes and the mean difference (experimental boiling points - predicted ones) is smaller at heptanes than at hexanes, too.

It is necessary to explain it. A plausible answer is that heptanes have a greater freedom. Their conformations can be more varied and so are their physical properties. Moreover they have more roots. The distances in a molecule are not equivalent. They are split into distances from different atoms. Even collective physical properties, as e. g. the boiling point is, can be weighted sums of properties of individual atoms. At monosubstituted alkanes (alcohols), their boiling points correlate with distances from the oxygen atom [38]. Distances between carbon atoms can be neglected.

The development of sciences is not a straight way. If somebody, say a hundred years ago, could predict from the knowledge of 7 boiling points of alkanes, from propane till heptanes, their structures and from the simple fact that ethane has 6 distances hydrogen-carbon and 1 distance C-C, that its boiling point should be about -87 0C, and methane's boiling point with its 4 C-H bonds should be about -141 0C, such a prophesy would have been considered by chemists as an accomplishment comparable with a calculation of the position of a new planet. But at that time such computations were difficult, because they are based on inversions of comparatively large matrices. The achievement of today is only technical.

Sophisticated polynomial coefficients and eigenvalues of different matrices representing molecules gave no improvement against the simplest representation of topological distances, the Altenburg polynomial. If correlations are poor with all eigenvalues they can not be improved by choosing only some special values because discarded values could only mend a correlation or they should have negligible weights.

Boiling points of alkane molecules depend on topological distances between all their carbon atoms. The number of hydrogen atoms behaves as a constant but it gives a different weight to C-C distances in correlations.

Boiling points of alkanes correlate satisfactorily with their Altenburg polynomials only inside narrow sets characterized by the length of the chain. This together with the results with the modeled property, the geometrical Wiener index, leads to the conclusion that atoms in a molecule behaves as a molecular neural network. Weights, which each molecule gives to different inputs, here distances, depend on the structure of the network. Therefore the output properties, here boiling points, are not and can not be a strictly linear function of the inputs. Trinajstic et al. [20] observed better correlation of boiling points of alkanes with the topological Wiener index than with the geometric one. This can be explained by the fact that topological weights are constant for all isomers and the final correlation with their sum is more uniform.

Molecular networks having a similar structure can be successfully modeled by neural networks as shown by Kvasnička [39]. The inverse matrix technique gives an opportunity to test descriptors before their applications in neural networks or constructions of topological indices if they are effective. It can even replace these methods.

A somewhat unfortunate finding is that even a successful multiple correlation does not guarantee a physical meaning of found parameters if different molecules behave differently. Therefore discussion of physical effects of special distances on alkane boiling points should be based on arguments other than on correlation results alone.

8. References

1. H. Wiener, J. Am. Chem. Soc., 69, (1949) 17-20.

2. A.T.Balaban, D. Ciubotariu and O.Ivanciuc, Commun. Math. Chem. (MATCH), 25 (1990) 41-70.

3. A.T.Balaban, D. Ciubotariu and M. Medeleneau, J. Inf. Comput. Sci., 31 (1991) 517-523.

4. H. Narumi and H. Hosoya, Bull. Chem. Soc. Jpn. 58, (1985) 1778-1786.

5. Y.Gao and H. Hosoya, Bull. Chem. Soc. Jpn. 61, (1988) 3093-3102.

6. D. Hnyk, Collect. Czech. Chem. Commun. 55, (1990) 55-62.

7. S.C. Basak, G.J.Niemi and G.D. Veith, J. Math. Chem., 7 (1991) 243-272.

8. R. Todeschini, R. Cazar and E. Colling, Chemom. Intell. Lab. Syst., 15 (1992) 51-59.

9. M. Randic, J. Comput. Chem., 12 (1991) 970-980.

10. M. Randic, New. J. Chem., 15 (1991) 517-525.

11. M. Randic, J. Mol. Struct. (Theochem), 233 (1991) 45-59.

12. M. Randic M and C.L. Wilkins, Int. J. Quantum. Chem., 18 (1980) 1005-1027.

13. B. Jerman-Blazic, M. Randic and J. Zerovnik, QSAR Strategies Des. Bioact. Comp. Proc. Eur. Symp. Quant. Struct. Act. Relat. 5th. 1984, pp. 39-48.

14. J.K. Burdett, Acc. Chem. Res., 21 (1988) 189-194.

15. B. Bogdanov, S. Nikolic and N. Trinajstic, J. Math. Chem., 3 (1989) 299-309.

16. M. Randic, B. Jerman-Blazic and N. Trinajstic, Computers Chem., 14 (1990) 237-246.

17. S. Nikolic, N. Trinajstic, Z. Mihalic and S. Carter, Chem. Phys. Letters, 179 (1991) 21-28.

18. N. Bosnjak, Z. Mihalic and N. Trinajstic, J. Chromat., 540 (1991) 430-440.

19. Z. Mihalic and N. Trinajstic, J. Mol. Struct. (Theochem.), 78 (1991) 65-78.

20. Z. Mihalic, S. Nikolic and N. Trinajstic, J. Inf. Comput. Sci., 32 (1992) 28-37.

21. Z. Mihalic, D. Veljan, D. Amic, S. Nikolic, D. Plavsic, N. Trinajstic, J. Math. Chem., 11 (1992) 223-258.

22. K. Balasubramanian, Chem. Phys. Lett., 169 (1990) 224- 228.

23. K. Balasubramanian, J. Comput. Chem., 11 (1990) 829-836.

24. J.F. Caputo and K.J. Cook, J. Pharm. Res., 6 (1989) 809- 812.

25. H. Hosoya, Bull. Chem. Soc. Jpn., 44 (1971) 2332-2339.

26. L. Xu, J. Serb. Chem. Soc., 57 (1992) 483-93.

27. K. Altenburg, Kolloid Z., 178 (1961) 112-119.

28. I. Gutman, T. Körtvélyesi, Z. Naturforsch. 50a (1995) 669.

29. V. Kvasnička, J. Math. Chem., 6, (1991) 63-76.

30. D.H. Rouvray, in A.T. Balaban (Editor), Chemical Applications of Graph Theory, Academic Press, London, 1976, pp. 175-221.

31. M. Kunz, Coll. Czech. Chem. Commun., 54 (1989) 2148-2155.

32. M. Kunz, J. Math. Chem., 13 (1993) 145-151.

33. M. Kunz, J. Math. Chem., 9 (1992) 297-305.

33. J.S. Rutherford, Acta. Cryst., B46 (1990) 289-292.

35. B.C. Eichinger, Macromolecules, 13 (1980) 1-11.

36. B.C. Eichinger, Macromolecules, 18 (1985) 211-216.

37. G. Wei and B.C. Eichinger, Macromolecules, 22 (1989) 3429-3435.

38. M. Kunz, J. Inf. Comput. Sci., in press.

39. V. Kvasnička, Chemometrics III, Brno, 1993.