Original version, the published paper is diffent, sometimes.

About Metrics of Bibliometrics

MILAN KUNZ

Chemopetrol, Research Institute of Macromolecular Chemistry

65649 Brno, Czechoslovakia

Received May 14, 1992

It is shown that bibliometric incidence matrices can be treated as vectors in nm-dimensional space and characterized by statistics of their singular values. A case of personal bibliography is demonstrated.

INTRODUCTION

Without a sound metrics any science trying to measure its object were lost as a soul without a living body. Horace knew knew that there is a measure in all things but Prothagoras did not explain how it came that the measure of all things is man and thus there is necessary to arrange conferences about this question1.

Bibliometrics is only one from many sciences having the same suffix as biometrics, technometrics, scientometrics, chemometrics etc. which aim is to measure exactly its object. It achieved many successes, some statistical patterns were discovered and as laws2 proclaimed. There arose a necessity of their theoretical and philosophical interpretation.

Haitun3-5 tried to show that human activities differ from its physical base by its infinite moments of characteristic distribution functions. It is not, unfortunately, true because humans are not immortals and their work can not be infinite6,7. Speculations about character of information lead to Khursin‘s conjectures8. He built a complete system of scientific hierarchies resembling to Smyth‘s results based on measurements of the Great pyramid9.

Information theory has already its mathematical ground. Rashewsky10 proposed long ago that information forms a hypersurface in multidimensional space.This technique is already used in coding theory11 or in factor analysis of citation studies12-14, analysis of databases15,16 but philosophical and conceptual consequences were not made from the mathematical formalism for information laws.

This situation is caused by difficulties connected with the notion of multidimensional spaces.In quantum chemistry their application is common nevertheless even specialists have difficulties with their antiintuitive properties17. Long disputations about localization of microparticles were not reflected into a question where is an information localized?

It should interest readers of this journal that mathematical formalism describing information, or directly information itself in form of messages is identical with formalism used for describing chemical compounds as graphs18.It is not surprising: Information in the form of messages is only a result of physicochemical processes in our brains, a trigger of these processes.And otherwise, a chemical compound can have similar effects as a message. If we consider literature as an external memory, it is an extension of the brain which is accessible to direct inspection and thorough analysis.

INCIDENCE MATRICES

The essential problem connected with applications of algebra for analysis of information strings is rooted in the fact that information strings are non-commutative, p+a+t is different from t+a+p. To overcome this difficulty we must use suitable formal representation of information using matrices.

In linear algebra, matrices are linear operators transforming

one vector into another:

y = Mx.

We will limit ourselves on a specific case when

x = I

where I is the unit diagonal matrix vector. Then y is identical with the matrix M itself. Results are elementary but not trivial. A message is an operator which task is to change somebody‘s mind.

A string of symbols on an alphabet of n symbols forming a message can be interpreted as a naive matrix N having in each row just one unit vector

ej = (01,02, ...1j,..0n)

corresponding to the given letter j. This string is mapped into the space of words, notions or names. When some bibliometric analysis is made, we select some vectors, e.g. authors as characteristic features and count their occurrences in a given set. This can be formalized as a projection of the matrix N onto the unit vector row JT, where the symbol T means the transposition. This naive formalism showed19,20 that information is governed by two groups of cyclic permutations Sm and Sn represented by the unit permutation matrices Pm and Pn acting on the information matrix N independently from the left and from the right PnNPm. These symmetries can be separated by finding two quadratic forms

PTnNTNP and PmNNTPT

It led to a simple proof that Boltzmann Hn and Shannon Hm entropy function are distinct and additive.

We can continue systematically in building more complicated matrices and corresponding multidimensional spaces. Matrices having in each row just two unit elements, either sums (ej + ei) or differences (ej - ei) are known as the incidence matrices of unoriented graphs G or oriented graphs S, respectively. With them or their quadratic forms all applications of graph theory in chemistry are connected21. But nevertheless how simple these matrices seems to be, some of their properties were found just recently22.

All statistical linguistics and bibliometrics is based on counting of distinctive words in sets of messages. When Lotka counted authors in Chemical Abstracts Index, he ignored coauthors.This technique of simplification is not possible e.g. at cocitation studies when we need connect papers which are cited together. For it we need matrices having in each row an arbitrary number of nonzero elements representing in rows papers and in columns authors or in rows citing papers and in columns cited references. It were advantageous, if we were not limited on unit symbols as when we express the importance of different words by shouting them, expressing their weight by other numbers.

MATRICES AND THEIR PROJECTIONS

A matrix M with elements mij is a vector in mn-dimensional space, known in statistical mechanics as the phase space. It describes completely a stochastic system, but such detailed description is too complicated. Either we see details but then we can not grasp the whole, or we get a picture of the whole without details. For example, at a system of gas molecules we do not feel individual molecules but only their mean motion as the wind and temperature. At information vectors we can read all words, but we try to find some parameters characterizing an information system as a whole, for example the mean productivity of individual authors or importance of different fields. These parameters should be found by statistical treatment of generalized incidence matrices M. As an example 3 types of incidence matrices can be given:

NT

1

0

1

0

1

NTJ

3

 

0

0

0

1

0

 

1

 

0

1

0

0

0

 

1

(word ACABA).This matrix has in each column (transposed row) just one nonzero symbol.

MT

1

1

1

1

0

NTJ

4

 

1

0

0

0

0

 

1

 

0

0

1

0

1

 

2

This matrix has in each row 1 till n unit symbols. This notation is used in music for different simultaneous tones

MT

0.5

1

0.8

1

0

MTJ

3.3

 

0.5

0

0

0

0

 

0.5

 

0

0

0.2

0

1

 

1.2

(weighted matrix MT). The column sums are always 1. The weights can be equal, as 0.5+0.5 or unequal, as 0.8+0.2.

A matrix vector is simplified if we find its projections into its subspaces, either into the subspace of its column or into the subspace of its rows. It is easily done by finding its scalar products with the unit vector row JT, it is JTM, and with the unit vector column J, which is MJ, respectively. These scalar products are just the column or row sums of matrix elements as in our examples, where the transposed form was presented.

The relations of a matrix vector to both projections is shown on Fig.1. The original nm-dimensional vector M was somewhere in the Hilbert space on a sphere with the diameter

L = (S  mij2)1/2

Traces of both quadratic forms have equal length MTM and MMT. This is just the result of rules for multiplication of matrices or finding quadratic forms of vectors. Thus

L2(M) = Tr(MTM) = Tr(MMT)

where Tr(M) is the trace of a matrix, the sum of its diagonal elements.

The difference between the trace vectors and both projection vectors is made by the off-diagonal elements of both quadratic forms. The diagonal elements and the off-diagonal elements form in multidimensional space a right triangle.

If an information matrix is naive all its columns are orthogonal and all off-diagonal elements of the quadratic form NTN are zeroes, the right triangle reduces into a straight line. Finding this quadratic form we transform a message into its statistics. We know which words were used and how many times but we can not tell the meaning of the message.If in the quadratic form off-diagonal elements exist, the trace has the same length as the original matrix vector M but it does not coincide with it. This matrix vector is better represented by the eigenvalues of the quadratic forms MTM or MMT (both forms have equal eigenvalues). These eigenvalues are known as singular values of the original matrix M.

When we speak of symmetry of information matrices, we have rotated an information vector in a fixed coordinate system and considered all matrices obtained by such permutations to be equivalent. They lie on an spherical orbit.

When we search eigenvalues, we leave the matrix vector in its position and rotate the coordinate system trying to find such a combination of unit vectors ej in which the matrix M appears as a diagonal vector. This combination is known as eigenvectors or factors and is explained in any textbook of chemometrics.

There is still another possibility how to interpret relations between both quadratic forms. We can form (m + n) dimensional space joining m rows and n columns of an information matrix and construct the adjacency matrix A as a block matrix which diagonal blocks 0 are zero matrices and off diagonal blocks the matrix M and its transpose. The adjacency matrix is symmetrical and its quadratic forms coincide with its square A which splits into two diagonal blocks MTM and MMT


A =

0

M

MT

0

There seems to be a paradox. Matrix MTM with n rows and columns corresponds to the projection of the matrix M into m-dimensional space and m-dimensional square matrix MMT to the projection of the matrix M into n-dimensional space. This discrepancy can be explained by nature of elements of both matrices.

If matrix M is the incidence matrix of publications and authors, then the measure in the space of authors MTM their publications are, on the main diagonal shares of individual authors appear, off-diagonal elements show publications common to given pairs of authors. The measure in the space of publications are their authors. It is well known that in both subspaces it is possible to determine distances between authors or publications as paths in a graph22, which gives local properties of the system described by the matrix M. These distances are connected in an intricated manner with inverses of both quadratic forms23-25.

It is customary to characterize a position of an information vector by a function. Entries in Chemical Abstracts were indexed, Lotka26 made a statistics from Authors Index. He counted the number nk of authors having mk publications and then expressed these numbers n as a function n = f(mk) which should describe the matrix vector M, its position in multidimensional space. This approximation is good if we deal with a naive matrix N, which column sums JTN coincide with column sums of its quadratic form JTNTN. In a general case, diagonal values of the quadratic form do not coincide with its eigenvalues. For our examples we have:

MTM

4

1

2

1

1

0

2

0

2

The diagonal values are: 4, 2, 1; the eigenvalues are: 5.40, 1,32, 0.28. For weighted matrix MT1M1 similarly diagonal values are 2.89, 1,04, 0.25 and eigenvalues 2.93, 1.03, 0.23. Here the difference is small but both values are different from simple sums. At the unweighted matrix the difference is great inough to be investigated.

Now comes the point: We know that eigenvalues of matrices characterizing physical objects determine their physical and chemical properties.They are more important that explicite matrix elements. If so, these parameters can be more important also for information systems described by incidence matrices and therefore distributions of singular values could be more interesting then distributions obtained by direct counts. Studies of distributions of singular values were already made but for other purposes, to determine the rank of correlation matrices. Here they identify the structure of the information field27,28.

If in an incidence matrix of authorships more than one entry in a row occur, publications are authorized by collectives.There exist many studies of different aspects of collective authorships. We can ask how collective authorships effect the extremely skewed statistical distribution known as the Lotka law. There are essentially 3 possibilities how to treat such matrices. To attribute to each coauthor a full authorships, to weight them, either evenly or unevenly and in extreme situation to give the full merit only to one author as Lotka26 did from practical reasons. Pao recommended the senior author29. An incidence matrix is naivized by such a procedure, but our picture of a system is distorted, and it is necessary to find some techniques how to overcome this fault.

In some cases the recommendation is not applicable as at personal bibliographies. They are not connected by a common subject but by a common author. At the beginning he is usually a younger author and only later he becomes the senior author. Moreover, coauthored publications are shared with bibliographies of his coauthors.

We use the bibliography of the first 150 publications of Ivan Gutman30. He was a member of the Zagreb group working on eigenvalues problems of chemical graphs . The bibliography matrix is formed by 150 rows (publications) and 32 columns (coauthors). The coauthorships is characterized by following values

Number of coauthors

1

2

3

4

5

  S 32

Number of publications

64

49

26

8

3

  S 150

The arithmetical mean of coauthors is 1.91. The incidence matrix is scarce, it has approximately only 2 nonzero elements in a row. More than two fifth of publications Gutman wrote alone, but 48 with its tutor Trinajstic. With 13 coauthors he has common only one publication (see Table 1). There are shown distributions of unweighted and evenly weighted authorships on a logarithmic scale together with corresponding singular values.

The distribution shows a typical pattern of extremely skewed information distributions. If this shape is common for personal bibliographies or specific for an exceptional author can not be decided without comparisons with other cases. Therefore it is not important to find some analytical function but to show show how the distribution of singular values differs from the distribution of authorships.

Both singular values have a singularity of 5 zero values and the skewness of other values is much less pronounced that for n values. This differs from 13 collaborators having only 1 common publication, but it can be compared with 5 coauthors which weighted coauthorships values are only 0.2. They collaborated with Gutman only once at publications coathored by 5 authors.

DISCUSSION

An exercise in linear algebra has shown that man is a measure of all things in the subspace of things, only. In the subspace of man things measure the importance of people, let it be words, publications,citations or money. Of course, we must at first relate such subspaces by some incidence matrices.These relations already exist, we are only unable to formulate corresponding matrices.

Behind apparent parameters obtained by simple bibliometric countings hidden parameters exist which can be calculated from corresponding explicite or implicite matrices. A special branch of mathematical chemistry studied many decades the problem of eigenvalues of chemical graphs because their importance.

If patterns of science are going more and more complicated, then bibliometrics can not solve its tasks by simplifying its problems, but by improving its methods. It will be soon obsolete to evaluate statistics obtained by computers by simple correlations as when results were gathered by annoying hand counts.

When at a scarce matrix of coauthorships a significant difference was found between simple counts and singular values,then a much greater difference must be expected at citation matrices with tens of nonzero elements in each row. Now it is relatively easy to calculate singular values and to study their distributions. Computers need not to be used only for registering of millions of chemical compounds and for unveiling their properties, but similar techniques could uncover mysteries of chemical literature itself.

ACKNOWLEDGMENT

I thank Professor I.Gutman (University of Kragujevac, Serbia) for numerous reprints and Dr.L.Quoniam (Centre de recherche rwtrospective, Marseille for suggesting some improvements of the manuscript and for valuable reprints.

REFERENCES

(1)Elkana,Y.;Lederberg,J.;Merton,K.R.;Thackray,A.;Zuckerman,H.; Eds.: Towards a Metric of Science;Wiley,New York,1978.

(2) White, H.D.; McCain,K.W.in: Williams,M.E.;Ed.: Annual Review of Information Science and Technology;Vol.24; Elsevier, Amsterdam, 1989,pp119-186.

(3) Haitun,S.D.:Scientometrics,1982,4,5.

(4) Haitun,S.D.: Scientometrics,1982,4,89.

(5) Haitun,S.D.: Scientometrics,1982,5,375.

(6) Kunz,M.: Scientometrics,1988,13,25.

(7) Kunz,M.Scientometrics,1990, 18, 179.

(8) Chhursin,L.A.: Nauchno Tekhnicheskaya Informaciya Ser.2, 1970,10.

(9) Edvards,I.E.S.: The Pyramids of Egypt,Penguin Books, Harmondsworth, 1961.

(10) Rashewsky,N.: Bull Math.Biophys,1950,12,359.

(11) Hamming,R.W.: Coding and Information Theory,Prentice Hall,New York,1980.

(12) Pinski,G.;Narin,F.: Information Processing Management,1976,

12,297.

(13) Noma,E.: Information Processing Management,1982,18,43.

(14) Noma,E.; J.Am.Soc.Inf.Sci.,1984,35,29.

(15) Dou,H.;Hassanaly,P.: World Patent Information,1991,4,223.

(16) Quoniam,L.in: Desvals,H.;Dou,H.;Eds.:La Veille Technologique, Dunod, Paris, 1992, pp243-262.

(17) Mezey,P.G.:Potential Energy Hypersurfaces, Elsevier, Amsterdam, 1987.

(18) Randic,M.:J.Chem.Inf.Comput.Sci.,1992,32,57.

(19) Kunz,M.: Information Processing Management,1982,18,43.

(20)Kunz,M.in:Mizerski,J.;Posiewnik,A.;Pykacz,J.;Zukovski,M.;Eds.:Problems in Quantum Physics II;Gdansk 89;World Scientific, Singapore,1990,377.

(21) Rouvray,D.H.in:

(22) Kunz,M.:J.Math.Chem.,1992,9,297.

(23) Kunz,M.:J.Math.Chem.,in press.

(24) Lotka,A.:J.Washington Acad.Sci.,1926,16,317.

(25) Odda,T.:Annals New York Acad.Sci.,1979,328,166.

(26) Kunz,M.: Cool.Czech.Chem.Commun.,1989,54,2148.

(27) Malinowski,E.R.:J.Chemometrics,1987,1,33.

(28)Quoniam,L.;Dou,H.;Hassanaly,P.;Mille,G.:Analusis,1991,19,148.

(29) Pao,M.L.:Information Processing Management,1983,21,305.

(30) Gutman,I.: Scientific Publications of Ivan Gutman (1-100),

(101-150), Personal communication.

(31) Trinajstic,N.: J.Math.Chem.,1988,2,197.

svet6.jpg

 

Figure 1 System of vectors of the information incidence matrix M.

M information vector is a string of words which lead our mind to

some state. We can choose just some parts, as authors, references, key words which then replace original information.

MJ projection of the matrix M into m dimensional space. If the matrix M is a text, it has in each row just one symbol (word) and then MJ = J. J is the unit vector which makes row sums of matrix elements.

JTM projection of the matrix M into n dimensional space. Its elements are column sums of elements in the information matrix M. In both projections we abstract some features of original information. We get statistics.

Tr(MTM), Tr(MMT) trace vectors of corresponding quadratic forms.

They have the same length as the matrix vector M.

S λj vector of singular values of the matrix M or eigenvalues of

both quadratic forms. It is the matrix vector M in rotated coordinates.

(MMT) inverse vectors if they are finite. Their importance at information vectors is not investigated.

MJT, Tr(MTM) and off-diagonal elements of MTM form a right triangle in the Hilbert space.The second triangle is formed by JTM, Tr(MMT) and off-diagonal elements of MMT.

Table 1 Coauthorships statistics of Ivan Gutman‘s publications

Logarithmic

scale

Unweighted authorships

Weighted authorships

Matrix sums

Singular

Matrix sums

Singular

log m

JTM

values

JTMi

values

< -9

0

5

0

5

4 to -9

0

0

0

0

-3

0

0

5

4

-2

0

0

9

8

-1

0

0

4

9

0

13

5

2

1

1

7

2

5

3

2

3

11

4

0

3

5

7

1

0

4

2

1

0

1

5

0

1

1

0

6

1

0

0

0

7

0

0

1

0

8

1

0

0

0

Notes: The first 150 publications29. Extremely skewed information distributions are modeled most simply by the truncated lognormal distribution7, therefore the logarithmic scale is used to form classes according to sums m unweighted or evenly weighted authorships The distribution of coauthorships is modelled satisfactorily, but singular values show a singularity corresponding to authors with the lowest degree of collaboration.