A draft of the publication

STATISTICAL ANALYSIS OF A PERSONAL BIBLIOGRAPHY

Milan KUNZ and Ivan GUTMAN

Jurkovičova 13, 63800 Brno, Czech Republic, and

Faculty of Science, University of Kragujevac, Yugoslavia

(Received 1998)

A statistical analysis is performed on a personal bibliography containing 450 items, continuously published over a period of more than two decades. Extremely skewed distributions of journals, coauthors and intervals between consecutive publications are observed. The author whose bibliography was examined comments on the conclusions drawn.

Introduction

A philosophical problem, which appeared in modern physics, that an observation changes the state of the observed system, is valid even in scientometrics. Scientists pragmatically change their behaviour, if they become aware that their performance is evaluated by some indicators, such as the counts of publications or citations. In this paper we make a step further: after certain observations are made, the "observed system" will change its state by becoming an active participant of the "observation".

In a recent work , one of the present authors used the other author's bibliography for illustrative purposes in the study of certain bibliographical features. In fact, the list of I.G.'s first 150 published scientific papers was utilized, but otherwise I. G. did not by any means participate in the research. Thus I. G. was the object of that scientific study.

After becoming aware of the paper, I.G. proposed a challenging experiment: First, the examination of his publication list had to be repeated, employing the same analytical techniques, but this time using the updated version which at present contains somewhat more than 450 entries. This would result in certain conclusions about I.G.'s scientific activities. Also the comparison of the results of and the newly acquired results would imply further conclusions about I. G. and the evolution of his position in, and approach to research. (Until this stage I. G. would again be just a passive object of statistical examination.) However, in the third part of the experiment, I. G. would critically evaluate and judge the correctness of the conclusions obtained, point out where (and why) they fail and explain "the story behind the story". Thus he would turn into an active participant of the study.

This paper reports the outcome of our experiment. The first part of the paper was produced solely by M.K., without any interference by the second author. After M. K.'s text was completed, I. G. added to it his comments. It is worth noting that the two coauthors never met in person. They also did not have any previous scientific cooperation.

Theoretical background

Bibliographies of prolific authors seem to be out of scope of statistical studies, only changes of customs of writing headlines were evaluated by a sophisticated entropy measure .

A personal bibliography can be compared with any bibliography, having as the nominator its subject (author) instead of an object. Thus, extremely skewed distributions of coauthors can be expected. Nevertheless, this depends on the habits of the main author. He can have a stable staff and then the distribution can be narrow.

 I. Gutman's bibliography was used primarily for studying effects of multiple authorships on the difference between simple counts of authorships and eigenvalues of the quadratic form of the adjacency matrix of the bipartite graph relations between papers and authors . The use of eigenvalues may be considered as a practical joke, because the scientific activities of I. G. happen to be connected with eigenvalues of matrices of chemical compounds.

A list of publications forms a matrix, e. g.

Number

Authors

Journals

Chronology

Difference

 

 

 

Days

Dt

1

1 1 1 0 0

1 0 0

1

 

2

1 1 0 1 0

0 1 0

31

30

3

1 0 0 0 0

1 0 0

73

42

4

1 0 0 0 1

0 0 1

205

132

The submatrix "Authors" will be denoted by G. It can be weighted so as to normalize the weights of papers with more authors to one. For statistical studies, column sums of incidence matrices G were used till now. For unweighted matrices, these are identical with the diagonal elements of the quadratic forms of the corresponding matrix. There are two such quadratic forms, and they separate the original matrix in different subspaces, in our case, into the subspace of authors, where the elements of GTG are publications, and into the subspace of publications, where the elements of GGT are authors.

Both quadratic forms can be divided into a diagonal matrix V of vertex degrees and a matrix of off-diagonal elements, which is known as the adjacency matrix A.

Alternatively, the matrix G and its transpose GT can be written in the

block form as the adjacency matrix A of a bipartite graph of adjacencies between publications and authors.

A =

0

G

GT

0

 

where 0 stands for the zero matrix.

Both quadratic forms appear as blocks in the second power of the adjacency matrix, A2

A2 =

GGT

G

0

GTG

The adjacencies, off-diagonal elements, deform simple counts obtained as column sums. The matrices are then characterized by their eigenvalues and eigenvectors. Eigenvalues are obtained if a matrix is transformed into diagonal form by multiplication with its eigenvectors. The sum of eigenvalues is equal to the trace of the matrix, i.e., to the sum of its diagonal elements. The eigenvalues of a squared matrix are squares of the original matrix eigenvalues. Otherwise, they are singular values of the incidence matrix G (its own eigenvalues are complex numbers, because this matrix is not symmetric) . The adjacency matrix A has zero trace, therefore the sum of its eigenvalues must be zero. Because it is a matrix of a bipartite graph, the spectrum of eigenvalues is symmetric: eigenvalues appear in pairs with different signs . Both blocks have equal eigenvalues (except the zero eigenvalues).

The distribution of the eigenvalues of the quadratic form of the submatrix "Authors" must be skewer than simple counts, because the limiting case (all publications were coauthored by all authors) has (n-1) zero eigenvalues and a single eigenvalue equal to mn. Since some eigenvalues are less than unity and some may be equal to zero, the truncation of the observed authorships is removed, which makes difficulties at evaluation of Lotka type distributions by means of the lognormal distribution . There appears then another singularity formed by zero eigenvalues.

Calculation of eigenvalues and eigenvectors of large matrices (of order few hundred or more) is a difficult task and is hard to be accomplished on personal computers which are usually encountered. The original list of 150 publications with about 30 coauthors was about maximum, what was possible to master. For the new list it were necessary to use special facilities.

"Chronology" is another form of indexing publications, where they are registered according to dates when they were received by editors. The differences, time intervals between consecutive publications are irregular. They count days with zeros between ones. Therefore, these counts obey the negative binomial distribution. Or otherwise, they form a distribution inverse to the Lotka distribution.

Observed values

The statistical analysis reported in this work is based on two lists of publications of Ivan Gutman, containing his first 150 papers (completed between September 1971 and October 1980) and 450 papers (completed until March 1993). In what follows we refer to them as lists A and B, respectively.

In Table 1 the papers are classified according to the number of co- authors. They are compared with a list, obtained from an accidental sample. Gutman's degree of cooperation is similar to a usual degree of coauthorships in the field. A somewhat higher ratio of one-man papers corresponds

to his interest in mathematical problems . It seems, that after formative years, his willingness to work together with others is increasing. If the number of papers increased three times, the list of his coauthors increased four times. [See comment A.]

Table 1. Statistical data of I.G.'s bibliography: coauthorships; "list A" and "list B" pertain to the first 150 publications and to the updated list of 450 publications, respectively; the comparative sample was obtained by an accidental count from J.Chem. Inf.Comput.Sci., J.Chemometrics, J.Comput.Chem., Commun.Math. Chem.(MATCH) and a chemometrics conference proceedings.

Number of coauthors

1

2

3

4

5

more

list A

64

44

26

8

3

0

list B

174

149

86

30

7

4

normalized to 150

58

50

29

10

2

1

comparative sample

48

56

32

21

15

6

 

The distribution of coauthors is compiled in Table 2. I. G. is a professor at a provincial university in Yugoslavia. His specialty is application of graph theory in chemistry. He started to publish with his tutor, but very soon he abandoned his native nest and traveled through the word.

Table 2. Statistical data of I. G.'s bibliography: distribution of coauthors; the second and third rows give the number of scholars (other than I. G.) which occur as coauthors in exactlyi papers

i

1

2

3

4

5

6

7

8

10

13

14

16

19

21

39

48

53

list A

13

7

2

1

2

2

1

0

1

1

0

0

0

0

0

1

0

list B

66

28

13

7

5

5

4

1

0

1

2

1

1

1

1

0

1

In nine consecutive years of cooperation with his tutor Trinajstic, he published 156 papers (yearly mean 17.3), in the following eleven years 271 (yearly mean 24.6). [See comment B.] He was able to cooperate with prominent scientists in the field, but all such cooperations were only short, except his joint work with a Norwegian group, where a mutual interest existed. [See Comment C.] The distribution of his coauthors is extremely skew, but not as long as could be expected from the original count.

An interesting form has the development of the ratio "number of co- authors/number of publications". It measures the relative size of a group, the author cooperates with, with his productivity and his capability to gain new coworkers. The ratio started at value 3, then decreased approximately to 1.94 after the formative years (list A) and further increased to about 2.06 (list B). He could find new coauthors with the same vigor as new themes, to keep pace with his prolificity. It is only partially a result of his visits to abroad, because there are many Yugoslav names in his list. It seems, that he was able to work even with undergraduate students and guide them to scientific work. On the other side, till now he did not find his successor. [See comment D.]

The distribution of coauthors is as good or as bad as any Lotka-type one. Similarly, as in the analysis of texts, where we get an extremely skewed distribution of words, I. G. got, living his biography, an extremely skewed distribution of coauthors. He is, maybe, able to explain each case, but it is doubtful that he could explain the final shape. [See comment E.]

The distribution of journals is shown in Table 3. I. G. published not only in the national chemical or mathematical journals and in international specialized journals, but he tried to increase his visibility by sending papers to diverse exotic journals, despite difficulties connected with their different demands on formal requirements. [See Comment F.] It would be interesting to compare this behaviour with other authors, but it seems to be unique.

Table 3. Statistical data of I.G.'s bibliography: distribution of publications between journals; the second and third rows give the number of journals occurring exactly i times in the lists A and B, respectively

i

1

2

3

4

5

6

7

8

9

10

11

12

13

15

20

22

29

list A

16

4

3

0

1

1

1

0

0

0

2

1

1

0

0

1

1

list B

25

4

5

3

3

3

2

1

1

1

0

1

0

1

1

2

1

i

34

41

47

58

Books

Chapters

A

0

0

0

0

-

-

B

1

1

1

1

6

10

The distribution is skew, but the supply of journal is smaller than of coauthors. Nevertheless, I.G increased his list of journals twice.

Next we examined the time intervals between consecutive submissions of papers for publication. Because these data are not available, we used instead of them the dates when the papers were received by the respective journals . [See comment H.]

The distribution of these intervals is shown in Table 4. This is an inverse of both previous distributions. The scale in Table 4 is logarithmic and there is a visible a peak with a slight singularity for simultaneously received publications. I.G.'s tendency to simultaneously submit more than one paper was more pronounced at the early stages of his career (list A). [See comment I.]

Table 4. Statistical data of I.G.'s bibliography: distribution of intervals between consecutive submissions for publication (list B)

days

0

1

2

3-4

5-8

9-16

17-32

33-64

More

number

21

20

25

43

71

130

91

41

7

Because the exact dates of receiving the paper were sometimes not recorded, such dates were arbitrarily set at the middle between the nearest preceding and the nearest subsequent dates. The distribution is slightly bimodal, as previously observed in the case of patents . Some publications are produced in batches, either a long one is split into several parts, or a solution is fruitful and can be applied to more problems. [See comment J.] The arithmetic mean for consecutive intervals is 17.5 days. Only 7 intervals were longer than 64 days.

It was not surprising, that the least productive month was August, but the low score of April (see Table 5) was unexpected. It does not correlate with the frequency of zero productivities , which were highest in summer. [See comment K.]

Table 5. Statistical data of I.G.'s bibliography: distribution of submissions for publication between months (list B)

month

1

2

3

4

5

6

7

8

9

10

11

12

number

39

50

47

27

40

31

32

23

44

41

39

37

zeros

5

5

2

3

4

4

6

8

3

4

4

2

Comments by the "observed system"

A. There are several reasons for cooperation between scientists that eventually result in coauthorship. Some are plainly rational: work in re- search teams, helping younger colleagues to start their own scientific activity, cooperation based on (and required by) joint research grants etc. Sometimes, however, the motivation is quite different: friendship, acquaintance with people from far-away countries (many of whom the author will never see), wish to have a joint paper with a scientific celebrity, as well as some less-easy- to-confess, like sex, money, kinship. In the case of I.G. a dominant motive for joint publications is simply the joy of having joint publications.

B. To avoid misunderstanding, in the first nine years I.G. published jointly with his tutor "only" 51 papers. In the subsequent eleven years they have only two more joint publications. The reasons for such a dramatic change are complex, but can be rationalized by the simple observation that every original researcher must one day leave his teacher and endeavour to find his own place in science.

C. Concluding about the length and depth of cooperation solely from the number of published joint papers is misleading. Some successful cooperations last longer than two decades, but produced "only" 3-5 joint papers. There are cooperations which did not result in a single publication, but were nevertheless quite useful (scientifically) for both parties.

D. The latter two inferences are perfectly correct. Both are immediate consequences of I.G.'s appointment at a "provincial" university in which teaching is mainly at undergraduate level. Ambitious and talented young men are usually attracted to more prestigious academic centers.

E. From I.G.'s point of view, the relation between him and any of the numerous coauthors is too personal and too complex to be described by means of a distribution curve. Every coauthorship is a long and separate story with its own peculiarities. Thus I.G. is neither able nor inclined to try to "explain the final shape" of the "distribution of coauthors".

F. Publishing papers in the most outstanding scientific journals (often referred to as "international") is the ambition of every scientist. For different reasons a part of his production is directed towards domestic ("national") journals. Scientists having many publications in both inter- national and national journals may, sometimes, afford to send some of their writings to "obscure" or "exotic" journals. I.G. enjoyed very much having papers published in Afrikaans, Bulgarian, Chinese, Finnish, Hungarian,... ,

although he was well aware that by this is would not at all "increase his visibility".

G. Conclusions based on the dates when a paper is received in the editor's office of a journal have a serious methodological pitfall. First, papers are sometimes submitted for publication long after the respective research has been completed. (In I.G.'s case the biggest such delay was 11.5 years.) Second, papers are often rejected from one journal and eventually re- submitted and published by another. In such cases the first date of submission is lost and the recorded date may be shifted relative to the "true" value by many months or even years .

H. In view of what is explained in point G, the occurrence of "singularities" caused by "simultaneously received publications" should be considered as a mere coincidence without any statistical (or any other) significance.

I. A more prosaic reason why "some publications are produced in batches" is that in certain periods of the year the author may be too busy (because of teaching, for instance) to be able to write down his results. Therefore these results slowly accumulate and, at a convenient moment, they are released in "batches". In I.G.'s case this often happened when he was visiting Western Europe or USA, where the conditions for work were much superior to what he had at his home University.

J. These results of the statistical analysis can be rationalized by realizing that at Yugoslav universities the teaching activity reaches its peak just somewhere in April. The greatest values in Table 5 are found around the ends of the summer and winter vacations (September & October, February & March). This agrees with the claim that the bulk part of the author's writing of scientific papers (which should not be identified with his scientific research!) is done during the school vacations.

K. As a final comment we wish to point out that a reliable insight into an individual's scientific activity (even if it consists of several hundreds of bibliographic units) cannot be achieved solely by means of statistical analysis. Such an analysis is certainly more valuable when larger groups of individuals are examined. In the case of a single person some of the inferences reveal properties of that person which were concealed otherwise. Most of the inferences, however, are those which are expected to hold for (almost) everybody in the group of individuals examined (in our case: for scientists). The analysis is unable to reproduce, or even to account for, most of the individual deviations from a "standard" behaviour. Sometimes the analysis oversimplifies the actual situation and "forces" the individual to fit into some pre-established patterns. Therefore, a great deal of caution is needed with the interpretation of the results of such an analysis.

In spite of all this, there is no doubt that a significant number of details of a scientist's life, work, conditions for work, cultural and political environment, style, interests, affinities, motivations, psycho- logy, etc . can be reconstructed by a statistical analysis of his scientific production. The present example of such an analysis of a personal bibliography may be thus considered as a clear illustration of the methods, scope and potentials of scientometrics.

References

1. M. KUNZ, About Metrics of Bibliometrics, Journal Chemical Information and Computer Science, 33 (1993) 193.

2. S. D. HAITUN, On a Method of Investigating Scientific Bibliography - the Case of Publications by Planck and Einstein (in Russian), in: Chelovek nauki, Nauka, Moskva, 1974, pp. 214-228.

3. M. KUNZ, A Matrix Theory of Information, in: H. Kretschmer (Ed.), Fourth International Conference on Bibliometrics, Informetrics and Scientometrics, September 11-15, 1993, Berlin, Book of Abstracts, Part I.

4. M. KUNZ, Can the Lognormal Distribution be Rehabilitated?, Scientometrics, 18 (1990) 179.

5. M. KUNZ, Time Spectra of Patent Information, Scientometrics, 11 (1987) 163.

6. D. de BEAVER, R. ROSEN, Studies in Scientific Collaboration. Part I. The Professional Origins of Scientific Co-authorships, Scientometrics, 1 (1978) 65.