Fourier Analysis of a Personal Bibliography.

Milan Kunz

Jurkovičova 13, 63800 Brno, The Czech Republic

Zdeňek Rádl

Rennenská 2, 60200 Brno, The Czech Republic

Ivan Gutman

Faculty of Science, University of Kragujevac, Yugoslavia

Received

The distribution of the time intervals in I.G.'s publication list (consisting of 450 publications) is well described by the Weilbull distribution. This distribution characterizes distances between some symbols in his texts, too. Fourier analysis gave frequencies corresponding to the range of mean intervals between consecutive publications and to some longer intervals. It could be speculated that short frequencies were induced by the author whereas the longer ones by the academic surrounding, but the importance of the frequencies is not confirmed by statistical tests.

1. INTRODUCTION

Most chemical compounds were prepared and described only once, very few are studied intensively. Such patterns are considered to be characteristic for information distributions . Understanding all properties of information could be important for solving problems related to chemistry, as the mystery of genetic code is. For it, new techniques must be used.

Between many chemical applications of topological indices, the Wiener index has a special role. Its importance was explained by the fact, that this sum of distances between atoms is at acyclic molecules equal to traces of inverse matrices of quadratic forms of incidence matrices .

In strings of symbols, distances between symbols of one kind can be considered as the negative binomial distribution, their set as the negative polynomial distribution. A hat shaped distribution of time intervals between consecutive patents from polymer branch was found .

Statistical counts were made of the personal bibliography of one of the authors (I.G.), containing 450 items. Its first 150 items were used as an example before . Fourier analysis was then applied to data. This method was used recently by Mc Grath for investigating the life beat of a library . In addition to Fourier analysis, some other tests were made, especially for establishing the form the distribution of intervals between consecutive publications.

Fourier analysis

A function describing a time series can be decomposed into a sum of sinusoidal waves with different frequencies and amplitudes. The observed series of n data is divided into periods and these periods are expressed as sums of cosine and sine parts. Their common element determines the weight of the period and its inverse, the frequency.

The first question to be solved is admissibility of the Fourier analysis for the given task. It follows that authors are considered as emitters of publishable information. Our experience with this technique comes from chemistry, where it is used for studying objects whose spectrum does not change in billions of years. Unfortunately, men do not have such convenient properties. Another difference is that molecules emit different waves simultaneously, whereas the authors "emit" their publications consecutively, but this distinction vanishes in a record.

According to chemical examples, the frequencies could be divided into inner frequencies, depending on the capabilities of the author and his field, amount of work needed for creating a standard publication, and conditioned frequencies depending on the immediate surrounding.

We applied Fourier analysis in two ways

a) to publication counts in time periods,

b) to the time-intervals between consecutive publications, counted in days.

We used the commercially available program "Statgraphics" from Statistical Graphic Corporation to analyze data and find at all frequencies their respective weights.

Since daily counts were mostly empty (I. G. had about yearly about 20 publications), it was necessary to filter them and collect data for longer periods. We tried weekly, fortnightly, monthly and quarterly periods. Monthly intervals were induced as 365 days/12 periods. Both analyses are reciprocal, since straight counts of publications are an in verse function to intervals.

Intervals between publications can be understood as a measure of the work needed for producing one publication. This work is, of course, not uniform, but differs randomly from the mean. Each author has different abilities and induces different amount of inventive energy into his output.

The waves obtained are necessarily connected with intervals between consecutive publications. Publications received by the publishers on the same day do not induce any wave. Publications received within 1 to 6 days are counted in the same or next week and are also ineffective. Longer periods between submissions induce weeks with zero counts, which are lows of appearing waves, if such longer intervals do not appear randomly but are induced e. g. by the academic year. It can be expected that intervals with the highest occurrences could induce corresponding frequencies even if these intervals were distributed in a random order.

Results

The distribution of intervals between I.G.'s 450 publications is well described by the Weilbull distribution with parameters

x a(ln[1/(1-P)]

i is the cumulative frequency,

n the number of publication,

1/c = 0.966459 (s=0.08276).

The linearity of the fitting is good, its correlation coefficient is r = 0.09967. Nevertheless, even lognormal distribution were acceptable, if the singularity (zeroes) was removed by adding 1 to all values.

The Weilbull distribution is known from the statistical theory of reliability and appeared here, as noted by Feller , somewhat secretly.

The mean interval length changed during the 20 years period. We divided I.G.'s career into two parts. In the first part (9 years) the mean of the intervals was calculated to be 21.05 days, in the second part (11 years) it was 14.81, the common mean being 16.22 days. The cumulative chart of publications revealed that the increase of the number of publications was remarkably regular, slightly concave and can be approximated by two linear segments that intersect at 12.89 years. In the first part the yearly output was 18.62 publications (interval length 19.60 days), in the second part the yearly output was 29.52 publications (interval length 12.36 days).

In the weekly periodogram, five of the seven most important frequencies are within the range of these b mean interval lengths. Shorter intervals, even if they are more frequent, did not have any effect, as expected.

All prominent frequencies possess matches in the weekly counts. Fortnightly and quarterly counts gave similar results.

Weights of frequencies of one week windows show that short waves corresponding to the intervals between the median and the mean are most effective. There are 5 waves between 15 to 21.5 days. Then a doublet wave follows at one month together with waves for approximately 1.5, 2 and 3 months. The next waves are one doublet wave at 15 weeks and a single one at 20 weeks. The longer waves appear to be more prominent, but this anomaly is explained by the fact that these describe the least-square differences between larger numbers, obtained as sums from shorter periods. They are produced by relatively few longer periods of idleness and/or by their groupings.

Extremely high weights appeared also in the Fourier analysis of intervals length between consecutive publications due to their span from 0 to 94. The most important frequencies appear to be induced by 9 to 27 publications, the longest wave 27.624 being the harmonics of 13.812. When we recalculated the observed intervals on weeks, using the mean interval of 16.22 days, we found corresponding close frequencies in week counts, except for the harmonics.

An interesting feature of the analysis is the observing of bands of neighbor frequencies with high weights (not necessarily maximal ones) and doublets, when two close frequencies with high weights are separated by one or few frequencies with low weights. These low-weight frequencies are sometimes associated with prime numbers or numbers possessing few divisors.

When we tried to establish the statistical importance of the frequencies obtained on the cumulative periodogram using the Kolmogoroff-Smirnoff limit for the even distribution of frequencies, only the part of the periodogram with cycles with the length about 17-18 days touched the 75% border. Therefore the statistical weight of the observed frequencies are not statistically significant.

Discussion

The Weilbull distribution is applied in industry for measuring and estimating of service life of machine parts. If we use this analogy, we must interpret a publication as a failure of a problem "in service". When the problem is cracked, it is published. Because the author is usually interested in several problems simultaneously, we observe only parts of the service life. For studying the whole life of an idea from the moment when it is conceived to the time when the respective scientific paper is submitted for publication, it would be necessary to keep exact diaries and make it available to scholars involved in scientometrics. In reality this hardly ever will happen. The data that can be acquired with relatively little difficulty are the intervals between consecutive publications. The aim of this work was to show that their study make sense.

It is technically possible to write a mathematical paper within a day or a week, having knowledge of literature, technical skills and facilities. But not every mathematician has so many bright ideas at hand to be able to start a new paper immediately after having completed the previous one. Periods of idleness appear, when waiting for inspiration, if new problems are not ready in a store. The ability to work simultaneously on different tasks is important for the productivity of an author. Moreover and more importantly, not every attempt to solve a task is successful and fruitful. Thus an additional difference between authors is the rate of their successfulness.

An experimental paper contains a conventional amount of work. This work can be done by co-workers, who can check references, and even write the publication. Inventive researchers employing large teams can be very prolific, but need to be very inventive, to be able to find suitable tasks for their teams. The first condition for prolificness is ability to generate new conceptions. Unsuccessful attempts to solve problems are demoralizing, easy tasks boring.

A difficulty with considering an author as an emitter is that this emitter is changing during his career. The impetus to publish usually decreases with the number of publications. A prominent author gets honorable duties which exhaust his energy and instead of relaxing him from creative efforts wear down his energy. It can be guessed that short frequencies are induced by the author himself, namely by his efforts to be productive, whereas longer ones by the academic surrounding, tasks which he has as the professor, personal and family affairs etc. These cycles are, unfortunately, statistically unprovable by the method presently employed.

We could speculate over the data obtained and make further guesses, but without comparative samples this would be a futile task. It would be necessary to analyze several similar bibliographies before all their intricacies could be understood. But even the first experiences indicate that the problem is worth to be studied.

The existence of intervals between consecutive information units in information strings is not limited on personal bibliographies or patents. Preliminary studies of distributions of intervals between letters in Czech and English texts showed that they can be described by skewed distributions, namely negative binomial, exponential, lognormal and Weilbull. The technique of evaluating of these intervals could be usefull for analysing of information encrypted in DNA, RNA and proteins.

References

(1) Kunz M.; Plots against Information Laws. Science and Science of Science 1995, 7, 91-97.

(2) Kunz M.; Path and Walk Matrices of Trees. Coll.Czech. Chem. Commun., 1989, 54, 2148-2155.

(3) Kunz M.; A Moebius Inversion of the Ulam Subgraphs Conjecture. J. Math. Chem., 1992, 9, 297-305.

(4) Kunz M.; Time Spectra of Patent Information. Scientometrics, 1987, 11, 163-173.

 (5) McGrath, W. E.; Periodicity in academic library circulation, a spectral analysis, in H. Kretschmer, Ed. Fourth International Conference o Bibliometrics, Informetrics and Scientometrics, September 11-15, 1993, Berlin, Book of Abstracts, Part I.

(6) Feller W.; An Introduction to Probability Theory and its Applications, Vol. II, J. Wiley and Sons, New York, 1971.