Monday, October 7, 2013

Impact Factors, Again

On The Big Bang Theory, Sheldon says something like, "That might well appear to be serendipitous to someone unfamiliar with the law of large numbers." Well, I have a passing familiarity with the law of large numbers, but I still find it serendipitous that a major paper on impact factors and citation patterns should come out in Science the day after my little blog post on journal impact factors in archaeology. Moreover, the paper comes from the laboratory of my favorite physicist, Albert-Lázló Barabasí, who in my opinion is producing some of the finest research in the current wave of social physics. His research has focused on power law distributions and scale free and fractal patterns in social phenomena, such as networks and human movement. He is also an alarmingly talented writer, as I have said before in this blog (see my August 21, 2010 post on his book Bursts).

The article that came out on Friday is

Wang, Dashun, Chaoming Song, and Albert-Lázló Barabasí (2013). Quantifying Long-Term Scientific Impact. Science Vol. 342, pp. 127-132. 10.1126/science.1237825.

The authors present a model that purports to predict the long-term impact of scientific articles. The model incorporates three factors: 1) preferential attachment--a positive feedback dynamic in which the increase in a quantity, such as wealth, citations, or social connections, grows for each individual in proportion to its already existing value, so that the rich get richer, highly cited papers accumulate even more citations, and the well-connected develop more associations. Preferential attachment is important because it is often invoked to explain the spontaneous emergence of power law distributions and scale free patterns. The distribution of citations has often been modeled as a power law, hence the logic of including preferential attachment as a process. 2) "Aging," which accounts for the decline in citations to an article over time. The authors show that when preferential attachment is controlled for by only considering papers with the same number of citation, the subsequent citations decay over time following a log-normal distribution. 3) "Fitness" is the third factor, defined as the "inherent differences between papers, accounting for the perceived novelty and importance of a discovery." This, of course, is difficult to measure because it is complex, intangible, and subjective, and it ultimately depends on the collective views of the scientific community.

The model seems to do a good job of describing citation patterns over time, both for individual papers and for journals. For example, given the first five years of citation data for a paper, the model predicts with considerable precision the next 25 years' of citations. If the first 10 years of data are used, the prediction improves markedly. A major novelty of the model is the incorporation of preferential attachment, which seems to account for its greater success compared to previously proposed models, which do not include a preferential attachment dynamic.

Removing preferential attachment from Barabasí's model yields a simpler, "lognormal" model that, like other existing models, works well for papers with small numbers of citations, but seriously under-predicts the long-term accumulation of citations. Another way of thinking of this might be that the dynamical process has two regimes. The first few (7 or 8) citations that a paper receives are subject to random effects, but after that preferential attachment kicks in and dominates the process of securing additional citations. This is not to say that the initial citations are truly random, but merely that the cumulative effects of the processes in total yield a match to a lognormally distributed random variable. Beyond the threshold, the citation process is dominated by different factors. For example, hypothetically, the first few citations to an article might come predominantly from articles in the same journal or closely allied ones, read by one community, but as citations accumulate the probability rises of being cited in journals with different audiences, which would exponentially increase the visibility of the article, thereby setting in motion the preferential attachment regime. This kind of explanation might even be tested.

The Wang et al. article set me to wondering whether we are nearing the day when one could engineer an article to be highly cited. I think the answer is still "no" because the critical part of the model is fitness, which remains a hot mess of imponderables. To address this, one could, hypothetically, pick the research topic of one's article through a social psychological analysis designed to identify a subject of optimal interest to the scientific community. I don’t know whether that is really possible, but I doubt it would be desirable. I fear it would preempt the element of serendipity that typifies most of the really significant and interesting scientific discoveries. 

No comments:

Post a Comment