For many years now, I’ve had a good grip on what the h-index is all about: if you would like to follow this blogpost all about the g-index, then please make sure that you already understand the h-index. I’ve recently had a story published with Library Connect, which elaborates on my user-friendly description of the h-index. There are now many similar measures to the h-index, some of which are simple to understand like the i10-index, which is just the number of papers you have published which have had 10 or more citations. Others are more difficult to understand, because they attempt to something more sophisticated, and perhaps they actually do a better job than the h-index alone: it is probably wise to use a few of them in combination, depending on your purpose and your understanding of the metrics. If you enjoy getting to grips with all of these measures then there’s a paper reviewing 108 author-level bibliometric indicators which will be right up your street!
If you don’t enjoy these metrics so much but feel that you should try to understand them better, and you’re struggling, then perhaps this blogpost is for you! I won’t even think about looking at the algorithms behind Google PageRank inspired metrics, but the g-index is one metric that even professionals who are not mathematically minded can understand. For me, understanding the g-index began with the excellent Publish or Perish website and book, but even this left me frowning. Wikipedia’s entry was completely unhelpful to me, I might add.
In preparation for a recent webinar on metrics, I redoubled my efforts to get the g-index into a manageable explanation. On the advice of my co-presenter from the webinar, Andrew Plume, I went back to the original paper which proposed the g-index: Egghe, L., “Theory and practice of the G-index”. Scientometrics, vol. 69, no. 1, (2006), pp. 131–152
Sadly, I could not find an open access version, and even when I read this paper, it is peppered with precisely the sort of formulae that make librarians like me want to run a mile in the opposite direction! However, I found a way to present the g-index at that webinar, which built nicely on my explanation of the h-index. Or so I thought! Follow-up questions from the webinar showed where I had left gaps in my explanation and so this blogpost is my second attempt to explain the g-index in a way that leaves no room for puzzlement.
I’ll begin with my slide from the webinar:
I read out the description at the top of the table, which seems to make sense to me. I explained that I needed the four columns to calculate the g-index, reading off the titles of each column. I explained that in this instance, the g-index would be 6… but I neglected to say that this is because this is the last row on my table where the total number of citations (my right hand column) is higher than or equal to the square of g.
Why did I not say this? Because I was so busy trying to explain that we can forget about the documents that have had no citations… oh dear! (More on those “zero cites” papers later.) In my defence, this is exactly the same as saying that the citations received altogether must be at least g squared, but when presenting something that is meant to be de-mystifying, the more descriptions, the better! So, again: the g-index in my table above is the document number (g) where the total number of citations is greater than or equal to the square of g (also known as g squared).
Also on reflection, for the rows where there were “0 cites” I should also have written “does not count” instead of “93” in the “Total number of citations” column, as people naturally asked afterwards why the g-index of my Professor X was not 9. In my presentation I had tried to explain what would happen if the documents with 0 citations had actually had a citation each, which would have yielded a g-index of 9, but I was not clear enough. I should have had a second slide to show this:
Here we can see that the g-index would be 9 because the 9th row has the total number of citations as higher than g squared, but in the 10th row the total number of citations are less than g squared.
My “0 cites” was something of a complication and a red herring, and yet it is also a crucial concept. Because there are many, many papers out there with 0 citations, and so there will be many researchers with papers that have 0 citations.
I also found, when I went back to that original paper by Egghe, that it has a “Note added in proof” which describes a variant where papers with zero citations, or indeed fictitious papers are included in the calculation, in order to provide a higher g-index score. However I have not used the variant. In the original paper Egghe refers to “T” which is the total number of documents, or as he described it “the total number of ever cited papers”. Documents that have never been cited cannot be part of “T” and that’s why my explanation of the g-index excludes those documents with 0 citations. I believe that Egghe used this as a feature of the h-index which he valued, i.e. representing the most highly cited papers in the single number, which is why I did not use the variant.
However, others have used the variant in their descriptions of the g-index and the way they have calculated it in their papers, especially in more recent papers that I’ve come across, so this confuses our understanding of exactly what the g-index is. Perhaps that’s why the Wikipedia entry talks about an “average” because the inclusion of fictitious papers does seem to me more like calculating an average. No wonder it took me such a long time to feel that I understood this metric satisfactorily!
My advice is: whenever you read about a g-index in future, be sure that you understand what is included in “T“, i.e. which documents qualify to be included in the calculation. There are at least three possibilities:
- Documents that have been cited.
- Documents that have been published but may or may not have been cited.
- Entirely fictitious documents that have never been published and act as a kind of “filler” for rows in our table to help us see which “g squared” is closest to the total number of citations!
I say “at least” because of course these documents are the ones in the data set that you are using, and there will also be variability there: from one data set to another and over time, as data sets get updated. In many ways, this is no different from other bibliometric measures: understanding which documents and citations are counted is crucial to understanding the measure.
Do I think that we should use the variant or not? In Egghe’s Note, he pointed out that it made no difference to the key finding of his paper which explored the works of prestigious authors. I think that in my example, if we want to do Professor X justice for the relatively highly cited article with 50 cites, then we would spread the total of citations out across the documents with zero citations and allow him a g-index of 9. That is also what the g-index was invented to do, to allow more credit for highly cited articles. However, I’m not a fan of counting fictitious documents. So I would prefer that we stick to a g-index where “T” is “all documents that have been published and which exist in the data set, whether or not they have been cited.” So not my possibility no. 1 which is how I actually described the g-index, and not my possibility no. 3 which is how I think Wikipedia is describing it. This is just my opinion, though… and I’m a librarian rather than a bibliometrician, so I can only go back to the literature and keep reading.
One final thought: why do librarians need to understand the g-index anyway? It’s not all that well used, so perhaps it’s not necessary to understand it. And yet, knowledge and understanding of some of the alternatives to the h-index and what they are hoping to reflect will help to ensure that you and the people who you advise, be they researchers or university administrators, will all use the h-index appropriately – i.e. not on its own!