This paper advances science overlay mapping processes. The intent is to provide the research communities using scientometrics with an improved methodology to generate overlay maps (Rafols, Porter, & Leydesdorff, 2010). An overlay map is a global map of science over which a subset of publications is projected, thus allowing the visualization of disciplinary scope for the scientific production of a given organization, individuals, territory, etc. Such maps can help analysts and readers grasp the mix of disciplines engaging a given topic or the portfolio of research interests reflected in the publication (sub)set of an organization (see Wallace and Rafols (2015) for a discussion of research portfolios).
The paper briefly overviews the heritage of the use of Web of Science subject categories (WCs) and of science overlay mapping. It then presents enhanced methodology to generate the maps, followed by examples to illustrate novel application opportunities. The paper updates the visualization process and provides an advanced 2015 basemap.
In order to understand the multidisciplinary profile of publication sets, disciplinary or sub-disciplinary categories can be assigned to the publications. These categories can then be used to represent the position of a publication set in the overall structure of science—i.e. to overlay a specific research activity onto the map of science (Rafols, Porter, & Leydesdorff, 2010).
One method to assign publications to a disciplinary category is to rely on the journal of the publication as an estimate of the scientific field. However, disciplines and fields of science develop above the level of individual journals. Scientometricians proposed the normalization of citations in terms of journal categories (ISI Subject Categories, now known as Web of Science Categories)—as proxies of scientific fields defined above the level of individual journals—in a series of publications during the 1980s (e.g. Schubert, Glänzel, & Braun, 1986; Schubert, Glänzel, & Braun, 1989; Vinkler, 1986).
Using these categories, Moed, de Bruin, & van Leeuwenet (1995) further developed the “crown indicator” at the Center for Science and Technology Studies (CWTS) in Leiden that was later improved as the “Mean Normalized Citation Score” (MNCS). This indicator remains based on the same subject categories, and it is currently the most widely used method to provide normalized comparisons across scientific areas.
The WCs tagged to the 11,000+ journals covered by the Science Citation Index (SCI) and the Social Sciences Citation Index (SSCI) are assigned by indexers on the basis of a number of criteria, including field experts’ judgment of relevance to a given field, the journal’s title, and its citation patterns (Bensman & Leydesdorff, 2009). As of 2015, there are 227 WCs covering SCI and SSCI. Pudovkin and Garfield (2002) described the methods used by the ISI (then provided by Thomson Reuters, and now Clarivate Analytics), and concluded that in many fields these categories are “sufficient;” but “in many areas of research these “classifications” are crude and do not permit the user to quickly learn which journals are most closely related” (p. 1113). Boyack, Börner, and Klavans (2007) estimated that the assignment of WCs is correct in approximately 50% of cases across the file. That said, the “correct” assignment based on detailed article content would usually be proximate.
On the basis of a comparison of this classification with algorithmically generated ones, Rafols and Leydesdorff (2009) (p. 1830) concluded that the WCs can be used for aggregate statistical purposes (i.e. above 100 or so publications, depending on the desired granularity); but are not well-suited for detailed analyses (e.g. to assess an individual’s research). The WCs sometimes cover similar sets of journals; for example, in the domain of biomedicine. In other cases, the categories added by an indexer cover areas that could be considered as separate sub-disciplines or subfields (Leydesdorff & Bornmann, 2016; van Eck et al., 2013). In the case of interdisciplinary publications, problems of imprecise or potentially erroneous classifications can be expected (Rafols & Meyer, 2010) In scientometric evaluations, journals are sometimes attributed percentages proportional to the categories under which they are subsumed. These multiple categories have also been considered indicators of the interdisciplinarity of journals (Bordons, Bravo, & Barrigon, 2004; Katz & Hicks, 1995; Morillo, Bordons, & Gomez, 2001).
In scientometric evaluations, journals are sometimes attributed percentages proportional to the categories under which they are subsumed. These multiple categories have also been considered indicators of the interdisciplinarity of journals (Bordons, Bravo, & Barrigon, 2004; Katz & Hicks, 1995; Morillo, Bordons, & Gomez, 2001).
Notwithstanding these issues, WCs are a main basis for scientometric analyses. The use of these journal categories has become conventional among scientometricians (e.g. Rehn et al., 2014), including use to assess research portfolios. For example, The field/subfield classification of Scopus is available in the journal list from
The field/subfield classification of Scopus is available in the journal list from
WCs can also be considered “macro-journals” representing fields and subfields of science. Their sub-disciplinary level of detail fits well with a US National Academies recommendation for study of interdisciplinarity (2005). The current (2015 WoS data) matrix of 227 WCs citing one another can be decomposed using multi-variate (e.g. clustering) analysis. It can be analyzed as a network using, for example, community-finding algorithms. Initially (refer to Leydesdorff & Rafols, 2009; Rafols, Porter, & Leydesdorff, 2010), we used 2007 data to develop a global map of science. At that time, drawing a map using the approximately 10,000 journals in the database was technically not feasible due, among other things, to the cluttering of the labels on the screen. This problem was elegantly solved by VOSviewer (which became available in 2009), by allowing interactive zoom in/out functionality in the visualization (Klavans & Boyack, 2009; van Eck & Waltman, 2010) Available at
Earlier maps were developed into an overlay-toolkit Pajek is a network analysis and visualization program freely available for non-commercial usage at
Pajek is a network analysis and visualization program freely available for non-commercial usage at
We now make some choices differently from the ones we made some ten years ago. The wide use by a variety of stakeholders (including not only some researchers, but also scientometric students and practitioners) and requests for a current database, together with technical improvements in visualization during recent years, lead us to revise the overlay basemaps and toolkit based on the most recent version of the
We use the combined set of the
Numbers of journals and Web of Science categories in SCI and SSCI. The journal
Journals WCS SCI 8,778 177 SSCI 3,212 57 Sum 11,990 234 Total 11,365 227 Overlap 625 6
Numbers of journals and Web of Science categories in SCI and SSCI.
The set of WCs covering SCI and SSCI has expanded from 224 in 2010 to 227 in 2015. The three newly added WCs are: “Audiology & speech-language pathology,” “Green & sustainable science & technology,” and “Logic.” The former WC—“Biology, miscellaneous”—was no longer in use in 2010 and, therefore, not included in the analysis; it is also absent from the 2015 data and the current maps.
Using dedicated software, the matrix of 227 × 227 cells was generated on the basis of whole-number citation counting. As previously, we normalize this matrix using the cosine function. However, the default VOSviewer setting normalizes using Zitt, Bassecoulard, and Okubo’s so-called “probabilistic activity index” (PAI) (2000). PAI is equal to the ratio between observed and expected values in a contingency table based on a probability calculus (Equations (1) and (2)):
In the context of VOSviewer, this measure is renamed as the “association strength” (van Eck & Waltman, 2009).
Unlike the cosine, which is symmetrical, PAI can be used to normalize asymmetrically the vertical and horizontal dimension of a matrix. However, this possible advantage is not exploited in VOSviewer because the matrix is first made symmetric using the sums of lower and upper triangle values (cellij + cellji) in a new matrix. The cosine-normalized matrix remains worth investigating, because one is able to show the difference between the citation as the current activity (citing)
Taking these issues into consideration, we first develop the citing-side, cosine-normalized map using 2015 data and VOSviewer visualization with default parameter values. This map is a “descendant” of our previous maps; strong relationship can be seen in comparing Figure 1 (our 2015 basemap from VOSviewer) with A-1 in Appendix A (the 2010 basemap from Pajek). A routine for making overlays on the basis of the map (“wc15.exe”) is provided at Rao-Stirling diversity is a measure that takes into account both the variety, balance, and the disparity of categories in a distribution. In the case of publication or patent portfolios the categories can be respectively, WCs or IPC classes. The indicator is defined as Equation (3) (Rao, 1982; Stirling, 2007):
where Zhang, Rousseau, and Glänzel (2016) and Garner at al. (2013) argue that 2 where
Rao-Stirling diversity is a measure that takes into account both the variety, balance, and the disparity of categories in a distribution. In the case of publication or patent portfolios the categories can be respectively, WCs or IPC classes. The indicator is defined as Equation (3) (Rao, 1982; Stirling, 2007):
Zhang, Rousseau, and Glänzel (2016) and Garner at al. (2013) argue that 2
As an additional resource, one can feed the citation matrix of 227 WCs (citing
The routines also provide cluster and vector files for cos15.paj and matrix.paj made available on the website for Pajek, respectively (as previously). Pajek and Gephi contain a suite of tools for network analysis and visualization such as various decompositions, layouts, and visualization options. Using Pajek or Gephi, for example, one can also obtain the results of the Louvain algorithm (Blondel et al., 2008) for the decomposition in a format that can again be visualized in programs such as VOSviewer or Gephi.
Using VOSviewer, the user can change the number of clusters by changing the resolution parameter and running the clustering algorithm again. Using default values, both maps (i.e. cosine-normalized or not) show five clusters, but chi-square statistics reject the zero-hypothesis that the two classifications are similar (Cramer’s
As mentioned in the previous section, the cosine-similarity matrix for the WCs provides both the basis for locating the WCs as nodes in science maps (Figure 1), and the basis to calculate measures of diversity. Footnotes 7 and 8 remind the users of Stirling’s measure and how it can be calculated using the 227-by-227 WC cosine-similarity matrix (see Rafols, Porter, & Leydesdorff (2010) for details).
Porter and colleagues introduced measures of interdisciplinarity and multidisciplinarity called “Integration scores” and “Specialization scores,” extended by Carley and Porter to “Diffusion scores” as well (Carley & Porter, 2012; Porter et al., 2007; Porter & Rafols, 2009). For a given set of publication from WoS, Specialization scores indicate the disciplinary diversity of the set based on the distribution of their WCs. Integration scores reflect the diversity of those publications’ cited references—again, using the cited WCs. Downloading the “cited references” of a given WoS search set allows one to pursue this metric. Conversely, Diffusion scores reflect the diversity of the disciplines citing a given set of papers, based on the citing journals’ WCs. This requires a citation search and data downloading from WoS.
These scores are different instances of the Rao-Stirling diversity measures (Footnote 7) (Stirling, 2007). As introduced earlier in this section, one can obtain the Specialization score (Rao-Stirling diversity for the WCs represented in the WoS search set) along with a science overlay map if desired, directly from the script provided at
Integration or Diffusion scores need more detailed computation. Scripts have been prepared to run in VantagePoint software Scripts available at
Scripts available at
As what is introduced earlier in this paper, and enabled at the website ( Another way to compute the maps is to use VantagePoint (
Another way to compute the maps is to use VantagePoint (
The website provides the option to generate either five-cluster science overlay maps or finer scaled (color-differentiated) 18-cluster overlay maps. Both cluster solutions were generated in VOSviewer, using its algorithm Our previous clustering solutions were generated using factor analyses in SPSS, resulting in 4 “metadisciplines” (see Appendix Figure A-1) and 19 “macro-disciplines” for 2010 base data.
Our previous clustering solutions were generated using factor analyses in SPSS, resulting in 4 “metadisciplines” (see Appendix Figure A-1) and 19 “macro-disciplines” for 2010 base data.
Our intent here is to present a range of maps to illustrate differences that the new science overlay mapping can convey. We hope that these promote thinking of additional uses of science overlay mapping, potentially augmented by enabling calculation of diversity measures (e.g. Specialization and Integration scores) with the same tool suite.
Figures 2 and 3 compare two multinational companies’ research publications in WoS for 2010–2015. Both show biomedical and physical science strengths. Unlike Unilever, Pfizer also has a pronounced portfolio in “economics” and “statistics and probability” as fields of science. These visualizations facilitate exploration of shared and complementary research interests, potentially of use in considering collaboration (as well as tracking competition) among organizations or nations.
Figures 4, 5, and 6 present three contrasting university profiles. Patterns stand out quite boldly among the engineering-oriented Georgia Tech, the social science emphases of the London School of Economics, and the full spectrum University of Amsterdam research. In contrast to Figure 5, Figure A-4 (Appendix) presents the same data using an 18-cluster map that facilitates finer comparisons.
Usually one would want to focus more tightly—e.g. on a particular research unit or even on an individual researcher’s work (say to ascertain complementarity with another research group or emphases of a funding program). As one step in that direction, contrast the emphases seen in Figure 4 to its subset for one department of Georgia Tech, the School of Public Policy, shown in Figure 7.
Conversely, one can observe even broader research profiles—Figure A-5 does so for a country, South Africa. Not surprisingly, one sees a very broad spectrum of research activity at this level. One could pursue via further analyses—e.g. to identify researchers active in a particular sub-domain as spotted on a map. We envision various uses for such technical intelligence, ranging from identification of others pursuing one’s area of interest to identifying complementary strengths for research center development, or such.
Figures 2 to 7 map the research outputs of a given organization. One can map other WoS search sets as well. For instance, in a study of the outputs and impacts of an NSF research program on Human & Social Dynamics (HSD), science overlay mapping was useful for those assessing the merits of that program to see the diversity of the publications generated by HSD support. However, it was even more interesting to see the spread of papers citing those publications across the disciplines. Those showed that this funding from the Social, Behavioral & Economic Sciences Directorate was actively cited beyond those social sciences by natural sciences and engineering (Garner et al., 2013).
Another appealing opportunity arises in mapping topical searches. Figure 8 illustrates for an emerging energy technology, dye-sensitized solar cells (DSSCs), dominated by materials science and related research. “Big Data” (using a first approach) (Figure 9) shows a strong concentration in Computer Science and related fields, but note the incredible breadth of publication as virtually all fields consider how Big Data and Analytics can enhance their R&D. Such research profiling could support funding agencies’ confirmation of interdisciplinary research programs.
This article bolsters science overlay mapping as a tool for researchers and analysts to help understand the disciplinary profiles of organizations, funding programs, topics, or other types of publication sets. Visualization of the disciplinary profile, operationalized at the sub-discipline level of 227 Web of Science Categories (WCs) can now offer an adjustable, “birds eye” view of the fields involved. By choosing the 18-cluster option (Figure A-3) or the five-cluster option (Figure 1), one can show the analysis at a narrow or broad disciplinary description.
We use a cosine-normalized basemap in this paper’s examples, but note the option of a non-normalized matrix that can default to VOSviewer’s internal normalization scheme for a different presentation (e.g. Figure A-2). We favor the cosine-normalization as 1) yielding more intuitive results, 2) consistent with our prior overlay maps (see Figure A-1), 3) and shown to be consistent with consensus science mapping (e.g. various renditions by Klavans & Boyack (in press), and others (Klavans, & Boyack, 2009), and 4) conducive to use as a diversity measure in calculating diversity indexes (Rao-Stirling). Comparing to Figure A-1 also shows the general continuity between the previous Pajek visualization to the current VOSviewer one. It also shows some differences, both in the visual rendition and in node localizations. We now favor VOSviewer for its ease of use and accessible richness of the visualization options.
As illustrated in the case examples, these science overlay maps can provide a quick and intuitive perspective on the disciplinary profiles of organizations. As explained in Rafols, Porter, and Leydesdorff (2010) (see also Leydesdorff & Bornmann, 2016; Rafols & Leydesdorff, 2009; Rafols & Meyer, 2010; van Eck et al. 2013), the main downside of this visualization tool is the lack of accuracy in the WCs—which nevertheless is the most widely used and easily available classification system. As shown in a previous study (Rafols, Porter, & Leydesdorff, 2010), the lack of accuracy of WCs is less problematic at a relatively high level of aggregation. Most errors in locating specific research are nearby in the mapping. For fine-grained descriptions, article-based clustering is preferred (Waltman & van Eck, 2012). However, that does not match the WC-based mapping for communication of which fields are engaged, to what degree.
We believe these new science overlay maps open opportunities for future research. For one, exploration of the differences between the global science maps over time (e.g. between 2010 and 2015 basemaps), shows promise to elucidate real shifts in global research emphases. For instance, is medical science becoming more closely related to biological sciences and less linked to chemistry? The basemaps appear to evolve slowly as shown by the fact that the underlying 2010 and 2015 citation matrices among WCs are very similar (QAP correlation
In stepping through the case analyses, we have pointed to a variety of appealing applications for the science
Numbers of journals and Web of Science categories in SCI and SSCI.
Search results of Big Data.
|No.||Search strategy||Big Data search terms (search conducted on 1/27/2016 by Alan)||Hits – 2006–2015|
|1||Core lexical query||TS = (“Big Data” or Bigdata or “Map Reduce” or MapReduce or Hadoop or Hbase or Nosql or Newsql)||8,602|
|2||Expanded lexical query||TS = ((Big Near/1 Data or Huge Near/1 Data) or “Massive Data” or “Data Lake” or “Massive Information” or “Huge Information” or “Big Information” or “Large-scale Data” or Petabyte or Exabyte or Zettabyte or “Semi-Structured Data” or “Semistructured Data” or “Unstructured Data”)||11,798|
|TS = (“Cloud Comput*” or “Data Min*” or “Analytic*” or “Privacy” or “Data Manag*” or “Social Media*” or “Machine Learning” or “Social Network*” or “Security” or “Twitter*” or “Predict*” or “Stream*” or “Architect*” or “Distributed Comput*” or “Business Intelligence” or “GPU” or “Innovat*” or “GIS” or “Real-Time” or “Sensor Network*” or “Smart Grid*” or “Complex Network*” or “Genomics” or “Parallel Comput*” or “Support Vector Machine” or “SVM” or “Distributed” or “Scalab*” or “Time Serie*” or “Data Science” or “Informatics*” or “OLAP”)||3,113,113 (part A AND part B = 7,696)|
|3||#1 OR (#2 AND #3); 2006–2016||SCI = 4,673; SSCI = 1,026, of which 541 are not also in SCI –download 541; AHCI (not in SSCI) = 45 down; CPCI-S & CPCI-SSH = 6,267 (of which 6,093 not in SCI-SSCI – download) – hit 5,000 limit. so split – download 6,093; BCI-S & BCI-SSH = 376 – download all (ignore possible overlaps)|
|ESCI – search #1 = 0; so leave that dB out; ** save the separate downloaded into VP files in case we want to analyze sometime – note trend behavior for 2015 differs greatly from SCI/SSCI (UP) to CPCI’s (DOWN). I think due largely to incomplete indexing at this date in WoS. Also saved the combo – 11,728 total – removed dups to get 11,684 (saved with the component files on the flash memory).|