This page outlines building citation networks from a seed corpus to obtain a full corpus for topic modeling.


One of the main challenes I've encountered in natural language processing of scientific literature is how to get a representative collection of articles. Searching for articles that have a given term (e.g. carbon nanotube) might miss relevant articles that just don't happen to use that word. To solve this problem I've been experimenting with building citation networks from the semantic scholar open research corpus.

  • Use the Semantic Scholar Open Research Corpus as a MySQL database, allowing for quick local access to metadata for ~25 million papers
  • Able to quickly get a set of paper abstracts from a simple search term, then building a citation network
  • Perform topic modeling on the dataset with a CorEx topic model

See the repository here for the codes used to generate the plots below and more information about the dataset used.

Building a Citation network for “Geographic imaging system”

The semantic scholar dataset includes information about references and citations for a paper, allowing us to build graphs of citations. The 2D layouts are determined by the Force Atlas Algorithm

Citation network algorithm

  1. Obtain 3000 papers for the phrase “geographic imaging system”
  2. Keep the largest connected graph
  3. Trim to 300 most connected papers, based on number of connected edges
  4. Grow the graph by adding references and citations, then trimming to a larger amount.
This shows the citation network building through one algorithm.

Resulting final literature dataset (~10000 abstracts)

Below is the final dataset of papers with the growth round indicated by the color.

Topic modeling on resulting literature dataset

Then on this final collection of papers, a topic modeling pipeline is performed

  1. The starting text is the title and abstract concatenated
  2. Check for the top 130 words in the overall semantic dataset and remove those (e.g. ‘science’, ‘research’, ‘et al’, etc.) and remove along with general stopwords
  3. Apply the Mat2Vec text processing to intellgently handle chemical formulas
  4. Pefrom Porter Stemming (testing -> test)
  5. Form bigrams (natural language -> natural_language)
  6. Perform topic modeling with a Correlation Explanation (CorEx) topic model with 50 topics. CorEx generally forms better topics to me than LDA and also allows for anchoring of topics, though I don’t utilize anchoring here.

Hot topics

By looking at the trend in the probability for each topic over each year, we can find which topics are ‘hot’ and which topics are ‘cold’.

This show the top 10 topics as determined by the CorEx topic model, ranked by the slope of the topic probability over the past 5 years

Interactive topic visualization

Below is an interactive visualization of the topic model. This visualization was developed as part of the MLEF 2021 program and is described further here.