This page is a programatically generated report of the results of topic modeling algorithms on a collection of scientific literature. Expand the sections below for more information. The code can be found on GitHub. You can find other topic modeling reports here.
The dataset used in this work is the Semantic Scholar Open Research Corpus. The full dataset includes the paper metadata (title, abstract, citation info, etc. ) for over 25 million papers stored locally in a ~100 Gb sqlite database. Topic modeling is performed on a subset of this data that is obtained by finding papers that contain the regular expression 'natural language processing' in the title or abstract.
Note: the 25 million papers mentioned above is a temporarily downselected version of the entire semantic scholar dataset. The dataset includes topic tags created by the now-defunct Microsoft Academic(MA), for the moment, for the inital phase of development we have reduced the database size by downselecting to papers containing the following MA topics: 'Chemistry', 'Computer Science', 'Engineering', 'Physics', 'Materials Science', 'Mathematics', 'Economics', 'Geology', 'Environmental Science'. For more information see here.
Topic modeling refers to machine learning algorithms that find collections of words that describe a corpus (collection of documents). The topic modeling is performed with either a Correlation Explanation (CorEx) topic model or a Gensim Latent Dirichlet Allocaiton (LDA) topic model.
Below is a plot of some different topics extracted from the corpus and their relative probabilities over time. The most probable words of the topic are shown above each figure. These words showcase the text processing that combines different variations on a word (i.e. strategy, strategic) into one root 'stem'. The probability is calculated for each year by summing the given topic's probability over each paper, then normalizing so the sum of all topic probabilites for that year is 1.
The nodes of the network represent the topics of the topic model. Click them for more information.
Each node represents a topic and each edge between two topics represents the likelihood of the two topics appearing in an abstract together.
The layout and scale of the axes is determined by the networkx spring_layout algorithm. The algorithm tries to keep the n nodes 1/sqrt(n) distance apart, which ends up determing the scale of the axes. This is not particularly informative so the axis tick marks are omitted.
NODE COLOR: community assigned by the Louvain community detection algorithm.
NODE SIZE: overall probability of the topic appearing in the analyzed collection of texts as a whole.
NODE OPACITY: probability of the topic appearing in the analyzed collection of texts over the past 5 years (opacity = 0.5 + 0.5*recent_probability).
EDGE THICKNESS: logarithmically scaled values of the topic covariance matrix, related to how often those two topics show up together in a given paper