Topic Modeling Report ('satellite imag')

This page is a programatically generated report of the results of topic modeling algorithms on a collection of scientific literature. Expand the sections below for more information. The code can be found on GitHub. You can find other topic modeling reports here.

The dataset used in this work is the Semantic Scholar Open Research Corpus. The full dataset includes the paper metadata (title, abstract, citation info, etc. ) for over 25 million papers stored locally in a ~100 Gb sqlite database. Topic modeling is performed on a subset of this data that is obtained by finding papers that contain the regular expression 'satellite imag' in the title or abstract.

Left: The number of all annual publications in the left) full Semantic Scholar Open Research Corpus database and right) The percentage of papers containing regular expression **'satellite imag'** in the title or abstract.

Note: the 25 million papers mentioned above is a temporarily downselected version of the entire semantic scholar dataset. The dataset includes topic tags created by the now-defunct Microsoft Academic(MA), for the moment, for the inital phase of development we have reduced the database size by downselecting to papers containing the following MA topics: 'Chemistry', 'Computer Science', 'Engineering', 'Physics', 'Materials Science', 'Mathematics', 'Economics', 'Geology', 'Environmental Science'. For more information see here.

Topic modeling refers to machine learning algorithms that find collections of words that describe a corpus (collection of documents). The topic modeling is performed with either a Correlation Explanation (CorEx) topic model or a Gensim Latent Dirichlet Allocaiton (LDA) topic model.

A schematic of a general topic modeling agorithm where each word and document has some probability of being associated with a topic. One caveat is that for CorEx topic models, each word only belongs to one topic.

Below is a plot of some different topics extracted from the corpus and their relative probabilities over time. The most probable words of the topic are shown above each figure. These words showcase the text processing that combines different variations on a word (i.e. strategy, strategic) into one root 'stem'. The probability is calculated for each year by summing the given topic's probability over each paper, then normalizing so the sum of all topic probabilites for that year is 1.

The topics sorted by the largest positive slope in the last five years. i.e. trending

The topics sorted by the largest negative slope in the last five years.

Interactive Topic Model Plot

The nodes of the network represent the topics of the topic model. Click them for more information.

Each node represents a topic and each edge between two topics represents the likelihood of the two topics appearing in an abstract together. The layout and scale of the axes is determined by the networkx spring_layout algorithm. The algorithm tries to keep the n nodes 1/sqrt(n) distance apart, which ends up determing the scale of the axes. This is not particularly informative so the axis tick marks are omitted.

NODE COLOR: community assigned by the Louvain community detection algorithm.
NODE SIZE: overall probability of the topic appearing in the analyzed collection of texts as a whole.
NODE OPACITY: probability of the topic appearing in the analyzed collection of texts over the past 5 years (opacity = 0.5 + 0.5*recent_probability).
EDGE THICKNESS: logarithmically scaled values of the topic covariance matrix, related to how often those two topics show up together in a given paper

Click on a topic to view information about that topic including the papers with the highest probability of containing the selected topic (will appear to the right), and the topic probability trend (lower right) .
Click in the blank area to reset the selection to nothing
With one topic selected, select another connected node (pink lines) to view the papers and words with the highest probabilities in both topics. In this case the 'topic_prob' is the probability of the two topics for that paper multiplied. Note that the words in CorEx topic models belong to only one topic, so there is no implemented method of finding the words associated with both topics.