Using topological text analysis for COVID-19 Open Research Challenge

My take on COVID-19 Kaggle challenge analysis of scientific white papers. This research is a first step to help specialist in virusology, pharmacy and microbiology to find answers to the problem

Edward Kibardin
March 24, 2020

On 12 March the White House and a coalition of leading research groups prepared the COVID-19 Open Research Dataset in response to the coronavirus pandemic. The dataset consists of over 44,000 scholarly articles, including over 29,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. Kaggle page for this challenge can be found here:

Calling all subject matter experts

If you are a specialist in virusology, pharmacy or microbiology, we would love to give you FREE access to the below analysis on DataRefiner platform to help speed up scientific research and tackle the virus. Please contact us on

The dataset was created by the Allen Institute for AI in partnership with the Chan Zuckerberg Initiative, Georgetown University’s Center for Security and Emerging Technology, Microsoft Research, and the National Library of Medicine — National Institutes of Health, in coordination with The White House Office of Science and Technology Policy

The challenge

The challenge have specified a number of questions for the community:

What is known about transmission, incubation, and environmental stability?
What do we know about natural history, transmission, and diagnostics for the virus? What have we learned about infection prevention and control?

What do we know about COVID-19 risk factors?
What have we learned from epidemiological studies?

What do we know about virus genetics, origin, and evolution?
What do we know about the virus origin and management measures at the human-animal interface?

What do we know about non-pharmaceutical interventions?
What is known about equity and barriers to compliance for non-pharmaceutical interventions?

What do we know about vaccines and therapeutics?
What has been published concerning research and development and evaluation efforts of vaccines and therapeutics?

What do we know about diagnostics and surveillance?
What has been published concerning systematic, holistic approach to diagnostics (from the public health surveillance perspective to being able to predict clinical outcomes)?

Data Preprocessing

We used both abstract and full text of the documents for this analysis. Often the documents in this dataset have only abstract or just full text. Merging them together helped to present the full picture.

Data quality dashboard from kaggle page for the challenge

Abstracts in scientific papers are normally quite complex and one way to analyse them is to split them into sentences. In this case each sentence is a small piece of information, which should be enough for automatic analysis.

This is an example of one of the abstracts:

‘The basic reproduction number of an infectious agent is the average number of infections one case can generate over the course of the infectious period, in a naïve, uninfected population. It is well-known that the estimation of this number may vary due to several methodological issues, including different assumptions and choice of parameters, utilized models, used datasets and estimation period. With the spreading of the novel coronavirus (2019-nCoV) infection, the reproduction number has been found to vary, reflecting the dynamics of transmission of the coronavirus outbreak as well as the case reporting rate. Due to significant variations in the control strategies, which have been changing over time, and thanks to the introduction of detection technologies that have been rapidly improved, enabling to shorten the time from infection/symptoms onset to diagnosis, leading to faster confirmation of the new coronavirus cases, our previous estimations on the transmission risk of the 2019-nCoV need to be revised.’

For our analysis we split it into sentences:

As you can see, each sentence is actually enough to express a piece of information needed for researchers to at least partially answer a question. In total we’ve got 291281 sentences — that is enough for analysis.

Analysis of sentences

All sentences were analysed and presented as a topological map below, each point on the map is one or more sentences, similar sentences are placed close to each other. Clusters are formed automatically and represent different themes in the text.

Topological map of all sentences from the dataset

As we trained the model, in addition to automatic clusterisation, we configured it to extract sentences with keywords like “origin”, “stability”, “evolution”, “risk” and “transmission”. We separated these clusters on the map manually using a semi-supervised approach.

Let’s have a look at the “origin” cluster and what knowledge we can extract from it.

Top keywords for “origin” cluster

These top keywords were automatically extracted from the sentences, we show top 20 here, the rest will be used to find top summary sentences that describe the cluster the best way.

Top sentences selected automatically for origin cluster

Top sentences describing the cluster ‘origin’ have been identified automatically too and are quite accurate.

Let’s take a look at another cluster, — this time the one, which was formed completely automatically, — “coronavirus sars / sars coronavirus / coronavirus mers”.

Top keywords for “coronavirus sars / sars coronavirus / coronavirus mers” cluster
User investigates top summary sentences for automatically discovered cluster “coronavirus sars / sars coronavirus / coronavirus mers”

Analysis of sentences related to “transmission” keyword as a separate map

In addition to analysis on a whole set of sentences, the system is able to process sentences related to a particular keyword, for example “transmission”. Below is a map of 8198 sentences related to the keyword “transmission”:

Map for “transmission” keyword

The dataset consists not only of coronavirus white papers but a large number of papers about other recent viruses such as “SARS” (Severe acute respiratory syndrome-related coronavirus), “MARS” (Middle East respiratory syndrome-related coronavirus), “TGEV” Transmissible gastroenteritis virus and some other. We have reviewed different clusters separately to understand all the key nuggets of information about transmission of this and similar diseases.

Description of “infectious disease / method use / use mathematical” cluster
Description of “transmissible gastroenteritis / cell / virus” cluster
Description of “disease outbreak / control effort / infectious disease” cluster
Description of “estimate probability / model / rate / transmission rate” cluster
Description of “control measure / reduce / human / sars” cluster
Description of “infectious disease / number contact” cluster

The system automatically identified the clusters of key words and segmented the dataset among them. The closer the clusters are, the more similar terms and topics they have. This helps to understand high-level grouping and to find meta-clusters in the data.


Topological segmentation of text helps researchers to systematise the knowledge in a vast amount of textual information available right now. While performing the segmentation and analysing each segment, the whole volume of data gets split into clusters, which simplifies the analysis and understanding of the data. The identified keywords and summary sentences help to find the most representative pieces of information and speed up the search for an answer.

Analysis for this article was performed using the DataRefiner platform. We’ve been using this technology in a number of companies not just for text analysis but also for user activity and IoT segmentation. Get in touch if you’d like to learn more or book a demo.

**** is a UK based company that specialises in analysis and segmentation of complex data like user activity, sensor or texts. The DataRefiner platform is the result of many years of refining approaches discussed here, but is applied across a wide variety of industries, including aviation, social networks, fraud detection and more. For more information regarding your industry, contact us on

AI for AI (artificial insemination) — Deep Topological Analysis for sensors
Find out how TDA and Deep Learning classifies events and discovers hidden patterns in IoT activity sensors’ data streams
March 10, 2020
Using topological text analysis for COVID-19 Open Research Challenge
My take on COVID-19 Kaggle challenge analysis of scientific white papers. This research is a first step to help specialist in virusology, pharmacy and microbiology to find answers to the problem
March 24, 2020
What 6.5 million of #coronavirus tweets reveal about people thoughts during the pandemic
We apply Topological Data Analysis and Deep Learning to a large volume of text data to reveal hidden patterns in discussions
June 15, 2020