AWS Public Sector Blog

The COVID-19 infodemic: How Novetta uses machine learning to analyze unproven narratives on social media

The COVID-19 pandemic is driving a parallel “infodemic”: the rapid spread of competing and often harmful narratives about the virus. Social media plays a central role in this infodemic, serving as a forum for the spread and evolution of theories and beliefs with origins in broadcast, print, online news, blogs, and other digital arenas.

The ability of decision-makers to understand the role of social media in spreading ideas and beliefs is limited by the scale of activity on these platforms, where users produce millions of posts per day. To help navigate this tsunami of information, analysts supporting time-sensitive missions have often had to apply vague, limited analytic approaches. Alternatively, assessments that require accurate, granular data analytics necessitate labor-intensive approaches; and with this, decision-makers are left without the information they need in fast-moving situations.

As the COVID-19 infodemic grew, Novetta used Amazon Web Services (AWS) to help produce a capability that overcomes this tradeoff of accuracy and speed in social media analysis. The result, dubbed Rapid Narrative Analysis (RNA), achieves accuracy by using human expertise at critical stages of analysis while using machine learning (ML) models to rapidly diagnose the severity of the spread of key narratives at a speed needed to take effective action.

Using RNA, we compared the severity of three prominent, unproven assertions within the Twitter discussion of COVID-19:

  1. COVID-19 is a biological weapon
  2. 5G is responsible for COVID-19
  3. Unproven remedies cure COVID-19

To do this, RNA compared the size of online communities (i.e., number of users) discussing the belief or disbelief of these assertions and then measured the rate of growth of these communities over time.

How RNA harnesses the power of machine learning

Novetta’s RNA approach begins with creating a small amount of high-quality, human-processed data to train a collection of machine learning models. We determined that as few as 500 human-coded tweets—approximately eight hours of human effort—was enough training data for our models to maintain a high accuracy while analyzing hundreds of thousands of tweets. By training on top of state-of-the-art language models, we achieved a high accuracy (F1 score of 85 percent, a metric for measuring the accuracy of ML models) despite a small training set.

To human-code the data, RNA used Novetta’s Mission Analytics user interface to ingest and label a subset of tweets that were representative of the topics of interest. This training data was then deployed to train and apply a model to more than 700,000 tweets via an AWS AMI using an Amazon Elastic Compute Cloud (Amazon EC2) GPU Spot Instance configured with AdaptNLP, Novetta’s open-source natural language processing (NLP) framework. To achieve these results at scale, Novetta’s Machine Learning Center of Excellence (ML-COE) used numerous AWS cloud computing services, including Amazon Deep Learning AMIs, Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon Elastic Container Registry (Amazon ECR).

The shortcomings of virality-only measures in social media analysis

While motivated by the need to address the COVID-19 infodemic, our approach made significant improvements to social media analysis generally. Specifically, Novetta’s approach moves beyond “virality,” a core concept underlying dominant approaches to social media analysis defined as “the tendency of an image, video, or piece of information to be circulated rapidly and widely from one Internet user to another.”

The concept of virality has often led to a myopic focus in social media analytics on discrete markers—such as hashtags, links to outside domains, or keywords. The measured prominence of certain sets of markers is used to infer the prominence of particular perspectives or beliefs that roughly correlate with those markers—essentially, the more that a marker is seen, the more prominent support for that perspective or belief is inferred to be.

However, this approach often fails to capture the complex and dynamic ways these markers are utilized, co-opted, and satirized in the public space – users share links to stories and perspectives they both agree and disagree with, and trolls and bots use trending hashtags to hijack conversations and introduce entirely unrelated ideas.

Introducing virulence: measuring the “believability” of social media narratives

RNA moves beyond “virality” to an analytic framework more akin to that of “virulence” in medicine, which is concerned with “the severity or harmfulness of a disease or poison.” In social media analysis, this framework puts the emphasis on measuring how believed ideas are online by assessing the full language of users’ posts, not just how often specific markers appear.

Community Predictions

In a matter of hours, RNA collected and diagnosed more than 700,000 tweets to assess the virulence of each of the three target narratives, revealing the COVID-19 is a biological weapon assertion to have a significantly higher share of believers (59%) than discussion of other assertions. This result is striking in comparison to the assertion that 5G is responsible for COVID-19, which was the only discussion with considerably more disbelieving participants (45%) than believers (35%).

RNA also mapped hidden features in the development of the COVID-19 Twitter discussion over time. Such features are generally not revealed by traditional “hashtag” analysis. For the COVID-19 is a biological weapon discussion, the number of new users entering the conversation and expressing belief in the assertion continued to grow even after the number of total participants in the conversation began to decline.

Using RNA to improve public health

Our approach is designed to integrate into an organization’s information lifecycle, enabling more informed decisions to address harmful misinformation based on its virulence and the ability to evaluate the effectiveness of those decisions in near real-time.

Novetta is engaged with Africa CDC to apply the RNA process to Twitter and Facebook discussion of COVID-19 vaccines and related misinformation. This work will provide national governments across the African continent with early and ongoing assessments of dangerous narratives influencing opinion and behavior toward vaccines and related public health measures.

Watch the VentureBeat webinar on finding new ways to operate and respond to the pandemic with AI and ML. Read more about Novetta’s NMA solution in Novetta product owner David Cyprian’s VentureBeat interview. And check out machine learning on AWS.

Elliot Stewart

Elliot Stewart

Elliot Stewart is an open source analyst at Novetta using digital media analytics to support defense, legal, and public health actors. Novetta delivers scalable advanced analytic and technical solutions to address challenges of national and global significance. Focused on mission success, Novetta pioneers disruptive technologies in machine learning, data analytics, full-spectrum cyber, open source analytics, cloud engineering, DevSecOps, and multi-INT analytics for defense, intelligence community, and federal law enforcement. Novetta is headquartered in McLean, VA with over 1,300 employees across the U.S.