
Overview
A dataset intended to support research on machine learning techniques for detecting malware. It includes metadata and EMBER-v2 features for approximately 10 million benign and 10 million malicious Portable Executable files, with disarmed but otherwise complete files for all malware samples. All samples are labeled using Sophos in-house labeling methods, have features extracted using the EMBER-v2 feature set, well as metadata obtained via the pefile python library, detection counts obtained via ReversingLabs telemetry, and additional behavioral tags that indicate the rough behavior of the samples.
Features and programs
Open Data Sponsorship Program
Pricing
This is a publicly available data set. No subscription is required.
How can we make this page better?
Legal
Content disclaimer
Delivery details
AWS Data Exchange (ADX)
AWS Data Exchange is a service that helps AWS easily share and manage data entitlements from other organizations at scale.
Open data resources
Available with or without an AWS account.
- How to use
- To access these resources, reference the Amazon Resource Name (ARN) using the AWS Command Line Interface (CLI). Learn more
- Description
- Sophos/ReversingLabs 20 million sample dataset
- Resource type
- S3 bucket
- Amazon Resource Name (ARN)
- arn:aws:s3:::sorel-20m/
- AWS region
- us-west-2
- AWS CLI access (No AWS account required)
- aws s3 ls --no-sign-request s3://sorel-20m//
Resources
Vendor resources
Support
Contact
Managed By
Sophos AI
How to cite
Sophos/ReversingLabs 20 Million malware detection dataset was accessed on DATE from https://registry.opendata.aws/sorel-20m .
License
See the Terms of Use