Multi Token Completion

This dataset provides masked sentences and multi-token phrases that were masked-out of these sentences. We offer 3 datasets: a general purpose dataset extracted from the Wikipedia and Books corpora, and 2 additional datasets extracted from pubmed abstracts. As for the pubmed data, please be aware that the dataset does not reflect the most current/accurate data available from NLM (it is not being updated). For these datasets, the columns provided for each datapoint are as follows: text- the original sentence span- the span (phrase) which is masked out span_lower- the lowercase version of span range- the range in the text string which will be masked out (this is important because span might appear more than once in text) freq- the corpus frequency of span_lower masked_text- the masked version of text, span is replaced with [MASK] Additinaly, we provide a small (3K) dataset with human annotations.

Overview

Features and programs

Open Data Sponsorship Program

This dataset is part of the Open Data Sponsorship Program, an AWS program that covers the cost of storage for publicly available high-value cloud-optimized datasets.

Learn more

Pricing

This is a publicly available data set. No subscription is required.

How can we make this page better?

Tell us how we can improve this page, or report an issue with this product.

Legal

Content disclaimer

Vendors are responsible for their product descriptions and other product content. AWS does not warrant that vendors' product descriptions or other product content are accurate, complete, reliable, current, or error-free.

Usage information

Info

Delivery details

AWS Data Exchange (ADX)

AWS Data Exchange is a service that helps AWS easily share and manage data entitlements from other organizations at scale.

Open data resources

Available with or without an AWS account.

How to use: To access these resources, reference the Amazon Resource Name (ARN) using the AWS Command Line Interface (CLI). Learn more

Description: multi-token-completion Datasets
Resource type: S3 bucket
Amazon Resource Name (ARN): arn:aws:s3:::multi-token-completion
AWS region: us-east-1
AWS CLI access (No AWS account required): aws s3 ls --no-sign-request s3://multi-token-completion/

Resources

Vendor resources

View this dataset on Github

Support

License

Datasets are published under CC-NC-SA-3.0 . Human evaluation is published under CC-SA-4.0 .

Similar products

Claude Enterprise

By Anthropic

Claude Enterprise gives every employee access to Claude Chat, Claude Code, and Cowork - Anthropic's full suite of AI tools for chat, coding, and workflow automation with enterprise-grade security, controls, and data privacy by default. Claude Enterprise is HIPAA-eligible, with a BAA available. Recently added capabilities include Claude Security and Claude Design. For AWS customers, spend draws down directly from your EDP or PPA commitment.

View product

TripleBlind - 30 Day Evaluation

By TripleBlind

TripleBlind has created the most complete and scalable solution for privacy-enhancing computation. Combining data and algorithms while preserving privacy and ensuring compliance, delivered via API.

View product

CORE Web3 – Enterprise Blockchain-as-a-Service (BaaS) - NODE

By CORE PLATFORM SYSTEM CORP

CORE BLOCKCHAIN is a modular, secure Blockchain-as-a-Service platform for enterprise deployment of permissioned networks across cloud, hybrid, multi-cloud, and on-premise environments. Licensed per node.

View product