AWS Machine Learning Blog

Create a Word-Pronunciation sequence-to-sequence model using Amazon SageMaker

Amazon SageMaker seq2seq offers you a very simple way to make use of the state-of-the-art encoder-decoder architecture (including the attention mechanism) for your sequence to sequence tasks. You just need to prepare your sequence data in recordio-protobuf format and your vocabulary mapping files in JSON format. Then you need to upload them to Amazon Simple Storage Service (Amazon S3). The Amazon SageMaker built-in algorithm handles building and training the deep learning (DL) architecture for you.

In this blog post, we walk you through our sample Amazon SageMaker Word-Pronunciation notebook to get you familiarized with SageMaker seq2seq. In the English language, there is an implicit but complicated rule to decode the pronunciation of a word from its spelling. We are going to model the rule using sequences of letters of the alphabet as source input and their corresponding sequences of phonemes as target output. This kind of model is referred as sequence to sequence.

The application of the seq2seq model is not limited to the relationship between word and pronunciation. Other use cases would be English sentences to German sentences, texts to summaries, texts to titles, questions to answers, and so on. This blog post will show you how to deploy your own sequence data and develop your own custom seq2seq model on Amazon SageMaker.

Amazon SageMaker seq2seq is based on the Sockeye package, which is a framework for seq2seq modeling with Apache MXNet. For more information see this AWS Blog.  However, it uses a different input data format and renames some hyperparameters to work effectively in Amazon SageMaker.

Data: Carnegie Mellon University Pronunciation Dictionary (cmudict-0.7b)

In this blog post, we use the Carnegie Mellon University Pronunciation Dictionary (cmudict-0.7b) data. A phonetic dictionary, which is a mapping of vocabulary words to sequences of phonemes, is important in a Text-to-Speech system. The original raw data will be found through the following link:

Dataset: http://www.speech.cs.cmu.edu/cgi-bin/cmudict

cmudict-0.7b is a pronunciation dictionary for North American English, which contains more than 134,000 word-pronunciation pairs. One of the advantages of using this dataset for our sequence-to-sequence task is that we can easily validate the results quantitatively and qualitatively.

A pair of word-pronunciation sequences looks like the following example.

Source: ACKNOWLEDGEMENT
Target: AE0 K N AA1 L IH0 JH M AH0 N T

In this example, the source word “Acknowledgement” is made up of 15 letters of the alphabet while the corresponding target pronunciation consists of 11 phonemes. Some words in the dataset have much longer sequences of letters and phonemes than “Acknowledgement” while some words are much shorter than it. The maximum lengths of words and phonemes in the dataset are 34 and 32, respectively.

In this blog post we show you how to train a seq2seq model using the cmudict-0.7b dataset so that the model can predict a sequence of phonemes out of a sequence of input letters.

The sample Amazon SageMaker Word Pronunciation Example can be found here:

Upload SageMaker-Seq2Seq-word-pronunciation.ipynb to your notebook instance. We also use create_vocab_proto.py and record_pb2.py because SageMaker-Seq2Seq-word-pronunciation.ipynb calls these two Python scripts. 

Preprocess the dataset

The seq2seq model takes two types of sequence data: source sequences and their paired target sequences. In our example, the source and target are the word and its corresponding pronunciation. Each word is a sequence of letters of the alphabet and each pronunciation is a sequence of phonemes. The model also requires string-to-integer mappings that tokenize human readable sequences to machine readable sequences.

Amazon SageMaker seq2seq expects the following four input files: vocab.src.json, vocab.trg.json, train.rec, and val.rec. vocab.src.json, and vocab.trg.json are the string-to-integer vocabulary mappings in the json format. train.rec and val.rec are recordio-protobuf files that contain tokenized source-target pairs for training and validation data, respectively. We are going to generate those files below.

Vocabulary files

After downloading the dataset, tokenize the letters of the alphabet and the phonemes into numbers, as shown in the following example.

Source:['A', 'C', 'K', 'N', 'O', 'W', 'L', 'E', 'D', 'G', 'E', 'M', 'E', 'N', 'T']
Target:['AE0', 'K', 'N', 'AA1', 'L', 'IH0', 'JH', 'M', 'AH0', 'N', 'T']

Source:[14 34 61 64 66 88 62 38 36 49 38 63 38 64 78]
Target:[18 61 64 16 62 53 60 63 21 64 78]

The letter A is mapped in 14, C is mapped in 34, K is mapped in 61, N is mapped in 64, and so on. This string-to-integer mapping becomes the vocabulary used in the JSON files. In this example, we create a joint string-to-integer mapping for both source and target sequences (for example, K = 61 and N = 64), which is used to generate both vocab.src.json, and vocab.trg.json. The following example demonstrates this:

vocab_dict =

	{
 '<pad>': 0,'<unk>': 1,'<s>': 2,'</s>': 3,'0': 4,'1': 5,'2': 6,
 '3': 7,'4': 8,'5': 9,'6': 10,'7': 11,'8': 12,'9': 13,'A': 14,
 'AA0': 15,'AA1': 16,'AA2': 17,'AE0': 18,'AE1': 19,'AE2': 20,
 'AH0': 21,'AH1': 22,'AH2': 23,'AO0': 24,'AO1': 25,'AO2': 26,
 'AW0': 27,'AW1': 28,'AW2': 29,'AY0': 30,'AY1': 31,'AY2': 32,
 'B': 33,'C': 34,'CH': 35,'D': 36,'DH': 37,'E': 38,'EH0': 39,
 'EH1': 40,'EH2': 41,'ER0': 42,'ER1': 43,'ER2': 44,'EY0': 45,
 'EY1': 46,'EY2': 47,'F': 48,'G': 49,'H': 50,'HH': 51,'I': 52,
 'IH0': 53,'IH1': 54,'IH2': 55,'IY0': 56,'IY1': 57,'IY2': 58,
 'J': 59,'JH': 60,'K': 61,'L': 62,'M': 63,'N': 64,'NG': 65,
 'O': 66,'OW0': 67,'OW1': 68,'OW2': 69,'OY0': 70,'OY1': 71,
 'OY2': 72,'P': 73,'Q': 74,'R': 75,'S': 76,'SH': 77,'T': 78,
 'TH': 79,'U': 80,'UH0': 81,'UH1': 82,'UH2': 83,'UW0': 84,
 'UW1': 85,'UW2': 86,'V': 87,'W': 88,'X': 89,'Y': 90,'Z': 91,
 'ZH': 92
	}

In this example, ninety three letter-phoneme sets are mapped. You may notice that the first four indices are used for special characters (such as <pad> : pad, <unk> : unknown character, <s> : beginning of the sequence, and </s> end of the sequence). Amazon SageMaker seq2seq expects this input format. Thus, our own vocabulary mapping must start from the index number of 4.

PAD_SYMBOL = "<pad>" #0
UNK_SYMBOL = "<unk>" #1
BOS_SYMBOL = "<s>" #2
EOS_SYMBOL = "</s>" #3

VOCAB_SYMBOLS = [PAD_SYMBOL, UNK_SYMBOL, BOS_SYMBOL, EOS_SYMBOL]
vocab_dict[PAD_SYMBOL] = 0
vocab_dict[UNK_SYMBOL] = 1
vocab_dict[BOS_SYMBOL] = 2
vocab_dict[EOS_SYMBOL] = 3

Here is how we generate source and target JSON vocabulary files.

import json
with open('vocab.src.json', 'w') as fp:
    json.dump(vocab_dict, fp, indent=4, ensure_ascii=False)
        
with open('vocab.trg.json', 'w') as fp:
    json.dump(vocab_dict, fp, indent=4, ensure_ascii=False)

Now we have vocab.src.json, and vocab.trg.json.

 Recordio-protobuf files

We are going to generate the recordio-protobuf files of training and validation data. The recordio-protobuf is the standard data I/O format expected by Amazon SageMaker algorithms. In the notebook, we have a helper function called “write_to_file” for you.

file_type = 'train'
output_file = "train.rec"
write_to_file(trainX, trainY, file_type, output_file)
file_type = 'validation'
output_file = "val.rec"
write_to_file(valX, valY, file_type, output_file)

The write_to_file takes stacks of source sequences (e.g., trainX <numpy array>) and their pair-wise target sequences (e.g., trainY <numpy array>) as inputs and then yields one output protobuf file (e.g., train.rec).  Stacks of sequences look like the examples that follow.

Stack of tokenized source sequence
	[array([36, 80, 36, 14, 75]) array([62, 52, 62, 14])
 	array([88, 38, 52, 64, 49, 14, 75, 78]) array([… …]
	
Stack of tokenized target sequence	
	[array([36, 85, 36, 42]) array([62, 57, 62, 21])
 	array([88, 31, 65, 49, 15, 75, 78]) array([… …]

The recordio-protobuf compresses the data into a smaller size. This helps cut down time taken for data transfer. Our write_to_file helper function borrowed several sub functions from create_vocab_proto.py and record_pb2.py that are found in the following link:

After we have vocab.src.json, vocab.trg.json, train.rec, and val.rec, we are ready to upload them to Amazon S3 buckets.

Upload 4 files to Amazon S3

The S3 bucket and folder prefix are specified. The Region of the S3 bucket should be  the same Region as the notebook and training and inference hosting instances. Name the folder prefix “seq2seq/word-pronunciation”.

# S3 bucket and prefix
bucket = '<your_s3_bucket_name_here>'
prefix = 'seq2seq/word-pronunciation'  
# i.e.'<your_s3_bucket>/seq2seq/word-pronunciation'

Upload the train.rec to train folder, val.rec to validation folder, and the two vocabulary files to vocab folder.

def upload_to_s3(bucket, prefix, channel, file):
    s3 = boto3.resource('s3')
    data = open(file, "rb")
    key = prefix + "/" + channel + '/' + file
    s3.Bucket(bucket).put_object(Key=key, Body=data)

upload_to_s3(bucket, prefix, 'train', 'train.rec') 
#/<your s3 bucket>/seq2seq/word-pronunciation/train/train.rec
upload_to_s3(bucket, prefix, 'validation', 'val.rec') 
#/<your s3 bucket>/seq2seq/word-pronunciation/validation/val.rec 
upload_to_s3(bucket, prefix, 'vocab', 'vocab.src.json') 
#/<your s3 bucket>/seq2seq/word-pronunciation/vocab/vocab.src.json
upload_to_s3(bucket, prefix, 'vocab', 'vocab.trg.json') 
#/<your s3 bucket>/seq2seq/word-pronunciation/vocab/vocab.trg.json

The following screenshot shows the training data file (train.rec) in Amazon S3.

The following screenshot shows the validation data file (val.rec) in Amazon S3.

The following screenshot shows the vocabulary data files (vocab.src.json and vocab.trg.json) in Amazon S3.

Training the Word Pronunciation model

Note: The training takes approximately two hours on an ml.p2.xlarge training instance.

First, we obtain a built-in training image from a Docker container. This training image contains the encoder-decoder model architecture, which is used for our sequence-to-sequence task. This is where the magic happens.

containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/seq2seq:latest',
              'us-east-1': '811284229777.dkr.ecr.us-east-1.amazonaws.com/seq2seq:latest',
              'us-east-2': '825641698319.dkr.ecr.us-east-2.amazonaws.com/seq2seq:latest',
              'eu-west-1': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/seq2seq:latest'}
container = containers[region_name]
print('Using SageMaker Seq2Seq container: {} ({})'.format(container, region_name))

Then, we create a dictionary that passes all training parameters, such as file locations, instructions for training instance, and other hyper parameters, to the training job.

job_name = 'seq2seq-wrd-phn-p2-xlarge-' + strftime("%Y-%m-%d-%H-%M", gmtime())
print("Training job", job_name)

create_training_params = \
{
    "AlgorithmSpecification": {
        "TrainingImage": container,
        "TrainingInputMode": "File"
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": "s3://{}/{}/".format(bucket, prefix)
    },
    "ResourceConfig": {
        # Seq2Seq does not support multiple machines. Currently, it only supports single machine, multiple GPUs
        "InstanceCount": 1,
        "InstanceType": "ml.p2.xlarge", # We suggest one of ["ml.p2.16xlarge", "ml.p2.8xlarge", "ml.p2.xlarge"]
        "VolumeSizeInGB": 50
    },
    "TrainingJobName": job_name,
    "HyperParameters": {
        # Please refer to the documentation for complete list of parameters
        "max_seq_len_source": str(source_sequence_length),
        "max_seq_len_target": str(target_sequence_length),
        "optimized_metric": "bleu", 
        "batch_size": "64", # Please use a larger batch size (256 or 512) if using ml.p2.8xlarge or ml.p2.16xlarge
        "checkpoint_frequency_num_batches": "1000",
        "rnn_num_hidden": "512",
        "num_layers_encoder": "1",
        "num_layers_decoder": "1",
        "num_embed_source": "512",
        "num_embed_target": "512",
        "checkpoint_threshold": "3",
        #"max_num_batches": "2100"
        # Training will stop after 2100 iterations/batches.
        # This is just for demo purposes. Remove the above parameter if you want a better model.
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 48 * 3600
    },
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://{}/{}/train/".format(bucket, prefix),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
        },
        {
            "ChannelName": "vocab",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://{}/{}/vocab/".format(bucket, prefix),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
        },
        {
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://{}/{}/validation/".format(bucket, prefix),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
        }
    ]
}

sagemaker_client = boto3.Session().client(service_name='sagemaker')
sagemaker_client.create_training_job(**create_training_params) 

Inside the seq2seq model, there are two blocks: encoder and decoder. Each block consists of embedding and (multiple) recurrent neural network (RNN) layers. You can find the corresponding instructions for those layers in the “HyperParameters” key covered earlier (i.e., “rnn_num_hidden”: “512”, “num_layers_encoder”: “1”, “num_layers_decoder”: “1”, “num_embed_source”: “512”, “num_embed_target”: “512”). In addition, the attention mechanisms described in the paper by Luong et al. as concat  are already implemented in the SageMaker seq2seq model as a default. The default value of the parameter “rnn_attention_type” is mlp, which is referred as concat in Luong et al. Likewise, the value bilinear is referred as general in the paper.

Look at “InputDataConfig”. That is where all locations of input files are instructed.

After the training is initiated, monitor the training status by executing the following command.

### Please keep on checking the status until this says "Completed". ###

status = sagemaker_client.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print(status)
# if the job failed, determine why
if status == 'Failed':
    message = sagemaker_client.describe_training_job(TrainingJobName=job_name)['FailureReason']
    print('Training failed with the following error: {}'.format(message))
    raise Exception('Training job failed') 

If it returns InProgress, the training is still ongoing. It takes approximately two hours on an ml.p2.xlarge training instance with the parameters given earlier. After it returns Completed, the training is successfully over. You will find an output model artifact (model.tar.gz) in your Amazon S3 bucket.

Here is the inside of the model.tar.gz.

symbol.json is a model structure file, whereas params.best is the model weights. decode.source and decode.target are text files in which tokenized source and target sequences are described line by line. They are constructed from the validation dataset. You can also choose the sample size of decode.source and decode.target by using the “bleu_sample_size” parameter. A series of decode.output.XXXXX keeps track of the prediction outputs at every checkpoint. The metrics files also keep track the change in training and validation accuracies.

Hosting inference

We now use the trained model to perform inference. There are three steps. First, we create an Amazon SageMaker model from the training output.

%%time

sage = boto3.client('sagemaker')

info = sage.describe_training_job(TrainingJobName=job_name)
model_name=job_name
model_data = info['ModelArtifacts']['S3ModelArtifacts']

print(model_name)
print(model_data)

primary_container = {
    'Image': container,
    'ModelDataUrl': model_data
}

create_model_response = sage.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = primary_container)

print(create_model_response['ModelArn'])

Then, we create the endpoint configuration. The endpoint configuration also contains information about the type and number of EC2 inference instances to use when hosting a model. Although we use the free-tier ml.m4.xlarge CPU instance for hosting in this example, we recommend the ml.p2xlarge GPU instance, which is more stable and can process requests much faster.

from time import gmtime, strftime

endpoint_config_name = 'Seq2SeqEndpointConfig-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_config_name)
create_endpoint_config_response = sage.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType':'ml.m4.xlarge', #####
        'InitialInstanceCount':1,
        'ModelName':model_name,
        'VariantName':'AllTraffic'}])

print("Endpoint Config Arn: " + create_endpoint_config_response['EndpointConfigArn'])

Finally, we create the endpoint that can be validated and incorporated into production applications. It takes about 10-15 minutes initially to set up the endpoint. After the endpoint is set up, you can use it for subsequent requests without any additional delay.

%%time
import time

endpoint_name = 'Seq2SeqEndpoint-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_name)
create_endpoint_response = sage.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name)
print(create_endpoint_response['EndpointArn'])

resp = sage.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']
print("Status: " + status)

# wait until the status has changed
sage.get_waiter('endpoint_in_service').wait(EndpointName=endpoint_name)

# print the status of the endpoint
endpoint_response = sage.describe_endpoint(EndpointName=endpoint_name)
status = endpoint_response['EndpointStatus']
print('Endpoint creation ended with EndpointStatus = {}'.format(status))

if status != 'InService':
    raise Exception('Endpoint creation failed.')

Here is how to perform inference. We know that the words “tapeworm” and “tapdance” are not listed in the original cmudict-0.7b dataset. And needless to say, the words “supercalifragilistic” or “expialidocious” are not listed either.

words_infr = ["car",
        "cat",
        "tapeworm",
        "tapdance",
        "supercalifragilistic",
        "expialidocious"]

payload = {"instances" : []}
for word_infr in words_infr:
    
    payload["instances"].append({"data" : " ".join(list(word_infr.upper()))})

response = runtime.invoke_endpoint(EndpointName=endpoint_name, 
                                   ContentType='application/json', 
                                   Body=json.dumps(payload))

response = response["Body"].read().decode("utf-8")
response = json.loads(response)
print(response)

Here is the inference result.

'car': 'K AA1 R'
'cat': 'K AE1 T'
'tapeworm': 'T EY1 P W ER2 M'
'tapdance': 'T AE1 P D AE2 N S'
'supercalifragilistic': 'S UW2 P ER0 K AE2 L AH0 F R AE1 JH AH0 L IH2 S T IH0 K' 
'expialidocious': 'EH2 K S P IY2 AH0 L AH0 D OW1 SH AH0 S'

Conclusion

Amazon SageMaker seq2seq is a supervised learning algorithm in which the input is a sequence of tokens and the output generated is another sequence of tokens. We walked through our end-to-end sample notebook. We trained a word-pronunciation model using the cmudict-0.7b dataset and hosted the inference. You may want to try out a different set of hyperparameters to examine the changes in the prediction accuracy. Or check out Amazon Polly, an API-driven service that enables you to quickly integrate speech into your applications, if you are looking for a plug-and-play solution. Amazon Polly supports lexicons that allow you to easily customize the pronunciation of words.

Now you can develop your custom seq2seq model by bringing in your own paired sequence data (for example, text summarization, Q&A modeling, title generation from text, machine translation, and so on). You are now able to generate train.rec, val.rec, vocab.src.json, and vocab.trg.json, upload them to Amazon S3, and run the training. The Amazon SageMaker notebook has already provided a sample seq2seq that can help you build an English-German machine translation model based on language data provided by the Machine Translation Group at UEDIN . We encourage you to take a look at the sample notebook as well. Or try Amazon Translate, a neural machine translation service that is currently in Preview.


Additional Reading

In this post, we provide an overview of NMT, and then show how to use Sockeye to train a minimal NMT model with attention.


About the Authors

Tatsuya Arai Ph.D. is a biomedical engineer turned deep learning data scientist in Amazon Machine Learning Solutions Lab team.  He believes in the true democratization of AI and the power of AI shouldn’t be exclusive to computer scientists or mathematicians.

 

 

 

Orchid Majumder is a Software Engineer within the SageMaker Algorithms Team. Previously he worked as a Software Engineer in Amazon India Seller Experience Team. In his leisure time, he like reading fictions or watching cricket.

 

 

 

Saurabh Gupta is an Applied Scientist with AWS Deep Learning. He did his MS in AI and Machine Learning from UC San Diego. He is currently working on building Natural Language Processing algorithms for Amazon SageMaker.

 

 

 

David Arpin is AWS’s AI Platforms Selection Leader and has a background in managing Data Science teams and Product Management.

 

 

 

 

Sunil Mallya is a Senior Solutions Architect in the AWS Deep Learning team. He helps our customers build machine learning and deep learning solutions to advance their businesses. In his spare time, he enjoys cooking, sailing and building self driving RC autonomous cars.