Amazon Polly 업데이트 – 시간 중심 운율 체계와 비동기식 합성

최근에 알려 드린 DeepLens Challenge와 Storage Gateway Recap을 비롯한 AWS 영문 블로그의 최신 게시물에 적용된 Polly 기반 오디오가 마음에 드십니까? 이제 블로그에 게시물을 올릴 때 블로그 게시물 초안을 합성된 음성으로 들으면서 글의 흐름을 효과적으로 파악할 수 있습니다.

오늘 AWS는 두 가지 Amazon Polly 기능을 새로 발표합니다.

시간 중심 운율 체계 – 이제 입력 텍스트의 일부 또는 전체에 해당하는 합성된 음성을 재생할 시간을 지정할 수 있습니다.
비동기식 합성 – 이제 한 번의 호출을 통해 크기가 큰 텍스트 블록을 처리하고 합성된 음성을 Amazon S3에 저장할 수 있습니다.

두 기능을 바로 오늘부터 사용할 수 있습니다. 자세히 살펴보겠습니다.

시간 중심 운율 체계
다국어 버전의 비디오나 자동 실행되는 프레젠테이션을 제작한다고 가정해봅니다. 스크립트를 작성하고 비디오를 특정 언어로 녹화한 후 Amazon Translate와 Amazon Polly를 사용하여 다른 언어의 오디오 트랙을 만듭니다. 각각의 언어를 시각적 콘텐츠와 동기화하려면 각 세그먼트의 재생 시간을 미세하게 제어해야 합니다. 이 새로운 기능은 여기에 사용됩니다. 이제 특정 세그먼트의 원하는 최대 재생 시간을 지정할 수 있습니다. 그러면 Polly가 음성 속도를 조정하여 각 세그먼트의 길이를 제한합니다.

Amazon Polly의 Joanna 음성을 다른 옵션 없이 사용할 경우 앞 단락의 오디오는 19초 길이로 만들어집니다.

<speak>
  In order to keep each language in sync with the visual content, 
  you need to exercise fine-grained control over the duration of
  each segment. That's where this new feature comes in. You can 
  now specify the maximum desired duration for any desired segments, 
  counting on Polly to adjust the speech rate in order to limit 
  the length of each segment.
</speak>

그리고 <prosody> 태그를 사용하면 길이를 15초로 제한할 수 있습니다.

<speak>
  <prosody amazon:max-duration="15s">
    In order to keep each language in sync with the visual content, 
    you need to exercise fine-grained control over the duration of
    each segment. That's where this new feature comes in. You can 
    now specify the maximum desired duration for any desired segments, 
    counting on Polly to adjust the speech rate in order to limit 
    the length of each segment.
 </prosody>
</speak>

여러 개의 <prosody> 태그를 사용하면 시간을 보다 미세하게 제어할 수 있습니다.

  <prosody amazon:max-duration="10s">
    In order to keep each language in sync with the visual content, 
    you need to exercise fine-grained control over the duration of
    each segment. 
  </prosody>
  <prosody amazon:max-duration="7s">
    That's where this new feature comes in. You can now specify 
    the maximum desired duration for any desired segments, 
    counting on Polly to adjust the speech rate in order to limit 
    the length of each segment.
 </prosody>

이 영어 텍스트의 스페인어 버전(Amazon Translate 사용)은 길이가 조금 더 길고 뚜렷하게 속도가 더 빠릅니다.

<speak>
  <prosody amazon:max-duration="15s">
    Para mantener cada idioma sincronizado con el contenido
    visual, es necesario ejercer un control detallado sobre
    la duración de cada segmento. Ahí es donde entra esta 
    nueva característica. Ahora puede especificar la 
    duración máxima deseada para los segmentos deseados, 
    contando con que Polly ajuste la velocidad de voz para 
    limitar la longitud de cada segmento.
 </prosody>
</speak>

시간이 제한된 각 <prosody> 태그의 텍스트는 1500자 길이로 제한되며 중첩은 허용되지 않습니다(내부 태그는 무시됨). 오디오를 알아듣기 쉬운 상태로 유지하기 위해 Polly는 오디오 속도를 최대 5배까지 높입니다.

비동기식 합성
이 기능은 Polly를 사용하여 기사나 서적 같은 긴 형식의 콘텐츠를 음성 오디오로 생성하기 쉽도록 비동기식 요청을 통해 동시에 최대 100,000자까지 텍스트를 처리할 수 있게 지원합니다. 합성된 음성은 사용자가 선택한 S3 버킷으로 배달되며, 선택적으로 장애 알림을 원하는 Amazon Simple Notification Service(SNS) 주제로 라우팅할 수 있습니다. 생성되는 오디오의 최대 길이는 6시간이며 일반적으로 몇 분 안에 준비됩니다. 100,000자의 텍스트 외에 각 요청마다 100,000자의 Speech Synthesis Markup Language(SSML) 마크업을 추가로 포함할 수 있습니다.

각각의 비동기식 요청은 새로운 음성 합성 작업을 생성합니다. Polly 콘솔, CLI(start-speech-synthesis-task) 또는 API(StartSpeechSynthesisTask)에서 작업을 실행하고 관리할 수 있습니다.

이 기능을 테스트하기 위해 이제 전혀 사용되지 않는 AWS 서적의 일반 텍스트 버전을 만들어 SSML 태그를 몇 개 추가하고 유효한 XML로 변환했습니다. 그런 다음 Polly 콘솔을 열고 Text-to-Speech를 클릭하고 XML을 붙여 넣은 다음 Synthesize to S3를 클릭합니다.