Transcribe Podcast Serverlessly using Cloud Build and Cloud Speech-to-Text

Henry Suryawirawan included in gcp serverless

05-May-2020 1008 words 5 minutes

/images/2020/20200505-transcribe-podcast-serverlessly/featured-image.png

Contents

Have you listened to a great podcast lately and wished that the transcript was available? Find out how you can generate a podcast transcript using GCP services serverlessly.

_____

My Podcast Story

I joined the podcast craze late and have only been enjoying it starting from one year ago. I used to find it difficult to keep my focus and to stay awake while listening to podcasts. As someone who prefers to consume information by reading (and sometimes highlighting texts), I found it difficult to consume information from a podcast for my learning purpose. When I learned about James Clear’s habit stacking technique, I tried to implement it for my podcast listening habit by combining it with my running/walking exercise habit. As a result, podcast has now become one of my most favorite learning mediums. There are so many interesting podcasts available for free, with a variety of topics that I can listen to depending on my mood and curiosity during the day. However, I still experience one challenge from listening to podcasts.

There are times after listening to a good podcast, I sometimes want to review the content and make personal notes for my learning and future references. Unfortunately, many good podcasts do not provide any transcript, hence makes it difficult for me to continue my learning without spending extra minutes or hours re-listening to the content. Apart from making it easier to review the content, there are other benefits of having podcast transcripts available, such as making it accessible for hearing-impaired and non-native English listeners, and making the podcast content searchable.

Cloud Speech-to-Text Experiment

/images/gcp-icons/Cloud-Speech-to-Text.png — Cloud Speech-to-Text

Faced with this challenge, one fine day over the weekend, I spent some time experimenting using GCP Cloud Speech-to-Text Machine Learning APIs to generate podcast transcript automatically. I submitted an audio file through a gcloud command and wait for the generated transcript within minutes. I found that the resulting transcript was not 100% accurate, but it was good enough for my purpose. Based on my experience, its accuracy varies a lot depending on the quality of the podcast audio, which sometimes includes the speakers’ accent clarity and background noise existence.

I generated the transcript by setting up a Google Compute Engine (GCE) instance manually and then executing the following steps:

Download the podcast episode from a URL link.
Convert the audio encoding from MP3 to FLAC using FFmpeg.
Copy the FLAC file to GCS bucket.
Submit the FLAC file to Speech-to-Text recognition API.
Wait until the Speech recognition API completes.
Once it completes, retrieve the generated transcription result in JSON format.
Use jq to merge the transcript texts from the JSON file into a TXT file.
Upload the transcript TXT file to GCS bucket.

Note

You can find the explanation on why you need to convert the audio encoding by reading this doc. You can also look at the other audio encoding alternatives supported by Cloud Speech-to-Text.

Moving to Cloud Build

/images/gcp-icons/Cloud-Build.png — Cloud Build

Not satisfied with doing it manually, I aimed to make the whole end-to-end process to run as simple as possible in the next few iterations. Eventually, I settled with the approach of using Cloud Build to run the entire process serverlessly. Below, I outline how I use this approach to transcribe a podcast. You can find the source code from my “audio-transcriber-cloud-build” GitHub repository.

Note

I am using the GCP Podcast 1st episode ("We Got a Podcast!") as the podcast example to transcribe. Please note that the GCP Podcast already provides a transcript for each of the episodes, and it licenses all of its contents under Creative Commons.

We can take the step-by-step process outlined above and translate the steps into a cloudbuild.yaml build configuration file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49


steps:
  # 1. Download audio mp3 file
  - name: gcr.io/cloud-builders/wget
    args: ['-O', '${_FILENAME}.mp3', '${_AUDIO_URL}']

  # 2. Convert audio from mp3 to flac
  - name: linuxserver/ffmpeg
    args: ['-i', '${_FILENAME}.mp3', '-ac', '1', '${_FILENAME}.flac']

  # 3. Copy the flac file to GCS bucket
  - name: gcr.io/cloud-builders/gsutil
    args: ['cp', '${_FILENAME}.flac', '${_GCS_FOLDER_PATH}/${_FILENAME}.flac']

  # 4. Run speech-to-text API to transcribe the flac file from GCS bucket, save the operation ID to file
  - name: gcr.io/cloud-builders/gcloud
    entrypoint: /bin/bash
    args:
      - -c
      - gcloud ml speech recognize-long-running ${_GCS_FOLDER_PATH}/${_FILENAME}.flac --language-code=en-US --async --format="value(name)" > operation-id.txt

  # 5. Wait for the speech-to-text API operation to finish
  - name: gcr.io/cloud-builders/gcloud
    entrypoint: /bin/bash
    args:
      - -c
      - gcloud ml speech operations wait `cat operation-id.txt`

  # 6. Get the speech-to-text transcription result
  - name: gcr.io/cloud-builders/gcloud
    entrypoint: /bin/bash
    args:
      - -c
      - gcloud ml speech operations describe `cat operation-id.txt` > ${_FILENAME}.json

  # 7. Use jq to get the transcription text into txt file
  - name: stedolan/jq
    entrypoint: /bin/bash
    args:
      - -c
      - jq <${_FILENAME}.json '.response.results[].alternatives[].transcript' -r > ${_FILENAME}.txt

  # 8. Upload the transcription json and txt file to GCS bucket
  - name: gcr.io/cloud-builders/gsutil
    args: ['cp', '${_FILENAME}.json', '${_FILENAME}.txt', '${_GCS_FOLDER_PATH}/']

substitutions:
  _AUDIO_URL: https://eps-dot-gcppodcast.appspot.com/dl/Google.Cloud.Platform.Podcast.Episode.1.mp3
  _GCS_FOLDER_PATH: gs://bucket-name/gcp-podcast
  _FILENAME: episode1

After the cloudbuild.yaml is ready, we can then submit it to Cloud Build by running a single gcloud command.

1

gcloud builds submit .

You may need to wait for some time until the build completes, depending on the duration of your audio file. Once the build completes, you can find the resulting transcript in the same GCS bucket folder as the source audio 🎉

/images/2020/20200505-transcribe-podcast-serverlessly/transcript-in-bucket.jpg — Transcript file in GCS bucket

Pricing

To understand the pricing details of the core resources used, please refer to the links below:

All of them are eligible for the Google Cloud Free Tier, which allows you to use resources for free up to specific limits.

Try it!

You can try running the same transcription process on your GCP project by clicking the Open in Google Cloud Shell button above (which you can also do from the GitHub repository) and then follow the walkthrough tutorial to guide you along the way. Do not forget to customize the substitution variables at the bottom of the cloudbuild.yaml based on the podcast/audio that you want to transcribe.

Please let me know how it goes for you and share any of your learning and interesting experiences!

_____

Like what you read?

Did you find this article helpful, or save you time? You can thank me with a cup of coffee — thank you!