Adding Watson Speech-to-Text to your Android App

This post is about injecting Watson Speech-to-Text into an Android native app. Speech-to-Text is available as a service on IBM Cloud i.e.., Bluemix. You will integrating the service available on Bluemix into our favourite chatbot “The WatBOT” using Watson Developer Android SDK with minimal lines of code.

Why Watson Speech-to-Text?

The IBM® Speech to Text service provides an Application Programming Interface (API) that lets you add speech transcription capabilities to your applications. To transcribe the human voice accurately, the service leverages machine intelligence to combine information about grammar and language structure with knowledge of the composition of the audio signal. The service continuously returns and retroactively updates the transcription as more speech is heard.

Overview for developers introduces the three interfaces provided by the service: a WebSocket interface, an HTTP REST interface, and an asynchronous HTTP interface (beta).

Input Features

  • Languages: Supports Brazilian Portuguese, French, Japanese, Mandarin Chinese, Modern Standard Arabic, Spanish, UK English, and US English.
  • Models: For most languages, supports both broadband (for audio that is sampled at a minimum rate of 16 KHz) and narrowband (for audio that is sampled at a minimum rate of 8 KHz) models.
  • Audio formats: Transcribes Free Lossless Audio Codec (FLAC), Linear 16-bit Pulse-Code Modulation (PCM), Waveform Audio File Format (WAV), Ogg format with the opus codec, mu-law (or u-law) audio data, or basic audio.
  • Audio transmission: Lets the client pass as much as 100 MB of audio to the service as a continuous stream of data chunks or as a one-shot delivery, passing all of the data at one time. With streaming, the service enforces various timeouts to preserve resources.

    Output Features

  • Speaker labels (beta): Recognizes different speakers from narrowband audio in US English, Spanish, or Japanese. This feature provides a transcription that labels each speaker’s contributions to a multi-participant conversation.
  • Keyword spotting (beta): Identifies spoken phrases from the audio that match specified keyword strings with a user-defined level of confidence. This feature is especially useful when individual words or topics from the input are more important than the full transcription. For example, it can be used with a customer support system to determine how to route or categorize a customer request.
  • Word alternatives (beta), confidence, and timestamps: Reports alternative words that are acoustically similar to the words that it transcribes, confidence levels for each of the words that it transcribes, and timestamps for the start and end of each word.
  • Maximum alternatives and interim results: Returns alternative and interim transcription results. The former provide different possible hypotheses; the latter represent interim hypotheses as the transcription progresses. In both cases, the service indicates final results in which it has the greatest confidence.
  • Profanity filtering: Censors profanity from US English transcriptions by default. You can use the filtering to sanitize the service’s output.
  • Smart formatting (beta): Converts dates, times, numbers, phone numbers, and currency values in final transcripts of US English audio into more readable, conventional forms.

Integrating STT into existing android app

  • Add this permission to RECORD_AUDIO in Manifest.xml

  • Open build.gradle(app) and add the below entries under dependencies

  • Add an image (mic) as an asset under res/mipmap
  • Open res/layout/content_chat_room.xml and add the below code

  • Entries in to request permission from the user to access Microphone and record audio

  • Add the below code in MainActivity.Java outside onCreate

  • Add these private methods to complete the story

Here’s the complete code to understand where to make the above entries, click here to see the MainActivity.Java

Also, check how to integrate Text-to-Speech here

(Visited 51 times, 2 visits today)

Vidyasagar Machupalli

Polyglot & Pragmatic Programmer • Developer Advocate, IBM • Microsoft MVP • Intel software Innovator • DZone MVB