Human Transcription in the Voice Recognition Software Age
Back to articles

Human Transcription in the Voice Recognition Software Age

The day which students long for and professional transcriptionists dread is the day voice recognition software understands us. For half of my life, I have thought that that day was just around the corner. One Christmas in the early nineties, I went to Columbus, Ohio to visit my grandparents. On a slushy grey afternoon, we ventured out to the original Wendy’s and the Center of Science and Industry. As I kicked the snow off my snow boots, I noticed a floor to ceiling glass case in the center of the museum’s atrium. Incandescent lights illuminated a large, vanilla desktop computer. Children shoved one another to get close to a microphone anchored in front. When I got my turn, I spoke into the microphone: “Ashley is here from Plano, Texas for Christmas.” On the black computer screen, green letters slowly followed the blinking cursor: “Ashy is ear from plain oh Texas for Christmas.” Though the machine was clearly incapable of deciphering my every utterance, I was sure that someday soon, I would be able to go to Circuit City and buy a computer that would type my homework for me. Someday, voice recognition software may be able to offer truly accurate transcription.

High school, college, and graduate school passed and my technology became smaller, simpler, and smarter. My smartphone plays my favorite podcasts on demand, keeps me up-to-date with email and social networking, and even sends positive affirmations from my favorite self-help guru. So when I heard about new voice recognition apps, I couldn’t wait to try them out. I didn’t want to overreach, so I recorded part of Wolf Blitzer’s introduction to the GOP debate in Jacksonville, Florida. Wolf said:

“…Florida and the Hispanic Leadership Network will also ask questions. I’ll follow up and try to guide the discussion. Candidates, I will try to make sure each of you gets your fair share of questions. You’ll have one minute to answer 30 seconds for follow ups and rebuttals and I’ll certainly make sure you get time to respond if you’re singled out for criticism.”

The voice recognition program recorded my audio and emailed me a transcript. The transcript it sent read:

“…Autotrader are you we’re still at work will also have questions I’ll follow up and running by the dozen chocolate shortage of you get your fair share of questions 130 seconds for followup models call servlet make sure you get find responder santa barbara craigslist alabama counties.”

Currently, human transcriptionists create the most accurate transcripts - by far. Maybe I wasn’t holding my phone close enough to the TV. I decided to try again with Mitt Romney’s statement on immigration in the same debate, in which he says:

“Well, you’ve just heard the last two speakers also indicate that they support the concept of self-deportation. It’s very simply this, which is for those who come into the country legally, they would be given an identification card that points out they’re able to work here and then you have an E-verify system that’s effective and efficient so employers can determine who is legally here and if employers hire someone without a card, or without checking to see if it’s been counterfeited, then those employers would be severely sanctioned.”

The voice recognition transcript reads:

“Are you destroy the last 2 speakers also indicated they support pick up to self deprecation persepolis which is close to come into the country legally they would be given out of the cage card points other able to work you’re coming at you verify system is affected medication orders to determine to sleep earlier employers hire someone without a card checking to see whether to the counterfeit is the most employers abusive relationship did you do that people don’t come here illegally.”

I gave the software one last try with Newt Gingrich’s response on immigration. Newt said:

“Because, in the original conversations about deportation, the position I took, which he attacked pretty ferociously, was that grandmothers and grandfathers aren’t going to be successfully deported. We’re not — we as a nation are not going to walk into some family — and by the way, they’re going to end up in a church, which will declare them a sanctuary. We’re not going to walk in there and grab a grandmother out and then kick them out.”

The voice recognition transcript read:

“Because I surgeons about the borders charter wichita pretty ferocious was the grandmothers in grandfather’s on 26 s for the border or not really is a nation are not gonna walk in the sun valley 1 with the church westwood plaza los angeles wanna go to walk in there and grandma remind me to call but I think you are you.”

Needless to say, my new app made for some amusing cocktail party conversations, but could not be trusted with transcription. I wondered why my navigation app and automated airline responders could understand my speech, yet these transcripts were so terrible. I found out that phone-answering voice recognition systems have very limited vocabularies and thus, are more likely to be able to identify a request. For instance, a voice recognition system which answered phones could be programmed to recognize extension numbers (zero through nine), “agent,” “book a flight,” “existing reservation,” and “operator.” Voice recognition applications with larger vocabularies tend to be accurate about 85 percent of the time when used by experts. Maybe Wolf Blitzer, Mitt Romney, and Newt Gingrich were not speaking American English at an expert level in the Jacksonville debates.

Fifteen years after my first encounter with voice recognition software, I still feel like its day is just around the corner. Or, perhaps, “jiu-jitsu Rhonda corn, y’all”.