Most people realize that emails and other digital communications they once considered private can now become part of their permanent record.
But even as they increasingly use apps that understand what they say, most people don’t realize that the words they speak are not so private anymore, either.
Top-secret documents from the archive of former NSA contractor Edward Snowden show the National Security Agency can now automatically recognize the content within phone calls by creating rough transcripts and phonetic representations that can be easily searched and stored.
The documents show NSA analysts celebrating the development of what they called “Google for Voice” nearly a decade ago.
Though perfect transcription of natural conversation apparently remains the Intelligence Community’s “holy grail,” the Snowden documents describe extensive use of keyword searching as well as computer programs designed to analyze and “extract” the content of voice conversations, and even use sophisticated algorithms to flag conversations of interest.
The documents include vivid examples of the use of speech recognition in war zones like Iraq and Afghanistan, as well as in Latin America. But they leave unclear exactly how widely the spy agency uses this ability, particularly in programs that pick up considerable amounts of conversations that include people who live in or are citizens of the United States.
A 2008 document from the Snowden archive shows that transcribing news broadcasts was already working well seven years ago, using a program called Enhanced Video Text and Audio Processing:
A version of the system the NSA uses is now even available commercially.
The NSA has been repeatedly releasing new and improved speech recognition systems for more than a decade.
The first-generation tool, which made keyword-searching of vast amounts of voice content possible, was rolled out in 2004 and code-named RHINEHART.
“Voice word search technology allows analysts to find and prioritize intercept based on its intelligence content,” says an internal 2006 NSA memo entitled “For Media Mining, the Future Is Now!”
A newer, more sophisticated product rolled out by the NSA’s Human Language Technology (HLT) program office called VoiceRT, was first introduced in Baghdad in 2006. The goal, according to another 2006 memo, was to use voice processing technology to be able “index, tag and graph,” all intercepted communications.
The memo says an “important enhancement under development is the ability for this HLT capability to predict what intercepted data might be of interest to analysts based on the analysts’ past behavior.”
To Phillip Rogaway, a professor of computer science at the University of California, Davis, keyword-search is probably the “least of our problems.” In an email to The Intercept, Rogaway warned that “When the NSA identifies someone as ‘interesting’ based on contemporary NLP [Natural Language Processing] methods, it might be that there is no human-understandable explanation as to why beyond: ‘his corpus of discourse resembles those of others whom we thought interesting’; or the conceptual opposite: ‘his discourse looks or sounds different from most people’s.’”
If the algorithms NSA computers use to identify threats are too complex for humans to understand, Rogaway wrote, “it will be impossible to understand the contours of the surveillance apparatus by which one is judged. All that people will be able to do is to try your best to behave just like everyone else.”
A 2009 memo from the NSA’s British partner, GCHQ, describes how “NSA have had the BBN speech-to-text system Byblos running at Fort Meade for at least 10 years. (Initially they also had Dragon.) During this period they have invested heavily in producing their own corpora of transcribed SIGINT in both American English and an increasing range of other languages.” (GCHQ also noted that it had its own small corpora of transcribed voice communications, most of which happened to be “Northern Irish accented speech.”)
VoiceRT, in turn, was surpassed a few years after its launch. According to the intelligence community’s “Black Budget” for fiscal year 2013, VoiceRT was decommissioned and replaced in 2011 and 2012, so that by 2013, NSA could operationalize a new system.
This system, apparently called SPIRITFIRE, could handle more data, faster. SPIRITFIRE would be “a more robust voice processing capability based on speech-to-text keyword search and paired dialogue transcription.”
According to a 2011 memo, “How is Human Language Technology (HLT) Progressing?“, NSA that year deployed “HLT Labs” to Afghanistan, NSA facilities in Texas and Georgia, and listening posts in Latin America run by the Special Collection Service, a joint NSA/CIA unit that operates out of embassies and other locations.
A June 2006 NSA powerpoint presentation describing the role of VoiceRT:
In a 2011 article, “Finding Nuggets — Quickly — in a Heap of Voice Collection, From Mexico to Afghanistan,” an intelligence analysis technical director from NSA Texas described the “rare life-changing instance” when he learned about human language technology, and its ability to “find the exact traffic of interest within a mass of collection.”
What’s less clear from the archive is how extensively this capability is used to transcribe or otherwise index and search voice conversations that primarily involve what the NSA terms “U.S. persons.”
The NSA did not answer a series of detailed questions about automated speech recognition, even though an NSA “classification guide” that is part of the Snowden archive explicitly states that “The fact that NSA/CSS has created HLT models” for speech-to-text processing as well as gender, language and voice recognition, is “UNCLASSIFIED.”
Spying on international telephone calls has always been a staple of NSA surveillance, but the requirement that an actual person do the listening meant it was effectively limited to a tiny percentage of the total traffic. By leveraging advances in automated speech recognition, the NSA has entered the era of bulk listening.
And this has happened with no apparent public oversight, hearings or legislative action. Congress hasn’t shown signs of even knowing that it’s going on.
Additional 2006 NSA document published with but not mentioned in article:
Source Documents (PDF):
Coming Soon! A Tool that Enables Non-Linguists to Analyze Foreign-TV News Programs (October 23, 2008)
For Media Mining, the Future is Now! (August 1, 2006)
For Media Mining, the Future is Now! (conclusion) (August 7, 2006)
SIRDCC Speech Technology WG assessment of current STT technology (December 7, 2009)
“Black Budget” — FY 2013 Congressional Budget Justification/National Intelligence Program, pp. 360-364 (February 2012)
“Black Budget” — FY 2013 Congressional Budget Justification/National Intelligence Program, p. 262 (February 2012)
How Is Human Language (HLT) Progressing?(September 26, 2011)
RT10 Overview (June 2006)
Finding Nuggets – Quickly – in a Heap of Voice Collection, From Mexico to Afghanistan (May 25, 2011)
Classification Guide for Human Language Technology (HLT) Models (May 18, 2011)
Dealing With a ‘Tsunami’ of Intercept (August 29, 2006)