[해외DS] Werner Herzog와의 ‘끝없는 대화’가 AI에 대해 우리에게 가르쳐 줄 수 있는 것

[해외DS]는 해외 유수의 데이터 사이언스 전문지들에서 전하는 업계 전문가들의 의견을 담았습니다.저희 데이터 사이언스 경영 연구소 (MDSA R&D)에서 영어 원문 공개 조건으로 콘텐츠 제휴가 진행 중입니다.

“AI가 만든 Werner Herzog와 Slavoj Žižek 사이의 대화는 확실히 흥미롭지만, ‘허위 정보의 위기’가 다가온다는 경고이기도 합니다.”

영화 제작자 Werner Herzog(왼쪽)와 철학자 Slavoj Žižek은 현실 속에선 다른 공간에 있지만, 웹사이트 ‘Infinite Conversation’에서는 이들의 목소리를 지닌 ‘클론’이 인공지능이 생성한 단어를 가지고 깊은 토론을 이어가고 있습니다./사진=K. Dowling/Getty Images(왼쪽); Nejc Trpin/Alamy Stock Photo

이탈리아계 미국인 컴퓨터 사이언티스트이자 사업가인 Giacomo Miceli는 지난 10월 웹 사이트 ‘Infinite Conversation’을 만들었습니다. 이곳에서는 독일 출신 영화 제작자 Werner Herzog와 슬로베니아 출신 철학자 Slavoj Žižek이 다양한 주제로 토론을 이어가고 있는데요, 이들의 대화를 듣다 보면 각자의 독특한 영어 억양이나 단어 선택 경향을 느낄 수 있어서 어느 정도 ‘진짜 같다’는 생각을 하게 됩니다. 그렇습니다, 사실 이 대화는 가짜입니다. 이들의 목소리는 딥 페이크(Deep Fake)인데, 그 ‘독특한 억양’뿐 아니라 말하는 내용까지도 전부 AI의 작품입니다.

Miceli는 Scientific American에 보낸 기고문에서 “경고하기 위해” 이 대화를 만들었다고 밝혔습니다. 소위 ‘머신 러닝(Machine Learning)’이 발전하면서 믿을 수 없을 정도로 사실적이지만 가짜인 이미지, 비디오 혹은 연설 등을 뜻하는 딥 페이크를 만들기가 너무나 쉬워졌고, 그 품질도 지나치게 좋아졌다는 것입니다. 동시에, ‘언어 생성’ AI는 많은 양의 텍스트를 빠르고 저렴하게 생성하는 능력을 갖게 됐습니다. Miceli는 이 기술들이 합쳐지면 무한한 대화를 준비하는 것 이상의 일, 우리를 ‘허위 정보의 바다’로 빠뜨리기에 충분한 일을 할 수 있다고 말했습니다.

눈부시게 발전하고 있는 머신 러닝은 특정 작업을 반복적으로 수행하면서 알고리즘을 개선하기 위해 대량의 데이터를 사용하여 ‘훈련’시키는 AI 기술입니다. 인간이 이해할 수 있는 발화를 생성하는 시스템인 음성 합성을 포함한 정보 기술의 모든 부문을 새로운 수준으로 끌어올리고 있지요. 인간과 기계 사이의 한계 공간에 흥미를 느끼던 Miceli는 항상 머신 러닝이 매력적인 응용 프로그램이라는 생각을 갖고 있었다고 합니다. 기계 학습의 발전으로 음성 합성 및 음성 복제 기술이 지난 몇 년 동안 (오랫동안 작고 점진적인 개선 과정을 거친 뒤) 크게 발전했다는 사실은 Miceli의 관심을 끌기에 충분했습니다.

Miceli는 대표적인 음성 합성 프로그램 ‘Coqui TTS’를 우연히 찾아낸 게 Infinite Conversation을 시작하게 된 계기라고 전했습니다. 많은 디지털 도메인 프로젝트는 이전에 알려지지 않은 소프트웨어 라이브러리 또는 오픈 소스 프로그램을 ‘발견’하는 것에서부터 시작되는데, 이 프로젝트도 마찬가지였던 겁니다. Miceli는 활성화된 사용자 커뮤니티, 수많은 문서와 함께 이 ‘툴킷’(Tool Kit)을 발견했을 때 ‘유명한’ 목소리를 복제하기 위한 모든 준비가 끝났다는 생각을 했다고 합니다.

Miceli는 Werner Herzog의 작품, 페르소나, 세계관을 감상하면서 항상 그의 목소리와 말하는 방식에 매료됐습니다. 다른 사람들도 마찬가지였는지, Herzog는 대중 문화에 힘입어 만화(Literal Cartoon)에 참여했습니다. (The Simpsons, Rick and Morty, Penguins of Madagascar 같이 누구나 한 번쯤 들어 봤을 작품들에 카메오나 공동 작업자로 참여했네요.) 그래서 Miceli는 ‘만지작거릴’ 목소리를 고를 때, 특히 그 목소리를 몇 시간이고 계속 들어야 한다는 사실을 알게 된 뒤에는 Herzog의 목소리가 최고라 생각했다고 합니다. Miceli는 “무시할 수 없는 중후함을 전하는 그(Herzog)의 무미건조한 말투와 묵직한 독일 억양을 듣는 일에 질리는 건 거의 불가능하다”는 말을 남겼습니다.

Herzog의 목소리를 복제하기 위한 훈련 세트를 구축하는 것은 프로세스에서 가장 쉬운 부분이었다고 합니다. Herzog의 인터뷰, 음성 해설 및 오디오북 작업물로부터는 기계 학습 모델을 훈련하기 위해(이 프로젝트의 경우에는 기존 모델을 미세 조정하기 위해) 수집할 수 있는, 수백 시간의 음성 ‘교재’를 구할 수 있습니다. 기계 학습 알고리즘의 출력은 일반적으로 신경망이 모든 훈련 데이터로 훈련되는 주기를 뜻하는 ‘에포크(Epoch)’를 거치면서 발전하는데, 알고리즘은 각 에포크가 종료될 때마다 그 결과를 샘플링합니다. 연구원은 이 자료를 검토해 프로그램이 얼마나 잘 진행되고 있는지 평가할 수 있지요. Miceli는 Werner Herzog의 합성 음성으로 각 에포크를 거치며 개선되는 모델을 ‘듣는’ 경험을 통해 디지털 영역에서 Herzog의 목소리가 ‘탄생’하는 과정을 지켜보고 있다는 느낌을 받았다고 합니다. 은유적인 표현이네요.

Miceli는 만족스러운 Herzog의 목소리를 얻은 뒤 두 번째 목소리 작업을 시작했는데, 이때는 직관적으로 Slavoj Žižek을 선택했다고 합니다. Miceli는 Žižek을 Herzog처럼 흥미롭고 기발한 억양, 지적 영역 내에서의 적절한 존재감 그리고 영화 세계와의 연결점을 가지고 있으며, 논쟁적인 열정과 때로는 논란이 되는 아이디어 등으로 어느 정도 대중적인 스타덤에 오른 인물로 평가했습니다.

그런데 Miceli는 이때까지도 진행 중인 프로젝트의 최종 형태가 무엇인지 확신할 수 없었다고 합니다. Miceli는 당시 “전체 음성 복제 프로세스가 얼마나 쉽고 순조로운지에 놀랐고, 이것이 관심을 지닌 모든 이들에게 보내는 경고임을 깨달았다”고 합니다. 딥페이크는 지나치게 좋아졌고, 지나치게 만들기 쉬워졌다는 것입니다. 바로 이번 달에는 마이크로소프트가 VALL-E라는 새로운 음성 합성 도구를 발표했는데, 연구원들은 이 도구와 3초가량의 녹음된 오디오만 있다면 모든 음성을 모방할 수 있다고 주장했습니다. Miceli는 “우리는 신뢰의 위기에 직면하고 있고, 이에 대해 무방비한 상태이다”라고 덧붙였습니다.

Miceli는 대량의 허위 정보를 생성하는 이 기술의 능력을 강조하기 위해 ‘끝없는 대화’라는 아이디어를 구현하겠다고 결심했습니다. Miceli에게 필요한 건 두 참가자가 각각 작성한 텍스트에 미세 조정을 거친 대규모 언어 모델과 자연스럽고 믿을 수 있는 대화 흐름을 만들기 위해 대화의 앞뒤를 제어하는 간단한 프로그램뿐이었습니다.

언어 모델의 핵심적인 기능은 이미 존재하는 일련의 단어를 가지고 다음 단어를 순서대로 예측하는 것인데, 어떤 사람의 풍부한 대화 기록이 있다면 언어 모델을 미세 조정해서 그 사람이 말할 것이라 생각되는 ‘스타일’과 ‘콘셉트’를 복제할 수 있습니다. Miceli는 사용 가능한 주요 상용 언어 모델 중 하나를 쓰기로 결정했는데, 그 당시 이미 합성 음성 형식이 포함된 가짜 대화를 실제로 그것을 듣기 위해 필요한 시간보다 빠르게 만들 수 있겠다고 생각했다네요. 프로젝트의 이름인 ‘Infinite Conversation’이 바로 여기에서 나왔다고 합니다. Miceli는 몇 달 간의 작업을 거쳐 지난 10월 Infinite Conversation을 온라인에 게시했는데, 이 프로젝트는 2월 11일부터 샌프란시스코의 예술 설치 공간인 Misalignment Museum에서도 만나볼 수 있습니다.

Miceli는 “모든 과정이 마무리된 뒤, 프로젝트를 시작할 때 예상하지 못했던 점에 놀랐다”고 전하며 ‘챗봇’ 버전 Herzog와 Žižek은 종종 실제 그들처럼 철학과 미학에 대한 대화를 하는데, 이런 주제가 난해하다 보니 청취자들은 모델이 이따금 생성하는 넌센스를 일시적으로 무시할 수 있게 된다고 덧붙였습니다. Miceli는 ‘AI’ Žižek이 유명 감독 Alfred Hitchcock을 어떨 때는 ‘천재’, 어떨 때는 ‘냉소적인 조종자’라 평가하는 것과, 실제 Herzog는 닭을 싫어하는 것으로 악명이 높지만 그의 AI는 때때로 닭에 대해 ‘자비로운’ 말을 하는 것을 ‘불일치’의 예로 들었습니다.

Miceli는 “Žižek 자신도 지적했듯이 실제 포스트모던 철학은 혼란스럽게 받아들여질 수 있는데, 그로 인해 Infinite Conversation의 명확성 부족은 불가능한 모순이 아닌 심오한 모호함으로 해석될 수 있다”고 말하며 이런 불일치를 프로젝트의 전반적인 성공에 긍정적인 영향을 준 요소로 평가했습니다.

Infinite Conversation의 방문자 중 수백 명이 한 시간 이상의 청취 기록을 올렸고 그보다 훨씬 더 오랫동안 이 대화를 틀어 둔 사람들도 있었습니다. Miceli는 기고문을 마무리하며 “웹사이트에 언급했듯이, Infinite Conversation 방문자가 챗봇이 말하는 내용에 너무 심각하게 연연하지 않고 이 기술과 그 결과에 대한 인식을 갖게 되기를 바란다. AI가 생성한 이 잡담이 그럴듯해 보인다면 정치인의 명성을 더럽히거나, 비즈니스 리더를 속이거나, 인간이 보도한 뉴스처럼 들리는 잘못된 정보로 사람들을 혼란에 빠뜨릴 수 있는 ‘사실적으로 들리는 연설’을 상상해 보라”고 주문했습니다.

An AI-generated conversation between Werner Herzog and Slavoj Žižek is definitely entertaining, but it also illustrates the crisis of misinformation beginning to befall us

On the website Infinite Conversation, the German filmmaker Werner Herzog and the Slovenian philosopher Slavoj Žižek are having a public chat about anything and everything. Their discussion is compelling, in part, because these intellectuals have distinctive accents when speaking English, not to mention a tendency toward eccentric word choices. But they have something else in common: both voices are deepfakes, and the text they speak in those distinctive accents is being generated by artificial intelligence.

I built this conversation as a warning. Improvements in what’s called machine learning have made deepfakes—incredibly realistic but fake images, videos or speech—too easy to create, and their quality too good. At the same time, language-generating AI can quickly and inexpensively churn out large quantities of text. Together, these technologies can do more than stage an infinite conversation. They have the capacity to drown us in an ocean of disinformation.

Machine learning, an AI technique that uses large quantities of data to “train” an algorithm to improve as it repetitively performs a particular task, is going through a phase of rapid growth. This is pushing entire sectors of information technology to new levels, including speech synthesis, systems that produce utterances that humans can understand. As someone who is interested in the liminal space between humans and machines, I’ve always found it a fascinating application. So when those advances in machine learning allowed voice synthesis and voice cloning technology to improve in giant leaps over the past few years—after a long history of small, incremental improvements—I took note.

Infinite Conversation got started when I stumbled across an exemplary speech synthesis program called Coqui TTS. Many projects in the digital domain begin with finding a previously unknown software library or open-source program. When I discovered this tool kit, accompanied by a flourishing community of users and plenty of documentation, I knew I had all the necessary ingredients to clone a famous voice.

As an appreciator of Werner Herzog’s work, persona and worldview, I’ve always been drawn by his voice and way of speaking. I’m hardly alone, as pop culture has made Herzog into a literal cartoon: his cameos and collaborations include The Simpsons, Rick and Morty and Penguins of Madagascar. So when it came to picking someone’s voice to tinker with, there was no better option—particularly since I knew I would have to listen to that voice for hours on end. It’s almost impossible to get tired of hearing his dry speech and heavy German accent, which convey a gravitas that can’t be ignored.

Building a training set for cloning Herzog’s voice was the easiest part of the process. Between his interviews, voice-overs and audiobook work there are literally hundreds of hours of speech that can be harvested for training a machine-learning model—or in my case, fine-tuning an existing one. A machine-learning algorithm’s output generally improves in “epochs,” which are cycles through which the neural network is trained with all the training data. The algorithm can then sample the results at the end of each epoch, giving the researcher material to review in order to evaluate how well the program is progressing. With the synthetic voice of Werner Herzog, hearing the model improve with each epoch felt like witnessing a metaphorical birth, with his voice gradually coming to life in the digital realm.

Once I had a satisfactory Herzog voice, I started working on a second voice and intuitively picked Slavoj Žižek. Like Herzog, Žižek has an interesting, quirky accent, a relevant presence within the intellectual sphere and connections with the world of cinema. He has also achieved somewhat popular stardom, in part thanks to his polemical fervor and sometimes controversial ideas.

At this point, I still wasn’t sure what the final format of my project was going to be—but having been taken by surprise by how easy and smooth the whole process of voice-cloning was, I knew it was a warning to anyone who would pay attention. Deepfakes have become too good and too easy to make; just this month, Microsoft announced a new speech synthesis tool called VALL-E that, researchers claim, can imitate any voice based on just three seconds of recorded audio. We’re about to face a crisis of trust, and we’re utterly unprepared for it.

In order to emphasize this technology’s capacity to produce large quantities of disinformation, I settled on the idea of a never-ending conversation. I only needed a large language model—fine-tuned on texts written by each of the two participants—and a simple program to control the back-and-forth of the conversation, so that its flow would feel natural and believable.

At their very core, language models predict the next word in a sequence, given a series of words already present. By fine-tuning a language model, it is possible to replicate the style and concepts that a specific person is likely to speak about, provided that you have abundant conversation transcripts for that individual. I decided to use one of the leading commercial language models available. That’s when it dawned on me that it’s already possible to generate a fake dialogue, including its synthetic voice form, in less time than it takes to listen to it. This provided me with an obvious name for the project: Infinite Conversation. After a couple of months of work, I published it online last October. The Infinite Conversation will also be displayed, starting February 11, at the Misalignment Museum art installation in San Francisco.

Once all the pieces fell into place, I marveled at something that hadn’t occurred to me when I started the project. Like their real-life personas, my chatbot versions of Herzog and Žižek converse often around topics of philosophy and aesthetics. Because of the esoteric nature of these topics, the listener can temporarily ignore the occasional nonsense that the model generates. For example, AI Žižek’s view of Alfred Hitchcock alternates between seeing the famous director as a genius and as a cynical manipulator; in another inconsistency, the real Herzog notoriously hates chickens, but his AI imitator sometimes speaks about the fowl compassionately. Because actual postmodern philosophy can read as muddled, a problem Žižek himself noted, the lack of clarity in the Infinite Conversation can be interpreted as profound ambiguity rather than impossible contradictions.

This probably contributed to the overall success of the project. Several hundred of the Infinite Conversation’s visitors have listened for over an hour, and in some cases people have tuned in for much longer. As I mention on the website, my hope for visitors of the Infinite Conversation is that they not dwell too seriously on what is being said by the chatbots, but gain awareness of this technology and its consequences; if this AI-generated chatter seems plausible, imagine the realistic-sounding speeches that could be used to tarnish the reputations of politicians, scam business leaders or simply distract people with misinformation that sounds like human-reported news.

But there is a bright side. Infinite Conversation visitors can join a growing number of listeners who report that they use the soothing voices of Werner Herzog and Slavoj Žižek as a form of white noise to fall asleep. That’s a usage of this new technology I can get into.