Director/author’s statement: My film Recorded as Stated by Me (V. Shashkov, 2024), released on Amazon Prime at the end of April 2025, is set in post-mobilization Russia and traces cultural resistance during wartime – from poetry readings in the bar and street memorials to “philosophy café” – all documented in intimate proximity. To preserve the presence and emotional cadence of real voices for international viewers, I implemented fully AI-generated and AI-translated dubbing across languages. This approach raised new questions around authorship, authenticity, and creative mediation. The text below reflects on the ethical and creative tensions raised by AI voice dubbing in nonfiction cinema, based on my experience directing the documentary.

* * *

Preparation for AI Dubbing

AI models typically perform well, as they have enough data to capture nuances in tone and delivery – three minutes of one character’s voice is typically sufficient. The real challenge lies with minor characters who deliver only a few lines, often totaling no more than 40 seconds. These limited datasets can lead to subpar AI-generated dubbing, as the model struggles to replicate the unique vocal character.

I had no time for such preparation. Ideally, if you know in advance that a film will use AI dubbing, it’s essential to not only obtain consent for voice use, but also to record about three minutes of clean, uninterrupted monologue. This provides a more robust dataset for training and improves the final dubbing quality.

To ensure high-quality output, all voice lines for each character should be isolated onto individual audio tracks during post-production. For example, if you use Sennheiser lavaliers during shooting, that simplifies the process. It becomes more complicated when recording with a boom mic, as in the case of the “Philosophy Café” or the “Student Debates at the Bar” scenes in the film. In such cases, separating voices into tracks becomes time-consuming.

Emotional Expressiveness in AI Dubbing

While AI excels at technical reproduction, it struggles with emotional authenticity. Isolating lines on a single track is not enough. To preserve emotional depth, AI must be trained using samples of the characters’ varied emotional states throughout the film.

Consider a scene where the protagonist runs through a forest memorial surrounded by graves, gasping for breath and stumbling over words. To retain the emotional impact, the AI model must be specifically trained on this scene’s audio – on strained breathing, broken speech, and raw emotional delivery. Similarly, when characters sing, scream, or shift emotionally, the model must be trained on these moments separately.

If AI dubbing is anticipated, teams should intentionally capture these “emotional fingerprints.” Isolated recordings of fear, joy, anger, or sorrow provide rich training data that enables emotionally faithful dubbing.

Managing Phrase Length in AI Dubbing

One of the recurring challenges with AI dubbing is its tendency to either slow down or speed up phrases inaccurately, especially when translated text differs in length from the original script.

It’s no secret that “automatic dubbing” is best suited for creating a rough canvas or base model, but to achieve meaningful and accurate results, every phrase must be customized to align with the intended meaning and tone. Adjusting phrase length is critical here. In my early experiences working with dubbing interfaces, I often adjusted phrases by trimming or padding them manually. This was the most time-consuming part of the project – until the specialized dubbing software received an update that included a drag-and-drop tool for adjusting phrase lengths during final editing.

For example, if a source-language phrase takes two to three seconds but the translation runs longer, the interface allows you to stretch or compress the AI voice while preserving natural intonation and inserting filler sounds like “uh” or “hmm” if they were present in the original track. This avoids the robotic or rushed sound that can result from improper timing. Additionally, this tool is particularly valuable in scenes where dialogue overlaps with other sound effects or music, as it provides the flexibility to fit the dubbed audio seamlessly into the mix.

With such tools, production teams can reduce post-production workload while achieving naturalistic dubbing results.

Translation Differences and Their Impact on AI Dubbing

When we have translating, for instance, from Russian to English, the length and structure of phrases can differ significantly. This variation is especially noticeable in key moments, such as transitions, overlapping crowd sounds.

In Recorded as Stated by Me, the translated English version ended up roughly eight minutes shorter than the Russian version due to pacing and scene transitions. This demonstrates how translation intricacies can extend beyond language, impacting the film’s rhythm, emotional weight, and narrative flow.

Interestingly, AI dubbing technologies have shown surprising proficiency in preserving the “naturalness” of speech, including subtle conversational elements like hesitations, filler words (e.g., “uh,” “well,” “you know”), and pauses. These nuances add a layer of authenticity to the performance, making it feel more human and less robotic.

However, this naturalism can also be a double-edged sword. Filler words and pauses, while enhancing realism, may unintentionally stretch scenes or disrupt tightly edited transitions. For instance, a random “umm” or “well” inserted into a critical pause might throw off the pacing of a dramatic scene.

Precision Adjustments in AI Dubbing: Express Tips

AI dubbing allows for a high degree of customization, enabling precise adjustments to enhance the naturalness and emotional impact of dialogue. Here are some key techniques:

Indicating Intentional Pauses: Use ellipses (…) to prompt the AI to add reflective or suspenseful pauses. Stress Correction: To control stress in words (especially for the AI model’s unknown surnames), repeat the vowel or consonant in the stressed syllable (e.g., Druzhiiinin) to guide correct pronunciation. Any unnatural elongation can later be adjusted during post-production to refine the flow and maintain naturalism. Blending with Original Audio: In scenes where the dubbed line is identical to the original (international claims, quotes in original language, etc.), overlay AI output with the actor’s real voice for a textured, hybrid effect. I used this technique in poetic sequences.

This approach enables filmmakers to craft dubbed performances that feel as authentic and engaging as the original, ensuring both technical precision and emotional depth.

Scenes That Are (Nearly) Impossible to Dub with AI

Songs: AI struggles to sync emotion and timing with music. Use subtitles or re-record with a dubbing language singer.

Overlapping Voices or Multivoice Scenes: AI can’t yet handle layered speech. Handle each voice individually, layering them manually during post-production. This also applies to distant shouts or off-screen voices; it’s easier to isolate the character’s voice and apply effects (e.g., equalization) to simulate distance. Wordplay, Puns, and Linguistic Humor: Cultural humor often gets lost in translation. Manual professional adaptation is essential. Scenes with Real-World Noise Interference: In scenes with heavy ambient noise (e.g., city streets, protests), AI voices may sound artificial. Blend with environmental soundscapes or use post-processing.

Authorship

While technical challenges of AI dubbing can be overcome, copyright issues present a more systemic obstacle to sociocultural development. I mean the term “copyright” exclusively in the context of the creative industries: cinema, video art, and DIY content for YouTube. We must understand that right holders will continue to fight vigorously for their rights – at least until the costs of enforcement begin to exceed the profits from upholding copyright. Because monitoring for violations was itself unprofitable, copyright enforcement often becomes a matter of random punishment. Neural networks resolve the problem of violation, but access to original sources important for cultural development are still lost.

Rare new films with Beatles tracks don’t automatically become cult classics. In my case, the long, opaque process of acquiring author and performer rights blocked me from sharing a track by the little-known Swedish noise artist Iannis Xenakis. His music could’ve intensified the memorial-running scene. Instead, I had to find a “similar-sounding” alternative, which weakened the emotional impact. At the time, music-generating AIs were still in their infancy. Today, I could recreate the track without breaching copyright and long-term assignments.

So, who benefits the most from AI voice dubbing? Primarily the producer, through reduced time and budget costs. But for the audience, the gain is far less certain.

Piracy was once the only way to give the public access to creativity. Copyright protects elite information privilege – not only by extracting maximum profit but also by artificially limiting access.

To avoid future legal pressure on platforms that host training data (or prompts that reference copyrighted music), I propose the following: Ensure the most aggressive permissible copyright enforcement only over the shortest time span. Collect the same profits in five years that we now collect in seventy. During this window, activate all enforcement and punishment mechanisms. Survive those five years and then release the work into the public domain. The shorter the enforcement, the more justifiable its severity.

AI dubbing doesn’t merely imitate speech – it mediates presence, memory, and meaning. It helps nonfiction filmmakers reach new audiences, but at a cost: a redefinition of authorship and a redistribution of creative agency. We’re not just translating voices – we’re reconstructing the self.

* * *

All images appear in the film and appear here courtesy of the director.