Between Diegesis and Mimesis: Voice-Over Narration in Fiction Film

The Lady from Shanghai (Welles, 1947): Voice-over at the intersection of language, image, and reflection

This brief essay will seek to answer the distinctions between mimetic and diegetic elements in a filmic text, isolate the voice-over as a unique event located between the strict division between mimesis and diegesis, demonstrate the linkages between the image-event and the language-event, and examine specific texts that demonstrate these distinctions.

* * *

Cinema is, fundamentally, mimetic. The cinema was established as a photographic medium long before sound, and the spoken word, was linked to it. Photography as the basis of cinema’s signification practice asserts mimesis as the cinema’s primary mode of representation. André Bazin, in What Is Cinema?, and Stanley Cavell, in The World Viewed, have adequately demonstrated this fact.1 The addition of the spoken word to narrative film in the 1920s and 1930s served to complete the mimesis when the source of the speech is clearly emanating from a character on screen or on the immediate periphery of the action (the character might be speaking from off screen, but if we are aware of his or her presence, if the character is a participant in the sequence, we do not question the source of the spoken text, and thus mimesis is intact).

However, the voice-over, as it is used in narrative film, presents a unique set of circumstances regarding signification, mimesis, and diegesis as these distinctions apply to a filmic text. The spoken text in a film, when linked causally (i.e., directly) to the image, may be considered a part of the mimesis of the film. However, the voice-over, when it occurs apart from the image on screen, signifies as a language-event, as opposed to the different case of signification of the image-event, in that the linkage between language and image is associative (or, indirect). This brief essay will seek to answer the distinctions between mimetic and diegetic elements in a filmic text, isolate the voice-over as a unique event located between the strict division between mimesis and diegesis, demonstrate the linkages between the image-event and the language-event, and examine specific texts that demonstrate these distinctions.

In the recent book The Realities of Film: Theories of Filmic Reality, Richard Rushton claims that, rather than refer to a reality outside of itself, a film constitutes its own reality, and thus its own mode of being, which is not secondary (referential), but primary (as is our experience of “reality” outside the text of the film). That is, films are a part of reality, but are not abstracted from reality, and thus constitute a “filmic reality” (2). The nature of this filmic reality is that it may not necessarily be self-contained, but it is necessarily self-referential. The thesis of Rusthon’s book is, as he writes, “to dispel the myth that there is anything behind films. Its central point is quite simply that films are enough – they do not need to be something else; there does not need to be anything behind a film that would be better or more significant than a film as it is. Filmic reality – the reality of film – is film as it is” (8). Rushton’s argument (following Gilles Deleuze and Cavell, after a fashion), as intriguing as it is, goes too far. Film is always in search of correspondence with the world to which it adheres; this is the nature of realism in film. In Rushton’s estimation, realism is an operating modality of film; I would argue that realism is what constitutes film. The question is not one of realism vs. formalism (or other oppositional aesthetic positions), but degrees of seemliness of the reality of our experience. No doubt films constitute their own reality – Bazin and Cavell agree on this point – but that reality adheres to a known reality common to the capabilities of our own experience of the world outside the film (we could not recognize the filmic material if we had no basis for comparison; that is, our catalog of experiences of reality. In this sense I would place Christian Metz’s “imaginary signifier” not in the apparatus of the cinema but in the reality of our experiences). Rushton’s book is helpful because it furthers the investigation into the nature of this adherence; the technique of voice-over, as utilized in narrative films, helps us consider the nature of the linkage between film and the world, as well as a determination of mimetic and diegetic elements of the filmic text.

To begin, I would separate the two fundamental aspects of narrative film: the language-event and the image-event, which combined constitute the filmic text. The essential basis of cinema is the image-event, which (as previously mentioned) is grounded in photography; or, if we prefer, a photographic presentation of the world. Cavell is correct in his assertions regarding the indexicality of the image in a filmic text, that “[photographs] have answers in reality” (24); likewise, Bazin, that “photography enjoys a certain advantage [over painting] in [the] . . . transference of reality from the thing to its reproduction” (14). In my expression image-event, the term event is intended to signify the fact of our experience of the image, as well as the way in which the image signifies its meaning to us. The expression language-event is meant to isolate those language aspects of the film that are not an immediate part of the mise en scéne, or are intimately (or apparently) linked to it. In these cases, the language-event, disconnected as it is from a corresponding image, signifies (i.e., makes meaning) as language, not as image; that is, it cannot be said to be part of the image-event. This is a fundamental condition of the voice-over in narrative fiction film.

The mode of filmic representation is mimetic in the classical sense of showing, as opposed to diegesis, which is representation by telling, or narrating. In each condition, something is represented (or if you like, presented again, an expression that preserves the inherent reality of the filmic text while also addressing the necessary fact of that text’s connection to a reality outside itself); the nature of mimesis is the fact that something is referred to (no matter how “real” a filmic text is, it cannot refer only to itself for meaning). The image-event is obviously mimetic in that it corresponds (the term is significant) with an independent reality that exists outside of ourselves and thus necessitates our recognition of it as being outside of ourselves; that is, reality that can be referred to as appreciable to our apparatus for perception but also stands apart as an event outside these apparatus after the initial (or authentic) perception has occurred. This is not a superficial point in consideration of the referentiality of the image-event; the image-event in cinema affects our perceptual apparatus in such a way that mimesis occurs through a resemblance to (or re-assemblage of) the world without necessitating another form of media (those aspects of the image-event that do not readily correspond to our experience, such as jump cuts, cross- fades, shifts in camera perspective, etc., are rightly termed non-diegetic, if we accept non-diegetic to be a subset of diegesis, and not the opposite of diegesis). The relationship between an image and its referent is one of non-ambiguous adherence, which C. S. Peirce terms a direct sign. That is, the image does not signify by means of its approximation to its referent; for example, an image of a whale is only an image of a whale, and we cannot mistake it for anything else. But the image of the whale is an image of an actual whale, the reality of the whale is transferred to the image of the whale, and there is no ambiguity between the two. Thus, when we see an image on screen, we know it to be what it is in itself.

No doubt the relationship between a word and its referent is fraught with ambiguity; the adherence between the signifier and the signified of the linguistic sign generates a semantic field that in turn affects potential meaning associations. The linguistic sign (following Peirce) requires three components: signifier, signified, and interpretant, while the image requires only the image and its indexical referent. The linkage between the image and referent is direct; the same linkage(s) in the linguistic sign are interpretive. (An image in a film may be “interpreted” by virtue of its symbolism, metaphor, allegory, etc., but these are tropes, and do not affect the nature of the linkage between image and referent.)

When the cinema was a silent medium, the function of the image-event was clear. If language existed, it did so in the form of title cards, which obviously were textual. Even without spoken language, the nature of cinema was mimetic; mimesis in this case determines the referential nature of the image-event. The addition of spoken language to the image-event only completes the sense of mimesis; the image still signifies as an image-event. But the relationship between language and image requires further consideration.

When language accompanies image in a way that is consistent with our authentic experience of the world, we may still consider the representational mode mimetic, and the language becomes a part of the image-event. For example, when a character on screen is speaking and we hear his or her language, the point of emanation is clear; even in the shot/reverse shot pattern of edited dialogue, while the speaker may not be on screen (the camera may cut to the other character on the screen to gauge his or her reaction to what is being said), we are still in no doubt as to who is speaking and how to attribute that spoken text. However, when the speaker of a text is not identified, or is not present to the image-event, or is abstracted from the image-event, we encounter a different experience of language in film: the language-event.

The language-event is distinct from the image-event in that it is not clearly attributed to a character in the mise en scéne. That is, the language is separate from the image-event, and the linkage between the image and voice-over is relegated to the realm of interpretation. The voice-over is indeed over; it is layered onto the image-event (we do not think of “voice-beyond” or “voice-other”). Separated from the image-event, the voice-over becomes a language-event, and as such signifies not as image but as language.2 The language-event is not strictly a part of the mimesis of the film, since it signifies as language, but because mimesis is the fundamental representational mode of cinema, the language-event is linked in some way to the mimesis. We may say that the language-event exists between mimesis and diegesis; it signifies as language and its representational modality is diegetic, but it is, by necessity, associated with the fundamental mimesis of the film. The language-event in cinema occurs most commonly in the form of voice-over.

For the purposes of this study, I will qualify voice-over as disembodied narration over an image-event. The voice-over is not specifically non-diegetic. Because the voice-over is a language-event, it represents as diegetic; because it is associated with the image-event, it is part of the mimesis of the filmic text. Voice-over in narrative cinema takes different aspects; I will attempt to qualify three: descriptive/discursive; descriptive/interpretive; and interior monologue.

Descriptive/Discursive

In this mode, the voice-over serves to introduce, provide context, or establish background, most commonly at the beginning of a narrative film. This is quite commonly seen in B-films (in particular, it seems, science fiction B-films); sometimes this application of voice-over results simply from an inability (often budgetary) on the part of the filmmakers to show the necessary images.3 Or, a brief narrative introduction may serve to set the tone and historical (or, if you will, literary) context of a film, such as in The Magnificent Ambersons (Welles, 1942) or How Green Was My Valley (Ford, 1941).

The Magnificent Ambersons. Welles’ voice-over introduces the context for the narrative.

How Green was My Valley. Huw’s adult voice-over opens the film.

In Huw’s voice-over the village today . . .

. . . becomes, via a cross-fade, the village of his boyhood.

However, this manner of voice-over can be used more interestingly. Ingmar Bergman uses a voice-over narration at the beginning of Wild Strawberries (1957) in order for the central character, Isak Borg (Victor Sjöström), to narrate the content of his dream (which is beyond his waking, or rational, perception), which establishes the sense of alienation that haunts him and drives the narrative tension of the film.

Wild Strawberries. In Isak Borg’s narration of his dream, he is further disembodied by his encounter with his own corpse.

In this example, a fact of the voice-over as a language-event is key to Bergman’s usage; because the voice-over comes from outside the text, it registers as a disembodied voice, and thus emanates from a position of authority (the voice-over as a language-event in this regard is not subject to verifiability; that is, the voice-over as a signifying language-event cannot lie). In Monsieur Verdoux (1947), Charles Chaplin uses an introductory narration to establish the action of the film, which happens in flashback; the narration is provided by the character Verdoux (Chaplin) over the image of his own grave (Verdoux’s last speech in the film, before his execution, suggests that he will haunt those responsible for his death; in this way the narrative comes full circle).

Monsieur Verdoux. Absent body, present voice.

In Billy Wilder’s Sunset Blvd (1950), the film is, again, introduced by a dead man; in this case, Joe Gillis (William Holden) introduces the narrative context over a shot of his own corpse floating in Norma Desmond’s swimming pool.

Sunset Blvd. Narrated by the body in the pool.

The switch to third person highlights the irony of present-tense narrative from a dead man.

In these examples the voice-over signifies as a language-event due to its narrative/discursive purpose; however, the diegesis is closely linked with the image-event; we see the body (or, in the case of Monsieur Verdoux, the grave marker that indicates the absent body: a double absence) from which the narration emanates; thus, the signification is linguistic, but, unlike the direct use of dialogue (in which the language is directly linked to the image), signification in this sense is somewhere between diegesis and mimesis.

Descriptive/Interpretive

Ingmar Bergman’s Winter Light (1963) opens with Pastor Tomas (Gunnar Björnstrand), in medium close-up, beginning the service of the Eucharist. Then Bergman fades to a perspective outside the church; we hear the text of Tomas’ liturgy, but in contrast we see the cold, desolate waste of the winter landscape. We hear the language, but we no longer have direct connection to the emanation; we may say that the voice speaks over the image, and also that the image appears under the voice. Thus, Bergman splits the filmic text between the language-event and the image-event, and this split, and its demand for an interpretive modality (intended to separate the language of the liturgy from the image of the warmth and sanctity of the church) reflects the split operating within Tomas himself, between the intent of the liturgy as sacrament and the hollowness of the language as he performs it:

Winter Light. “Thy kingdom come, Thy will be done, on earth as it is in heaven.” Long shot of the church from across a snow-covered field; POV is oblique to the front of the church.

CROSS-FADE INTO:

“Give us this day our daily bread and forgive us our trespasses . . .

. . . as we forgive those who trespass against us.” Long shot of the church (slightly closer) from across the snow-covered field; POV is directly in front of the church, which is framed between two bare trees.

CROSS-FADE INTO:

“And lead us not into temptation, but deliver us from evil. For Thine is the kingdom and the power and the glory forever.”
Long shot of the church from across a frozen pond. The voice-over carries over the fade into a return shot of Tomas, now in medium close-up (in profile), completing the prayer.

The language-event and the image-event converge at the close of the prayer. The voice-over signifies as language, while the concurrent images, without an apparent linkage, signify as an image-event. A linkage may be implied only in the interpretive realm; that is, the text of the language may be interpreted to comment directly (ironically, satirically) on the images. In this way the language and images interpret each other, but do not signify as a unified filmic text.

Interior Monologue

The case of voice-over to show interior monologue is perhaps the most complex case of language/image signification. This usage of voice-over, in its most simplistic, is a holdover from literature and drama, and in its weakest application functions in the same way. For example, in the Tay Garnett melodrama Cause for Alarm! (1951), Ellen Jones (Loretta Young) searches her home for a possibly incriminating letter; we hear her narration, in voice-over, of her observation that her husband is dead.

Cause for Alarm! Ellen’s voice-over: “That man lying there was George, my husband. He was dead. He died trying to kill me.” The voice-over merely repeats the action on screen.

This is a poor use of interior monologue for several reasons. First, the film has already established the possibility that the letter may have been taken or lost, so Jones’ anxiety is clear to us; second, the voice-over distracts from the mise en scéne, so that we must scrutinize the text of the narration to determine whether there is another, less obvious reason for her fear (there isn’t); third, the voice-over merely describes the images on the screen, and is wholly unnecessary. In this case the interior monologue is a restating of the obvious (at best) and an inability on the part of the director and actor to communicate visually the intended signification (at worst).

However, interior monologue can be handled in a way that does not sever the mimesis. In Alfred Hitchcock’s Vertigo (1958), Judy (Kim Novak) writes a letter to Scotty (James Stewart) revealing the truth about the mysterious “Madeleine.”

Vertigo. Judy’s voice-over begins in media res: “And so you found me.”

As she writes, we hear her read the text of the letter in voice-over, which seems unnecessary, because Hitchcock could easily show us the text on screen. In this instance, though, as we hear Judy’s voice, we realize that though the letter is intended for Scotty, the narrative is a confession to herself about her dual “Judy/Madeleine” identity (another example of the motif of doubling/reflection in the film). (Significantly, Judy tears up the letter when she finishes writing it; again, the confession is for herself, not for Scotty.) In a critical scene in Woody Allen’s Crimes and Misdemeanors (1989), Judah Rosenthal (Martin Landau) reflects alone, late at night in his home, on the choice of whether to have his affair with a flight attendant (Angelica Huston) exposed, jeopardizing his medical career and social standing, or having the woman killed. This is an obvious opportunity for use of interior monologue.

Crimes and Misdemeanors

However, Allen has Judah discuss the dilemma with an apparition of his brother Ben (Sam Waterston), a rabbi.

The appearance of Ben’s “voice” in Judah’s voice-over corresponds with the flash of lightning.

The scene thus signifies as an image-event; Allen stretches strict plausibility (normally our sense of moral guidance does not take human form and appear in our living room in the middle of the night) for the sake of the mimesis.

Also, the voice-over as interior monologue can expand our understanding of a character and a character’s motivations. In Orson Welles’ The Lady from Shanghai (1947), we hear the interior monologue of Michael O’Hara (Welles) as he negotiates his way through the tangled plot involving femme fatale Elsa Bannister (Rita Hayworth), her husband Arthur Bannister (Everett Sloane), and their seemingly countless hangers-on.

The Lady from Shanghai. Michael’s voice-over interprets his frame of mind . . .

. . . as well as his self-perception.

O’Hara’s narration is necessary for us to understand his motivations, especially since his actions defy logic; in this way Welles uses the interior monologue to communicate the discrepancy between O’Hara’s better judgment (in voice-over) and his inability to control himself (his actions on screen); or rather, what he thinks he should do (indirect reflections that signify as a language-event) and what he actually does (which signify as image-event). This duality between interior monologue and exterior action is reflected in the film’s use of mirror imagery and in O’Hara’s perfectly pitched rhetorical question that closes the film.

Welles’ use of interior monologue also demonstrates a problem inherent in the technique: inconsistency of point of view. Aside from the occasional experimental narrative (such as Robert Montgomery’s 1947 film Lady in the Lake), cinema operates in the third person; however, interior monologue is by necessity first-person narrative. If the language-event (the monologue) and the image-event (the images on screen) are intended to occur simultaneously (as in Cause for Alarm! or The Lady from Shanghai), we are presented with a rupture of the mimesis, if not a rupture of spatial narrative logic: a character reflects on or narrates action he or she is in the process of performing. This rupture is evident in Cause for Alarm!, in which we see Ellen Jones performing the actions she is narrating (thus making the narration redundant), and also emphasizing the fact that we are seeing a character signify the same act in image and in language (if Ellen’s voice-over did not accompany her actions, the narrative would signify as language and we would be spared the rupture). The redundancy in signification is disruptive because of the fundamental fact that image and language signify in radically different ways.

However, Welles uses interior monologue in The Lady from Shanghai not as a narrative device but as an interpretive device; Michael O’Hara’s running narrative sounds like an interpretation of his actions (and thus reflects the distinct dilemma he faces within himself between reason and desire). This interpretative slant signals the voice-over as a language-event, connected to the image-event in the interpretive realm; thus, no rupture in the mimesis, though the running narrative is, in fact, an interior monologue.

As André Bazin has demonstrated, photography (film) necessarily refers to the world which it presents. The strict classical distinction between diegesis and mimesis is complicated by the fact of the photographic indexicality of film. Certainly a filmic text creates a world, but it cannot be said to be entirely self-contained; the world of the filmic text exists in the ambiguous area between presentation and representation. This fundamental referentiality of the image skews our understanding of diegesis and mimesis as these terms apply not to literature but to the unique medium of the motion picture. The various manifestations and practices of voice-over allow us a way into these questions and a footing upon which to begin a more thorough critical qualification of these distinctions.

Works Cited

Bazin, André. What Is Cinema? Volume 1. Trans. Hugh Grey. Berkeley: University of California Press, 1967.

Cavell, Stanley. The World Viewed: Reflections on the Ontology of Film. Enlarged edition. Cambridge, MA: Harvard University Press, 1979.

Rushton, Richard. Theories of Filmic Reality. Manchester, UK: Manchester University Press, 2011.

For the sake of brevity, I will refer the reader to Cavell, The World Viewed, chapters 2 and 3, and Bazin, What Is Cinema? vol. 1, chapter 1. Also, I explain this idea in more detail in my article “Toward a Semiotics of Poetry and Film: Meaning-Making and Extra-Linguistic Signification.” Literature/Film Quarterly 40:1 (January 2012): 46-53. [↩]
When a word appears on-screen as a signifier of linguistic content (i.e., language in a letter), words signify as language. When a word appears in a non-diegetic context (i.e., a logo or a name), it signifies as image, since it registers as part of the image-event. [↩]
For example, Earth vs the Flying Saucers (Sears, 1956) and It! The Terror from Beyond Space (Cahn, 1958); it may be argued that these films utilize an initial descriptive voice-over to establish the context for the supernatural or a condition that lies outside of our own sense of authenticity; but note that A-films such as The Thing (From Another World) (Hawks, 1951) and The Day the Earth Stood Still (Wise, 1951), find mimetic (visual) means to establish this same context. [↩]

— Bill Scalia

Bill Scalia holds a PhD in American Literature from Louisiana State University. He has published essays on literature and film in the journals Religion and Literature, Literature/Film Quarterly, The Mark Twain Annual, and contributed a chapter on Ingmar Bergman for the anthology Faith and Spirituality in Masters of World Cinema. Also, he edited the anthology Classic Critical Views: Ralph Waldo Emerson. Dr. Scalia teaches writing and literature at St. Mary’s Seminary & University in Baltimore, Maryland.

Also in Bright Lights

Links + BSA