Emotional stimuli are preferentially processed, in particular fearful faces in the visual system. Most laboratory studies present emotional faces unimodally, although real-life emotional experience typically involves input from multiple sensory channels, such as faces paried with voices. Therefore, in this study, we investigated how concurrent emotional voices influence the preferential processing of emotional faces. We used the breaking continuous flash suppression paradigm (b-CFS) to quantify biased early visual processing: In CFS, pictures of emotional faces presented to one eye are initially not accessible to conscious awareness due to the predominant perception of a dynamic mask presented to the other eye. We presented fearful, happy, and neutral faces either unimodally or paired with non-linguistic vocalizations (fearful, happy, neutral). Thirty-six healthy participants were asked to indicate when the face reached conscious awareness. We replicated that fearful faces overall broke suppression faster, i.e., they are processed preferentially. In addition, all faces broke suppression faster when they were paired with voices. Interestingly, faces paired with neutral and happy voices broke suppression the fastest, followed by faces with fearful voices. These results demonstrate the preferential processing of emotional faces, particularly threat-related cues, early in the visual stream. Moreover, the visual processing of faces is modulated by auditory stimulation.