Previous psycholinguistic studies have demonstrated that people tend to direct fixations toward the visual object to which spoken input refers during language comprehension. However, it is still unclear how the visual scene, especially the semantic consistency between object and background, affects the word-object mapping process during comprehension. Two visual world paradigm experiments were conducted to investigate how the scene consistency dynamically influenced the language-driven eye movements in a speech comprehension and a scene comprehension task. In each trial, participants listened to a spoken sentence while viewing a picture with two critical objects: one is the mentioned target object (e.g., tiger), which was embedded in either a consistent (e.g., field), inconsistent (e.g., sky) or blank background; the other is an unmentioned non-target object (e.g., eagle), which was always consistent with its background. The results showed that the fixation proportion of the inconsistent target was higher than the consistent target, and the task demand can affect the strength and the direction of the inconsistency effect before and after the target had been mentioned. In summary, the spoken language, scene-based knowledge and task demand were intertwined to determine eye movements during audio-visual integration for comprehension.