This voice conversion device comprises a language information extraction unit that extracts language information corresponding to speech content from a voice signal of a conversion source, an appearance feature extraction unit that extracts an appearance feature representing a feature of a person's face from a captured image in which the person is imaged, and a converted voice generation unit that generates a post-conversion voice on the basis of the language information and the appearance feature.
展开▼