Natural human-computer interaction (HCI) always receives widely attention from peoples. As an important aspect, voice conversion, which can transform the voice of one speaker so that it is perceived as the voice of another speaker, has significant meanings for speech individualization in HCI. The majority of methods proposed in existing literature assume the availability of parallel training sentences, which is speech data with same context from the source and target speakers, are referred to as the text-dependent voice conversion. The requirement of parallel database is inconvenient and sometimes hard to fulfill, which restricts the practical applications. Text-independent voice conversion is a main path to solve this problem. This paper mainly focuses on the study of text-independent voice conversion. Voice conversion can be separated into training and conversion stage. For the training stage, there are also data alignment and model training stage. The main problem of text-independent voice conversion is the data alignment under nonparallel training database. The paper proposes new supervisory data alignment methods for text-independent voice conversion which use phonetic information as a restriction during alignment: phoneme cluster based linear alignment and self-organizing based iterative learning. For phoneme cluster based linear alignment, a mapping between the source and target parameters spaces is established using weighted linear alignment based on common phonetic clusters. These common phoneme clusters between the source and target speech are regarded as anchors for the mapping from the source speaker onto the parameter space of a target speaker. And several of the nearest phonetic clusters to each vector are taken into account simultaneously to ensure mapping continuity. Furthermore, to fine-tune and improve the alignment, a nonlinear data alignment that uses a self-organizing iterative learning algorithm is proposed. The result of the linear alignment is used as the initialization of the iterative learning. The algorithm establishes an optimal balance between the phonetic restriction and preservation of the topology, and it thus maintains alignment accuracy and stability. As these nonlinear alignment results are self-organized, the underlying internal structures of the source and target spaces can be associated. As an extension of the algorithm to cross-lingual voice conversion, a manifold expansion algorithm is used in the nonlinear data...
修改评论