RANDOM CYCLE LOSS AND ITS APPLICATION TO VOICE CONVERSION(SPOT/DEMO/SUPPLEMENT)
Haoran Sun, Dong Wang, Lantian Li, Chen Chen, Thomas Fang ZhengThis is a supplemental page for Random Cycle Loss and Its Application to Voice Conversion(paper, code). Speech disentanglement aims to decompose independent causal factors of speech signals into separate codes. Perfect disentanglement benefits to a broad range of speech processing tasks. This paper presents a simple but effective disentanglement approach based on cycle consistency loss and random factor substitution. This leads to a novel random cycle (RC) loss that enforces analysis-and-resynthesis consistency, a main principle of reductionism. We theoretically demonstrate that the proposed RC loss can achieve independent codes if well optimized, which in turn leads to superior disentanglement when combined with information bottleneck (IB). Extensive simulation experiments were conducted to understand the properties of the RC loss, and experimental results on voice conversion further demonstrate the practical merit of the proposal.
Section 1 : Random cycle loss
(1) Analysis-and-resynthesis principle· Analysis and resynthesis is a primary principle in reductionism. The primary belief is that once a phenomenon can be well explained by independent factors, recombining the factors can lead to a new and valid phenomenon, where the term ‘valid’ means that the new phenomenon can be explained in the same way as the existing observations. For example, once scientists know different materials are composed of atoms, it would be possible to construct new materials by combining atoms in new ways, and the new materials can be decomposed into atoms following the same decomposition rule. The picture is from TACC report, credited to Aaron Dubrow (https://www.tacc.utexas.edu/-/collision-chemistry).

(2) Analysis-and-resynthesis principle for information disentanglement
· The analysis-resynthesis principle should be followed by an ideal information disentanglement. As shown in the following picture, any valid data points x1 and x2 are disentangled into two informational codes by the encoder f, respectively. One can resynthesize a new data by selecting each component of the code from either x1 or x2 and passing the decoder g. The resynthesized new data is then perfectly decomposed into the original two informational codes when passing the encoder f, if f can perfect disentangle the information factors.

(3) Random consistency loss
· We design a random consistency loss for the IB-based information disentanglement system, e.g., an auto-encoder. When the data X passes f to get Z, part of the components of Z are randomly substituted by the components of Z from other data samples. The random subsitutation is denoted by RFS (random factor substitution). The RFS code Z' passes through the decoder g and then once again the encoder f, obtaining Z''. The cycle loss measures the difference between the code Z'' of the reconstructed speech and the RFS code Z'.
Section 2 : Simulation
(1) Mutual information (MI) reduction· We use conditional auto-encoder (CAE) as the informatio disentanglement model, where the class label was used as the conditional variable of the decoder. Ideally, if the disentanglement is perfect, the code should be independent of the class. We measure the MI reduduction of CAE during the training process, when different regularization losses were involved. AD denotes adversarial loss, MI denotes mutual information loss (vCLUB), and RC denotes the random cycle loss.

(2) Conversion example
· Conversion examples. Top-left: vanilla CAE; Top-right: CAE + adversarial loss; Bottom-left: CAE + MI loss; Bottom-right: CAE + random cycle loss. In each figure, the blue solid line is the data sample providing class information, and the orange dash line is the data sample providing content information. The green dot-dash line is the reconstructed data with the resynthesized code, i.e., the conversion result. The red dash line is the converted data with class information shifted off, so that we can easily compare if the content is well retained in the conversion. It can be seen that the RC loss produces much better reconstruction than the vanilla CAE and CAE with AD and MI losses. In particular, it retains the content information better than CAE+AD and CAE+MI.


model
CAE
CAE+AD


model:
CAE+MI
CAE+RC
(3) Training stability
· The conversion result along with the training process. Top-left: vanilla CAE; Top-right: CAE + adversarial loss; Bottom-left: CAE + MI loss; Bottom-right: CAE + random cycle loss. In each figure, blue and orange lines represent data providing class information and content ionformation; the green and red lines are reconstruction of the two data. The purple line represents the converted sample. It can be seen that with the RC loss, the conversion quickly converges to a stable status and remain unchanged. In contrast, the CAE baseline cannot converge to a good condition, and CAE+AD and CAE+MI converge slowly and are less stable than CAE+RC.


model
CAE
CAE+AD


model:
CAE+MI
CAE+RC
Section 3 : Demos of SpeechFlow&CycleFlow
Below are audio clips of some kinds of voice conversion, which corresponds to section 6.5 in the paper. Similar to demo websites of SpeechFlow, for each source-target pair, you can select the aspect(s) you wish to convert, and then the corresponding converted speech will automatically load. For these aspect(s) to be selected:
- Style to be converted: corresponding to "Style conversion (pitch + rhythm)" in section 6.5;
- Timbre to be converted: corresponding to "Timbre conversion (timbre only)" in section 6.5;
- Style and timbre to be converted: corresponding to "Full conversion (timbre + pitch + rhythm)" in section 6.5.
Utterance 1
Source Speech

Target Speech

Choose What To Convert:
SpeechFlow

CycleFlow

Utterance 2
Source Speech

Target Speech

Choose What To Convert:
SpeechFlow

CycleFlow

Utterance 3
Source Speech

Target Speech

Choose What To Convert:
SpeechFlow

CycleFlow

Utterance 4
Source Speech

Target Speech

Choose What To Convert:
SpeechFlow

CycleFlow

Utterance 5
Source Speech

Target Speech

Choose What To Convert:
SpeechFlow

CycleFlow

Utterance 6
Source Speech

Target Speech

Choose What To Convert:
SpeechFlow

CycleFlow
