Style transplantation in neural network-based speech synthesis

Suzić, Siniša; Delić, Tijana; Pekar, Darko; Delić, Vlado; Sečujski, Milan

Please use this identifier to cite or link to this item: https://open.uns.ac.rs/handle/123456789/15353

DC Field	Value	Language
dc.contributor.author	Suzić, Siniša	en
dc.contributor.author	Delić, Tijana	en
dc.contributor.author	Pekar, Darko	en
dc.contributor.author	Delić, Vlado	en
dc.contributor.author	Sečujski, Milan	en
dc.date.accessioned	2020-03-03T14:59:36Z	-
dc.date.available	2020-03-03T14:59:36Z	-
dc.date.issued	2019-01-01	en
dc.identifier.issn	17858860	en
dc.identifier.uri	https://open.uns.ac.rs/handle/123456789/15353	-
dc.description.abstract	© 2019, Budapest Tech Polytechnical Institution. All rights reserved. The paper proposes a novel deep neural network (DNN) architecture aimed at improving the expressiveness of text-to-speech synthesis (TTS) by learning the properties of a particular speech style from a multi-speaker, multi-style speech corpus, and transplanting it into the speech of a new speaker, whose actual speech in the target style is missing from the training corpus. In most research on this topic speech styles are identified with corresponding emotional expressions, which was the approach accepted in this research as well, and the entire process is conventionally referred to as “emotion transplantation”. The proposed architecture builds on the concept of shared hidden layer DNN architecture, which was originally used for multi-speaker modelling, principally by introducing the style code as an auxiliary input. In this way, the mapping between linguistic and acoustic features performed by the DNN was made style dependent. The results of both subjective or objective evaluation of the quality of synthesized speech as well as the quality of style reproduction show that in case the emotional speech data available for training is limited, the performance of the proposed system represents a small but clear improvement to the state of the art. The system used as a baseline reference is based on the standard approach which uses both speaker code and style code as auxiliary inputs.	en
dc.relation.ispartof	Acta Polytechnica Hungarica	en
dc.title	Style transplantation in neural network-based speech synthesis	en
dc.type	Journal/Magazine Article	en
dc.identifier.doi	10.12700/APH.16.6.2019.6.11	en
dc.identifier.scopus	2-s2.0-85068819967	en
dc.identifier.url	https://api.elsevier.com/content/abstract/scopus_id/85068819967	en
dc.relation.lastpage	189	en
dc.relation.firstpage	171	en
dc.relation.issue	6	en
dc.relation.volume	16	en
item.grantfulltext	none	-
item.fulltext	No Fulltext	-
crisitem.author.dept	Fakultet tehničkih nauka, Departman za energetiku, elektroniku i telekomunikacije	-
crisitem.author.dept	Fakultet tehničkih nauka	-
crisitem.author.dept	Fakultet tehničkih nauka, Departman za energetiku, elektroniku i telekomunikacije	-
crisitem.author.dept	Fakultet tehničkih nauka, Departman za energetiku, elektroniku i telekomunikacije	-
crisitem.author.parentorg	Fakultet tehničkih nauka	-
crisitem.author.parentorg	Univerzitet u Novom Sadu	-
crisitem.author.parentorg	Fakultet tehničkih nauka	-
crisitem.author.parentorg	Fakultet tehničkih nauka	-
Appears in Collections:	FTN Publikacije/Publications

Show simple item record

SCOPUS^TM
Citations

10

checked on May 10, 2024

Page view(s)

32

Last Week
7

Last month
2

checked on May 10, 2024

Google Scholar^TM

Check

SCOPUSTM Citations

Page view(s)

Google ScholarTM

Altmetric

SCOPUS^TM
Citations

Google Scholar^TM