1-2hit |
Jan ANGUITA Javier HERNANDO Alberto ABAD
Jacobian Adaptation (JA) has been successfully used in Automatic Speech Recognition (ASR) systems to adapt the acoustic models from the training to the testing noise conditions. In this work we present an improvement of JA for speaker verification, where a specific training noise reference is estimated for each speaker model. The new proposal, which will be referred to as Model-dependent Noise Reference Jacobian Adaptation (MNRJA), has consistently outperformed JA in our speaker verification experiments.
Javier R. SAETA Javier HERNANDO
The selection of the most representative utterances coming from a speaker is essential for the right performance of automatic enrollment in speaker verification. Model quality measures and threshold estimation methods mainly deal with the scarcity of data and the difficulty of obtaining data from impostors in real applications. Conventional methods estimate the quality of the training utterances once the model is created. In such case, it is not possible to ask the user for more utterances during the training session if necessary. A new training session must be started. That was especially unusable in applications where only one or two enrolment sessions were allowed. In this paper, a new on-line quality method based on a male and a female Universal Background Model (UBM) is introduced. The two models act as a reference for new utterances and show if they belong to the same speaker and provide a measure of its quality at the same time. On the other hand, the estimation of the verification threshold is also strongly influenced by the previous selection of the speaker's utterances. In this context, potential outliers, i.e., those client scores which are distant with regard to mean, could lead to wrong mean and variance client estimations. To alleviate this problem, some efficient threshold estimation methods based on removing or weighting scores are proposed here. Before estimating the threshold, the client scores catalogued as outliers are removed, pruned or weighted, improving subsequent estimations. Text-dependent experiments have been carried out by using a telephonic multi-session database in Spanish. The database has been recorded by the authors and has 184 speakers.