Computer Science - Sound Publications (50)


Computer Science - Sound Publications

The Vocal Joystick Vowel Corpus, by Washington University, was used to study monophthongs pronounced by native English speakers. The objective of this study was to quantitatively measure the extent at which speech recognition methods can distinguish between similar sounding vowels. In particular, the phonemes /\textipa{@}/, /{\ae}/, /\textipa{A}:/ and /\textipa{2}/ were analysed. Read More

We formulated and implemented a procedure to generate aliasing-free excitation source signals based on the Fujisaki- Ljungqvist model. It uses a new antialiasing filter in the contin- uous time domain followed by an IIR digital filter for response equalization. We introduced a general designing procedure of cosine series to design the new antialiasing function. Read More

Sound events often occur in unstructured environments where they exhibit wide variations in their frequency content and temporal structure. Convolutional neural networks (CNN) are able to extract higher level features that are invariant to local spectral and temporal variations. Recurrent neural networks (RNNs) are powerful in learning the longer term temporal context in the audio signals. Read More

With the development of speech synthesis techniques, automatic speaker verification systems face the serious challenge of spoofing attack. In order to improve the reliability of speaker verification systems, we develop a new filter bank based cepstral feature, deep neural network filter bank cepstral coefficients (DNN-FBCC), to distinguish between natural and spoofed speech. The deep neural network filter bank is automatically generated by training a filter bank neural network (FBNN) using natural and synthetic speech. Read More

This work presents a novel framework based on feed-forward neural network for text-independent speaker classification and verification, two related systems of speaker recognition. With optimized features and model training, it achieves 100% classification rate in classification and less than 6% Equal Error Rate (ERR), using merely about 1 second and 5 seconds of data respectively. Features with stricter Voice Active Detection (VAD) than the regular one for speech recognition ensure extracting stronger voiced portion for speaker recognition, speaker-level mean and variance normalization helps to eliminate the discrepancy between samples from the same speaker. Read More

The mechanism proposed here is for real-time speaker change detection in conversations, which firstly trains a neural network text-independent speaker classifier using in-domain speaker data. Through the network, features of conversational speech from out-of-domain speakers are then converted into likelihood vectors, i.e. Read More

Musical source separation methods exploit source-specific spectral characteristics to facilitate the decomposition process. Kernel Additive Modelling (KAM) models a source applying robust statistics to time-frequency bins as specified by a source-specific kernel, a function defining similarity between bins. Kernels in existing approaches are typically defined using metrics between single time frames. Read More

This research was conducted to develop a method to identify voice utterance. For voice utterance that encounters change caused by aging factor, with the interval of 10 to 25 years. The change of voice utterance influenced by aging factor might be extracted by MFCC (Mel Frequency Cepstrum Coefficient). Read More

Korea University Intelligent Signal Processing Lab. (KU-ISPL) developed speaker recognition system for SRE16 fixed training condition. Data for evaluation trials are collected from outside North America, spoken in Tagalog and Cantonese while training data only is spoken English. Read More

Chord recognition systems use temporal models to post-process frame-wise chord preditions from acoustic models. Traditionally, first-order models such as Hidden Markov Models were used for this task, with recent works suggesting to apply Recurrent Neural Networks instead. Due to their ability to learn longer-term dependencies, these models are supposed to learn and to apply musical knowledge, instead of just smoothing the output of the acoustic model. Read More

Several recent polyphonic music transcription systems have utilized deep neural networks to achieve state of the art results on various benchmark datasets, pushing the envelope on framewise and note-level performance measures. Unfortunately we can observe a sort of glass ceiling effect. To investigate this effect, we provide a detailed analysis of the particular kinds of errors that state of the art deep neural transcription systems make, when trained and tested on a piano transcription task. Read More

In a recent conference paper, we have reported a rhythm transcription method based on a merged-output hidden Markov model (HMM) that explicitly describes the multiple-voice structure of polyphonic music. This model solves a major problem of conventional methods that could not properly describe the nature of multiple voices as in polyrhythmic scores or in the phenomenon of loose synchrony between voices. In this paper we present a complete description of the proposed model and develop an inference technique, which is valid for any merged-output HMMs for which output probabilities depend on past events. Read More

Hidden Markov model based various phoneme recognition methods for Bengali language is reviewed. Automatic phoneme recognition for Bengali language using multilayer neural network is reviewed. Usefulness of multilayer neural network over single layer neural network is discussed. Read More

Interaction with the world requires an organism to transform sensory signals into representations in which behaviorally meaningful properties of the environment are made explicit. These representations are derived through cascades of neuronal processing stages in which neurons at each stage recode the output of preceding stages. Explanations of sensory coding may thus involve understanding how low-level patterns are combined into more complex structures. Read More

We report investigations into speaker classification of larger quantities of unlabelled speech data using small sets of manually phonemically annotated speech. The Kohonen speech typewriter is a semi-supervised method comprised of self-organising maps (SOMs) that achieves low phoneme error rates. A SOM is a 2D array of cells that learn vector representations of the data based on neighbourhoods. Read More

Most of the previous approaches to lyrics-to-audio alignment used a pre-developed automatic speech recognition (ASR) system that innately suffered from several difficulties to adapt the speech model to individual singers. A significant aspect missing in previous works is the self-learnability of repetitive vowel patterns in the singing voice, where the vowel part used is more consistent than the consonant part. Based on this, our system first learns a discriminative subspace of vowel sequences, based on weighted symmetric non-negative matrix factorization (WS-NMF), by taking the self-similarity of a standard acoustic feature as an input. Read More

This work aims to investigate the use of deep neural network to detect commercial hobby drones in real-life environments by analyzing their sound data. The purpose of work is to contribute to a system for detecting drones used for malicious purposes, such as for terrorism. Specifically, we present a method capable of detecting the presence of commercial hobby drones as a binary classification problem based on sound event detection. Read More

To stretch a music piece to a given length is a common demand in people's daily lives, e.g., in audio-video synchronization and animation production. Read More

Behavioral annotation using signal processing and machine learning is highly dependent on training data and manual annotations of behavioral labels. Previous studies have shown that speech information encodes significant behavioral information and be used in a variety of automated behavior recognition tasks. However, extracting behavior information from speech is still a difficult task due to the sparseness of training data coupled with the complex, high-dimensionality of speech, and the complex and multiple information streams it encodes. Read More

In this paper, a novel architecture for a deep recurrent neural network, residual LSTM is introduced. A plain LSTM has an internal memory cell that can learn long term dependencies of sequential data. It also provides a temporal shortcut path to avoid vanishing or exploding gradients in the temporal domain. Read More

The higher order differential energy operator (DEO), denoted via $\Upsilon_k(x)$, is an extension to the second order famous Teager-Kaiser operator. The DEO helps measuring the higher order gauge of energy of a signal which is useful for AM-FM demodulation. However, the energy criterion defined by the DEO is not compliant with the presumption of positivity of energy. Read More

We propose a new deep network for audio event recognition, called AENet. In contrast to speech, sounds coming from audio events may be produced by a wide variety of sources. Furthermore, distinguishing them often requires analyzing an extended time period due to the lack of clear sub-word units that are present in speech. Read More

Speechreading is a notoriously difficult task for humans to perform. In this paper we present an end-to-end model based on a convolutional neural network (CNN) for generating an intelligible acoustic speech signal from silent video frames of a speaking person. The proposed CNN generates sound features for each frame based on its neighboring frames. Read More

Traditional speech enhancement techniques modify the magnitude of a speech in time-frequency domain, and use the phase of a noisy speech to resynthesize a time domain speech. This work proposes a complex-valued Gaussian process latent variable model (CGPLVM) to enhance directly the complex-valued noisy spectrum, modifying not only the magnitude but also the phase. The main idea that underlies the developed method is the modeling of short-time Fourier transform (STFT) coefficients across the time frames of a speech as a proper complex Gaussian process (GP) with noise added. Read More

There is a common observation that audio event classification is easier to deal with than detection. So far, this observation has been accepted as a fact and we lack a careful analysis. In this paper, we reason the rationale behind this fact and, more importantly, leverage them to benefit the audio event detection task. Read More

We introduce a dataset for facilitating audio-visual analysis of musical performances. The dataset comprises a number of simple multi-instrument musical pieces assembled from coordinated but separately recorded performances of individual tracks. For each piece, we provide the musical score in MIDI format, the audio recordings of the individual tracks, the audio and video recording of the assembled mixture, and ground-truth annotation files including frame-level and note-level transcriptions. Read More

In this paper we propose a novel model for unconditional audio generation based on generating one audio sample at a time. We show that our model, which profits from combining memory-less modules, namely autoregressive multilayer perceptrons, and stateful recurrent neural networks in a hierarchical structure is able to capture underlying sources of variations in the temporal sequences over very long time spans, on three datasets of different nature. Human evaluation on the generated samples indicate that our model is preferred over competing models. Read More

Most of the existing studies on voice conversion (VC) are conducted in acoustically matched conditions between source and target signal. However, the robustness of VC methods in presence of mismatch remains unknown. In this paper, we report a comparative analysis of different VC techniques under mismatched conditions. Read More

About 90 percent of people with Parkinson's disease (PD) experience decreased functional communication due to the presence of voice and speech disorders associated with dysarthria that can be characterized by monotony of pitch (or fundamental frequency), reduced loudness, irregular rate of speech, imprecise consonants, and changes in voice quality. Speech-language pathologists (SLPs) work with patients with PD to improve speech intelligibility using various intensive in-clinic speech treatments. SLPs also prescribe home exercises to enhance generalization of speech strategies outside of the treatment room. Read More

This paper addresses the problem of Target Activity Detection (TAD) for binaural listening devices. TAD denotes the problem of robustly detecting the activity of a target speaker in a harsh acoustic environment, which comprises interfering speakers and noise (cocktail party scenario). In previous work, it has been shown that employing a Feed-forward Neural Network (FNN) for detecting the target speaker activity is a promising approach to combine the advantage of different TAD features (used as network inputs). Read More

In this work, we propose a two-dimensional Head-Related Transfer Function (HRTF)-based robust beamformer design for robot audition, which allows for explicit control of the beamformer response for the entire three-dimensional sound field surrounding a humanoid robot. We evaluate the proposed method by means of both signal-independent and signal-dependent measures in a robot audition scenario. Our results confirm the effectiveness of the proposed two-dimensional HRTF-based beamformer design, compared to our previously published one-dimensional HRTF-based beamformer design, which was carried out for a fixed elevation angle only. Read More

We introduce a novel approach to studying animal behaviour and the context in which it occurs, through the use of microphone backpacks carried on the backs of individual free-flying birds. These sensors are increasingly used by animal behaviour researchers to study individual vocalisations of freely behaving animals, even in the field. However such devices may record more than an animals vocal behaviour, and have the potential to be used for investigating specific activities (movement) and context (background) within which vocalisations occur. Read More

This paper describes a computational model of loudness variations in expressive ensemble performance. The model predicts and explains the continuous variation of loudness as a function of information extracted automatically from the written score. Although such models have been proposed for expressive performance in solo instruments, this is (to the best of our knowledge) the first attempt to define a model for expressive performance in ensembles. Read More

In this paper, we describe three neural network (NN) based EEG-Speech (NES) models that map the unspoken EEG signals to the corresponding phonemes. Instead of using conventional feature extraction techniques, the proposed NES models rely on graphic learning to project both EEG and speech signals into deep representation feature spaces. This NN based linear projection helps to realize multimodal data fusion (i. Read More

We propose a new algorithm for time stretching music signals based on the theory of nonstationary Gabor frames. The algorithm extends the techniques of the classical phase vocoder by incorporating adaptive time-frequency representations and adaptive phase locking. Applying a preliminary onset detection algorithm, the obtained time-frequency representation implies good time resolution for the onsets and good frequency resolution for the sinusoidal components. Read More

This paper describes the LIA speaker recognition system developed for the Speaker Recognition Evaluation (SRE) campaign. Eight sub-systems are developed, all based on a state-of-the-art approach: i-vector/PLDA which represents the mainstream technique in text-independent speaker recognition. These sub-systems differ: on the acoustic feature extraction front-end (MFCC, PLP), at the i-vector extraction stage (UBM, DNN or two-feats posteriors) and finally on the data-shifting (IDVC, mean-shifting). Read More

In an attempt at exploring the limitations of simple approaches to the task of piano transcription (as usually defined in MIR), we conduct an in-depth analysis of neural network-based framewise transcription. We systematically compare different popular input representations for transcription systems to determine the ones most suitable for use with neural networks. Exploiting recent advances in training techniques and new regularizers, and taking into account hyper-parameter tuning, we show that it is possible, by simple bottom-up frame-wise processing, to obtain a piano transcriber that outperforms the current published state of the art on the publicly available MAPS dataset -- without any complex post-processing steps. Read More

Chord recognition systems depend on robust feature extraction pipelines. While these pipelines are traditionally hand-crafted, recent advances in end-to-end machine learning have begun to inspire researchers to explore data-driven methods for such tasks. In this paper, we present a chord recognition system that uses a fully convolutional deep auditory model for feature extraction. Read More

In this demo we show a novel approach to score following. Instead of relying on some symbolic representation, we are using a multi-modal convolutional neural network to match the incoming audio stream directly to sheet music images. This approach is in an early stage and should be seen as proof of concept. Read More

This paper demonstrates the feasibility of learning to retrieve short snippets of sheet music (images) when given a short query excerpt of music (audio) -- and vice versa --, without any symbolic representation of music or scores. This would be highly useful in many content-based musical retrieval scenarios. Our approach is based on Deep Canonical Correlation Analysis (DCCA) and learns correlated latent spaces allowing for cross-modality retrieval in both directions. Read More

We explore frame-level audio feature learning for chord recognition using artificial neural networks. We present the argument that chroma vectors potentially hold enough information to model harmonic content of audio for chord recognition, but that standard chroma extractors compute too noisy features. This leads us to propose a learned chroma feature extractor based on artificial neural networks. Read More

The use of deep learning to solve problems in literary arts has been a recent trend that has gained a lot of attention and automated generation of music has been an active area. This project deals with the generation of music using raw audio files in the frequency domain relying on various LSTM architectures. Fully connected and convolutional layers are used along with LSTM's to capture rich features in the frequency domain and increase the quality of music generated. Read More

Some glottal analysis approaches based upon linear prediction or complex cepstrum approaches have been proved to be effective to estimate glottal source from real speech utterances. We propose a new approach employing both an all-pole odd-order linear prediction to provide a coarse estimation and phase decomposition based causality/anti-causality separation to generate further refinements. The obtained measures show that this method improved performance in terms of reducing source-filter separation in estimation of glottal flow pulses (GFP). Read More

This paper proposed a class of novel Deep Recurrent Neural Networks which can incorporate language-level information into acoustic models. For simplicity, we named these networks Recurrent Deep Language Networks (RDLNs). Multiple variants of RDLNs were considered, including two kinds of context information, two methods to process the context, and two methods to incorporate the language-level information. Read More

We introduce a method for imposing higher-level structure on generated, polyphonic music. A Convolutional Restricted Boltzmann Machine (C-RBM) as a generative model is combined with gradient descent constraint optimization to provide further control over the generation process. Among other things, this allows for the use of a "template" piece, from which some structural properties can be extracted, and transferred as constraints to newly generated material. Read More

This paper introduces a new paradigm for sound source lo-calization referred to as virtual acoustic space traveling (VAST) and presents a first dataset designed for this purpose. Existing sound source localization methods are either based on an approximate physical model (physics-driven) or on a specific-purpose calibration set (data-driven). With VAST, the idea is to learn a mapping from audio features to desired audio properties using a massive dataset of simulated room impulse responses. Read More

This paper presented our work on applying Recurrent Deep Stacking Networks (RDSNs) to Robust Automatic Speech Recognition (ASR) tasks. In the paper, we also proposed a more efficient yet comparable substitute to RDSN, Bi- Pass Stacking Network (BPSN). The main idea of these two models is to add phoneme-level information into acoustic models, transforming an acoustic model to the combination of an acoustic model and a phoneme-level N-gram model. Read More

State-of-the-art i-vector based speaker verification relies on variants of Probabilistic Linear Discriminant Analysis (PLDA) for discriminant analysis. We are mainly motivated by the recent work of the joint Bayesian (JB) method, which is originally proposed for discriminant analysis in face verification. We apply JB to speaker verification and make three contributions beyond the original JB. Read More

In this paper, we investigate DCTNet for audio signal classification. Its output feature is related to Cohen's class of time-frequency distributions. We introduce the use of adaptive DCTNet (A-DCTNet) for audio signals feature extraction. Read More

Several methods exist for a computer to generate music based on data including Markov chains, recurrent neural networks, recombinancy, and grammars. We explore the use of unit selection and concatenation as a means of generating music using a procedure based on ranking, where, we consider a unit to be a variable length number of measures of music. We first examine whether a unit selection method, that is restricted to a finite size unit library, can be sufficient for encompassing a wide spectrum of music. Read More