Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning

Recently, there has been an increasing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. One approach is the attention-based encoder-decoder framework that learns a mapping between variable-length input and output sequences in one step using a purely data-driven method. The attention model has often been shown to improve the performance over another end-to-end approach, the Connectionist Temporal Classification (CTC), mainly because it explicitly uses the history of the target character without any conditional independence assumptions. However, we observed that the attention model has shown poor results especially in noisy condition and is hard to be trained in the initial training stage with long input sequences, as compared with CTC. This is because the attention model is too flexible to predict proper alignments in such cases due to the lack of left-to-right constraints as used in CTC. This paper presents a novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue. An experiment on the WSJ and CHiME-4 tasks demonstrates its advantages over both the CTC and attention-based encoder-decoder baselines, showing 6.6-10.3% relative improvements in Character Error Rate (CER).

Similar Publications

A recurrent neural network model of phonological pattern learning is proposed. The model is a relatively simple neural network with one recurrent layer, and displays biases in learning that mimic observed biases in human learning. Single-feature patterns are learned faster than two-feature patterns, and vowel or consonant-only patterns are learned faster than patterns involving vowels and consonants, mimicking the results of laboratory learning experiments. Read More

Emojis are ideograms which are naturally combined with plain text to visually complement or condense the meaning of a message. Despite being widely used in social media, their underlying semantics have received little attention from a Natural Language Processing standpoint. In this paper, we investigate the relation between words and emojis, studying the novel task of predicting which emojis are evoked by text-based tweet messages. Read More

We investigate the use of pivot languages for phrase-based statistical machine translation (PB-SMT) between related languages with limited parallel corpora. We show that subword-level pivot translation via a related pivot language is: (i) highly competitive with the best direct translation model and (ii) better than a pivot model which uses an unrelated pivot language, but has at its disposal large parallel corpora to build the source-pivot (S-P) and pivot-target (P-T) translation models. In contrast, pivot models trained at word and morpheme level are far inferior to their direct counterparts. Read More

Topic models can provide us with an insight into the underlying latent structure of a large corpus of documents. A range of methods have been proposed in the literature, including probabilistic topic models and techniques based on matrix factorization. However, in both cases, standard implementations rely on stochastic elements in their initialization phase, which can potentially lead to different results being generated on the same corpus when using the same parameter values. Read More

Affiliations: 1Dept. of Computer Science, Sun Yat-Sen University, GuangZhou, China., 2Dept. of Computer Science, Sun Yat-Sen University, GuangZhou, China., 3Dept. of Computer Science, Sun Yat-Sen University, GuangZhou, China., 4Dept. of Computer Science, Sun Yat-Sen University, GuangZhou, China.

Topic models have been widely used in discovering latent topics which are shared across documents in text mining. Vector representations, word embeddings and topic embeddings, map words and topics into a low-dimensional and dense real-value vector space, which have obtained high performance in NLP tasks. However, most of the existing models assume the result trained by one of them are perfect correct and used as prior knowledge for improving the other model. Read More

Medical errors are leading causes of death in the US and as such, prevention of these errors is paramount to promoting health care. Patient Safety Event reports are narratives describing potential adverse events to the patients and are important in identifying and preventing medical errors. We present a neural network architecture for identifying the type of safety events which is the first step in understanding these narratives. Read More

The Vocal Joystick Vowel Corpus, by Washington University, was used to study monophthongs pronounced by native English speakers. The objective of this study was to quantitatively measure the extent at which speech recognition methods can distinguish between similar sounding vowels. In particular, the phonemes /\textipa{@}/, /{\ae}/, /\textipa{A}:/ and /\textipa{2}/ were analysed. Read More

Hand-engineered feature sets are a well understood method for creating robust NLP models, but they require a lot of expertise and effort to create. In this work we describe how to automatically generate rich feature sets from simple units called featlets, requiring less engineering. Using information gain to guide the generation process, we train models which rival the state of the art on two standard Semantic Role Labeling datasets with almost no task or linguistic insight. Read More

This paper focuses on unsupervised modeling of morphological families, collectively comprising a forest over the language vocabulary. This formulation enables us to capture edgewise properties reflecting single-step morphological derivations, along with global distributional properties of the entire forest. These global properties constrain the size of the affix set and encourage formation of tight morphological families. Read More

We present an unsupervised explainable word embedding technique, called EVE, which is built upon the structure of Wikipedia. The proposed model defines the dimensions of a semantic vector representing a word using human-readable labels, thereby it readily interpretable. Specifically, each vector is constructed using the Wikipedia category graph structure together with the Wikipedia article link structure. Read More