Formant analysis in dysphonic patients and automatic Arabic digit speech recognition
© Muhammad et al; licensee BioMed Central Ltd. 2011
Received: 21 October 2010
Accepted: 30 May 2011
Published: 30 May 2011
Background and objective
There has been a growing interest in objective assessment of speech in dysphonic patients for the classification of the type and severity of voice pathologies using automatic speech recognition (ASR). The aim of this work was to study the accuracy of the conventional ASR system (with Mel frequency cepstral coefficients (MFCCs) based front end and hidden Markov model (HMM) based back end) in recognizing the speech characteristics of people with pathological voice.
Materials and methods
The speech samples of 62 dysphonic patients with six different types of voice disorders and 50 normal subjects were analyzed. The Arabic spoken digits were taken as an input. The distribution of the first four formants of the vowel /a/ was extracted to examine deviation of the formants from normal.
There was 100% recognition accuracy obtained for Arabic digits spoken by normal speakers. However, there was a significant loss of accuracy in the classifications while spoken by voice disordered subjects. Moreover, no significant improvement in ASR performance was achieved after assessing a subset of the individuals with disordered voices who underwent treatment.
The results of this study revealed that the current ASR technique is not a reliable tool in recognizing the speech of dysphonic patients.
Among the tasks for which machines may simulate human behavior, automatic speech recognition (ASR) has been foremost since the advent of computers. A device to understand speech, however, needs a calculating machine capable of making complex decisions, and, practically, one that could function as rapidly as humans. As a result, ASR has grown rapidly in proportion to other areas of pattern recognition (PR) based in a large part on the power of computers to capture a relevant signal and transform it into pertinent information, i.e., recognizing patterns in the speech signal .
There has been a growing interest in objective assessment of acoustic variables in dysphonic patients in recent years. Voice pathology detection and classification is a topic which has interested the international voice community . Most of the work in this field is concentrated on automatically diagnosing the pathology using digital signal processing methods [3–6]. For example, in the study of Dibazar et al,  five different vocal pathologies were detected using MFCC and fundamental frequencies. In their study, the highest recognition sensitivity was achieved with vocal fold paralysis while the lowest sensitivity was for hyperfunctional voice disorders.
In another study by Dubuisson et al , discrimination of normal and pathological voices was analyzed using correlation between different types of acoustic descriptors. Such descriptors were of two types; temporal and cepstral. Temporal descriptors included energy, mean, standard deviation, and zero crossing, while spectral descriptors included delta, mean, several moments, spectral decrease, roll-off, etc. It has been found that using spectral decrease and first spectral tri-stimulus in the Bark scale, and their correlation leads to correct classification rate between normal and pathological voices of 94.7% for pathological voices and 89.5% for normal ones with sustained vowels. These rates mean that 94.7% of the pathological voices were classified as pathological voices and 89.5% of the normal voices were classified as normal voices. The reason behind the higher rates for pathological voices is that the authors use features inspired from voice pathology assessment and the number of normal voice samples is much lower than that of pathological samples. The performance of linear predictive coding (LPC)-based spectral analysis to discriminate pathological voices of speakers affected by vocal fold edema was evaluated in the study of Costa et al . Their results show that LPC-based cepstral method is a good way to represent changes in vocal tract by vocal fold edema. In another study, estimation of glottal noise from voice signals using short-term cepstral was used to discriminate pathological voices from normal voices . It was found that glottal noise estimation correlated less with jitter and shimmer for pathologic voices and not significantly for normal voices. Miyamoto et al  investigated pose-robust audio-visual speech recognition of a person with articulation disorders resulting from cerebral palsy. They used multiple acoustic frames (MAF) as an acoustic feature and active appearance model (AAM) as a visual feature in their system. Their proposed audio-visual method resulted in an improvement of 7.3% in the word recognition rate at 5 dB signal-to-noise ratio compared to the audio-only method.
All of the above-mentioned studies used only sustained vowel /a/ as an input. Comparative evaluation between sustained vowel and continuous speech for acoustically discriminating pathological voices was studied by Parsa et al . It was found in their experiment that classification of voice pathology was easier for sustained vowel than for continuous speech. On the other hand, automated intelligibility assessment was performed with context dependent phonological features using 50 consonant-vowel-consonant (CVC) words from six different types of voice disordered speakers in the study of Middag et al . Their evaluation revealed that the root mean squared error of the discrepancies between perceived and computed intelligibilities can be as low as 8 on a scale of 0 to 100. Automatic recognition of Polish words was carried out in the study Wielgat et al , where the input was speech from voice disordered Polish children. They used MFCC and human factor cepstral coefficients (HFCC) to recognize words with confusing phonemes. In their experiment, HFCC performed better than MFCC. In a recent work, automatic recognition system evaluated speech disorders in head and neck cancer, where the speakers were German natives . Intelligibility was quantified by speech recognition on recordings of a standard text read by laryngectomized patients with cancer of the larynx or hypopharynx and patients who had suffered from oral cancer. Both patient groups showed significantly lower word recognition rates than an age-matched control group.
In the current study, a conventional ASR system was used for evaluation of six different types of voice disordered patients speaking Arabic digits. MFCC and GMM (Gaussian mixture model)/HMM (hidden Markov model) were used as features and classifier, respectively. The recognition results were analyzed for types of diseases. Effects on performance before and after clinical management in a subset of the disordered voices were also investigated. Finally, the first four formants (F1, F2, F3, and F4) of vowel /a/ present in the digits were extracted to make a comparison of distortion in terms of formants for different voice disorders. We believe that this is the first such work that tries to examine the accuracy of ASR in Arabic speech of people with pathological voices. Also the comparison of ASR performance between pre and post management (surgical or medical) may provide additional interest to other language communities now investigating ASR as a mean of examining outcomes of treatments.
Materials and methods
Arabic digits used in the study
Number of syllables
Details of speech samples used for training and testing
Number of speakers
Number of speakers
Vocal fold cysts are subepidermal epithelial-lined sacs located within the lamina propria, and may be mucus retention or epidermoid in origin. The voice often sounds diplophonic (particularly with epidermoid cysts), whereby there is great pitch instability (Figure 1).
LPRD is the retrograde movement of gastric contents (acid and enzymes, such as pepsin) into the laryngopharynx leading to symptoms referable to the larynx, hypopharynx, and nasopharynx. Symptoms include dysphonia, globus pharyngeus, mild dysphagia, chronic cough, excessive throat mucus, chronic throat clearing, etc (Figure 2).
Spasmodic dysphonia (SD) is a neuromuscular disorder. It is characterized by strained, strangled, and interrupted voice, with pitch and phonatory breaks and difficulty coordinating respiration with phonation.
Sulcus vocalis is a linear depression on the mucosal cover of the vocal folds, parallel to the free border. It is of variable depth, and usually bilateral and symmetrical. It inhibits complete closure of the vocal folds and causes stiffness in the vocal fold mucosa (Figure 3).
Vocal fold nodules are defined as bilateral symmetric epithelial swelling of the anterior/mid third of the true vocal folds. These are seen in children, adolescents, and female adults working in professions with high voice demands. Vocal fold nodules frequently interfere with vocal fold closure, so dysphonia is a common symptom (Figure 4).
Vocal fold polyps are usually unilateral, occasionally pedunculated masses encountered on the true vocal fold. They occur more often in males, and they often occur after intense intermittent voice abuse. Polyps result in excess air egress during phonation, and are associated with earlier vocal fatigue, frequent voice breaks in singers, and worsening dysphonia (Figure 5).
Experiments with ASR system
The experiment in this work was conducted on a connected phoneme task constituting isolated Arabic digits. Each phoneme was modeled by a three state HMM. The state transition was left-to-right. Observation probability density functions were modeled using GMM. The number of mixtures in the model of each state was varied between 1, 4, 8, and 16. All training and recognition experiments were implemented with the HTK package . Training was performed using normal speech, while testing was performed using normal and voice disordered speech.
The parameters of the system were: 16 kHz sampling rate with a 16 bit sample resolution, 25 milliseconds Hamming window with a step size of 10 milliseconds, and the pre-emphasis coefficient was 0.97. As features, 12 MFCC and 24 MFCC (12 MFCC plus their delta coefficients) were used.
A second experiment was also carried out where pre- and post- management samples of 16 voice disordered patients who underwent treatments were compared in order to see whether there was any improvement in ASR after the management. Twelve patients received surgical intervention (2 having vocal fold cyst, 5 with spasmodic dysphonia, and 5 with sulcus vocalis) and four patients were medically treated (LPRD).
Also, a formant-based analysis of the Arabic vowel /a/ was carried out for different types of voice disordered speech. First four formants of the vowel /a/ present in Arabic digits were analyzed. This vowel is present in all of the ten Arabic digits (Table 1). Three frames in the middle of vowel /a/ in each utterance were considered to minimize the co-articulation effects. Middle frames are manually detected. Formant values of these frames are averaged to determine final four formants. Praat software  was used for voice analysis of samples in this study.
Results and discussion
Best accuracy (%) obtained in different voice disorders groups
Type of voice disorder
8 or 16
1 or 16
In terms of performance of mixtures, it was found that there was no pattern of mixture for the best performance. In some of the disorders, the best performance was achieved with mixture 1, some with mixture 4, and some with mixture 8 and 16. In all of the disorders, 24 MFCC performed better than 12 MFCC. This indicates that adding time derivatives to the feature can improve the performance with voice disorders, as it can do with normal voice.
Best accuracy (%) obtained in pre- and post-management of patients
Type of voice disorder
Pre - management
Formant analysis of/a/for different types of voicing disorders
In Arabic, there are six vowels:/a/,/i/,/u/and their longer counterparts/a:/,/i:/,/u:/. Some researchers consider Arabic vowels to be eight in total, by adding two diphthongs as vowels, and this is normally considered to be the case for modern standard Arabic (MSA) . By changing the vocal tract shape, different resonating frequencies are produced. Each of the preferred resonating frequencies of the vocal tract (corresponding to the relevant bump in the frequency response curve) is known as a formant. These are usually referred to as F1 indicating the first formant, F2 indicating the second formant, F3 indicating the third formant, etc. .
Comparison between the first four formants of the vowel/a/in the tested Arabic digits in different voicing disorders
P Value (t test)
P Value (t test)
P Value (t test)
P Value (t test)
The speech obtained from all voice disordered patients had significantly (P < 0.05, Student's t-test) deviated formant values from that of normal. For instance, vocal fold cysts patients had F1 of 528 Hz, which was 182 Hz less than F1 of normal subjects. Similarly, F2 of the same disorder (1145 Hz) was 347 Hz less than that of normal. F1 value deviated most from normal in vocal fold cysts, followed by spasmodic dysphonia. The same was true for F2 value. F3 that corresponds to phoneme quality had the highest deviation from the normal in vocal fold polyps group, while F4 that corresponds to voice quality deviated most in vocal fold cysts. This information can be embedded in ASR system to correctly recognize the type of the voice disorder from a sample of pathological voice.
The formant values were not consistent even in the same type of voice disorder. It varied between different samples and sometimes within the same sample. Sustained vowel speech sounds of voice disordered people exhibit a large range of behavior. The behavior can be characterized as nearly periodic or regular vibration, aperiodic or irregular vibration and sounds with no apparent vibration at all. All can be accompanied by varying degrees of noise which can be described as "breathiness". Voice disorders therefore commonly exhibit two characteristic phenomena: increased vibrational aperiodicity and increased breathiness compared to normal voices .
From Table 5, it was shown that F1 in sulcus vocalis patients had standard deviation of 324 and vocal fold polyp patients had that of 377. The high standard deviation indicates unstable nature of formants in each type of voice disorders. For every formant value in each type of disorder, the standard deviation was more than 100 with the exception of F2 in nodules and polyp cases. This is one of the reasons of low recognition accuracy of Arabic digits uttered by voice disordered patients. However, we find a relation between recognition accuracy and standard deviation of first two formants, which are more significant than F3 and F4 in ASR, of /a/ for different types of disorder. For example, sulcus and polyp have higher standard deviations (324 and 377, respectively) with F1 and lower recognition accuracies (less than 62%); nodules have lower standard deviation (104) and higher recognition accuracy (84%). With F2, sulcus has the highest standard deviation (108) and nodules has one of the lowest (75) ones.
Formants of the vowels can be studied further to embed it in feature extraction module of Arabic ASR designed for pathological voices. It can be mentioned that every word and syllable in Arabic language must contain at least one vowel. This analysis is expected to be helpful in future Arabic speech processing tasks such as vowel and speech recognition and classification of voice disorders.
Arabic digits ASR performance in six different voice disorders was evaluated. Recognition accuracy varied between 56% and 82.50% in the disordered groups, while it was 100% in normal subjects. Performance was also checked in post-management condition, where there was no significant increase in recognition accuracy. In addition, the first four formants of /a/ sound in Arabic digits were analyzed in these voice disordered conditions. The formant values varied significantly across and within voice disorders groups. The results of this study revealed that the current ASR technique is far from reliability in recognizing the speech of dysphonic patients. More studies are needed to look for meaningful features in order to improve ASR performance in speech of pathological voices. Our future work includes evaluating the performance of the ASR system by (i) using acoustic models which are trained by speech samples of disorders, (ii) comparing different sets of feature vectors, and (iii) selecting optimal features.
This work was supported by the Research Chair of Voice And Swallowing Disorders (RCVASD), King Abdul Aziz University Hospital, King Saud University, Riyadh, Saudi Arabia.
- O'Shaughnessy D: Invited paper: Automatic speech recognition: History, methods and challenges. Pattern Recognition 2008, 41: 2965–2979. 10.1016/j.patcog.2008.05.008MATHView ArticleGoogle Scholar
- Manfredi C, Kob M: New trends in voice pathology detection and classification. Biomedical Signal Processing and Control, Editorial 2009, 4: 171–172. 10.1016/j.bspc.2009.07.001View ArticleGoogle Scholar
- Dibazar AA, Berger TW, Narayanan SS: Pathological voice assessment. Proc. IEEE Engineering in Medicine and Biology Society (EMBS) Annual International Conference, NY; 2006.View ArticleGoogle Scholar
- Dubuisson T, Dutoit T, Gosselin B, Remacle M: On the Use of the Correlation between Acoustic Descriptors the Normal/Pathological Voices Discrimination. EURASIP Journal on Advances in Signal Processing, Article ID 173967; 2009.Google Scholar
- Costa SC, Neto BGA, Fechine JM, Correia S: Parametric cepstral analysis for pathological voice assessment. Proc. the 2008 ACM symposium on Applied computing 2008, 1410–1414.View ArticleGoogle Scholar
- Michaelis D, Frohlich M, Strube HW: Selection and combination of acoustic features for the description of pathologic voices. Journal of the Acoustical Society of America 1998, 103(3):1628–1639. 10.1121/1.421305View ArticleGoogle Scholar
- Miyamoto C, Komai Y, Takiguchi T, Ariki Y, Li I: Multimodal speech recognition of a person with articulation disorders using AAM and MAF. 2010 IEEE International Workshop on Multimedia Signal Processing 517–520.Google Scholar
- Parsa V, Jamieson DG: Acoustic Discrimination of Pathological Voice: Sustained Vowels versus Continuous Speech. Journal of Speech, Language, and Hearing Research 2001, 44: 327–339. 10.1044/1092-4388(2001/027)View ArticleGoogle Scholar
- Middag C, Martens JP, Nuffelen GV, Bodt M: Automated Intelligibility Assessment of Pathological Speech Using Phonological Features. EURASIP Journal on Advances in Signal Processing, Article ID 629030; 2009.Google Scholar
- Wielgat R, Zieliński TP, Woźniak T, Grabias S, Krol D: Automatic Recognition of Pathological Phoneme Production. Volume 60. Folia Phoniatr Logop; 2008:323–331.Google Scholar
- Maier A, Haderlein T, et al.: Automatic Speech Recognition Systems for the Evaluation of Voice and Speech Disorders in Head and Neck Cancer. EURASIP Journal on Audio, Speech, and Music Processing, Article ID 926951; 2010.Google Scholar
- Alotaibi YA, Mamun KA, Muhammad G: Noise Effect on Arabic Alphadigits in Automatic Speech Recognition. Proc. The 2009 International Conference on Image Processing, Computer Vision, and Pattern Recognition (IPCV'09), Las Vegas; 2009.Google Scholar
- Jiang J, Stern J, Chen HJ, et al.: Vocal efficiency measurements in subjects with vocal polyps and nodules: a preliminary report. Ann Otol Rhinol Laryngol 2004, 113: 277–82.View ArticleGoogle Scholar
- Koufman JA: The otolaryngologic manifestations of gastroesophageal reflux disease (GERD): a clinical investigation of 225 patients using ambulatory 24-hour pH monitoring and an experimental investigation of the role of acid and pepsin in the development of laryngeal injury. Laryngoscope 1991, 101: 1–78.View ArticleGoogle Scholar
- Hirano M, Yoshida T, Tanaka S, et al.: Sulcus vocalis: functional aspects. Ann Otol Rhinol Laryngol 1990, 99: 679–83.View ArticleGoogle Scholar
- Ludlow CL, Naunton RF, Terada S, et al.: Successful treatment of selected cases of abductor spasmodic dysphonia using botulinum toxin injection. Otolaryngol Head Neck Surg 1991, 104: 849–55.Google Scholar
- Bouchayer M, Cornut G, Witzig E, et al.: Epidermoid cysts, sulci, and mucosal bridges of the true vocal cord: a report of 157 cases. Laryngoscope 1985, 95: 1087–94.View ArticleGoogle Scholar
- Young S: The HTK Book (for HTK Version. 3.4). Cambridge University Engineering Department; 2006.Google Scholar
- Praat: doing phonetics by computer [http://www.fon.hum.uva.nl/praat/]
- Dailey SH, Ford CN: Surgical management of sulcus vocalis and vocal fold scaring. Otolaryngo Clin North A 2006, 39(1):23–42. 10.1016/j.otc.2005.10.012View ArticleGoogle Scholar
- Elshafei M: Toward an Arabic Text-to -Speech System. The Arabian Journal for Science and Engineering 1991, 16(4B):565–583.MathSciNetGoogle Scholar
- Alotaibi Y, Husain A: Formant Based Analysis of Spoken Arabic Vowels. In BioID_MultiComm 2009, LNCS 5707 Edited by: J. Fierrez et al. 170–177.Google Scholar
- Newman DL, Verhoeven J: Frequency analysis of Arabic vowels in connected speech. Antwerp papers in Linguistics 2002, 100: 77–86.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.