Detection of Non-verbal Speech Segments

Introduction

This is an introductory post to the problem "Detection of Non-verbal Speech Segments", which is being studied as part of RedHen Summer of Code (RHSoC).

In this post, I will discuss about my understanding of the importance of non-verbal segments and their detection.

What are non-verbal sounds???

First of all, what are verbal sounds?

oh...yes,,, Before going into this topic, let us first understand what are verbal sounds, with which most of us are familiar with. Then we can understand non-verbal sounds better.

Verbal sounds (say, Fancy name, for the sounds we produce in our daily conversation).

When we speak, we utter a number of words in a sequence. The person on the other side will understand this speech (of course!, if he knows the language) based on the meaning carried by each word. This is called linguistic information, as each word we produce is a combination of certain sounds which follow certain rules of production. For instance, if we consider the word "speech", it is a combination of different sounds like 's', 'p', 'i', 'ch'. Each sound is produced by following certain rules of production. For example, 's' is a fricative (it is particular sound type production with turbulence), 'p' is a stop (this is another sound type which is produced with complete closure formed at certain position of our oral cavity), 'i' is a vowel (again this is another sound type produced with significant opening in our oral cavity) and 'ch' is again a fricative.

So what's all this?

These are the terms we use to describe the sounds we produce as a part of our speech.

Verbal Sounds:

Verbal sounds refer to the sounds which makes the different words in our language. Each sound has its own rules, which are followed to produce that particular sound. The meaning conveyed by our speech depends on the sequence of sounds we produce.

So, let us now discuss what are non-verbal sounds.....

Non-verbal sounds:

There are many ways non-verbal sounds are defined in the literature.
Here the definition is based on the way I am going to work on the project.

In our daily conversations, apart from speaking, we laugh, cry, shout and produce sounds like deep breathing to show exhaustiveness, yawing, different sighs, cough etc. All these sounds/speech segments are produced by humans and do not convey any meaning explicitly but do provide information about the physical, mental and emotional state of a person. These speech segments are referred to as non-verbal speech sounds.

One interesting difference between verbal and non-verbal speech sounds is that a person cannot understand the verbal content if he is not familiar with the language but irrespective of the language, we can perceive the non-verbal speech sounds and get some knowledge about the emotional or physical state of a person.

As we have some understanding of non-verbal speech sounds, let us now discuss about the project details and the work done till now.

Detecting non-verbal speech sounds:

The main aim of this project is to detect different non-verbal speech sounds that are present in the News Corpus provided by Red Hen Lab.
In the Red Hen News Corpus, non-verbal speech segments are not annotated. Hence we have to either annotate the speech files or we have to develop an unsupervised algorithm to detect the different non-verbal speech segments.

Annotating such a large database is not an affordable option so we have to develop unsupervised algorithms to detect different non-verbal segments and then using these algorithms as a front end, we can train models to have better performance.

I started by developing an unsupervised algorithm for detection of laughter segments, which is one of the most common non-verbal speech sound.

Approach:

An algorithm for detection of laughter segments is developed by considering the features and common pattern exhibited by most laughter segments.

Main steps involved in the algorithm are:
1) Pre-processing
2) Feature extraction
3) Decision logic
4) Post-processing

1) Pre-processing: Most of the laughter segments are voiced. Hence, the first step of the algorithm involves the voiced non-voiced segmentation of the speech signal. After the voiced non-voiced segmentation, only the voiced segments are considered for further analysis.

2) Feature extraction: In this step, only voiced segments obtained in the first step are considered. Acoustic features based on the excitation source and vocal tract system characteristics of laughter segments are extracted for detection.

3) Decision logic: A decision is obtained for every feature extracted in step 2. This decision is obtained by laying a threshold on the feature value (threshold is different for different feature).

4) Post-processing: This step is used to obtain the boundaries of the laughter segments based on the decision obtained in step 3.

Coding part:

I began with the pre-processing step i.e., voiced non-voiced segmentation of the given speech signal. Voiced non-voiced segmentation of the speech signal is obtained based on the energy of the speech signal in the lower frequency regions and stability of the fundamental frequency values in the presence of noise.

To do:

Code to extract features for laughter detection needs to be written.

Wednesday, 10 June 2015