Chapter 1: Background

Contents

  1. Vocoding
    1. The speech chain
    2. Speech production
  2. Joint source-channel coding
    1. Tandem source and channel coding
    2. Joint coding

While mainly a Matlab and C exercise in its realisation, this project is based on theories of both speech-optimised coding—also referred to as vocoding—which is an optimisation of standard audio signal coding for the specificity of speech, and joint source and channel coding, a novel theory of general signal coding that goes in exactly the opposite direction as the one established by the Shannon separation theorem.

In this chapter, both theories are summed up and presented, in order to give the reader a minimum background about the subject of this dissertation.

1.1 Vocoding

According to Scott McCloud, all forms of media are motivated by our inability to directly communicate from mind to mind. The idea is then that a thought of our mind be translated to the physical world, travel through it, hopefully as untouched as possible, to be received and perceived as a thought. Speech is one of the many media1 conveying this indirect, imperfect and apparently not perfectible mind-to-mind communication [7].

1.1.1 The speech chain

Among all types of audio data encoding, voice coding stands apart for the simple reason that voice signals are produced by a very specific source: the human vocal organs. Consequently, careful analysis of those organs, and the way voice signals are produced, might help design adapted and optimised coding schemes thus drawing better results – as far as encoding efficiency and quality are concerned – than on more general audio signals.

Let us first take a look at an example of a speech chain, explained in detail in [1], as illustrated below. We can identify as in other contexts, e.g. the OSI network stack, several interacting layers. The top, most abstract layer would be the linguistic layer, taking place in the brain, where the message the speaker wants to convey is formed, the sentence prepared, the words ordered.

The next level is the physiological level: based on the information prepared in the brain, several parts of the body work in conjunction in order to produce a sensible sound that will be emitted at the mouth. The sound waves then propagate through the aerial medium (at least in most common cases) to the listener’s ear. This is the acoustic level.

Figure-1.1 The speech chain

Figure-1.1 The speech chain (from [1])

Upon reaching the listener’s ear, the acoustic waves stimulate the listener’s auditory organs, which in turn translate the acoustic excitation into a message for the listener’s brain. This operation is considered to take place at the physiological level again. Finally, the nervous message from the listener’s ear reaches the listener’s brain, where they are decoded into sensible information, at the physiological level.

Notice that the acoustic waves emitted by the speaker also reach his own ears, where they are also decoded through the physiological and linguistic layers. This works as a feedback link to allow the speaker for a better control over the sounds emitted. As a consequence, it is no surprise that deaf people have trouble obtaining a clear pronunciation.

1.1.2 Speech production

Careful analysis of the physiological production of speech allows for satisfying means of artificial speech synthesis.

The first step in speech production is the contraction of the lungs with the help of the diaphragm, creating a flow of air through the vocal tract. The glottis is the first organ the flow of air goes through. There, depending on the sound uttered, the flow may or may not be made a quasi-periodic signal. If it is, the utterance is said to be voiced. In that case, the base frequency of the signal is called the pitch. It is linked to the general perception of the speaker’s tone of voice. It will be generally higher in frequency for children and women, and lower for men. Typical values for the pitch range from a hundred hertz to a few hundred hertz. If the flow of air is left untouched, the utterance is said to be non-voiced. Also some utterances consist of a mix of voiced and non-voiced sounds, in variable proportions.

Next the flow reaches the velum. When it is open, acoustic coupling with the nasal cavity occurs. When the coupling occurs, the utterance is said to be nasal: as an example, consider the sound of the consonants ‘m’ and ‘n’ in the English language.

Figure-1.2 The primary articulators of the vocal tract

Figure-1.2 The primary articulators of the vocal tract (from [1])

Finally, the tongue, and lips provide fine shaping of the air flow in order to produce a wide range of sounds.

For speech coding purposes, the physiological production of speech is separated into two steps. The first one is excitation, which will either consist of white noise, corresponding to non-voiced sounds, or a periodic signal at an adequate frequency, corresponding to voiced sounds, or a mix of both. The other one is the vocal tract, which may be considered like a filter shaping the excitation. For the needs of vocoding, the vocal tract can be considered as a time varying filter.

Thus encoding will consist in extracting the three basic pieces of information from the speech source: the voicing information, indicating the relative amount of voiced and non-voiced signals in the excitation, the pitch information, if the voiced signal is present in the excitation, and the vocal tract filter parameters.

Figure-1.3 The source-filter model

Figure-1.3 The source-filter model

Most vocoders are based on the source-filter model, the only differences being in the practical implementation choices of the different components. For example, the mixing stage is commonly designed as a hard decision between voiced and non-voiced. While allowing for greater simplicity both at analysis and synthesis stages, this approach leads to slightly reduced performance as far as the quality of the synthesised signal is concerned.

1.2 Joint source and channel coding

1.2.1 Tandem source and channel coding

As a general contract, digital transmission is expected to provide efficiency, reliability and privacy. Putting privacy aside, we can see that the goals of efficiency and reliability require contradicting approaches.

Indeed, efficiency, gauged as a ratio between amount of information transmitted and the energy required to transmit it, will require that redundancy be stripped of a given source prior to transmission. This action of optimising efficiency for data to be transmitted is called source coding. On the other hand, reliability of transmission will require some ordered redundancy to be left in the transmitted data, so the correct data can be deduced from errored data. This is called channel coding.

Let’s consider as an example the English language, and the Arab numbering system, as in: “There are 23 students present today.”2 The natural language is naturally extremely redundant, and indeed, a few errors in the above sentence still let the reader get the meaning of it: “Tkere are 23 stndents pretent today.” On the other hand, our numbering system bears no redundancy at all, and even a single error totally changes the conveyed meaning: “There are 93 students present today.” Here, the small class has become a big amphitheatre!

On the other hand, all that redundancy present in the natural language terribly reduces its space-to-information ratio: we just need to compare “twenty three” to its non-redundant counterpart “23”.

However, this apparent opposition in goals is—partly—dismissed by the theorem, proved by Shannon, that source and channel coding can be separated into two disjoint operations without any loss of performance. Indeed, given a certain source of data, there is a limit on the channel that can carry this data in a reliable manner. In other words, the theorem mentioned above states that for any arbitrary channel satisfying the above condition, data could be transmitted over it, with an arbitrarily high reliability.

Now this theorem is not as miraculous as it seems, for it is hampered by several drawbacks. First of all, it gives no clue as to how to obtain an optimal solution. Second, it is not based on any assumption of practicality. Given that most closely optimal solutions obtained so far have been far from practical, either by requiring too high computation delays, or too much computation, it is assumable that an optimal solution would require arbitrarily impractical implementations. Further, the theorem does not prove the uniqueness of the solution either, which allows for other directions to be investigated.

1.2.2 Joint coding

Thus the idea of joint source and channel coding: since separate optimisation of source and channel coding, as mentioned in the above theorem has no guarantee of uniqueness, one could try and jointly optimise the two steps of coding, and see whether that would lead to a better practicality-to-performance ratio.

Some hints as to what specific direction to take are along the lines of noticing that since source coding does not practically remove all redundancy in the source, a decoder could try and take advantage of that residual redundancy for protection against channel errors. A widely investigated application of this is known as structured codeword assignment, where efforts are made so that errors introduced at channel level create the least amount possible of distortion after decoding.

At channel level, since channel coding does not protect against all errors, source coding might be designed to be robust against decoding errors. Channel optimised source encoders, aware of the properties of the channel they are coding for, explicitly include error-correcting code during source coding.

Other options include intelligent decoding, where the decoders knows the characteristics of the source they are decoding, and thus ignore unlikely decoded signals, as well as unequal error protection, where data is more or less heavily error protected depending on its sensitivity.

To illustrate the different possible approaches of joint coding, consider a typical data transmission diagram. Raw data from the source first undergoes some sort of transformation, f, to be transformed into the source sample. After transformation, the samples are quantised (operation Q), and the quantised samples are assigned a representation (operation IA: index assignment). Finally, the signal is channel-encoded (operation CC).

Figure-1.4 The transmission model

Figure-1.4 The transmission model

At the exit of the transmitter, the signal is modulated (operation m), sent over the channel, which adds transmission noise N, and demodulated (operation m-1). Finally, at the receiver, the complementary operations of transmission take place: the signal is first of all channel-decoded (operation CD); the quantised samples are then recovered through the complementary of the index assignment operation. Finally, the samples are de-quantised and sent through the inverse transformation f-1.

Given these operations, several levels of focus are available when attempting to joint-code data. At the narrowest level of focus, only optimisation of the index assignment is considered. These techniques are categorised as IA adjustment, which strives at optimising the index assignment operation described above in order to make it robust against channel errors. Examples of these techniques are BSA (binary switching algorithm), where permutations of a given codebook are examined to see whether they yield lower overall distortion, and SAA (simulated annealing) where controlled randomness is included in the process of refining the index assignment in order to try and reach a global distortion minimum instead of getting stuck in a local minimum.

A broader focus considers joint optimisation of IA and quantisation. This is also known as “zero-redundancy quantisation”, in the sense that data that would be allocated for error protection is rather used to refine the quantiser. Examples include COSQ (channel optimised scalar quantisation) and its vectorial counterpart COVQ (channel optimised vectorial quantisation), which are generalisations of the Lloyd-Max algorithm—respectively the Linde-Buzo-Gray algorithm in the vectorial case—to noisy channels. Another example is the self-organizing hypercube, SOHC, referred to as self-organizing map, SOM, in the scalar case.

Finally, the broadest focus aims at joint optimisation of IA, quantisation and modulation. An example of this is the MORVQ (modulation organized vector quantisation) algorithm, where the signal space is directly mapped onto a two-dimensional modulation space by using the Kohonen algorithm.

  1. Note that the medium analyzed by Scott McCloud in his very good analysis [7] happens to be the comics. Of course this is irrelevant to this dissertation, but I thought his proposition interesting enough to be mentioned here. [back]
  2. This is of course a wink to Dr Wrenn’s Information theory and coding course, though I think some of his twenty-three students were sleeping. [back]