Chapter 2: A real life application

Contents

The TCENLP vocoder
1. Principles of the TCENLP vocoder
2. Quantitative data of the TCENLP vocoder
The CDMA IS-95A standard
Our home-made channel coding attempt

Given the conditions of the project, I had to decide, jointly with Fernando, the exchange student, and Arturo Veloz and Jean-Marc Boucher, the two professors responsible for the project, on a “close-to-real-life” application. In particular, the thesis work of Seyed Zahir Azami [4] had already studied joint coding techniques on a binary channel.

After exchange of point of views on the subject, and taking into account the limited time resource allocated to the project, as well as the tools and support available we decided to use the vocoder developed by Fernando as our source encoder, and a Rayleigh fading channel. Furthermore, we would use the CDMA IS-95A standard for modulation purpose over the fading channel, as it was available for use in Matlab.

2.1 The TCENLP vocoder

The TCENLP vocoder is a non-linear analysis-by-synthesis vocoder. As developed by Fernando Villavicencio for his MSc thesis, it is the implementation of the architecture proposed by L. Wu, M. Niranjan and F. Fallside in a paper published in 1994 in the IEEE transactions on speech and audio processing [9].

2.1.1 Principles of the TCENLP vocoder

Consistently with common linear prediction coders (such as the DoD’s LPC-10) and the source-filter model above (see §1.1.2), the vocoder first aims at isolating the short-term modulation of the speech. This modulation is the one induced by the vocal tract, running from the vocal cords to the lips. By non-linearly filtering out this modulation, a residual is obtained. This is called the short-term prediction (STP) analysis of the signal. The second step is then to analyse the residual and find out whether it has pseudo-periodic characteristics. If it does—in case of a voiced sound—the vocoder proceeds to extract the base frequency, or pitch, of the signal; if not—in case of a non-voiced sound—the residual will already be pretty much white noisy. In any case, after analysis, and possible filtering out, of the pseudo-periodic characteristics, a noisy—i.e. carrying no information—residual is obtained. This second step is the long-term prediction (LTP) analysis of the signal.

The final step of the encoding is to represent in some manner the last residual. As the coder’s name indicates, this is done by choosing a “representative” residual from a codebook and transmitting the corresponding code. The codebook is created at the time of design, and both the encoder and the decoder share the same one. Choosing a different—but close enough—residual instead of the one obtained after filtering of course introduces some measure of distortion. The aim is to minimize the distortion while keeping the codebook size small enough that transmission of the representative code does not degrade too much the performance—in terms of bit rate—of the encoder.

Now, considering that the encoder is also designed on an analysis-by-synthesis base, the steps described above take place in a closed loop, as described in figure 2.1: after the analysis described above, the input signal is re-synthesised from the parameters obtained and compared to the original. The general idea of the analysis-by-synthesis design is to minimize the difference between the original signal and the synthesised signal according to some metrics. In general, it is possible to know, based on the results of the comparison, how to tweak the analysis parameters to obtain better results.

Figure-2.1 Basic concept of analysis-by-synthesis

Figure-2.1 Basic concept of analysis-by-synthesis (from [8])

More precisely, the one parameter on which to play, in the case of a code-excited coding scheme is the choice of the representative residual from the codebook. In coders where linear filtering occurs, it is possible to organize the codebook, so that it is a partition of the signal space, where each entry of the codebook represents a cell. This supposes that a topology can be established on the signal space, which is usually possible. The algorithms used to establish the partition are based on the original Lloyd-Max algorithm [4].

However, in the case of the TCENLP, the non-linearity of the involved filters makes it impossible to establish a partition of the residual space. As a consequence, this specific encoder has to exhaustively consider all entries of the codebook, in order to determine the one that, at synthesis time, produces the minimal distortion with the original signal. The need for an exhaustive search makes the encoder extremely expensive in terms of required processing power.

Figure-2.2 Analysis-by-synthesis in a source-filter model

Figure-2.2 Analysis-by-synthesis in a source-filter model (from [8])

2.1.2 Quantitative data of the TCENLP vocoder

Finally, it is worth noting, as we will use it later on, the specific data output from the encoder, as developed by Fernando Villavicencio. The encoder processes frames of speech data sampled at 8kHz, each consisting in 64 samples, thus presented at one frame per 8ms. Every operation of the source-filter model analysis is operated on each frame, with the exception of the short-term analysis that needs to consider two frames at once, and thus is performed only once every two frames. Thus, the encoder will output an encoded frame every two input frames, that is every 16ms.

Short-term analysis is carried every two frames. It results in choosing an adapted non-linear (neural network) predictor from a predefined book of predictors. The codebook holding 64 entries, this data is transmitted as 6 bits values.

Long-term analysis is carried every input frame. As for the short-term analysis, this analysis yields to the choice of an adapted non-linear predictor from a fixed book of predictors³. The book holding 64 entries, the index is transmitted as a 6 bits value. This analysis also extracts the pitch value for the input frame, which is transmitted as an 8 bits value. Thus, in an output frame, long-term analysis requires twice the sum of the above, i.e. 28 bits.

As we said, after both short- and long-term analysis, the input data has been filtered into almost-white noise. In order to choose an adapted substitute for that residual from the excitation codebook, the residual is first normalized (e.g. to a unity power). The normalizing gain is evaluated for every input frame, and transmitted as a 5 bits value. Finally, the index representing the chosen codebook entry is transmitted as 6 bits value (indicating a codebook with a 64 entries). Overall, this last step requires 22 bits per output frame.

Thus, output frames count with 56 bits, yielding an output rate of 3.5kbps, a 94.5% improvement over the 64kpbs of the standard PCM rate⁴.

2.2 The CDMA IS-95A standard

2.2.1 Rationale for the choice of the standard

The CDMA IS-95A standard is one of the North American mobile phone standards, based on a CDMA (Code Division Multiple Access) scheme. While not necessarily common at a worldly scale, and while not necessarily granted a long future, this is a sound, real-world standard.

As a mobile phone data transmission standard, it is designed to efficiently transmit voice data over multipath fading (Rice and Rayleigh) channels. Furthermore, an implementation of the standard is available in Matlab’s Simulink. Since Fernando was developing his encoder on Matlab, which I was eventually supposed to use in a simulation and since we wanted a real-world fading-channel-able modulation technology, it appeared that despite its possible lack of future and acceptance worldwide, the standard was a good choice for the project.

2.2.2 Overview of CDMA

CDMA (Code Division Multiple Access) is a modulation technique based on spectrum spreading. By carefully operating the spreading, it also allows multiple-access to a channel, ensuring that data belonging to different users is pseudo-orthogonal to each other.

An important feature of CDMA schemes is that signals sent over the same physical channel (though in different logical channel) will appear mixed both in the time and frequency domains. Separation of the different logical channels is achieved at the receiver by correlating the received signal with the spreading pseudo-random sequence that the logical channel of interest was encoded with. The property of pseudo-orthogonality ensures that only the signal of interest will be restored, while all other signals will only contribute to noise, as they are not de-spread in bandwidth. Filtering with a narrow band-pass filter, centred on the decoded signal allows further reduction of the noise contributed by all the other signals.

Thus channel assignment is determined by creating adequate sets of pseudo-orthogonal, pseudo-random sequences. The sampling frequency of the sequences, also known as the chip rate, is chosen so that the bandwidth of the spread signal is several times that of the original signal. Assuming that synchronization is established between an emitter—of several channels, each spread by its own set of pseudo-random sequences—and a receiver, the receiver will be tuned on one logical channel by using the corresponding spreading sequence, in sync with the emitter, to de-spread the received signal.

One of the positive points of spread-spectrum modulation is that it is naturally resistant to multipath fading, since each path can be demodulated independently from the other. However, this is not perfect and there are still cases when fading affects the demodulated signal.

In CDMA systems, the bandwidth of the spreading sequence is chosen to be much greater than that of the transmitted signal, so that the bandwidth of the spread signal is roughly that of the spreading sequence. Thus the cross-correlation properties of the modulated signals are rather equivalent to those of the spreading sequences. As such, the condition of pseudo-orthogonality, which is equivalent to good cross-correlation properties—i.e. cross-correlation of two signals yields a large bandwidth, low energy, preferably noise-like signal—is implemented on the spreading sequences, independently of the transmitted signals, which greatly simplifies the design of such systems.

2.2.3 Fundamentals of IS-95A

The IS-95A standard, as a CDMA system, naturally implements the properties above. Being a standard for the industry, is also implements more features, in order to ensure data integrity, and adapts the CDMA principles to the particular case of mobile phone communication, where there is a pronounced dissymmetry between the two sides of the communication link. While the base station suffer virtually no energy restriction, and thus can emit several channels at high power, the mobile phones live on a small battery and have to restrict themselves to managing only the channels they are concerned with, at the lowest manageable output power.

The two following diagrams show respectively the forward and reverse channel diagrams (respectively from the base station to the mobile phone and vice versa).

IS-95A forward channel diagram

Figure-2.3 IS-95A forward channel diagram

IS-95A reverse channel diagram

Figure-2.4 IS-95A reverse channel diagram

We can see that besides the spreading and modulation steps belonging to the CDMA scheme as described above, there is also several steps of channel coding (that is error protection) taking place. However, as we will see later, in our case, we are interested in providing our own channel coding scheme, so that we do not use the channel coding facilities provided by the standard. On the diagrams above, we drop the data path running from the CRC generator to the scrambler (in the forward channel) or the interleaver (in the reverse channel), and the corresponding path on the receiver side.

The IS-95A standard specifically implements the CDMA scheme using the following characteristics. First of all, the pseudo-random sequences bandwidth is chosen so that after modulation, the signal has a bandwidth of 1.25 MHz, which corresponds to one tenth of the total band of frequencies allocated to a mobile operator.

Furthermore, the standard uses two sets of pseudo-random codes. The so-called short PN codes are a pair of sequences, of length 2¹⁵ symbols, used to modulate the signal into in-phase and quadrature components. Note that different base stations operating at the same frequency can use the same sequences by simply using different offsets into the sequences. The long PN code, of length 2⁴²-1 symbols, is used for spreading, scrambling and power control on the reverse link.

In addition to these codes, the standard also uses a set of mutually orthogonal codes, called the Walsh codes, which will ensure the mutual-orthogonality property central to the CDMA scheme.

In addition to the traffic channels, which carry the data, the standard defines a few other (logical) channels for the proper operation of the system: on the forward link, the pilot and sync channels provide channel estimation and synchronization, thus allowing correct demodulation of the traffic channels. On the forward and reverse link, a channel is dedicated to control information, spontaneous requests from the mobile station and paging.

The base station simultaneously transmits all the traffic channels, plus the three control channels (pilot, sync and paging). To allow individual decoding by the receiving mobile station, each traffic channels is spread by its own Walsh code sequence. Furthermore, as in all mobile communication system, each traffic channel is granted its own power, in order to optimise the overall transmitted power. In the CDMA system, this is ensured by adding the spread traffic channels together after correction by a factor corresponding to the intended transmitted power. Note that the factor is determined with respect to the intended transmitted power per bit. In case a traffic channel operates at a lower bit rate, bits are repeated in order to obtain the same transmitted bit rate, but the correcting factor applied to the channel is modified accordingly.

Finally, both links use a rake receiver to acquire the transmitted data. The aim of the rake is to efficiently manage the effects of multipath fading during transmission. While standard receivers would greatly suffer from the fading incurred, rake receivers are able to individually isolate each path components of a signal and recombine them. On the forward link, the pilot channel allows for coherent detection in the (mobile) receiver, while on the reverse link, the absence of such channel requires the (base) receiver to implement a non-coherent detection scheme.

2.3 Our home-made channel coding attempt

Given our source coder, the TCENLP vocoder, and our channel, beginning at the spreading and modulation stages of the IS-95A CDMA standard, what was left to do was implement our attempt at a channel encoder. Based on the information, mainly gathered from the Zahir Azami thesis [4], we decided we could have a hopefully successful try at applying a simple index assignment optimisation on the codebooks prepared in the source encoder.

On the other hand, the vocoder was developed by my colleague Fernando Villavicencio in the framework of his own master’s thesis, so that he was not keen on including external work into his implementation. As a consequence, we decided to develop the “joint” channel encoder as a separate component, which admittedly seems to contradict the joint approach.

Joint “disjoint” channel coding

Figure-2.5 Joint “disjoint” channel coding

The vocoder, as implemented by Fernando, outputs the following data every two input frames:

two excitation indexes, indicating the best choice excitation vector for each input frame,
two LTP indexes, indicating the best choice set of parameters for the non-linear long term prediction filter for each frame,
one STP index, indicating the best choice set of parameters for the short term prediction filter for the double frame made of both frames,
two pitch values, as extracted by the short term predictor, one for each frame, and
two gain values, as used in order to normalize each frame.

We decided to treat each of these parameters separately, according to their nature. The values (gain and pitch) could be translated into a representation that would be more robust than the binary representation against channel errors.

On the other hand, the indexes (excitation, LTP and STP) could be reassigned in order to minimize data distortion in case of index error. In other words, given a measure on the data space (e.g. in the case of the excitation codebook, an Euclidian measure on the vectors of samples), and a measure on the index space, ideally adapted to take into account the errors introduced during transmission over the channel, we want to try and give both spaces the same topology, so that two data vectors that are close in the data space, according to the measure defined there, would be represented by indexes close in the index space, according to the measure defined there.

Once the reassignment is computed, we can apply it out of the vocoder itself, in a disjoint component. Thus, the apparently disjoint channel encoder does perform a joint coding operation.

Given the choice of this approach, we have several problems to solve in order to apply it. We need of course to determine an adapted measure on the different data spaces. While it might seem trivial that the Euclidian measure is plainly adapted to scalar or vectorial data such as the pitch and gains values, or the excitation vectors, we might have more trouble defining a measure on the STP or LTP filter parameters, which are no more than an arbitrarily ordered set of parameters for the non-linear short- and long-term prediction filters.

On the other side of the channel encoder, we also need to determine an adequate choice of a measure on the index spaces. As we said, this measure should be determined in order to take into account the distortion introduced by the channel, so that the index re-assignment is channel-optimised. This determination should at the very least require a statistical analysis of the channel, in order to try and determine statistical properties of the errors introduced upon transmission.

Note that how this book is obtained is also part of the design of the TCENLP encoder, though not relevant in this dissertation. Suffice to say that a training of the encoder takes place prior to use. [back]
Please note that the values exposed in this section do differ from the values presented by F. Villavicencio in [8]. The reason for this apparent incoherence is that my work was based on Fernando’s encoder as it was before the tuning that took place for the redaction of his thesis. [back]