Chapter 3: Implementation and results

Contents

  1. IS-95A and fading channels
  2. The TCENLP vocoder
  3. Interfacing the components
  4. The channel part of the joint coding

As I stated earlier, both professors in charge of the project, Jean-Marc Boucher at ENST de Bretagne and Arturo Veloz at Cinvestav, as well as both students, Fernando Villavicencio at Cinvestav and myself decided on the components that would surround my work on joint coding, namely Fernando’s implementation of the TCENLP vocoder described in [9] as a source coder, and the IS-95A CDMA standard as binary-channel-over-fading channel. The reason for the choice of the source coder was that, given its design, the TCENLP already had interesting joint-coding features. On the other hand, the channel standard was chosen, because it came from the “real-world”, it corresponded to the requirements stated in the project outline [A.2] and we had an available implementation on Matlab.

Unfortunately implementation of a complete, running system took most of the time allocated to the project, so that while implementation details are numerous, results are few. This is most likely due to the following conjunction of parameters: first of all, as time passed, it appeared that the project definition might have been too ambitious. Furthermore, while the project started rather late, towards the end of April, I had to get to know the required points of theory. Finally, the broad openness of the project allowed for much options to be explored, while it might have proven wiser to restrict the focus of the project to some particular aspect.

As we said, implementation of a complete system, as shown in Figure-2.5, took place on Matlab, and more precisely on the Simulink tool shipped with the Release 12 of Matlab. As it is described in its own documentation, Simulink is “a software package for modelling, simulating, and analysing dynamical systems”. In particular, it allows for model description via a very intuitive graphical user interface, thus making it a very practical tool for developing, and later simulating, models.

We shall examine in turn implementation choices and details for the channel, our source of data, our attempt at joint (channel) coding and our data analysis tools.

3.1 IS-95A and fading channels

The project outline states that we want to conduct our study of joint channel coding for speech signals on fading channels. While Simulink provides such channels, and it is easy to set up a system integrating such channel, standard modulation techniques do not allow reasonable error rates upon transmission over such a channel. As an example, let us consider a system where a source of modulated signals emits a constant value of imaginary 1. After transmission through a fading channel including both a line-of-sight transmission—with Rician characteristics—and a multipath transmission—with Rayleigh characteristics—the signal is plotted on the complex plane.

A fading channel example from Simulink

Figure-3.1 A fading channel example from Simulink

The system is represented in Figure-3.1 above, as it would appear in Simulink. While most details are not relevant to this dissertation, we can appreciate the clarity of the diagram. Of course, this is only a simplistic example, but the user interface to Simulink is a great help when dealing with more complex models. The “Rician Fading” and “Multipath Rayleigh Fading” blocks appearing on the diagram are also built-in in Simulink.

On Figure-3.2 below, the output of the combination of the fading channels appears. The red dot is the input signal, while the blue curve is the output signal. An immediate remark is that, obviously, a great deal of distortion occurs. Now, to grasp how much of a hindrance this is, imagine a binary phase-shift keying modulation (BPSK) scheme. The signal emitted by a BPSK modulator, would be represented on the graph below as the red dot and its symmetrical with respect to the abscissa line (i.e. values ±j). A simple decoding scheme is to observe the received signal’s imaginary part sign, deciding according to its sign what was the value of the signal emitted. This is basically what the BPSK demodulator does. However, with the distortion caused by the fading channels, as represented by the blue curve below, this analysis is not possible anymore. Such a simplistic scheme would statistically always fail to detect the right input signal, thus leading to a 50% bit error rate (BER) between the (binary) input and output of the system.

Effect of a fading channel over a constant signal

Figure-3.2 Effect of a fading channel over a constant signal

For this very reason, we decided that we needed a modulation scheme for our fading channels that would resist their degrading effects. Since we did not consider it worth to develop such a model during the project (even if develop was only taken as meaning implement an existing specified system), we decided to use the available “real-world” system that came with Simulink: an implementation of the North-American CDMA mobile telephony standard IS-95A, along with examples of full communication channels. Although the standard specifies fully geared channel coding system, along with modulation, we decided that, since our project was about joint source-channel coding, it would not make much sense to use the provided channel-coding scheme. In the end, we only kept the modulation and spreading components of the IS-95A standard, operating over an analogue channel with Rician and Rayleigh Fading properties, as well as additive white gaussian noise (AWGN). This set of components shall now be referred to as simply “our channel”5.

Our channel is thus a binary channel, for it admits binary data as an input and outputs binary data. The efficiency of the spreading modulation in IS-95A, as mentioned in 2.2.2 allows for rather satisfying results for the properties of our channel. Indeed, analysis shows that the BER generated by the channel is slightly less than 2%. Of course, one might argue that this is still a rather large value, considering the field of application, but compared to the results given by simple—though robust—modulation schemes, such as BPSK, it is a quite satisfying result.

As stated in 2.3, we also would need an analysis of the statistical properties of the errors introduced by our channel, in order to design adapted measures on the data transmitted over it. While I was able to gather data about the channel, in particular about the repartition of errors in time, I did not have time to either analyse it, or put it to use into determining an ad hoc measure.

3.2 The TCENLP vocoder

Unfortunately, during the development of our project, the TCENLP vocoder was not completed yet. As such, it was not possible, though it was developed under Matlab as well, to integrate it in out project. Since we decided anyways to use it for the project, if only to let Fernando re-use the project as a base for an eventual PhD, we had no choice but develop an emulator, a stub in place of the original vocoder.

For that purpose, Fernando provided me with a model of the output data, as we should expect it. As we said earlier, the model specifies output frames of 56 bits, every 16ms. In order to correctly manage the data along the communication line, I had to arbitrarily choose a layout for the data within a frame, which came as described in Table 1 below. In the specification came also statistical information about each field in the output frame, in order to simulate more closely the true encoder. Note however, that this information is at best guesswork, and does not come from an analysis of the output of the true encoder, which, again, was not ready for that.

Table 1. Description of the TCENLP emulator output frame
Bits Field Description
Bits Field Description
0-5 STP The index of the chosen non-linear short-term predictor for both input frames corresponding to this output frame
6-13 Pitch1 The pitch value found for the first input frame corresponding to this output frame.
14-19 Exc1 The index of the excitation chosen from the excitation codebook for the first input frame corresponding to this output frame.
20-25 LTP1 The index of the chosen non-linear long-term predictor for the first input frame corresponding to this output frame.
26-30 Gain1 The gain computed for the residual after STP and LTP filtering for the first input frame corresponding to this output frame.
31-38 Pitch2 Idem Pitch1 for the second input frame that produced this output frame.
39-44 Exc2 Idem Exc1 for the second input frame that produced this output frame.
45-50 LTP2 Idem LTP1 for the second input frame that produced this output frame.
51-56 Gain2 Idem Gain1 for the second input frame that produced this output frame.

While the determination of scalar values, such as the gain or pitch values, was relatively easy, it proved more difficult to set up a satisfying scheme concerning the codebook index values, namely the STP, LTP and Exc fields. We decided that the gain would cover its full range uniformly. On the other hand, a normal distribution would be used to determine if a pitch field would be 0 or not. A null value would indicate a non-voiced segment of speech, while a non-null value would indicate a voiced segment. In case a voiced segment was determined, its pitch value would be computed with a normal distribution of mean 50 and variance 10, which corresponds more or less to the expected variation of the pitch in speech data produced by an adult male. In the statistically rare case a pitch value would be out of range (since normal distributions are not bounded), clipping would occur. Note that we decided to represent the pitch values as an 8-bit unsigned integer, thus giving us a range of 0-128. On the other hand, the parameters of the distribution ensure that most of the chosen values will be in the range 40-60, so that we are admittedly “wasting” some output bandwidth. We want to remember though that this is only a crude first-try approach, and was most likely refined in the final version of the true vocoder.

For the index fields, STP, LTP and excitation, the idea was that the distribution of each would likely be normal: while a reduced set of indexes would represent most of the input data presented, the rest would represent less likely, and more specific cases. This observation is rather consistent with several systems we trained ourselves on, in particular the linear prediction coder variant of the US Department of Defense’s LPC-10.

However, we also wanted to take into account that the indexing was arbitrary for every step. In other words, Fernando’s implementation of the TCENLP did not guarantee that the indexes most likely to be picked up would be clustered together. The idea for the emulator was thus to choose the indexes according to a normal distribution, but only after having operated a permutation on them.

Finally, in order to take full advantage of the features offered by Simulink, we decided to implement the emulator in C, in the framework of the development of Simulink components. However, C is not well furnished in mathematical functions. According to [5], the only statistical function available in C is the uniform distribution, and even its quality is not always quite satisfying. For that reason, we decided that the statistical part of our implementation, in other words, the determination of the different required distributions and permutations, should be obtained via interfacing with Matlab through its C API.

3.3 Interfacing the components

Once the channel, as understood in its largest form, and the source of data were ready, the only thing left to implement was logically enough the disjoint attempt at joint coding (as explained in 2.3). However, there was first another issue to solve. Indeed, our decisions and choices on the different components of the project led us to have to try and interface two components with seemingly incompatible data flows.

The TCENLP vocoder, as we pointed out earlier, outputs 56 bits frames every 16ms, amounting to a bit rate of 3.5kbps. On the other hand, the IS-95A standard specifications require 576 bits frames every 20ms, amounting to a total bit rate of 28.8kbps. While we could certainly have a 28.8kbps flow carry the data of a 3.5 kbps flow (not considering for this project the obvious waste of bandwidth incurred), we had to devise a way to practically achieve it. In particular, we had to satisfy constraints of regularity of flows: more precisely, the flow of data out of the channel, to the TCENLP decoder had to have the same properties of 56 bits frames every 16 ms as the flow out of the encoder. Reasons include that the decoder, if and when ready, would require to work on the fly on a steady flow of data, as well as our ability to compare input and output data given the flow architecture in Simulink.

Thus, the requirements for a “rate adaptation” component are the following, using the symbols on Figure-3.3:

  1. The same component must be able to adapt flow 1 to flow 2 and flow 2 to flow 1.
  2. The component accepts and outputs data in frames (even of one bit length only).
  3. The component must be able to admit any frame size at any rate for both flows.

Bit flow adapter

Figure-3.3 Bit flow adapter

Furthermore, the component shall comply to the following requirements, ensuring, in the case of our project, the integrity of the data coming to the last decoding stages. Referring to Figure-3.4 and assuming that the bit rate of flow 2 is greater than that of flow 16, the requirements come as follow:

  1. The data in flow 1’ must be, not withstanding a possible delay, the same as the data in flow 1. In other words, the black box must act as nothing more than a delay component.
  2. The frames of flow 1’, though maybe delayed with respect to those of flow 1, must contain individually the same data. In other words, the flow of bits must not lose its synchronisation with the flow of frames.

Bit flow adapter and “de-adapter”

Figure-3.4 Bit flow adapter and “de-adapter”

As a consequence of the first requirement, the component will have two working modes: if the bit rate of the input flow is less than that of the output flow, it will manage to transfer all the data in the input flow to the output flow, adding padding as necessary. On the other hand, in the opposite situation, the component will select some of the data from the input flow, and transfer it in the output flow, without adding any padding. Because of requirement 4, of course, the data dropped in the second case shall exactly correspond to the padding added in the first case, so that the only data going untouched through the system described in Figure-3.4 is the original data of flow 1.

Thus, the component mixes abilities of both a delay line and a bit selector—be the selection performed to choose where to include padding, or on the contrary to know what data to drop. To satisfy requirement 5 and ensure synchronisation of data and frames is only a matter of carefully choosing the size of the delay lines at de-adaptation. The component himself has no way to determine this factor by itself, but it is a constant of a system—in other words, in our project at least, flow 2 incurs a known delay—and can thus be passed to the component as a parameter.

As a consequence of the above requirements, the bit selection algorithm has to ensure that it will work symmetrically: the bits selected to receive padding in the output frame at adaptation stage must be the exact same bits that are dropped from the input frame at de-adaptation stage. The algorithm I came up with goes along the following lines: along the life of the component, the input and output bit rates are given and constant. Thus, there is a constant factor of proportionality rate between them. Considering the adaptation stage, for example, the idea is then simply to put each of the input bits every rate output bits. Of course, rate will generally not be an integer constant. Thus, also maintaining a real offset offset into the output frame, we will place the input bits at positions floor of (offset + n * rate) for all positive integer n such that offset + n * rate is within the output frame. The offset is then re-updated to the value of the first position out of the output frame, modulo the output frame size.

In C, this algorithm translates into a rather simple for loop. Given a bit mask mask, whose size sz is the size of the frame in which the selection will take place—the output frame at adaptation stage, the input frame at de-adaptation stage—, given the offset offset mentioned above, and given the coefficient rate defined as the ratio of the smaller bit rate over the larger bit rate, the loop that will select the adequate bits within the mask marking them as 1 is the following7:

double ptr;
for (ptr=offset; ptr<sz; ptr+=1/rate)
{
    mask[(int)floor(ptr)] = 1;
}
offset=ptr-sz;

Figure-3.5 the core of the frame adaptation algorithm

Once implemented in C, within the Simulink framework (for further detail, refer to [10]), the component was integrated in our model, as being part of the channel. Thus, we could consider our model as integrating a source of data (the TCENLP vocoder emulator) and a channel with compatible data formats.

3.4 The channel part of the joint coding

Mainly based on ideas gathered in S. Zahir Azami’s thesis [4], I decided to try and implement an added protection on the data output by the TCENLP vocoder. As we pointed out earlier, the vocoder contains three codebooks. Two of those assign an index to a set of parameters for non-linear filters, the short- and long- term predictors. The third codebook assigns an index to the residual of the signal after filtering through both predicting filters. According to [4], we can consider both codebook values and keys as elements of vectorial spaces. In particular, in the case of the excitation codebook, the residuals are vectors of audio data, naturally part of vectorial spaces with as many dimensions as they contain samples.

By choosing appropriate measures—or distances—on these vectorial spaces, we can give them a topology. The “joint coding idea” is thus to manage to give, after re-organisation, the same topology to both spaces of data. In other words, we are trying to achieve that two close members of the data space are represented by two close members of the index space. Thus, error introduced during transmission over a noisy channel on the indexes will not get even more amplified in the data space.

However, at that point, much of the project’s allocated time had been spent setting up other parts of our system, and I did not have much time to spend on this matter. To successfully apply this algorithm to the different codebooks of the TCENLP required that I define some sort of measure on the data. For the short- and long-term predictors, the members of the codebooks were sets of non-linear filter—neural network—parameters, so that defining a measure on it did not seem any easy task.

Thus, to try and obtain results before the end of the project, I decided, after consultation with Fernando and Arturo, to try and implement a similar idea on the scalar data we had: the gain and pitch fields of each frame (see Table 1). In Fernando’s implementation, those fields were simply encoded in binary form. However, when transmitted over the channel we set up, which introduced binary error, this data would be mangled. We thus thought that because the error was introduced in binary form, maybe the binary representation was not the most adapted to the channel. Given the property of the Gray codes that two adjacent codes are Hamming-distant by only one unit, we thought that maybe Gray code would prove more adapted to resist the introduction of errors by the channel, as far as the decoded data was concerned, than the original encoding.

The other problem we had to solve was designing adequate tools to evaluate the impact of our attempt. The ideal solution would have been to compare original and restored (after vocoding, transmission over the noisy channel and decoding) speech samples. This was unfortunately not a possibility, since we did not have a functional vocoder, but rather an emulator of it. Thus, given our system, we could only directly compare the data of interest, the gain and pitch values. Assuming that these would have a rather direct influence on the decoded speech, an adapted measure would simply be the Euclidian measure. By comparing the distance of the data of interest after transmission through the fading channel to the same data before transmission, both with and without our encoding scheme, we could get a rough idea of whether this was a viable idea or not.

The channel coding testing model

Figure-3.6 The channel coding testing model

For that purpose, we created the system in Figure-3.6 above. The TCENLP block holds the emulator for the vocoder, as described in 3.2. The “Channel–coded” block holds the fading channel along with the modulating part of the IS-95A standard, as described in 3.1 and the frame adaptation component described in 3.3. As we can see, the data out of the vocoder travels along two different paths. The top one, is merely a delay path, and leaves the data untouched, save for the delay, equal to the one introduced by the bottom path. The bottom path, on the other hand performs the data encoding and decoding (for the gain and pitch fields only), and processes the encoded data through the channel. Both paths finally join in a mixing component that regroups data from both paths, keeping from the bottom path only the gain and pitch fields, and all the other fields from the top path. Thus, the data reaching the Rx label sees only the gain and pitch fields altered, while all the others remain untouched. While this does not bring much in our configuration, it would allow isolating the effects of the changed fields in the TCENLP decoder, when we would have it running.

The channel coding blocks perform the conversion from binary representation to Gray representation for the pitch and gain fields of the frame, and back. These operations did not require development of Simulink components in C, but rather could be achieved using the available base blocks: selector blocks isolate the fields to be converted, which are then processed through an adequately configured data-mapping component. The isolated fields are finally re-concatenated together.

The channel coding block

Figure-3.7 The channel coding block

Finally, the data analysis block compares the transmitted and received data, in terms of bit error rate for the whole frame—which is in turn representative of the bit error rate for the fields of interest, since the others are transmitted error-free—and mean Euclidian distance between the transmitted and received fields.

For the data to be meaningful, we developed three versions of the system. The first version has the physical channels (within our big channel block) configured to introduce no error. This is to test that the system is correct and does not introduce errors by design. The second and third version both carry an operational channel component that introduces error as shown in 3.1. However, while the third version performs the channel coding scheme that we came up with, the second version performs no channel coding. This is to try and determine the effects of our scheme.

Relative “gain” in Euclidian error for the pitch and gain fields

Figure-3.8 Relative “gain” in Euclidian error for the pitch and gain fields

However, the results obtained proved quite surprising. While we were expecting the Euclidian error between original and transmitted fields to drop when enabling our coding scheme, we could only witness that it rose in the simulation. Figure-3.8 shows, for the pitch and gain fields, the “gain” in error as computed by the simulation plotted against time. In both cases we actually observe a drop, by up to 25% for the pitch field, of the error.

Further study, over simpler channels, all showed similar results, i.e. that introduction of Gray coding in place of binary coding for the scalar data of our system did not reduce the measured Euclidian error, and even in most cases worsened it, thus shattering the preconceived idea that led us to try and decide to the experiment described. Furthermore, a more careful theoretical analysis indeed showed that what we had done was to map Gray coding (which indeed has nice properties when considered in a vectorial space of as many dimensions as the code has bits) to a one-dimension space, the real line, and that by doing so we could not take advantage of its properties of Hamming-proximity. This did not guarantee that we would necessarily get worse results, but it did guarantee that we would not get better results.

It is interesting to mention that the basically wrong idea on which this experiment was based was supported by the professor in charge in Mexico. It was thus his and my surprise to find out, upon closer examination, that it was indeed invalid and that there was indeed no reason to expect any improvement of the Euclidian error in the coded scalar values.

  1. Note by the way that S. Zahir Azami, in [4] refers to the ambiguity of what we call a channel, which, depending on the authors, varies from just an analogue channel, to an analogue channel plus a modulation step, and even sometimes a channel coding step. [back]
  2. This assumption just reflects the fact that if the bit rate of flow 2 in Figure-3.4 was greater than that of flow 1, passage through the adapter would drop some of the data of flow 1, and neither of the following requirements could be respected. [back]
  3. Note that the floor function used in the algorithm is part of the ISO C standard, is also described for reference in [5] and that it does what it is expected to do, i.e. return the floor value of the passed parameter, that is the largest integer value less than the parameter. [back]