In this tutorial we will be talking about mel-frequency feature generation. I assume that reader is familiar with the Discrete Fourier Transform. Hence I will omit the conversion from time-domain to frequency domain.
For most of the audio processing tasks working with frequency data instead of time-series original data is more intuitive and more helpful. Mel-frequency features are transformed (we will see how this is done and motivation for it) version of the frequencies generated by Discrete Fourier Transform. This transformation is useful for audio processing tasks and decreases the feature dimensions. Let's assume that we have an audio signal.
%matplotlib inline
import os
import scipy.io
from scipy.io import wavfile
cur_dir = os.path.dirname(os.path.abspath(__name__))
wav_fname = os.path.join(cur_dir, 'OSR_us_000_0010_8k.wav')
sr, x = wavfile.read(wav_fname)
n_sample = x.shape[0]
t_end_sec = n_sample/sr
import numpy as np
import matplotlib.pyplot as plt
ts = np.linspace(0, t_end_sec, n_sample)
plt.figure()
plt.plot(ts, x)
plt.title('Audio Sample (in Time-Domain)')
plt.xlabel('Time (sec)')
plt.ylabel('Magnitude')
Text(0, 0.5, 'Magnitude')
Time series representation of the audio signal can be seen in the Figure above. From the figure, we can see that there are some sections silent, some sections speaker talks. Other than these kinds of things we cannot say much. Now, consider spectrogram image of the same record.
from scipy import signal
n_fft = 400
noverlap = 0
nperseg = 400
f, t, Sxx = signal.spectrogram(x, sr, nperseg=400, noverlap=0, nfft=400)
Sxx_log = np.log10(np.abs(Sxx))
plt.figure()
plt.imshow(Sxx_log, cmap='jet', origin='lower', aspect='auto', extent=[t[0], t[-1], f[0], f[-1]])
plt.ylabel('Frequency [Hz]')
plt.xlabel('Time [sec]')
plt.title("Spectrogram of the Record")
plt.colorbar()
plt.show()
Spectrogram image shows more information. It tells which parts are silent, which parts contain speaking (similar to time-domain version). Additionally, It also shows which frequencies speech contains, and the magnitude of the each frequency component. Frequency-domain representation is more useful for audio processing tasks.
However, form the figure above, we can see that spectrogram image has a lot of redundancy. We can see that after 2000 Hz; there is really no change. To see the trend better, consider the plot of energy of each freuency componeny for figure above.
freq_energies = np.sum(Sxx_log, axis=1)
plt.plot(f, freq_energies)
plt.ylabel('Frequency [Energies]')
plt.xlabel('Frequency [Hz]')
plt.show()
It is better to smooth out plot to see trend better. Below, you can see smoothed out version of the lot above.
n_avg_filter = 30
avg_filter = np.ones(n_avg_filter)/n_avg_filter
freq_energies_smoothed = np.convolve(freq_energies, avg_filter, mode='valid')
plt.plot(f[n_avg_filter//2: -n_avg_filter//2+1], freq_energies_smoothed)
plt.ylabel('Frequency [Energies]')
plt.xlabel('Frequency [Hz]')
plt.show()
From the figure above, we can see that energy decreases with a logarithmic trend. This means that in a typical audio singal (especially speech). Most of the action takes place in lower frequencies. Higher frequencies contain less information.
Hence it is redundant, and wasteful to use spectrogram image as is for feature. It is a very sparse representation for the audio signal, and doesn't emphasize well the sections where more information is contained. Similary, doesn't de-emphasize sections where less information is contained.
In the studies, on human audio perception[1]. Similar pattern with above is observed. Our ear is more alert to changes in smaller frequencies. However, we cannot differentiate well in higher frequencies.
In experiments, when judgers are asked to list equal distance pitches from one another. Table below (or similar tables) is obtained.
Hz | 20 | 160 | 394 | 670 | 1000 | 1420 | 1900 | 2450 | 3120 | 4000 | 5100 | 6600 | 9000 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mel | 0 | 250 | 500 | 750 | 1000 | 1250 | 1500 | 1750 | 2000 | 2250 | 2500 | 2750 | 3000 |
Where Hz represents real frequency listened, mel represents equidistant pitches judged by humans. If above table fitted (may not produce exactly the formula given) to a curve, transformation below can be obtained to convert in-between Hertz and mel frequencies.
m = 2595 x log10(1 + f/700)
f = 700 x (10(m/2595) -1)
For these conversions, we can use code below.
def hz_to_mel(freq):
return 2595*np.log10(1+freq/700)
def mel_to_hz(mel):
return 700*(10**(mel/2595)-1)
If we were to get 40 equidistant mel-frequencies that cover between 0 and 4000 Hz. We would obtain the plot below.
n_mel = 40
mel_max = hz_to_mel(4000)
mels = np.linspace(0, mel_max, 40).flatten()
freqs = mel_to_hz(mels)
plt.figure()
plt.plot(freqs, mels, '*')
plt.xlabel('Frequencies')
plt.ylabel('Mel Frequencies')
Text(0, 0.5, 'Mel Frequencies')
This plot shows the mapping between frequencies and their correspondings mel-frequencies (This figure is basically enriched and plotted version of the table constructed in the experiments). With this information we can calculate a more condensed representation of the spectrogram image. For easy reference let's recreate the spectrogram image again.
plt.imshow(Sxx_log, cmap='jet', origin='lower', aspect='auto', extent=[t[0], t[-1], f[0], f[-1]])
plt.ylabel('Frequency [Hz]')
plt.xlabel('Time [sec]')
plt.title("Spectrogram of the Record")
plt.colorbar()
plt.show()
y-axis of the spectrogram image covers the frequencies between 0-4000Hz. Also y-axis has size 201 (n_fft//2+1). One of the main motivations of the mel-frequency transformation is to reduce dimension and de-emphasize irrelevant sections (decrease sparsity in the data). Let's convert above spectrogram image with 40 mel-freqency features. This transformation basically would turn 201x672 original spectrogram image to 40x672 mel-spectrogram image.
During dimension reduction we want to squueze as much information as possible to mel-frequency representation. Hence mel representation should have equidistant frequencies.
mel_min = hz_to_mel(0)
mel_max = hz_to_mel(4000)
mel_frequencies = np.linspace(mel_min, mel_max, n_mel+2)
In the above calculation we found lower, and upper boundaries for mel-frequency. Then get 40 points between them(When lower and upper boundaries are considered total of 42 mel-frequencies). Frequency, mel-frequency mapping can be seen in the plot below.
freqs = mel_to_hz(mel_frequencies)
plt.figure()
plt.plot(freqs, mel_frequencies, '*')
plt.xlabel('Frequencies')
plt.ylabel('Mel Frequencies')
Text(0, 0.5, 'Mel Frequencies')
One way to convert spectrogram image to mel-spectrogram image is as follows. We now have 40 frequency values (which are equidistant in mel-frequency). Find the bin of these frequencies, then use these bins in the spectrogram image. To accomplish so we can use code below.
freq_max = sr/2
n_dim = n_fft//2+1 # 201
indices = np.floor(n_dim * freqs/freq_max).astype(np.int32)
mel_naive = Sxx_log[indices, :]
plt.imshow(mel_naive, cmap='jet', origin='lower', aspect='auto', extent=[t[0], t[-1], mel_frequencies[0], mel_frequencies[-1]])
plt.ylabel('Frequency [Hz]')
plt.xlabel('Time [sec]')
plt.title("Mel-spectrogram of the Record(Naive version)")
plt.colorbar()
plt.show()
What we did above is essentialy, getting a selected 40 rows out of 201 rows in the original spectrogram image. This can be seem plausible, and final image is Ok. However, it is arbitrary, and prone to lose information. Assume we have found that rows 0, 1, 3, 5, 7, 9 ... 190, 200 should be used as representative versions. This operation doesn't consider any information on rows 2, 4, ... 191, 192, etc.
What we really want is to fuse the values between intermadiate frequencies to the final result. This way final representation would be more robust, and would contain information from all of the rows. One way to do this is to use triangular filter between adjacent bins. If the neighboring bin indices are 23, 26, 29 weight of the 26th bin would consist weigted sum of the 23th 24th, 25th, 26th, 27th, 28th, 29th. Where weights are such that at the center (26) weight is 1, at the tails(23, and 29) weights are 0. In-between weights changes linearly. Below image shows the weights for the example case(Where 40 mel-frequencies constructed for 201 frequencies).
filter_bank = np.zeros((n_mel, n_dim))
for i in range(n_mel):
cur_weights = np.zeros(n_dim, )
l_idx = indices[i]
c_idx = indices[i+1]
r_idx = indices[i+2]
l_range = (c_idx-l_idx)
for idx in range(l_idx, c_idx):
cur_weights[idx] = (idx-l_idx)/l_range
r_range = (r_idx-c_idx)
for idx in range(c_idx, r_idx):
cur_weights[idx] = (r_idx-idx)/r_range
filter_bank[i, :] = cur_weights
plt.figure()
plt.imshow(filter_bank, cmap=plt.get_cmap("gray"), aspect='auto', origin='lower')
plt.xlabel("Frequency Indices")
plt.ylabel("Mel-Frequency Indices")
plt.title("Filter Bank Weights")
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x7fb8494194f0>
From the image we can see that at lower frequencies, filter is more local, and considers very few bins. However, at higher frequencies filter spreads, and considers larger sections. We can think is like that, in the frequecies human ear is sensitive (and information is high), we fuse smaller sections to not mix useful information. However, as sensitivity (and information) decreases we fuses larger sections, since they are not really helpful anyway.
Please note that, If we add the weight coefficients in y-axis we would obtain 1 for each column from index 1 through 190( all mel bin indices that are not at the edges). Triangular filter fuses all the frequencies such that in the final representation all the frequencies are considered, and their relative ratio is not changed (None of them is emphasized or de-emphasized).
assert(np.array_equal(np.sum(filter_bank, axis=0)[1:190], np.ones(189, )))
When we apply this weighting to original spectrogram image, we will obtain mel-spectrogram image. The code to do so is as follows
mel_img = np.dot(filter_bank, Sxx_log)
plt.figure()
plt.imshow(mel_img, cmap='jet', origin='lower', aspect='auto', extent=[t[0], t[-1], f[0], f[-1]])
plt.ylabel('Frequencies[Hz]')
plt.xlabel('Time [sec]')
plt.title("Mel Spectrogram Image")
plt.colorbar()
plt.show()
In the above mel-spectrogram image we can see, that redundant sections are less compared to the original spectrogram image. Also general trends are similar to the original image (We have not distorted image, although we have reduced dimension 80% (from 201 to 40)).
That is it:)
I tried to explain, motivation and rationale behind mel-spectrogram and mel-frequency features. Without knowing the assumptions and rationale one cannot use this technique effectively. For instance, transformation used in this setting is based on the experiments done on human audio perception. Hence when using mel-features outside the human audio tasks one should be sure whether assumptions in these settings holds for the task at hand.
[1] https://en.wikipedia.org/wiki/Mel_scale
[2] https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html