Project overview
Speech is a fundamental part of human connection, but for those living with dysarthria – a condition that results from neuro-motor disorders that cause disturbances in muscular control during articulation of speech[1] – it poses a unique set of challenges.
In this blog post we will focus on the intricate task of analyzing dysarthric speech using Dynamic Time Warping (DTW). DTW is a way to compare two sequences (in this case audio sequences) that do not perfectly sync together. It is a method used to find an optimal alignment between two time series by warping them in the time dimension. We wanted to explore this method for analyzing dysarthric speech because it allows us to not only match like-sounding phonemes, phoneme duration and pitch features of a typical speaker to that of an atypical speaker, but also which speech features atypical speakers are unable, or less able to produce
Understanding DTW
Before we get into the weeds of speech processing and analysis, let’s first breakdown the fundamentals of dynamic time warping. As mentioned previously, DTW allows us to compare two time series that do not perfectly align with each other. To demonstrate this, let’s create two time series A and B
A = [4, 2, 2, 4, 5, 3, 2]
B = [3, 1, 0, 3, 3, 1, 0, 1, 1, 2]
In this example, time series A and B are different lengths. This is an intentional example for when we are analyzing atypical speech, some speakers may stutter or take more time pronouncing each syllable of a word or phrase. This highlights the value of DTW. Through this we are able to generate a warping path W that maps the elements of A and B to minimize the distance between them regardless of size discrepancies.
Codebase for these plots found at: https://builtin.com/data-science/dynamic-time-warping

We can calculate the optimal path with the equation:
DTW(x,x’) = |Ai – Bj| + min(D[i – 1, j – 1], D[i – 1, j], D[i, j – 1])
A and B represent our two time series with i and j representing the respective index positions. D is our distance matrix. The right half of our equation involves the previous three calculations, which we can see in the diagram on the left, indicating whether our path moves horizontally, vertically or diagonally.
For example, D[1, 1], the formula gives us: D[1, 1] = |2 – 1| + min(D[0,0], D[0,1], D[1,0]) = 2. Since our minimum value in the right half of the equation was determined to be from D[0,0] we see our path move diagonally from D[0,0] to D[0,1].
If we map the two time sequences together we get the diagram on the right. If we are to imagine our time series being two audio samples from speakers, we would expect our aligning points to be things like matching phonemes uttered.
Although this example highlights the benefits of the DTW algorithm, our diagram also shows us an area where DTW falters. In our diagram we see all points MUST be mapped to a point of the other series. So what does this mean if we have outlier points? Regardless if there are outliers, dynamic time warping will map all points of our two time series. An improved version of the DTW algorithm was proposed by Nikita Dvornik et al called Drop-DTW, which as you would expect, drops outlier data points and then performs DTW.
You can read more on Drop-DTW here.

Applying DTW to our Dataset
First let’s install and import our dependencies
import os
import matplotlib.pyplot as plt
import librosa, librosa.display
import IPython.display as ipd
import numpy as np
import pandas as pd
from datasets import load_dataset, Audio
from encodec.utils import convert_audio
from encodec import EncodecModel
import torchaudio, torch
Processing our Dataset
To apply DTW to atypical and typical speech we utilized the UA-Speech dataset. This dataset was published by the University of Illinois, with contributions by Heejin Kim, Simone Frame, Harsh Vardhan Sharma and Xi Zhou. Subjects in the dataset include 15 speakers with Cerebral Palsy and 13 age-matched healthy controls. Each speaker’s utterances contains the following prompts:
- Digits (10 words x 3 reps): “One, two, three….”
- Letters (26 words x 3 reps): the 26 letters of the international radio alphabet: “Alpha, Bravo, Charlie,…”
- Computer Commands (19 words x 3 reps): the most common 100 words in the Brown corpus, “the, of, and,…”
- Uncommon Words(300 words x 1 rep): 300 words selected from Project Gutenberg novels using an algorithm that sought to maximize biphone diversity, e.g. “naturalization, faithfulness, frugality,…”
There are three folders – noisereduce, normalized and original – which provides the same speaker utterances with varying levels of background noise and overall clarity. For this example we will be using the noise reduced set. It provides the most clear version of each speaker without any background noise or microphone static. Locate the directory where your data is stored and choose which speaker’s utterances you will be using.
speaker_M05 = DATA_PATH + "/audio/noisereduce/M05"
speaker_F04 = DATA_PATH + "/audio/noisereduce/F04"
control_CM05 = DATA_PATH + "/audio/noisereduce/CM05"
control_CF04 = DATA_PATH + "/audio/noisereduce/CF04"
The codes for speakers are as follows. M – Male, F – Female, CM – Control Male, CF – Control Female. For our analysis we handpicked a few prompts that contain a variety of phonetic sounds in addition to varying levels of difficulty to pronounce. Each utterance is affiliated with a code, which can be found in the downloaded dataset (i.e. “Seven” -> “D7”)
utterance_dict = {'Seven': 'D7',
'Whiskey': 'LW',
'Quebec': 'LQ',
'Paragraph': 'C11',
'your': 'CW37',
'there': 'CW40',
'people': 'CW78',
'naturalization': 'B1_UW1',
'Pennsylvania': 'B1_UW17',
'Iroquois': 'B3_UW10'}
Ultimately, to perform DTW on audio samples you will only require the path to the audio file. For the sake of reproducibility and readability, we opted to use a dictionary that contains the phrase reference as our key and the value being the wav file associated with it. This is processed in the block below, which will create dictionaries for each speaker we will use.
def create_speaker_dict(speaker_wavfiles, utterance_dict):
speaker_dict = {}
files = os.listdir(speaker_wavfiles)
for phrase, utterance_code in utterance_dict.items():
idx= [i for i, s in enumerate(files) if utterance_code in s]
# Get first instance only
idx = idx[0]
speaker_dict[phrase] = files[idx]
return speaker_dict
Visualizing Audio Data with Spectrograms
Visualizing audio data is a crucial step in understanding speech patterns. We use spectrograms to represent the frequency content of our audio samples over time. The visual representation aids in identifying distinctive features and patterns in dysarthric speech. Spectrograms are especially helpful in highlighting nuances that may not be apparent through conventional auditory analysis.
def plot_magnitude_spectrum(speaker, speaker_path, speaker_dict, title):
"""
speaker: the name of the speaker
speaker_path: the path to the speaker's utterances
speaker_dict: dictionary containing the word being uttered and the wav file
title: The word being uttered
"""
utterance, sr = librosa.load(os.path.join(speaker_path, speaker_dict[title]))
display = librosa.stft(utterance)
S_db = librosa.amplitude_to_db(np.abs(display), ref=np.max)
fig, ax = plt.subplots()
img = librosa.display.specshow(S_db, x_axis='time', y_axis='linear', ax=ax) graph_title = title + " " + speaker ax.set(title=graph_title)
fig.colorbar(img, ax=ax, format="%+2.f dB")
plt.show()
ipd.display(ipd.Audio(os.path.join(speaker_path, speaker_dict[title])))
for i in range(len(F04_dict)):
plot_magnitude_spectrum("CM05", control_CM05, CM05_dict, list(CM05_dict.keys())[i])
plot_magnitude_spectrum("M05", speaker_M05, M05_dict, list(M05_dict.keys())[i])
Running this script will plot spectrograms for all sample speakers and their respective control speaker. Using the ipd.display() method will also add the audio from each speaker’s utterance of a given phrase so there is a reference for the spectrogram.


The phonetic breakdown of the word ‘whiskey’ as proposed by the ARPAbet (Advanced Research Projects Agency phonetic alphabet) is W IH S K IY. Our first spectrogram shows the speaker with Cerebral Palsy saying the word whiskey. Compared to our control speaker, we can see the speaker takes more time, drawing out the overall enunciation of the word. When we listen to the recording we can also hear some difficulty in pronouncing the “W IH” sound, which is fairly common amongst the other speakers with Cerebral Palsy in the dataset.
Generate Audio Codes
To create our time series for each speaker we will utilize Meta’s EnCodec audio compression architecture. Why we want to use this is because if we are analyzing a waveform that is 3 seconds in length at a sample rate of 16k, we would be working with a vector of length 48k. That would be far too long for anyone to work with, so by using the EnCodec architecture we can compress our audio waveforms into much smaller vectors while retaining key information. This process acts similarly to Singular Value Decomposition (SVD) which can be used for image compression. To learn more about Meta’s EnCodec architecture, you can visit their GitHub page here. EnCodec has also been added to the HuggingFace Transformer suite, which you can find here.
def process_audio_codes(speaker_path, file_path):
model = EncodecModel.encodec_model_24khz()
audio_path=os.path.join(speaker_path, file_path)
wav, sr = torchaudio.load(audio_path)
wav = convert_audio(wav, sr, model.sample_rate, model.channels)
wav = wav.unsqueeze(0)
# Extract discrete codes from EnCodec
with torch.no_grad():
encoded_frames = model.encode(wav)
codes = torch.cat([encoded[0] for encoded in encoded_frames], dim=-1) # [B, n_q, T]
codes_np=codes.cpu().detach().numpy()
codes_np = codes_np.squeeze()
return codes_np, sr
codes_M05, sr = process_audio_codes(speaker_M05,M05_dict['Whiskey'])
codes_CM05, sr = process_audio_codes(control_CM05,CM05_dict['Whiskey'])
To simplify the dynamic time warping process, we can utilize the librosa.sequence.dtw() method. This creates our cost matrix D for us, in addition to returning the warp path between our two time series.
D, wp = librosa.sequence.dtw(X=codes_CM05, Y=codes_M05, metric='cosine')
Next, we utilize librosa.frames_to_time() to convert frame counts to time (seconds). This method takes in our warp path, sampling rate and the hop length, which is the number of samples between successive frames. Then we can plot our graph.
hop_size = 256
wp_s = librosa.frames_to_time(wp, sr=sr, hop_length=hop_size)
fig, ax = plt.subplots()
img = librosa.display.specshow(D, x_axis='time', y_axis='time', sr=sr,
cmap='gray_r', hop_length=hop_size, ax=ax)
ax.plot(wp_s[:, 1], wp_s[:, 0], marker='o', color='r')
ax.set(title='Warping Path on Acc. Cost Matrix $D$',
xlabel='Time $(X_2)$', ylabel='Time $(X_1)$')
fig.colorbar(img, ax=ax)

Similar to the DTW heatmap example, we can see the warp path between our two time series (X1 is the dysarthric speaker and X2 is our control speaker). If we were to use the same audio sequence on both our X and Y axis, we would see the warping path as a straight diagonal line over the cost matrix. As we would expect from analyzing the spectrograms, our two time series do not perfectly align as is shown by the small horizontal and vertical segments of our path and the much longer segment starting around (1.6, 1.4).
It is not apparent by this graph what is being said by our dysarthric speaker during this segment that is being mapped to the control speaker, but we can get a clearer idea of that by using the warping path and plotting the two audio waveforms together.
whiskey_M05, sr = librosa.load(os.path.join(speaker_M05,M05_dict['Whiskey']))
whiskey_CM05, sr = librosa.load(os.path.join(control_CM05,CM05_dict['Whiskey']))
fig = plt.figure(figsize=(16, 8))
plt.subplot(2, 1, 1)
librosa.display.waveshow(whiskey_CM05, sr=sr)
plt.title('Control Speaker (Whiskey)')
ax1 = plt.gca()
plt.subplot(2, 1, 2)
librosa.display.waveshow(whiskey_M05, sr=sr)
plt.title('Dysarthric Speaker (Whiskey)')
ax2 = plt.gca()
plt.tight_layout()
trans_figure = fig.transFigure.inverted()
lines = []
arrows = 30
points_idx = np.int16(np.round(np.linspace(0, wp.shape[0] - 1, arrows)))
for tp1, tp2 in wp[points_idx] * hop_size / sr:
# get position on axis for a given index-pair
coord1 = trans_figure.transform(ax1.transData.transform([tp1, 0]))
coord2 = trans_figure.transform(ax2.transData.transform([tp2, 0]))
# draw a line
line = matplotlib.lines.Line2D((coord1[0], coord2[0]),
(coord1[1], coord2[1]),
transform=fig.transFigure,
color='r')
lines.append(line)
fig.lines = lines
plt.tight_layout()
The code block above shows our warping path directly on the time domain signals. We can use arrows to connect corresponding time positions in the input signals. By doing this we are able to highlight which waveforms connect to each other.

Next Steps
So now that we have several ways we can view and analyze dysarthric speech, where do we go from here? Ideally, we can use our results from dynamic time warping to reconstruct dysarthric speech to better match the speech patterns of a non-impaired speaker. However, when listening to dysarthric speakers we are going to see outlier speech patterns like stuttering, or inability to pronounce certain phonemes that will not align to control speakers using regular DTW. Further research into the Drop-DTW algorithm could prove beneficial for addressing outliers. In its current form, Drop-DTW has been utilized to align video to video and text to video. Obviously for this work, we would need to repurpose their codebase to be able to analyze audio to audio sequences.