Have you ever considered how computers can understand words that sound alike, even if they are spelled differently? In the world of sound and audio, especially when dealing with names or spoken words in noisy environments – perhaps even the sounds of a bustling street – phonetic matching becomes incredibly valuable. This is where algorithms like SoundEx come into play.
SoundEx is a phonetic algorithm for indexing names by sound, as pronounced in English. It’s particularly useful when searching databases where you might not know the exact spelling of a name but have a rough idea of how it sounds. Let’s delve into a JavaScript implementation of the SoundEx algorithm to understand how it works and its potential applications.
What is the SoundEx Algorithm?
The SoundEx algorithm works by converting a word into a four-character code. This code is built based on the first letter of the word and the phonetic sounds of the subsequent consonants. Vowels and certain consonants are ignored as they often contribute less to the perceived sound of a word, especially in name variations.
Here’s a breakdown of the SoundEx encoding process:
- Retain the First Letter: The first letter of the word is always retained as the first character of the SoundEx code.
- Consonant Grouping: Subsequent consonants are replaced by digits based on the following groups:
- 1: B, F, P, V
- 2: C, G, J, K, Q, S, X, Z
- 3: D, T
- 4: L
- 5: M, N
- 6: R
- Ignore Certain Letters: The vowels (A, E, I, O, U, Y) and the consonants H, W are typically ignored unless they are the first letter.
- Reduce Consecutive Duplicates: If two or more consecutive letters have the same SoundEx code, they are reduced to a single digit.
- Zero Padding: The resulting code is truncated to four characters. If it’s less than four characters, it’s padded with zeros to reach a four-character length.
JavaScript Implementation of SoundEx
Below is a JavaScript function that implements the SoundEx algorithm. This code is adapted from a publicly available version and is designed for clarity and functionality.
function SoundEx(WordString, LengthOption, CensusOption) {
var TmpStr;
var WordStr = "";
var CurChar;
var LastChar;
var SoundExLen = 10;
var WSLen;
var FirstLetter;
if (CensusOption) {
LengthOption = 4;
}
if (LengthOption != undefined) {
SoundExLen = LengthOption;
}
if (SoundExLen > 10) {
SoundExLen = 10;
}
if (SoundExLen < 4) {
SoundExLen = 4;
}
if (!WordString) {
return ("");
}
WordString = WordString.toUpperCase();
/* Clean and tidy */
WordStr = WordString;
WordStr = WordStr.replace(/[^A-Z]/gi, " "); // rpl non-chars w space
WordStr = WordStr.replace(/^s*/g, ""); // remove leading space
WordStr = WordStr.replace(/s*$/g, ""); // remove trailing space
/* Some of our own improvements */
if (!CensusOption) {
/* v1.0e: GH at begining of word has G-sound (e.g., ghost) */
WordStr = WordStr.replace(/^GH/g, "G"); // Chng leadng GH to G
WordStr = WordStr.replace(/DG/g, "G"); // Change DG to G
WordStr = WordStr.replace(/GH/g, "H"); // Change GH to H
WordStr = WordStr.replace(/GN/g, "N"); // Change GN to N
WordStr = WordStr.replace(/KN/g, "N"); // Change KN to N
WordStr = WordStr.replace(/PH/g, "F"); // Change PH to F
WordStr =
WordStr.replace(/MP([STZ])/g, "M$1"); // MP if fllwd by ST|Z
WordStr = WordStr.replace(/^PS/g, "S"); // Chng leadng PS to S
WordStr = WordStr.replace(/^PF/g, "F"); // Chng leadng PF to F
WordStr = WordStr.replace(/MB/g, "M"); // Chng MB to M
WordStr = WordStr.replace(/TCH/g, "CH"); // Chng TCH to CH
}
/* The above improvements may
* have changed this first letter
*/
FirstLetter = WordStr.substr(0, 1);
/* in case 1st letter is
* an H or W and we're in
* CensusOption = 1
*/
if (FirstLetter == "H" || FirstLetter == "W") {
TmpStr = WordStr.substr(1);
WordStr = "-";
WordStr += TmpStr;
}
/* In properly done census
* SoundEx the H and W will
* be squezed out before
* performing the test
* for adjacent digits
* (this differs from how
* 'real' vowels are handled)
*/
if (CensusOption == 1) {
WordStr = WordStr.replace(/[HW]/g, ".");
}
/* Begin Classic SoundEx */
WordStr = WordStr.replace(/[AEIOUYHW]/g, "0");
WordStr = WordStr.replace(/[BPFV]/g, "1");
WordStr = WordStr.replace(/[CSGJKQXZ]/g, "2");
WordStr = WordStr.replace(/[DT]/g, "3");
WordStr = WordStr.replace(/[L]/g, "4");
WordStr = WordStr.replace(/[MN]/g, "5");
WordStr = WordStr.replace(/[R]/g, "6");
/* Properly done census:
* squeze H and W out
* before doing adjacent
* digit removal.
*/
if (CensusOption == 1) {
WordStr = WordStr.replace(/./g, "");
}
/* Remove extra equal adjacent digits */
WSLen = WordStr.length;
LastChar = "";
TmpStr = "";
// removed v10c djr: TmpStr = "-"; /* rplcng skipped first char */
for (i = 0; i < WSLen; i++) {
CurChar = WordStr.charAt(i);
if (CurChar == LastChar) {
TmpStr += " ";
} else {
TmpStr += CurChar;
LastChar = CurChar;
}
}
WordStr = TmpStr;
WordStr = WordStr.substr(1); /* Drop first letter code */
WordStr = WordStr.replace(/s/g, ""); /* remove spaces */
WordStr = WordStr.replace(/0/g, ""); /* remove zeros */
WordStr += "0000000000"; /* pad with zeros on right */
WordStr = FirstLetter + WordStr; /* Add first letter of word */
WordStr = WordStr.substr(0, SoundExLen); /* size to taste */
return (WordStr);
}
This function, SoundEx(WordString, LengthOption, CensusOption)
, takes a word as input (WordString
) and optionally allows you to specify the length of the SoundEx code (LengthOption
) and whether to use census-style SoundEx (CensusOption
).
Let’s see some examples of how this function works:
alert(SoundEx("paul") + " " + SoundEx("ball"));
alert(SoundEx("paul").substring(1) + " " + SoundEx("ball").substring(1))
alert(SoundEx("car") + " " + SoundEx("truck"));
alert(SoundEx("car").substring(1) + " " + SoundEx("truck").substring(1))
These examples demonstrate how words that sound similar, like “paul” and “ball”, receive similar SoundEx codes, while words that sound different, such as “car” and “truck”, get different codes. Notice how the first letter is always preserved, and the subsequent digits represent the phonetic sounds.
Applications of SoundEx
While initially designed for census and genealogical research to handle variations in name spellings, the SoundEx algorithm has broader applications, including:
- Search Engines: Improving search accuracy by matching words that sound alike, even with misspellings.
- Data Matching and Deduplication: Identifying records that likely refer to the same entity despite slight name variations.
- Phonetic Dictionaries: As a basis for creating phonetic dictionaries for speech recognition or synthesis systems.
- Street Sound Analysis (Indirect): Although not directly for “Street Names” in the geographical sense, in the context of “streetsounds.net,” SoundEx could be used to categorize or group audio recordings based on phonetic similarity of spoken words captured in street environments. For example, identifying different instances of someone saying a particular street name, even with variations in pronunciation or background noise.
Conclusion
The SoundEx algorithm provides a robust and efficient way to perform phonetic matching. This JavaScript implementation offers a practical tool for developers looking to incorporate phonetic searching or matching capabilities into their applications. Whether you’re working on genealogical research, improving search functionality, or even analyzing the sounds of voices on the street, understanding and utilizing algorithms like SoundEx can be incredibly powerful.