**Soundex Algorithm is a very simple and basic phonetic algorithm that groups together the names that sound alike but are spelled differently, for example, Jacky or Jackie. Soundex mainly encodes the consonants of the word; a vowel will not be encoded unless it is the first letter of the word and generates four-character codes based upon the pronunciation of the word. And these generated codes can be used to to match two words to determine whether they sound alike. It is also a standard feature of many popular databases like DB2, Oracle, PostgreSQL, MySQL, SQLite and MS SQL Server etc. **

Background

Background

**It is the best-known phonetic matching scheme, developed by Margaret King Odell and Robert C. Russell, and patented in 1918, but came to prominence in the 1960s when it was described in Donald Knuth’s The Art of Computer Programming.**

## The Algorithm

**American Soundex algorithm, a variant of the original**,** is followed by the most SQL languages**.** It applies a set of rules to a string to generate the four-character code. The encoding steps are as follows:**

1. Retain the first letter of the name and drop all other occurrences of a, e, h, i, o, u, w, y.

2. Replace consonants with digits as follows (after the first letter):

* b, f, p, v => 1* c, g, j, k, q, s, x, z => 2 * d, t => 3 * l => 4 * m, n => 5 * r => 6 |

**American Soundex Mapping Code**

**3. Two adjacent letters with the same number are coded as a single number. 4. Continue until you have one letter and three numbers. If you run out of letters, fill in 0s until there are three numbers.For example, “SAINT” and “SANT” are both reduced to S530 or AMERICA and AMERIKA both reduced to A562 i.e. two similar sounding names match if they have the same soundex representation. Another example can be given as “CALCUTTA” and “KOLKATA” will be reduce to C423 and K423 accordingly.**

Now all these things can also be found on wikipedia. So what’s new??? |

## A New Method for Code Generation

**Here it is our new Soundex code generating methodology. Given below algorithm is a small deviation**

**from the American Soundex.****Using this new method policy, we can reduce the 4 characters Soundex code to 3 characters. How???**

**Now if you are found of mathematics then you should know that the number can be represented in many forms like hexadecimal, decimal, binary etc. Now in the binary format, the number can be represented by only two digits – 0 and 1; in octal it can be represented by 0 to 7 while in hexadecimal it can be represented by 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E and F. Same way if the number radix is 6, then it can be represented via 0 to 5.**

**Number System Conversion** (base 6 to hexadecimal)

**Now think 312 **

**is base 6 number format. Now convert (**

**)6 number into the decimal number (means base 10). Here the number will reduce automatically.****312****312****will become**

**116****in base 10 or decimal number system.**

**(312)6 = (116)10**

**Now again convert this decimal number into the hexadecimal number so number 116 in base 10 will become 74 in hexadecimal. Thus, (116)10 = (74)16**

**In short,**

**(312)6 = (116)10** = **(74)16**

**(51)6 = (31)10** = **(1F)16**

**(51)6 = (31)10**

**(1F)16**

### How number system helps to reduce the code length?

**Now you may wonder what the use of all these things here, but let me tell you it can help you to reduce the Soundex code length from 4 to 3 and even the storage size in the database in case if you are storing or indexing the Soundex code for faster searching process. In our new method policy, the encoding of the groups of the labial consonants starts from 0 rather than 1.**

* b, f, p, v => 0* c, g, j, k, q, s, x, z => 1 * d, t => 2 * l => 3 * m, n => 4 * r => 5 |

**New Soundex Mapping Code**

**Now let’s generate the Soundex code for CALCUTTA/KOLKATA following the American Soundex steps with applying new mapping code policy. Here “CALCUTTA” and “KOLKATA” will be reduce to C312 and K312 accordingly. Now remove first char ‘C’ or ‘K’ from the code so remain number will be 312 only. Now consider 312 number as base 6 and apply base 6 to hexadecimal number system conversion, we get new number 74 of base 16. Now append 74 to ‘C’ or ‘K’ and finally the new code will become C74 and K74 respectively for “CALCUTTA” and “KOLKATA”. I hope now you understand how we reduce the Soundex code for KOLKATA to C74 from C312.**

## New Soundex Algorithm Steps

**It applies a set of rules to a string to generate the four-character code. The encoding steps are as follows:**

**1. Retain the first letter of the name and drop all other occurrences of a, e, h, i, o, u, w, y.2. Replace consonants with digits as follows (after the first letter):**

*** b, f, p, v => 0* c, g, j, k, q, s, x, z => 1* d, t => 2* l => 3* m, n => 4* r => 5**

**Two adjacent letters with the same number are coded as a single number.****Continue until you have one letter and three numbers. Also there will be no padding with 0 even if the code length is less than 3.****Assume 3 digit numbers as base 6 format and convert into hexadecimal format.****Append the first letter with 2 digit hexadecimal number.**

**Comparision of American Soundex Code and New Soundex Code:**

*American Soundex Code:*

AHMEDABAD = A531

AMDAVAD = A531

BANGALORE = B524

BENGALURU = B524

MYSORE = M260

MYSURU = M260

DELHI = D400

DILLI = D400

ALPHABET = A411

ALFABAT = A411

AMERICA = A562

AMERIKA = A562

*Our 4 length Soundex code*

AHMEDABAD = A420

AMDAVAD = A420

BANGALORE = B413

BENGALURU = B413

MYSORE = M150

MYSURU = M150

DELHI = D3

DILLI = D3

ALPHABET = A300

ALFABAT = A300

AMERICA = A451

AMERIKA = A451

*Final & converted 3 length Soundex code*

AHMEDABAD = A9c

AMDAVAD = A9c

BANGALORE = B99

BENGALURU = B99

MYSORE = Mb

MYSURU = Mb

DELHI = D3

DILLI = D3

ALPHABET = A6c

ALFABAT = A6c

AMERICA = Aaf

AMERIKA = Aaf

## New Soundex Algorithm Implementation in Java

```
/*
*
* Coded by Virag R. Shah
* Subject:-- Shrinking Soundex Code
* Compiled & Run on JDK 11.0.4
*
*/
import java.util.HashMap;
public class SndX {
public static String soundex(String source){
/*
* New Soundex Mapping
*
* b, f, p, v => 0
* c, g, j, k, q, s, x, z => 1
* d, t => 2
* l => 3
* m, n => 4
* r => 5
*
* Giving no code number to A,E,I,O,U,Y,W,H
*/
//Putting code hash
HashMap hm=new HashMap();
hm.put("A",""); //
hm.put("B","0");//0
hm.put("C","1");//1
hm.put("D","2");//2
hm.put("E",""); //
hm.put("F","0");//0
hm.put("G","1");//1
hm.put("H","");//
hm.put("I","");//
hm.put("J","1");//1
hm.put("K","1");//1
hm.put("L","3");//3
hm.put("M","4");//4
hm.put("N","4");//4
hm.put("O","");//
hm.put("P","0");//0
hm.put("Q","1");//1
hm.put("R","5");//5
hm.put("S","1");//1
hm.put("T","2");//2
hm.put("U","");//
hm.put("V","0");//0
hm.put("W","");//
hm.put("X","1");//1
hm.put("Y","");//
hm.put("Z","1");//1
String firstChar=source.substring(0,1);
String res="";
String temp=source.substring(1);
for(int i=0;i < temp.length();i++){
String m=temp.substring(i,i+1);
String f=(String)hm.get(m);
res=res+f;
}
if(res.length() < 3)
res=convertToHex(res.substring(0, res.length()));
else
res=convertToHex(res.substring(0, 3));
res=firstChar+res;
return res;
}
//convert number from base 6 to hex
public static String convertToHex(String numb){
//first convert number base 6 radix format to decimal format
Integer i=Integer.parseInt(numb, 6);
//convert number to hex format
String hexas=Integer.toHexString(i);
return hexas;
}
// removing adjacent duplicate characters from the word
public static String rmAdjacentChar(String word) {
if(word. length() < 2)
return word;
if(word. charAt(0) == word. charAt(1))
word = rmAdjacentChar(word. substring(1));
word = word. substring(0, 1) + rmAdjacentChar(word. substring(1));
return word;
}
public static void main(String arg[]){
System.out.println(soundex(rmAdjacentChar("AHMEDABAD")));
System.out.println(soundex(rmAdjacentChar("AMDAVAD")));
System.out.println(soundex(rmAdjacentChar("BANGALORE")));
System.out.println(soundex(rmAdjacentChar("BENGALURU")));
System.out.println(soundex(rmAdjacentChar("MYSORE")));
System.out.println(soundex(rmAdjacentChar("MYSURU")));
System.out.println(soundex(rmAdjacentChar("DELHI")));
System.out.println(soundex(rmAdjacentChar("DILLI")));
System.out.println(soundex(rmAdjacentChar("ALPHABET")));
System.out.println(soundex(rmAdjacentChar("ALFABAT")));
System.out.println(soundex(rmAdjacentChar("AMERICA")));
System.out.println(soundex(rmAdjacentChar("AMERIKA")));
}
}
```

## CONCLUSION AND SCOPE:

Here I have presented a new Soundex code generator. Also here I do not attempt to address any issues of Soundex algorithm but rather represent a simple approach to minimize the length of the Soundex code. Kindly note there are some databases like MySQL where Soundex function returns more than 4 characters. You can still apply the number format conversion but the chance of the code reduction is not 100% in that case. In the case, if you are not indexing or caching or storing the Soundex code then it is an overhead of the unnecessary processes and both the time and space complexity is higher. In the case of caching, for the first time code generation it will take more time. But after that the time process for matching any name from the cache is faster as it has to compare 3 characters rather than 4.