Shrinking Soundex Code

research article Study

Soundex Algorithm is a very simple and basic phonetic algorithm that groups together the names that sound alike but are spelled differently, for example, Jacky or Jackie. Soundex mainly encodes the consonants of the word; a vowel will not be encoded unless it is the first letter of the word and generates four-character codes based upon the pronunciation of the word. And these generated codes can be used to to match two words to determine whether they sound alike. It is also a standard feature of many popular databases like DB2, Oracle, PostgreSQL, MySQL, SQLite and MS SQL Server etc.


Background

It is the best-known phonetic matching scheme, developed by Margaret King Odell and Robert C. Russell, and patented in 1918, but came to prominence in the 1960s when it was described in Donald Knuth’s The Art of Computer Programming.

The Algorithm

American Soundex algorithm, a variant of the original, is followed by the most SQL languages. It applies a set of rules to a string to generate the four-character code. The encoding steps are as follows:


1. Retain the first letter of the name and drop all other occurrences of a, e, h, i, o, u, w, y.
2. Replace consonants with digits as follows (after the first letter):

* b, f, p, v => 1
* c, g, j, k, q, s, x, z => 2
* d, t => 3
* l => 4
* m, n => 5
* r => 6


American Soundex Mapping Code

3. Two adjacent letters with the same number are coded as a single number.
4. Continue until you have one letter and three numbers. If you run out of letters, fill in 0s until there are three numbers.


For example, “SAINT” and “SANT” are both reduced to S530 or AMERICA and AMERIKA both reduced to A562 i.e. two similar sounding names match if they have the same soundex representation. Another example can be given as “CALCUTTA” and “KOLKATA” will be reduce to C423 and K423 accordingly.

Now all these things can also be found on wikipedia. So what’s new???

 A New Method for Code Generation

Here it is our new Soundex code generating methodology. Given below algorithm is a small deviation from the American Soundex. Using this new method policy, we can reduce the 4 characters Soundex code to 3 characters. How???

Now if you are found of mathematics then you should know that the number can be represented in many forms like hexadecimal, decimal, binary etc. Now in the binary format, the number can be represented by only two digits – 0 and 1;  in octal it can be represented by 0 to 7 while in hexadecimal it can be represented by 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E and F. Same way if the number radix is 6, then it can be represented via 0 to 5.

Number System Conversion (base 6 to hexadecimal)

Now think 312 is base 6 number format. Now convert (312)6 number into the decimal number (means base 10). Here the number will reduce automatically. 312 will become 116 in base 10 or decimal number system.

(312)6 = (116)10

Now again convert this decimal number into the hexadecimal number so number 116 in base 10 will become 74 in hexadecimal. Thus, (116)10 = (74)16

In short,

(312)6 = (116)10 = (74)16

(51)6 = (31)10 = (1F)16

How number system helps to reduce the code length?

Now you may wonder what the use of all these things here, but let me tell you it can help you to reduce the Soundex code length from 4 to 3 and even the storage size in the database in case if you are storing or indexing the Soundex code for faster searching process. In our new method policy, the encoding of the groups of the labial consonants starts from 0 rather than 1.

* b, f, p, v => 0
* c, g, j, k, q, s, x, z => 1
* d, t => 2
* l => 3
* m, n => 4
* r => 5

New Soundex Mapping Code

Now let’s generate the Soundex code for CALCUTTA/KOLKATA following the American Soundex steps with applying new mapping code policy. Here “CALCUTTA” and “KOLKATA” will be reduce to C312 and K312 accordingly. Now remove first char ‘C’ or ‘K’ from the code so remain number will be 312 only. Now consider 312 number as base 6 and apply base 6 to hexadecimal number system conversion, we get new number 74 of base 16. Now append 74 to ‘C’ or ‘K’ and finally the new code will become C74 and K74 respectively for “CALCUTTA” and “KOLKATA”. I hope now you understand how we reduce the Soundex code for KOLKATA to C74 from C312.

New Soundex Algorithm Steps

It applies a set of rules to a string to generate the four-character code. The encoding steps are as follows:

1. Retain the first letter of the name and drop all other occurrences of a, e, h, i, o, u, w, y.
2. Replace consonants with digits as follows (after the first letter):

* b, f, p, v => 0
* c, g, j, k, q, s, x, z => 1
* d, t => 2
* l => 3
* m, n => 4
* r => 5

  1. Two adjacent letters with the same number are coded as a single number.
  2. Continue until you have one letter and three numbers. Also there will be no padding with 0 even if the code length is less than 3.
  3. Assume 3 digit numbers as base 6 format and convert into hexadecimal format.
  4. Append the first letter with 2 digit hexadecimal number.

Comparision of American Soundex Code and New Soundex Code:

American Soundex Code:

AHMEDABAD = A531
AMDAVAD = A531
BANGALORE = B524
BENGALURU = B524
MYSORE = M260
MYSURU = M260
DELHI = D400
DILLI = D400
ALPHABET = A411
ALFABAT = A411
AMERICA = A562
AMERIKA = A562

Our 4 length Soundex code

AHMEDABAD = A420
AMDAVAD = A420
BANGALORE = B413
BENGALURU = B413
MYSORE = M150
MYSURU = M150
DELHI = D3
DILLI = D3
ALPHABET = A300
ALFABAT = A300
AMERICA = A451
AMERIKA = A451

Final & converted 3 length Soundex code

AHMEDABAD = A9c
AMDAVAD = A9c
BANGALORE = B99
BENGALURU = B99
MYSORE = Mb
MYSURU = Mb
DELHI = D3
DILLI = D3
ALPHABET = A6c
ALFABAT = A6c
AMERICA = Aaf
AMERIKA = Aaf

New Soundex Algorithm Implementation in Java

/*
 * 
 * Coded by Virag R. Shah
 * Subject:-- Shrinking Soundex Code
 * Compiled & Run on JDK 11.0.4
 * 
 */



import java.util.HashMap;

public class SndX {	
	
	public static String soundex(String source){
		/* 
		 * New Soundex Mapping
		 *  
		 * b, f, p, v => 0
		 * c, g, j, k, q, s, x, z => 1
		 * d, t => 2
		 * l => 3
		 * m, n => 4
		 * r => 5
		 * 
		 * Giving no code number to A,E,I,O,U,Y,W,H 
		 */
		
		//Putting code hash
				
		HashMap hm=new HashMap();
		  hm.put("A",""); //
		  hm.put("B","0");//0
		  hm.put("C","1");//1
		  hm.put("D","2");//2
		  hm.put("E",""); //
		  hm.put("F","0");//0
		  hm.put("G","1");//1
		  hm.put("H","");//
		  hm.put("I","");//
		  hm.put("J","1");//1
		  hm.put("K","1");//1
		  hm.put("L","3");//3
		  hm.put("M","4");//4
		  hm.put("N","4");//4
		  hm.put("O","");//
		  hm.put("P","0");//0
		  hm.put("Q","1");//1
		  hm.put("R","5");//5
		  hm.put("S","1");//1
		  hm.put("T","2");//2
		  hm.put("U","");//
		  hm.put("V","0");//0
		  hm.put("W","");//
		  hm.put("X","1");//1
		  hm.put("Y","");//
		  hm.put("Z","1");//1
	
		
		
		String firstChar=source.substring(0,1);
		
		String res="";
		String temp=source.substring(1);
		
		for(int i=0;i < temp.length();i++){
			String m=temp.substring(i,i+1);
			String f=(String)hm.get(m);
			res=res+f;
		}

		if(res.length() < 3)
		res=convertToHex(res.substring(0, res.length()));
		else
		res=convertToHex(res.substring(0, 3));
		
		res=firstChar+res;
		
		return res;
		
			
	}
	
	//convert number from base 6 to hex
	public static String convertToHex(String numb){
		//first convert number base 6 radix format to decimal format
		Integer i=Integer.parseInt(numb, 6);
		//convert number to hex format
		String hexas=Integer.toHexString(i);
		return hexas;
	}

    // removing adjacent duplicate characters from the word
    public static String rmAdjacentChar(String word) {
        if(word. length() < 2)
        return word;
        if(word. charAt(0) == word. charAt(1))
        word = rmAdjacentChar(word. substring(1));
        word = word. substring(0, 1) + rmAdjacentChar(word. substring(1));
        return word;
    }
    
	public static void main(String arg[]){
	    System.out.println(soundex(rmAdjacentChar("AHMEDABAD")));
		System.out.println(soundex(rmAdjacentChar("AMDAVAD")));
		System.out.println(soundex(rmAdjacentChar("BANGALORE")));
		System.out.println(soundex(rmAdjacentChar("BENGALURU")));
		System.out.println(soundex(rmAdjacentChar("MYSORE")));
		System.out.println(soundex(rmAdjacentChar("MYSURU")));
		System.out.println(soundex(rmAdjacentChar("DELHI")));
		System.out.println(soundex(rmAdjacentChar("DILLI")));
		System.out.println(soundex(rmAdjacentChar("ALPHABET")));
		System.out.println(soundex(rmAdjacentChar("ALFABAT")));
		System.out.println(soundex(rmAdjacentChar("AMERICA")));
		System.out.println(soundex(rmAdjacentChar("AMERIKA")));
	}
}

CONCLUSION AND SCOPE:

Here I have presented a new Soundex code generator. Also here I do not attempt to address any issues of Soundex algorithm but rather represent a simple approach to minimize the length of the Soundex code. Kindly note there are some databases like MySQL where Soundex function returns more than 4 characters. You can still apply the number format conversion but the chance of the code reduction is not 100% in that case. In the case, if you are not indexing or caching or storing the Soundex code then it is an overhead of the unnecessary processes and both the time and space complexity is higher. In the case of caching, for the first time code generation it will take more time. But after that the time process for matching any name from the cache is faster as it has to compare 3 characters rather than 4.

References

SOUNDEX | IBM Db2
SOUNDEX (Transact-SQL) - SQL Server | Microsoft Docs
SOUNDEX | Oracle Database Online Documentation
SOUNDEX | Wikipedia
Soundex System | National Archives
MySQL SOUNDEX() function