pdanese
2006-09-06 19:50:31 UTC
1 of the table columns contains “company names” and I want to get a list of each company in alphabetical order.
Normally, I’d do something like “select distinct company_name from table order by 1”
However, there is no uniform input method for company name:
typos are common (e.g. citigrup instead of citigroup)
some companies are listed with and without “inc.” (e.g. Wal-Mart, Inc. vs Wal-Mart vs Wal-Mart, Incorporated, etc).
I also don’t control user input so I can’t “fix” the problem at the front-end.
I must not be the first person to encounter this issue.
Is there a commonly used method/algorithm for removing superfluous (or misspelled) information and making the values consistent with each other?
Do you just have to make a huge mapping table? i.e. a table that says “anything that begins with Wal-mart or sounds like Wal-mart” is now listed as “Wal-Mart”?
Or is there another way?
Thanks!