Question:
organizing imperfect data in a database table?
pdanese
2006-09-06 19:50:31 UTC
I have a database table w 500,000 user-entered records.

1 of the table columns contains “company names” and I want to get a list of each company in alphabetical order.

Normally, I’d do something like “select distinct company_name from table order by 1”

However, there is no uniform input method for company name:
typos are common (e.g. citigrup instead of citigroup)
some companies are listed with and without “inc.” (e.g. Wal-Mart, Inc. vs Wal-Mart vs Wal-Mart, Incorporated, etc).

I also don’t control user input so I can’t “fix” the problem at the front-end.

I must not be the first person to encounter this issue.
Is there a commonly used method/algorithm for removing superfluous (or misspelled) information and making the values consistent with each other?

Do you just have to make a huge mapping table? i.e. a table that says “anything that begins with Wal-mart or sounds like Wal-mart” is now listed as “Wal-Mart”?

Or is there another way?

Thanks!
Three answers:
sheeple_rancher
2006-09-06 22:00:21 UTC
Various governments in all countries have lists of company names. Example:

http://www.sunbiz.org/corpweb/inquiry/cormenu.html

Some organizations sell large compilations of these lists on DVD as well as via online databases. Example:

http://selectory.com/



It should be possible to write some code to validate all names against a large list and have it spit out or mark the "mystery meat" names for special treatment. It will not be all that easy, but will certainly be a lot easier than hand-inspecting 500,000 entries.



Comparisons should be case insensitive and you should cleanse each record of extra spaces and crap.
amit_online
2006-09-06 21:08:13 UTC
There really is no single method or algorithm which will rectify this problem. The problem is compounded by the fact that you are having to deal with company names (proper nouns) so you cant do a word lookup in a dictionary to see if its a proper word or not.



The weakest point of any system is the point of data entry, as thats the only point where a human being enters data into the system and thats where most of the mistakes happen. No amount of coding or logic can ever compensate for bad data entry.



Perhaps you can do a grouping on the company name and store off the unique company names into another table. That table can have another field which will be empty by default. You can do some lookup or some processing to process each of the extracted company names to see if its valid or not and assign the proper company name for that value into the second field. eg.



fld1 fld2

**** ****

Wal-mart Walmart Inc

Walmurt Walmart Inc.



And then you can overwrite the company name field in the original table with fld2 from the temp table using fld1 as a lookup field. Not a very satisfactory solution , but I cant think of anything better
ungkog
2006-09-06 21:03:20 UTC
1. make a new table for company name with all data correct.



2. make/change entries into capital letters



3. remove all punctuations and other symbols



4. make company name the primary key to avoid multiple entry



5. manipulate your db to remove double entry. (sql and programming language)



_____________________



At your user interface, map user input to this new table so that there are no wrong entry.


This content was originally posted on Y! Answers, a Q&A website that shut down in 2021.
Loading...