Best data structure for storing frequency of words in a text file?

Ratchetr

2012-12-06 17:36:00 UTC

FWIW, below is an implementation in C#. I ran it against a downloaded text only copy of the Bible.

I timed how long it took. The results are below.

In C#, a Dictionary is implemented internally as a hash map.

Just curious, have you written and timed your program yet, or just designed it? If you have written it, how long is it taking to run? How long do you expect it to take?

I would say that 452 milliseconds to read the file, parse out the words and store them in the Dictionary is very efficient. And the 16 millisecond sort certainly is.

If you aren't seeing numbers in that ballpark with your hash table and quick sort, then there is probably something wrong in your implementation.

public static void Main()

{

var sr = File.ReadLines( @"c:\temp\biblewords.txt" );

Dictionary d = new Dictionary();

var sep = new char[] { '\t', ' ', '.', ':', ';', ',' , '(', ')', '!', '?' };

var startTime = Environment.TickCount;

foreach (var line in sr)

{

var words = line.Split(sep, StringSplitOptions. RemoveEmptyEntries);

foreach (var word in words)

{

var lw = word.ToLower();

if (!d.ContainsKey(lw))

d.Add(lw, 0);

d[lw]++;

}

}

Console.WriteLine("Dictionary built in {0} milliseconds", Environment.TickCount - startTime);

startTime = Environment.TickCount;

var sorted = d.OrderByDescending(n => n.Value).ToList();

Console.WriteLine("Sorted in {0} milliseconds", Environment.TickCount - startTime);

Console.WriteLine("\nTotal words {0}", d.Sum(n => n.Value));

Console.WriteLine("Total unique words {0}", d.Count);

Console.WriteLine("\nTop Ten:");

for (int i = 0; i < 10; i++)

Console.WriteLine("{0}\t{1}", sorted[i].Key, sorted[i].Value);

}

Output:

Dictionary built in 452 milliseconds

Sorted in 16 milliseconds

Total words 789632

Total unique words 12897

Top Ten:

the 63924

and 51695

of 34617

to 13561

that 12913

in 12667

he 10420

shall 9838

unto 8997

for 8971

2016-08-03 14:46:44 UTC

Put the documents in DB. If they down load it use $sqlUpdate = " update `".Table_name."` SET `".Table_name."`.`hits` = `".Table_name."`.`hits`+1 where `".Table_name."`.`identification` = '" . $tempID."' "; and after that simply list the biggest 'hits' :) cheers T.

deonejuan

2012-12-06 16:48:56 UTC

You don't say what language but C# and java have Collections. Here's java...

import java.util.*;

public class Freq {

public static void main(String[] args) {

Map m = new HashMap();

// Initialize frequency table from command line

for (String a : args) {

Integer freq = m.get(a);

m.put(a, (freq == null) ? 1 : freq + 1);

}

System.out.println(m.size() + " distinct words:");

System.out.println(m);

}

}

The only tricky thing about this program is the second argument of the put statement. That argument is a conditional expression that has the effect of setting the frequency to one if the word has never been seen before or one more than its current value if the word has already been seen. Try running this program with the command:

running this short program from the Cmmd-line terminal...

java Freq if it is to be it is up to me to delegate

The program yields the following output.

8 distinct words:

{to=3, delegate=1, be=1, it=2, up=1, if=1, me=1, is=2}

I don't know what you mean by efficiency. I've used such structure with the public domain ENABLE word dictionary -- 180k words.

The bible is 593,493 in the Old Testament and 181,253 in the New Testament giving 774,746 words.

Dave P

2012-12-06 16:48:53 UTC

I did something similar a while ago with javascript but can't find it now. This is the gist of how I used two arrays to store words as they are first encountered and to increment the count for them in the second array each time the word is seen again. Finally the arrays are combined for sorting.

// words have been split from text, feed each in turn into:

if (wordArray.indexOf(currentWord) > -1)

{

// word has already been found and a count added to array;

// increment count in count array;

countArray[wordArray.indexOf(currentWord)]++;

}

else

{

// word not seen before so add to word array and initialise countArray to 1 at the same index;

wordArray[wordArray.length]=currentWord;

countArray[wordArray.length]=1;

}

// next word.

At the end you will have two arrays, one containing all the words, the other containing the count for that word at the same index. To use an array sort on this, they'd probably have to be combined into a two dimensional array:

for (i=0; i
{

combinedArray[i][0] = wordArray[i];

combinedArray[i][1] = countArray[i];

}

(I think that doing this at the end will be much faster than loading the words and counts into a two dimensional array initially as the search for words already added would be slower in a 2-d array)

Finally, sort by count using a normal sort function such as:

combinedArray.sort(sort2DimArray);

function sort2DimArray(a,b)

{

// this sorts the array using the second element

return ((a[1] < b[1]) ? -1 : ((a[1] > b[1]) ? 1 : 0));

}

If the code above got clipped, I've pasted it here:

http://pastebin.com/wFv0JP7L

?

2012-12-06 16:34:11 UTC

I would just create a table with 2 columns. One column would be the word itself, and the other would be the frequency of the word. While looping through your "Bible", I would add new words and assign a count of 1, but if the word exists, I would increment the count.

Chris

2012-12-06 16:33:32 UTC

I'd say you want an associative array (aka a Dictionary). An associative array is just like an array except that instead of having a single value indexed by an integer, you have a value indexed by another value.

For example, using C#:

A Normal array:

string[] foo = new string[] {"apple", "peach", "pear"};

In this normal array of strings, foo[1] would have the value "peach".

An associative array:

Dictionary foo = new Dictionary;

foo.Add("apple", 1);

The first line declares the dictionary to have a string key and in integer value.

The second line adds a new entry to the dictionary with a key of "apple" and a vlue of 1.

So, assuming you had your next word in _word, you can update your histogram like so:

if(foo.Contains(_word)){

foo[_word]++;

}else{

foo.Add(_word, 1);

}

Note that we used a generic dictionary (by specifying the data types used for the key and the value).

Note that most modern languages have an associative array, but it may be called something else and the syntax will likely be different.

C# supports some very powerful extensions that allow you to order the dictionary by specifying an anonymous function similar to the following:

foo.OrderBy(n => n.value);

The "n => n.value" is a lambda function that says:

Given an member of foo (which is a KeyValuePair), call that instance n and return n.value... ie. sort on the value field of the KeyValuePair.

The compiler can infer all the types involved here so the source code remains very clean and simple.

husoski

2012-12-06 18:16:02 UTC

A few search hits tell me that there are fewer than 13,000 distinct English words in the King James Bible, so the quicksort of that list is not likely to be your problem. Make sure your hash table has about twice that many entries at the start if you want to get best performance.

If your hash function is any good, the n log n performance of quicksort is almost guaranteed. The table itself is in randomized order and is incredibly unlikely to have O(n^2) performance.

So, you are already using what I think are the generically best data structures.

If, against all odds, it is the sort that is a bottleneck for you, you can improve the sort-by-frequency by simply using an array of pointers indexed by frequency. Each of those is the head of a linked list of words with exactly that frequency. The average number of collisions should be quite small, with the longest lists of duplicates in the very low frequency range, or so I'd expect. Anything that, against expectations, occurred more often than the table size goes into an "overflow" list.

This is essentially a radix sort, with a very large radix.

With custom code, that should perform pretty well. With an OOP implementation, calling 5000 (or so) constructors to initialize an array of linked list objects is likely to exceed the sort time. (A C/C++ pointer array can be blasted to all zeros in just half that number of memory store cycles, assuming 64 bit memory bus and 32-bit pointers.