you have to make a specification for what defines w word and what defines whitespace (a word boundary). for instance, a word can include certain ranges of characters in the UNICODE character set (to see what that would look like, do
[windows-logo-flag-key]-R charmap [Enter] or
start, run, type in charmap hit Enter or
start, type in charmap and hit enter, or
start, all programs, accessories, system tools, charmap.
some characters, like space, newline, paragraph (for a word processor), hyphen and characters like it, period, comma, quotes sets of various types used in different languages (there are 4 types for english, one used for typesetting and documents, and 2 in ascii ' and " and accent grave ` called backquotes if you choose to use those).
that's for generic UNICODE documents. you can instead analyze the character set and generic types of document(s) that come through and base your word counting system on that.
after the analyzing, you write a lexical analyzer (token scanner) which picks out and returns words from text given the input which is the document file in question. the input file format can be in zipped XML files (as is the current case with microsoft office) in which case you can use 7-zip or pkzip library to unzip the zip files and libxml or xerces to parse the XML (and then you can extract the text), or it can be a binary file, or it can be simple ascii files, or it can be a UTF-8-encoded text file, or it can be ASCII.