How computer might count the words?

Question:

Crabman

2012-01-05 22:15:24 UTC

You know that we can easily count the words using the microsoft word.

But I wonder how programmers made such function? How is it working? How is it possible that computer can count the word? I don't think that computer really count the words... then computer count the space between words? if this is so, how computer count that space? Wow... I really think computer programmers are really remarkable

Five answers:

2012-01-06 00:14:30 UTC

Basically, you got the idea right.

When we, as human, counts how many words are there in a document, we are actually counting the number of groups formed by continuous characters. As long as characters stick together, we treat them as a word.

What separate words are generally known as whitespaces. This is not limited only to the "space" character (the character you get by pressing the spacebar), but also include all other "invisible" characters such as the newline character (When you press enter), tab character and so on.

In programming, we have some nifty "tools" that can identify these whitespace very efficiently. One of the most common tools we use is the Regular Expression. When we can identify whitespaces within the document, we merely "slice up" the text in the document on the position of each white space, a process commonly known as "splitting". What we get will be slices of characters that are sticking together. We count the number of slices, which represents the number of words.

I am a boy. == Cut into ==>

1. I

2. am

3. a

4. boy.

Slices count: 4 ==> 4 words.

This is the general idea. Note that word counting for some languages maybe different (for example, a single Chinese character is actually counted as a word). For that, we'll have a different algorithm for counting.

Hope this helps.

And yes. Programmers are remarkable. Don't let anyone tell you different.

2012-01-06 08:55:47 UTC

But I wonder how programmers made such function?

They wrote a programme to do that task.

How is it working?

Very well.

How is it possible that computer can count the word?

Scan through a line of text look for alphabetic characters. When you find one add one to the count of words then skip through all trailing alphabetic characters . Then reset word found flag and scan through punctuation and white space, Repeat until end of input text found.

Easy.

I don't think that computer really count the words...

it can. It can skip the space between words. It can know what are alphabetic characters and what are white space characters and what is punctuation.

Wow... I really think computer programmers are really remarkable

Yep programmers can be really Cool. Programmes too!

Computers can be fascinating. Programming them is great fun!

2012-01-05 22:37:14 UTC

you have to make a specification for what defines w word and what defines whitespace (a word boundary). for instance, a word can include certain ranges of characters in the UNICODE character set (to see what that would look like, do

[windows-logo-flag-key]-R charmap [Enter] or

start, run, type in charmap hit Enter or

start, type in charmap and hit enter, or

start, all programs, accessories, system tools, charmap.

some characters, like space, newline, paragraph (for a word processor), hyphen and characters like it, period, comma, quotes sets of various types used in different languages (there are 4 types for english, one used for typesetting and documents, and 2 in ascii ' and " and accent grave ` called backquotes if you choose to use those).

that's for generic UNICODE documents. you can instead analyze the character set and generic types of document(s) that come through and base your word counting system on that.

after the analyzing, you write a lexical analyzer (token scanner) which picks out and returns words from text given the input which is the document file in question. the input file format can be in zipped XML files (as is the current case with microsoft office) in which case you can use 7-zip or pkzip library to unzip the zip files and libxml or xerces to parse the XML (and then you can extract the text), or it can be a binary file, or it can be simple ascii files, or it can be a UTF-8-encoded text file, or it can be ASCII.

2012-01-05 22:42:27 UTC

it can be written like this in C#

string paragraph = "your paragraph should go here";

int words = 0;

int spaces = 0;

int numb = 0;

foreach(char oneChar in paragraph)

{

if(Char.IsLetter((char)oneChar )

{

words++;

}

else if( Char.IsDigit((int)oneChar )

{

numb++;

}

else

{

spaces++;

}

}

Console.WriteLine("Found number of spaces " + spaces );

Console.WriteLine("Found number of letters " + numb );

Console.WriteLine("Found number of characters " + words );

2014-11-19 01:03:18 UTC

difficult situation. research in google and yahoo. this can assist!

ⓘ

This content was originally posted on Y! Answers, a Q&A website that shut down in 2021.

about - legalese