Measuring languages in the Internet by counting word occurences with search engines.

Daniel PIMIENTA <pimienta@funredes.org>
FUNREDES
Dominican Republic

Daniel PRADO <ulat2a@calva.net>
Union Latine
France

Marcelo SZTRUM <sztrum@worldnet.fr>
Union Latine
France

Abstract

Many linguistic areas of the world are interested to get a measure of their presence, progress and trend on the Internet. The need is then for a replicable manner to measure language and culture presence in the main information spaces (www and Usenet).

The paper present an original methodology for such measurement which has improved since first application in 1995 (when it started by comparing French presence to English) and which now offers reliable results for Spanish, Portuguese, Italian and Romanian.

The methodology makes use of the most powerful search engines to compute the number of occurences of a carefully selected set of words in the different languages choosen. A number of linguistic obstacles exist which call for a minutious and systematic selection of the words. Once the word sample is established following these criteria, the statistical results become extremely convincing.

The paper explain the step of the methodology and show the results. The scope is extended to German in opportunity of INET99.