Software checks languages used on the Internet

More than 6,000 languages are spoken around the world, but how many are in use on the Internet?
To help preserve a balanced use of languages, a Thai research team at Thai Computational Linguistics Laboratory (TCL), a research unit under the National institute of Information and Communications Technology in Japan, has developed what it called Web Language Engineering (WLE), to help identify the languages used on the Internet. WLE is a kind of software identifier that automatically identifies languages on documents or Web pages. It's now in use in a project to survey the overall status of languages used on the Internet. Called Language Observatory Project, the project is hoped to help policy makers and webmasters in each country understand the actual status of languages used, and find ways to preserve a balanced use of them. Virach Sornlertlamvanich, the co-director of Thai Computational Linguistics Laboratory, said the software was used as a key part in the project and it could now identify around 100 languages. In the survey, he said the team found that only some major languages including English, Japanese, Chinese and French were used widely, while the use of domain names in some countries in Africa was not according to world standards. The team has also integrated the language identifier software with Google Earth to plot areas of language use on the Web and found that countries in Africa, such as Somalia, did not have any Web pages as yet. Virach said this information was vital as a guideline for further development for the Internet community in each country. Importantly, it is hoped that the survey would encourage minority groups in each area to develop Web pages and documents in their own language to mark their presence on the Internet. For Thailand alone, it is assumed that there are more than 10 written languages in use. Virach said the team hoped to find partners to help preserve such languages and make them available on the Internet. Meanwhile, the team has also worked with Japan and Australia to set up collaborative crawler servers to oversee data collection for the survey project. Each crawler will help the others keep data for further language-identification process. Each of them is defined in a particular category for data collection so they will not overlap each other's work. This would increase the amount of data collection and when users search for information on the Internet, they would get better and increased numbers of results. The next step, he added, was to develop archives to allow Internet users to search for documents in any language, and trace back to documents they want according to a particular period of time. "This development will be different from other search engines used today as it will allow users to determine both the topic and period of time to trace back to past documents, while the existing search engine covers only the present document results," he said. The development of the archive for Thai documents is underway and it is hoped that the prototype will be complete in the next few months. "We will start the project with Thai documents and then we will seek collaboration from other countries to develop a similar archive in their language, so that once they're linked together, it will give users the option to search for a document in many languages," Virach said.
Pongpen Sutharoj The Nation
|