Archive for November, 2004

US to have 30m newspaper pages online by 2006

November 19, 2004

US announced this week that they will have 30m newspaper pages on net by 2006.

This article mentions that …

The span of the joint project is limited because type faces of printers used before 1836 are too difficult for optical scanners to read, and copyright restrictions are in force on papers published after 1923.

They have developed a prototype at the Library of Congress site – The Stars and Stripes, 1918-1919. It is a pretty basic interface – they are clearly focusing on getting the basics right before developing the front end.  If you go to the above link and look to the bottom left of the page you will see that you are able to view “the OCR-generated text transcription of this page”. This gives a reasonably accurate OCR version of the page.  The PDF’s and the OCR accuracy look to be excellent from a quick look.  However they don’t look to have done that well on “segmenting” the page into it’s constituent elements but to be fair this is clearly an early version.

Reuters slows NewsML roll-out

November 19, 2004

Overview of the Reuters project

NewsML is the structure used to publish news in any format. It can be used by news providers to combine their pictures, video, text, graphics and audio files in news output available on web sites, mobile phones, high end desktops, interactive television and any other device.