[BCMA] Nicola Valley Newspaper Digitization Project

Moderated BCMA subscriber listserv. bcma at lists.vvv.com
Wed Jan 19 23:15:13 PST 2011


Greetings from the Nicola Valley Museum and Archives.

 

In response to questions regarding newspaper digitization.

 

We are working on an in-house newspaper digitization project that began last
summer.

 

Working with "off-the-shelf" hardware and open-source or "off-the-shelf"
software, two summer students were able to photograph and digitize
approximately 3,000 pages of the Merritt Herald in the 1910 to 1921 range.

 

Here are the basics - we still are doing some fine-tuning to the procedures,
but preparing photos and banners for Merritt's upcoming 100th Anniversary
has slowed down the process.

 

.        We used a modified copy-stand and a Canon T2i Digital SLR camera
connected to a desktop computer running Windows 7 to produce 18 megapixel
images of the newspaper pages.

.        We used Corel PhotoPaint to clean up the images and improve the
contrast between the century-old black type and yellowing paper.

.        We used ReadIris Pro Optical Character Recognition (OCR) program to
"read" the page images and produce a searchable "image over text" PDF file.

.        We used the "Python" programming language to write scripts to speed
up the Corel PhotoPaint process and to assist with the page numbering.

.        We loaded the finished pages onto our in-house server: a QNAP
TS-439PRO II with two two-terabyte drives running in a RAID format .

.        We used  Copernic Desktop Search loaded on our individual computers
to create a running index of the pages as they were produced.

 

The results are encouraging.  The digital camera, photo, and OCR software
produce a PDF image of the newspaper page that is quite readable on screen
and when printed to an 11x17 sheet.  Although the newspaper columns aren't
separated by the OCR program, the image can be "highlighted" and the text
behind it copied and pasted to another program such as a word-processor,
saving some typing, particularly if the original pages were in good shape.  

 

While we didn't expect 100 percent OCR accuracy with the old newspapers, my
educated guess is that we are getting 80 percent.  The Copernic Desktop
Search software produces a "Google" like index of the sheets within minutes
of the sheets being OCR'ed.  Generally, if a name appears at least once on a
page (the more often, the more likely it is to be recognized correctly), a
Copernic search will identify the page(s) and highlight the word within
seconds.

 

The resulting page files are fairly large and don't lend themselves to
transfer over the Internet, but an in-house researcher can quickly find
possible sources, review the material, and print or copy the appropriate
image or text without handling the gradually deteriorating newspapers.

 

On the "to-do" list:

 

.        Experiment with a 21 megapixel Canon 5D Mk II in "RAW" mode and
Canon's Digital Photo Professional software to produce a higher resolution
image and possibly eliminate the PhotoPaint/PhotoShop step.  Also explore
open source photo software.

.        Explore other OCR options to see if there is a program that will
reliably differentiate the columns in the old newspapers.

.        Fine tune the set-up and procedures so that our staff, volunteers
and summer students can continue with digitizing the next 90 years' worth of
newspapers.

 

Additional information sources:

 

.        Canon T2i Digital SLR Camera:
http://www.canon.ca/inetCA/products?m=gp
<http://www.canon.ca/inetCA/products?m=gp&pid=3529> &pid=3529  (+/-
$1,000.00)

.        Corel PhotoPaint:  http://www.corel.com/  (+/- $600.00)

.        ReadirisT Pro 12:
http://www.irislink.com/c2-1684-189/Readiris-12-for-Windows.aspx  (+/-
$130.00)

.        Python Programming Language:  http://python.org/  (Open Source)

.        QNAP TS-439PRO II Turbo NAS server:
http://www.qnap.com/pro_detail_feature.asp?p_id=148  (+/- $1200.00)

.        Copernic Desktop Search:  http://www.copernic.com/  (+/- $50.00)

 

If anyone else has any other comments or suggestions, we look forward to
reading them.  If there is additional interest in the physical setup and
procedures, we would be happy to provide more details and photographs.

 

Murphy Shewchuk

President, etc.

Nicola Valley Museum & Archives Association

 

 

******************************************* 
Murphy Shewchuk 
Freelance Writer/Photographer 
PO Box 400 
Merritt, BC  V1K 1B8 
Phone: 250 378-5930 
E-Mail: <murphy at sonotek.com> 
Web:  <http://www.murphyshewchuk.com> http://www.murphyshewchuk.com 
Co-Author of " <http://www.fitzhenry.ca/detail.aspx?ID=8104> Okanagan Trips
& Trails" 
Author of " <http://www.fitzhenry.ca/detail.aspx?ID=9049> Coquihalla Trips &
Trails" 
and " <http://www.fitzhenry.ca/detail.aspx?ID=10179> Cariboo Trips & Trails"
Visit my "
<http://www.istockphoto.com/file_search.php?action=file&userID=6022474&refnu
m=Murphy_Shewchuk> iStockphoto Portfolio".
******************************************* 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.vvv.com/pipermail/bcma/attachments/20110119/61d55066/attachment-0001.htm 


More information about the BCMA mailing list