This is Interesting: Free Magazines for Graphics designers and webmasters  


Home > Archive > Computer Graphics with Photoshop > May 2005 > Cleaning up original Burton "Kama Sutra" page scans -- need advice/help





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author Cleaning up original Burton "Kama Sutra" page scans -- need advice/help
Jon

2005-05-25, 7:14 pm

I'm now working to "clean up" the 182 page images from a recent scan
of a very rare and noteworthy public domain book. The cleaned-up scans
will be released to the public (such as given to the Internet Archive)
for free access. [For those interested, the book is the 1885 second
printing of the second edition of Sir Richard F. Burton's "Kama Sutra
of Vatsyayana".]

The scans were done at 600 dpi (optical) 256-color greyscale (there's
no color in the book), to capture sufficient fine-detail to aid in the
cleanup process. Of course, the book was chopped (the binding was
falling apart anyway) and each page scanned on a flat-bed, so there's
no page distortion caused by trying to scan a bound book. There are no
illustrations -- it's all black and white text.

I've already deskewed, cropped, centered and size-normalized all 182
pages. (For those interested, links to two sample partially-cleaned
pages are given below.)

In the cleanup process, I'd like to convert what I now have into
600-dpi *bitonal* (black and white) with uniform and nicely readable
character density, removal of "pepper", cleanup of larger blotches,
etc. I recognize there will be some handwork required, particularly to
remove larger "pepper" and blotches, and repair a few characters,
etc., but of course want to minimize handwork.

[Note that the purpose of the cleanup is for direct human-use of the
scans, and not solely for OCR purposes which doesn't require the
planned level of cleanup. For example, I plan to produce a DjVu
version for direct reading. For those who will probably ask, the raw
page scans have already been uploaded to Distributed Proofreaders for
conversion to structured digital text.]

Unfortunately, what complicates the clean-up process is that the
original book is in poor and variable condition. The paper is quite
yellowed and darkened, and many pages are quite faded. Were the
original in mint condition with good, uniform ink-to-paper contrast, I
wouldn't be posting this request for advice. But the overall poor
quality and page-to-page variation is taxing my graphics abilities to
produce a clean finished product with reasonably readable and uniform
character density (at 600-dpi bitonal.)

Here are two sample pages, each about 4.5 megs in size (2550x3900
greyscale):

http://www.openreader.org/kamasutra/page031.png (good condition)
http://www.openreader.org/kamasutra/page106.png (poor condition)

I would assume that others have had similar needs and have come up
with various processing tricks and even built special tools to aid in
the clean-up process (e.g., how to auto-remove small "pepper", the
one to few pixel wide black spots on the white background?). I look
forward to your advice and even help if you are interested (I will
upload all the partially-cleaned images somewhere if you want to help
with the actual clean-up process -- the whole set of images totals 680
megs.)

[As a final note, I use Paint Shop Pro 9, but do not have Photoshop.
But since PSP9 is fairly powerful, I assume that many, if not all,
recommended Photoshop processes will map over to PSP9.]

Thanks!

Jon Noring
Lorem Ipsum

2005-05-25, 7:14 pm


"Jon" <jon@noring.name> wrote in message
news:e56991p513k1bqk38i9j4mklnsgiuf4cl1@4ax.com...
> I'm now working to "clean up" the 182 page images from a recent scan
> of a very rare and noteworthy public domain book. The cleaned-up scans
> will be released to the public (such as given to the Internet Archive)
> for free access.


I only looked at the 'poor' example, page 106, which is all text, so I'll
address that one: when we have exactly those cases, we use OCR unless there
is some historical signifcance to the type-face (aka: font). That way you
get perfect type. For illustrations, well you would have to show us one.
Your server is pretty slow so downloading 9mb was discouraging enough that
I'm moving on.

Rendering scanned text clearly is not a job for an image-processing program.


David Littlewood

2005-05-27, 7:15 pm

In article <11999a6k4uq7a6c@news.supernews.com>, Lorem Ipsum
<Lorem@ipsum.xxx> writes
>
>"Jon" <jon@noring.name> wrote in message
>news:e56991p513k1bqk38i9j4mklnsgiuf4cl1@4ax.com...
>
>I only looked at the 'poor' example, page 106, which is all text, so I'll
>address that one: when we have exactly those cases, we use OCR unless there
>is some historical signifcance to the type-face (aka: font). That way you
>get perfect type. For illustrations, well you would have to show us one.
>Your server is pretty slow so downloading 9mb was discouraging enough that
>I'm moving on.
>
>Rendering scanned text clearly is not a job for an image-processing program.
>
>

True; however, I did download both, and found that a very simple
increase in contrast in PS (+75% for the "poor" image and +50% for the
"good") gave perfectly readable images. It didn't remove the small
blemishes, but I did not find them obtrusive. Saved as best quality
jpegs, they took up only 220-350 kb each. I would imagine that for
distribution a pdf file would be the most suitable. I'm certainly no
Photoshop expert, but it took me about 1 minute each.

The trouble with OCR is that you will have to spend many days proof
reading the output - and even then (if my experience is anything to go
by) you won't catch all the silly errors.

David
--
David Littlewood
Sponsored Links


Copyright 2003 - 2008 forum4designers.com  Software forum  Computer Hardware reviews