This work was written to help people read using the technique called Proportional Reading. In this approach the eyes never move. You can read up to 700 words per minute and still feel like you are being read aloud to. Text can also be read out loud in real human voice at normal reading speed as it is displayed one word at a time. In order to do this type of reading text must first be in electronic form. The author spent three years developing an understanding of how to scan books easily so any student could easily scan course material or other reading material into e-text for Proportional Reading. The material presented here is essentially chapters 7 and 8 of the Instruction Manual for Proportional Reading.
Scanning really involves three parts:1) Making a picture of a page (scanning), 2) Using an Optical Character Recognition program to convert the picture into typed text and 3) Cleaning up the text after this process. In actual practice, scanning and OCR decisions are made before scanning starts.
1) from the actual book placed on the scanner bed and scanned one or two pages at a time.
2) from separated pages of the book placed on the scanner one or two pages at a time
3) from actual book pages bulk-loaded into the automatic document feeder of the scanner, or
4) from copies of book pages, which are then either scanned individually or bulk loaded into the automatic document feeder.
Scanning can be done almost effortlessly if you choose the right approach. This article will help you understand what this approach should be.
Scanning involves a little bit of learning, but once a book is turned into ascii text, it can be read by everybody in a school system without any repeating of these steps. It can be mailed as a diskette or sent by modem, etc.
First, a few words about copyrights. Be sure to get copyright permission first before any wide dispersal. Proportional Reading was designed to help people read who would otherwise not be able to benefit from printed text. Publishers almost universally are very helpful in allowing special treatment of their works for the learning disabled and physically disabled.
Furthermore, Proportional Reading is designed for average readers to use on their own reading material which they already have in their possession. This private, non-profit copying of books is within purchase rights, and it makes reading possible for many and increases purchase of books.
Most importantly, the basic thrust of Proportional Reading as applied to scanning books is to return to the original book for the graphics (charts, illustrations, drawings, graphs, pictures, etc.) and to see the original text layout. To this end Proportional Reading keys to the original page numbers of the original text. As a result, actual use of the basic text book increases, not decreases. This will be especially true as millions of people become able to read and start to love learning. In all these ways Proportional Reading actually helps publishers.
Finally, the formatted or Proportionalized version of text requires a special program to play. So, the formatted text by itself is of little or no use without both the playing software as well as the original book.
In this article you will learn how to add colored pictures to scanned text. However, this process adds tremendously to file size and is therefore impractical except for short articles or articles saved on CD ROM or removable cartridge. It is usually much easier to refer to the original book for pictures and other graphics
For this reason automatic document feeders should not be used with actual book pages unless pages are copied first onto 20 lb paper with only one side of the paper used.
These books can easily be scanned. However, you must first separate the pages. Be happy about this. Scanning individual pages is much less physical work than scanning a book. In scanning individual pages there is no lifting and turning and pressing down on the book. You can sit comfortably in a chair and hardly move as you scan first one side of a page and then the other side of the same page and then the next page. Separate the book chapters into different manilla folders.
A separated book has real value after scanning. It is often much easier to read a book this way than trying to keep the pages open. Also, bookbags become much lighter when only the relevant chapters are carried around. The trick is to keep the different chapters in different folders.
For this reason it is best to tape a black piece of paper on the underside of the cover of the scanner and scan the pages one page at a time, or scan from an open book where the pages are automatically backed up. Alternatively you can make one-sided copies of the text pages and run these copies through the document feeder. However, this costs a lot of money and requires a good quality copier. Regardless of how good the copier is, you will loose quality when you make copies and this will cause errors in scanning. When all is said and done it is usually best to scan one page at a time, or from an open book that will lie flat.
The best way to do this is to specially mark the text boxes and captions right after the page is scanned and OCR'd. Here again it is usually best to scan one separated page at a time, or from an open book that will lie flat.
The simple solution to this is to select just the sections of text and captions and text boxes and in the order you want, ignoring the pictures. The way to do this is to insert one page at a time and manually zone each page. This process is much faster than deselecting all the zones you do not want and then reordering the zones you have left from an automatically zoned page.
To readd a picture in color, you first save the text in ascii format and open it up in your word processor. Then you scan the colored picture using the scanner alone (not the OCR program) and then copy and paste in the desired picture into the word processor document at the desired point. Choose "screen" resolution so the picture file will not be too big.
Another good trick is to place an open book on the scanner with a weight on top of it and scan two pages at a time. This way you don't have to personally press down on the book binding all the time the scanner is working. Use a gallon of water in a plastic jug for a weight. Build up an area next to the scanner to the same height as the lid, using telephone books or other books. Now you can just drag the water on and off the scanner lid (from the top of the pile). No lifting of the weight is required.
A book can be cut apart this way in about two minutes. If you don't want to reglue the pages, reset them in the cover (still completely intact) and add a rubber band. Frequently it is much easier to read loose pages than bound pages.
Re-gluing pages is very simple. Just add some wood glue to the binding and to the binding edge of the pages and stick the pages in the binding. Let set overnight. The new binding will work just as well as before.
Notes: Some pages are printed right to the center "gutter". This makes manually scanning one or two pages at a time impossible. It is also impossible to copy such pages. These pages have to be cut out to be scanned. Secondly, tiny paperback pages are too small to fit in most document feeders. These pages should be scanned manually, two pages at a time with deferred OCR, or copied first and then inserted into the automatic document feeder.
However, cutting and then re-gluing is not workable for library books.
However, you can quickly process any book this way, especially if you copy two pages at a time. You can easily copy 300 pages an hour, two pages at a time. These pages can be inserted into the document feeder as they come off the copier. Scanning can occur simultaneously. Putting copies of pages in a document feeder is a great solution for scanning borrowed books.
Note: Some small paperbacks are sometimes printed on very poor quality paper with too much ink. As a result, letters are badly formed and scanning even at the best quality level will not be successful. In this situation, the best approach is to get a library edition of the book to scan. Don't just waste your time.
If you are copying two pages at a time, it is important to make sure the scanner differentiates between the left and right page. Sometimes this can be a problem if the margins and gutters between pages gets reduced too far. Otherwise, text from the two pages will merge. It is also important to cut out all the heavy black areas around the margins and in the gutter. Otherwise, these areas will be read as characters.
One solution for this problem is to manually zone the image before scanning the next page.
If you want to do automatic zoning, there is an easy way around these problems. Mark either side of the copy window half way up its length. Always center the book gutter on this center line each time you set the book down on the scanner bed. Then manually zone the scanner for two zones (one for each page), cutting out the areas of black. Be sure to zone the earlier page first (otherwise, the second page will always come before the first). Now save the zone template and call it up for this book. Pages will be automatically separated in scanning and black areas will be ignored.
Alternatively, you can set the scanner to automatically zone both pages with no zones. Then after the scanning is finished and before the text recognition function starts, manually rezone each page. At the same time you can cut out graphics and headers. You can also make the page number of each page the first and top item on that page by selecting it first, even if the page number is on the bottom of the page. The best approach is not to zone the page number and to type it in later at the top of the page, or ignore it completely and delete it later.
Note: When you scan original individual pages (cut out from the book binding) one at a time, either manually or in a document feeder, there is no gutter problem, nor problem with black areas.
If you are scanning one page at a time you may want to zone, OCR and edit each page right after it is scanned. This is fine. However, if you are doing two pages at a time, or if you want to make maximum use of your scanner, and/or if you wish to have the OCR done automatically while you do something else, you should scan all the pages first into separate files which can be finished later.
Later you, or somebody else on another machine, zones the pages manually or has them automatically zoned when OCR is done. Then the pages are OCR'd and then edited. It's usually best to scan all the pages first.
If you are setting the brightness level yourself, be sure to scan and check just one page of text to begin with. It is important to check the scanning as it occurs. It is very important that the letters not have broken or missing parts. Cancel the scanning and move the brightness control towards darken if this is the case. Then rescan the page for a second check.
To do this, make sure the boxes for multiple pages and deferred recognition are not checked. The box for automatically saving a document should also be unchecked.
It is also very important that the letters do not run together. If this is happening, lighten the brightness control. What you are looking for is the point right between these two problems. Too much correction for one problem causes the other problem. Actually, the OCR program does not mind if the letters are very close, but it minds terribly if the letters are not completely formed or parts of letters are broken.
Don't have letters any thicker than necessary. If you do, open sections in letters like "a" and "e" will get blocked out. These letters will subsequently be misread by the character recognition program.
Start off by scanning just a single copy of text (one or two pages on the copy). Look at the little view window as the scan is progressing. Cancel the scan and reset the brightness control and re-scan as often as necessary, until you think you have scanned a single page of text correctly.
Then, when the scanning ends, look at the actual document. Doing this will uncover many setting errors that would otherwise go unnoticed. If you see on your scanned document a number of letters which are only part of the full letters they are supposed to be ("c" instead of "d" for example, "lll" instead of "M"), then you need to darken the brightness control.
Making this kind of check is the best way to save a lot of wasted time. Now is the point to take some extra time. Darken or lighten the brightness control and repeat the process until you have a clean document of text. Now start to scan. When you have this control adjusted correctly, there will be a minimum of spelling errors. All your downstream efforts at Proportionalizing and reading text will be frustrated if you have a lot of unnecessary spelling errors which you will have to correct or accept.
Remember: The easiest way around this whole chore is to use the slowest speeds (best quality) of the scanner. In these modes, brightness level is automatically adjusted. Note: the scanner will be operating as a greyscale scanner.
Also, sometimes a list will have several columns which get read as one unit of text. You may need to rezone the list into two or more columns in proper sequence. A quick look at how the list has been zoned will tell you if you need to make a correction. It is easy to delete the current zones on a page and redo the zones and OCR. It is also easy to delete the current page and re-scan it.
To manually scan one page after another, just press Command+L after you turn each page.
You will need extra hard disk memory if you are going to use deferred recognition. You should plan on leaving at least 50 to 100 megs free, depending on how many pages of text you want to scan at a time before doing the text recognition. Forty pages of text can easily temporarily use up to 20 megs of hard disk space as a Caere file. After recognition the resulting text may only be 200k. All the bit maps with their large memory requirements will have gone away or are ready for you to delete, depending on which choice you have made.
For maximum use of the scanner, transfer documents of scanned only pages to another computer where zoning and OCR and spell checking and final editing will take place. If you don't have a network, use a removable cartridge hard drive. Transfer files will be large, but once processed the same cartridge can be reused over and over. This way one scanner can scan many books each day. Individual teachers or students can finish the OCR work on their own computers.
Note: Be sure to remove all deferred files from your hard drive after they have been turned into text. You can choose to do this automatically. Each deferred file is like a group of pictures, and takes up a tremendous amount of memory on your hard drive. Left to accumulate, they will quickly eat up all your disk space.
1) The document feeder on the 4C takes pages as small as 5" x 7". The (greyscale) scanner has a minimum size which is much larger than the 4C. This in turn means that middle-size paperbacks can not be cut apart and fed automatically on the greyscale scanner . They must be copied first. The reason for all this is that pages feed from the side of the machine and from the side of the paper (longer direction) on the 4C and from the top of the machine and the top of the paper on the greyscale scanner. A small page which measures too narrow for top loading, often still has sufficient size for automatic loading if loaded from the side.
2) Pages are more stable when scanned in the 4C. This is because the paper moves in the greyscale scanner, while the scanner light moves in the 4C.
3) With the 4C, color pictures from original text can be scanned in and added after text is recognized and in WordPerfect. Obviously, a greyscale scanner can't add color.
4) The flatbed on the 4C is much longer than the flatbed on the greyscale scanner. This means that fairly large books can be laid down on the 4C and scanned two pages at a time. You simply can not do this on the greyscale scanner flatbed.
5) Color adds a great deal to almost all presentations. The 4C allows students to make Proportional Reading articles using their own color pictures or color pictures downloaded from many other sources besides books.
6) The 4C can be used by other departments than just reading. Therefore, it can be better justified than the greyscale scanner, as the expense can be amortized over more people and more departments.
7) The 4C document feeder holds fifty separate pages while the greyscale scanner only holds twenty. Tending the machine to restock the document feeder can be cut way down with the 4C.
There are two places to do editing. The first editing is done in the Caere document right after OCR has taken place. The second editing is done in the saved ascii text which has been reopened in your word processor.
Start by adding the page number. As each page comes up you should add a page number indicator to the top of the page, like "p#" and then the actual page number. Then press return to put the page number info on its own line. If you have scanned two pages at once, mark the second page now. If you did not already cut out headers in the zoning process, cut out the headers now. All this is easy to do because the cursor automatically goes to the top of each page as it comes up.
Adding the page number to the top of the page is important to do for many reasons, one of which is that saved text in ascii format will not be saved as separate pages and it is otherwise very difficult to know where one page ends and the next page starts.
After marking the page number, scroll down the text looking for any areas of colored text. These are areas the OCR program could not read. They need to be deleted or corrected. Usually they are parts of pictures or misread letters in bold or italisized sections. Delete or correct these colored areas.
Also check any columns to make sure they have been zoned correctly. If not, click back on the zone picture and redo all the zones. To do this press Command+a and then press "return". A window will appear asking you if you really want to remove all the zones from this page. Say "yes". Now click on the zoning tool and rezone the page. Then OCR just this page by typing Command+r. While you are moving your eyes down each page, make sure each paragraph ends as it should. Sometimes blank lines need to be deleted and separated text stitched together.
If text begins with an indent, occasionally the first or last full line of text will be at the beginning of the paragraph, instead of at the end. Look for this and cut and paste any such sections back to their rightful place.
Also, this is a good time to mark titles and subtitles, boxes, captions, and key words if you wish. It is easy to do this now because bolded words show up clearly as bolded and paragraph formatting is like the original. You can use the keyboard and shift key in the regular manner or you can quickly type marking combinations using the triple letter keystrokes and 555 and 554. If you doing this in WordPerfect you can use the macro keystrokes listed just before the triple letter keystrokes. However, these WordPerfect macro keystrokes won't work in Caere documents. This is why you use the triple letter keystrokes in Caere documents.
for <:# (indicates a chapter title) Type: Option+a or aaa
for <:= (indicates a primary sub-title) Type: Option+s or sss
for <: (indicates a secondary sub-title) Type: Option+d or ddd
for <:- (indicates a tertiary sub-title) Type: Option+f or fff
for <:> (marks a selected name or word) Type: Option+g or ggg
for <:% (marks a new part of a book) Type: Option+h or hhh
for p# (marks a page number) Type: Option+z or zzz
for << (marks beginning of caption or box of text) Type: Option+Comma or 555
for << (marks end of caption or box of text) Type: Option+Period or 554
If you use the triple letters and 555 and 554 you need to run the change code program in WordPerfect which will change these keystrokes into the right code. These triple letter codes and 555 and 554 are usually used on the Caere documents where macro keystrokes won't work. They save a great deal of time. To run the change code program in WordPerfect just type: Control+Option+Command+c.
Now save the text as ascii text.
The first time a new name comes up add it to the vocabulary list and the word won't resurface as needing to be spelled. Many of the remaining spelling errors will be matters of adding hyphens between words.
Do not worry about paragraph indents. All these indents (if present) are automatically removed later during Proportionalizing.
Next, select and cut footnotes. Either discard them or paste them next to their reference number in the text, separated by a space or treat them like captions.
Furthermore, scanning usually does a terrible job on sub and super scripts as well as fancy math graphics. If you do not want to rework the math, it may be easier to just treat math sections like a graph and have the student refer to the appropriate page in the book. Type in the words "SeePage".
The third and best approach for math equations is to cut them from the text and re-scan them as a line drawing graphic which you copy and paste into the word processor text at the right point.
If you are working with a lot of books which you are not going to use that often, you may want to save them as text files. Then you can Proportionalize a whole book overnight as necessary. This means you can save the average book on just one diskette (1.4 megs.).
Alternatively, about seventy pages of Proportionalized text can be saved on each diskette (1.4 megs.)
The best approach for a school is to keep all the books in current use on a file server in Proportional format on locked files. Each student downloads Proportionalized text as needed from the central memory onto his own, or lab computer and plays it as he or she wishes, marking the text as desired and saving selections onto personal files. This way text can also be sent via modem over the phone lines to students at home. This process can operate automatically without involving school personnel.
Note: Be sure to choose "within selection" or you will cut out all the hard returns and/or tabs in the piece.