An Interactive System for Off-Line Text-Recognition

An Interactive System for Off-Line Text-Recognition.

Susan Laflin
School of Computer Science
University of Birmingham, England.
September 1993
email : S.Laflin@cs.bham.ac.uk

1. Introduction.

Last year at the Bologna Conference, I discussed the necessary stages in extracting information from manuscripts and other documents and storing it in databases on a computer (Ref 1). The four main stages in this process were identified as:

a) The original document.
b) copies such as facsimiles, photocopies and computer images.
c) text, including printed versions, ascii and other machine-readable copies.
d) entries in a database.

For the Computer Scientist, the most interesting problems lay in the transfer between these stages and for myself, the area of greatest interest was the transfer from the image of a page of manuscript to an ascii version of the same text. I have been working on this aspect of text recognition during the past year and I now report on my progress.

2. General Method.

Much work has been done on text recognition of all kinds and the existing OCR packages can recognise printed text with varying degrees of success. The recent workshop at Leiden in June 1993 (Ref 2) considered the current position regarding OCR and its usefulness for historical records, but it is not suitable for handwritten records. Because most existing system start from printed documents and then attempt to extend the same approach to handwritten documents, they are best suited to recognising single letters or strokes and then building them up into letters and words. A recent issue of Pattern Recognition Letters (Ref 3) surveys the current position and indicates some success in recognising handwriting with a very limited vocabulary, such as may be found in Postal addresses or on money orders.

In spite of many years work on this problem, the human scholar is still much more successful at reading faded or poor quality manuscripts than any existing computer system. This suggests that an approach which imitates the method used by a human scholar might be more useful for such cases. Only seldom does the human reader attempt to spell out the words, letter by letter, and this usually occurs when an unfamiliar word is encountered. Usually the whole word or phrase is recognised and accepted when it matches a limited set of possibilities for the present context. As described in the text by Hector (Ref 4):

"if, having read the beginning of a phrase, he knows how it is likely to continue and end, his progress is bound to be easier; his mind will lead his eye."

This implies that a successful method must include much of the expertise of the historian, both information about the likely content and vocabulary of the manuscript and knowledge of the form and appearance of words written in that particular hand. Then the method entails comparing the image of a known word with the various word-images within the manuscript in order to identify matches between them. This is the case, whether the image has been generated explicitly within a computer system and the comparisons are carried out mathematically within a computer or, more traditionally, when the image exists only in the mind of the reader and the comparisons are also carried out within the mind of the reader. The latter process may have become completely unconscious, once the historian has learnt how to read a particular hand.

Such a project is very ambitious and will require many years of work by many people. To achieve this, it is desirable to break it down into smaller components, in which individuals can achieve a noticeable contribution within a reasonable length of time. As a first step, the following subdivisions are suggested:

Generation of Word Images.

This will produce software which will accept the ascii string for any given word and generate the image of this word in the desired hand. It will require a detailed study of the different hands and their description in computer terms, and relies on the assumption that there is an ideal standard for the hand. This is supported by Hector (Ref 4), who states that "the student ..... does come to carry in his eye something like a set of standards" and this suggests that the original writers of the manuscripts also had in their minds a standard form for the letters they were writing. The task is to analyse the letters and describe this standard in terms which can be coded within the computer, and then use it to generate images which can be used for comparison with word-images within the manuscript. Hopefully once one or more standards have been produced for each hand, different writers, or groups of writers, can be identified by their deviations from the standard.

Classification of Documents.

Archivists already classify the documents in their keeping in general terms such as "will", "inventory", "diary" and so on. Each of these categories indicates to the historian an expectation of the language and content of the document, and in some cases an idea of the probable structure of the manuscript. To incorporate this knowledge into a computer system, it will be necessary to define the structure in some way and to prepare list of likely words and phrases for each type of document.

Image Segmentation.

Since the method depends on comparison of word-images, it will be necessary to segment the document into lines of text and then into words. This can probably be achieved by an extension of the method described by Shapiro et al (Ref 6), with possble on-line editing for the difficult cases.

Comparison of Word Images.

Methods of comparison of word-images is a vital part of this project, and is likely to be a major area of research for many years to come. Some of the earlier work on recognising features within the words can be adapted for this purpose, but another important area relates to the use of neural network methods for this problem. In each case, it is essential to determine to what degree of reliability the method being studied can be used to determine whether or not the two images represent the same word.

The Prototype System.

Once methods exist for at least one example in each of the areas, it will be possible to construct a prototype system and start testing it with actual users. It is likely that any such system will need to evolve with use, and this will need experience with a number of users in order to design a final system which does meet the needs of many historians at all levels of expertise.

3. A Vision of the Future.

To imagine how such a system might appear in the future, let us consider an archive, such as a Public Record Office, which is in the process of transferring its collection of assorted documents into a database of document-images stored on a computer system. Some of the user's requests will be for documents which have already been digitised and these will be immediately available on the screen of a terminal. Others will request documents which have yet to be processed and these will experience some delay while the document is scanned, after which it will be available in the same way. At present all users have to wait while the documents they have ordered are located and fetched, and this delay will not apply to those documents which have already been digitised. This method of working will automatically ensure that those documents most in demand are entered first and will allow the staff to digitise the collection in the course of their normal duties.

Some users will not wish to have anything to do with computers and for these there will be the option of obtaining hard-copy of the document, which may be taken away and studied at leisure. Others will wish to examine the image on the screen before choosing a small section of hard-copy and this facility will be easily available. Yet others will wish to make full use of all the available tools and they will get the maximum benefit from the system. This will allow the user to choose a hand from the data files available and either choose an existing file of words and phrases for the type of document under consideration or type in his/her own list of words for later use. Once these data files have been loaded, the system may proceed.

Within the system, the user has the choice of "automatic" or "interactive" modes and may switch between them at will. In automatic mode, the system generates images for each of the words in the data file in turn and then searches the document for matches. All matches whose similarity is greater than some threshold (which may also be varied by the user) are stored in descending order of similarity and may be viewed at a later stage. Once all the words in the data file have been compared will all the words in the document (or at an earlier stage if the user becomes impatient), the manuscript is displayed, word by word, with the matches displayed below it and the user is invited to scan through it and comment on the progress so far.

In interactive mode, the user is invited to scan through the document and indicate particular words and type in their ascii strings. These are taken to be 100% correct (unless the user indicates otherwise) and the description of the hand is modified to make these an exact fit. After each such identification, the user chooses whether to continue in interactive mode or to switch to automatic mode and do a complete search with the new, updated version of the hand. In addition, when scanning through the manuscript after an automatic search, the user may choose one of the matches and a similar update of the hand is carried out. Thus, during such a session, the user may build up a transcription of the manuscript. At any stage, the session may be ended and hard-copy requested. This may be a mixture of ascii and images, with all matches about a certain value substituted, or it may include the best match for each word, however poor that may be, or it may include a list of possible matches with the probabilities included against each. Hard copy of the word-images may be included or not as desired.

Some users may wish to return at a later date and continue from the previous position. The number of users who can have storage space on the computer system for this purpose will have to be a decision of the manager of the archive, who will also decide whether the final transcription should be retained for general use or not. These are all possiblities within the system, but other considerations, usually based on available funds, will decide on whether they are used or not.

4. Generation of Gothic Text.

Returning to the present, it is necessary to start with one particular example of a hand and experiment with the different parameters which control the image produced. Modern texts on calligraphy (e.g. Ref 5) show how to produce the ancient handwriting using modern equipment. The main difference is the use of modern pen-nibs, which remian constant in shape and size for a large section of the text and, when worn, can be replaced by an identical nib. Under these conditions, the most important factors in controlling the shape of the final image are pen-path, pen-width and pen-angle. These are indicated in figure 1.

The pen-path is taken to be the path of the centre of the nib across the paper. For each position, a line is drawn, of length 2w (w either side of the centre) at an angle A to the horizontal and the final image is produced as the sum of all these lines. Instead of taking the image from the document and thinning it, the ascii string is used to produce the image as though drawn by a pen of thickness 2w and then the images are compared.

Gothic text was chosen as an example for two reasons. The script is very angular, which means that the pen-path may be represented by a succession of straight line segments and so the polyline representation is appropriate. This is quick and easy to implement. Other hands are much more curved and will require other forms of representation - the range of possible forms was discussed in an earlier paper at the Leiden workshop ( Ref 7) and will not be repeated here. The other advantage is the fact that Gothic was not a cursive script and so the words may be built up by the juxtaposition of successive letters without having to join them into a single curve. This is again quicker and easier to implement for an initial trial.

The other parameters are pen-width and pen-angle. Figure 2 shows an example of the same letter for several different pen-widths, pen-width zero corresponding to the path for this letter. This gives an indication of the effect of varying this parameter, and it may be seen that for these units, a width of 5 gives a good representation.

The other variable parameter is the pen-angle. That is the angle the end of the nib makes with the horizontal on the page. Figure 3 shows the same letter and width for five different angles, although the main difference is seen at the end of each stroke, and to some extent in the thickness of the cross-bar. An angle of 35 degrees is recommended by the modern text, and this will be used in future as the default value.

These representations are derived from modern texts describing the production of lettering using modern equipment. How relevant is it to the production of manuscripts under the conditions in earlier centuries, in particular when each scribe had to trim his own quill pen and the size and shape of the nib varied continually throughout production of the document. This also ignores the variation in width due to differing pressure as the letter was written. With all these possible variables, is there any point in attempting to simulate the process on a computer?

However one frequent comment about such manuscripts relates to the consistency of the lettering. It is not a random process, but is closely guided by the intelligence and experience of the scribe. It does seem likely that these many variables are in fact used to cancel one another out and produce a uniform result. This uniform result can be duplicated on the computer by the above method, even though it is a gross over-simplification of the actual situation. Pen-thickness and pen-angle may be set to that intended by the scribe and then the main effect will be minor variations in pen-path as the individual letters are drawn. This will then allow the generation of word images which will be useful for comparison.

7. Conclusions.

This paper describes the current state of this research and the approaches to be attempted over the next few years. Comments are welcomed, especially from possible collaborators. Later stages will be reported at future conferences.

6. References.

1. Susan Laflin "Processing Historical Information with the Aid of Computers" paper presented at AHC conference at Bologna 1992. Proceedings to be published.

2. "Optical Character Recognition in the Historical Discipline." International Workshop organised by N.H.D.A. and N.I.C.I. Published by Max-Planck-Institut Fur Geschichte, Band A18. 1993. ISBN 3-928134-97-3.

3. Pattern Recognition Letters Vol 14 No 4 1993. Special edition on Handwriting Recognition.

4. L.C.Hector "The Handwriting of English Documents" second edition. Edward Arnold 1966.

5. D. Hardy-Wilson "The Encyclopedia of Calligraphy Techniques." Headline Books 1990. ISBN 0-7472-7931-4.

6. V.Shapiro et al "Handwritten Document Image Segmentation and Analysis." Patt.Rec.Letters Vol 14 No 1 p71.

7. Susan Laflin. "An Interactive System for the Recognition of Manuscripts" Optical Character Recognition in the Historical Discipline. Band A18 August 1993 p 53-58.