Optical Recognition Technology is typically used to convert hard copy printed material into its electronic form. Many presentational artifacts such as end-of-line hyphenations, running headers and footers are literally converted. These artifacts can possibly hinder proximity and exact match searching.; This thesis develops an algorithm to extract running headers and footers from electronic documents generated by OCR. This method associates each page of the document with its neighboring pages and detects the headers and footers by comparing the page with its neighboring pages. Experiments are also taken to test the effectiveness of these algorithms.
展开▼