Segmentation based on character recognition is one of the most popular methods of segmenting mixed Chinese/English documents. However, the rejection to outliers is always the bottleneck of this method. A new method is provided to alleviate the problem in this paper. We will give language attribute of each segment as possible as we can and then merge or split segment according to the language attribute. First of all, we construct a mixed OCR engine for Chinese radical and English character and some English character-pairs. Furthermore, English negative samples are trained to improve the capability of rejection to outliers. Finally, language determination of segments based on the mixed OCR engine and complexity analysis of local pattern is conducted. Encouraging performance has been obtained according to the test results.
展开▼