Sie befinden Sich nicht im Netzwerk der Universität Paderborn. Der Zugriff auf elektronische Ressourcen ist gegebenenfalls nur via VPN oder Shibboleth (DFN-AAI) möglich. mehr Informationen...
Identification of embedded mathematical expressions in scanned documents
Ist Teil von
Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004, 2004, Vol.1, p.384-387 Vol.1
Ort / Verlag
IEEE
Erscheinungsjahr
2004
Quelle
IEEE Xplore
Beschreibungen/Notizen
Efficient extraction of mathematical expressions is considered as an important pre-processing step to apply existing OCR systems to convert scientific papers into their electronic format. In this correspondence, a technique for extracting embedded (or in-line) expressions has been presented. The proposed method for expression extraction initially invokes an existing OCR to recognize the input document. Several features including word n-grams (a statistical analysis of a corpus of scientific documents reveals that the word level n-gram profile for sentences containing embedded expressions is quite different from that of the sentences without any expression) are computed on sentence level to spot sentences containing expressions. Expression zones are pin pointed by exploiting OCR inability to handle expressions and by using some common typographical aspects followed in typing mathematical expressions. Experimental results on a considerable size of dataset show high efficiency of the proposed technique.