Dr. Johannes Höhne: "Chargrid - Understanding 2D documents"

13. Jun

Dr. Johannes Höhne: "Chargrid - Understanding 2D documents"

Im Rah­men des For­schungs­kol­lo­qui­ums Wirt­schafts­in­for­ma­tik und Data Sci­ence re­fe­riert Dr. Johannes Höhne von SAP Research Berlin  über "Chargrid - Understanding 2D documents"


Da­tum und Ort:  13. Juni 2019    12.15 Uhr   C 14.203


Textual information is often represented through structured documents which have an inherent 2D structure. This is even more so the case with the advent of new types of media and communications such as presentations, websites, blogs and formatted notebooks. In such documents, the layout, positioning, and sizing might be crucial to understand its semantic content and provide a strong guidance to the human perception. Natural language processing (NLP) addresses the task of processing and understanding plain texts. However, it processes text by serializing it thereby completely ignoring any 2D structure in the text. On the other hand, computer vision (CV) may be used to process document images. In this way, the structure is retained but the document semantics should be learned all the way from the image pixels. We introduce a new representation for 2D documents – the character grid (chargrid) – that retains the original 2D structure while directly encoding the characters in the text. The character grid representation can readily be used with, e.g. deep neural networks. We apply chargrid to the task of information extraction from invoices and show that it captures the best of both worlds – NLP and CV. Chargrid is accepted for presentation at EMNLP 2018 and is also deployed in the production system of SAP Concur, currently processing tens of thousands of invoices every month.