Options
Label-Value Extraction from Documents Using Co-SSL Framework
Journal
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
ISSN
03029743
Date Issued
2022-01-01
Author(s)
Sara, Sai Abhishek
Singh, Maneet
Pegu, Bhanupriya
Singh, Karamjit
Abstract
Label-value extraction from documents refers to the task of extracting relevant values for corresponding labels/fields. For example, it encompasses extracting the total amount from receipts, the date value from invoices/patents/forms, or tax amount from receipts/invoices. Automated label-value extraction has widespread applicability in real-world scenarios of document understanding, book-keeping, reconciliation, and content summarization. Recent research has focused on developing label-value extraction models, however, to the best of our knowledge, limited attention has been given to developing a light-weight compact label-value extraction module generalizable across different document types. Since in real-world deployment, a developed model is often required to process different types of documents for the same label/field type, this research proposes a novel Context-based Semi-supervised (Co-SSL) framework for the same. The proposed Co-SSL framework focuses on identifying candidates for each label/field, followed by the generation of their context based on spatial cues. Further, novel data augmentation strategies are proposed which are specifically applicable to the problem of information extraction from documents. The extracted information (candidate and context) is then provided to a deep learning based model trained in a novel semi-supervised setting for applicability in real-world scenarios of limited training data. The performance of the Co-SSL framework has been demonstrated on three challenging datasets containing different document types (receipts, patents, and forms).
Volume
13088 LNAI
Subjects