Text Extraction Tool
Glyphic is a state-of-the-art engine for accurately and efficiently identifying and extracting key targets within PDF documents. The name originates from the word “glyph”, a typographer’s pictorial representation of a symbol.
Many PDF extraction libraries focus only on the textual semantics of the documents. Glyphic employs that but also allows rule-based queries to use the structure and formatting of the document to identify possible content of interest. The library provides an expressive language for implementing content queries for PDF extraction, the data structures for efficient processing of those queries, and interfaces for accessing those queries from a larger system. Glyphic integrates smoothly with best-in-class OCR for digitization of scanned documents and can be deployed on prem and on cloud, depending on customer requirements.
Advantages over coordinate and text-based solutions:
- Maintains hierarchical relationships among structural components
- Robustly handles routine shifts and movements in content; not bound to exact positioning
- Query language unifies both coordinate and text-based approaches
- Recognizes that blocks, columns, and rows of text ‘go together’ enabling the creation of powerful and robust extraction rules