Extract Figures from Scholarly Documents
Scholarly documents often contain import results and visual aids in tables or figures embedded in the document. Most existing tools, such as the well known pdfimages, cannot extract those graphics if they are composed of vector graphics or contain text components, and cannot pair them with their associated caption. pdffigures is an easy to use command line tool that can match figures and tables to their captions and is robust to the many different ways figures and tables can be formatted.
Identifying and extracting figures and tables along with their captions from scholarly articles is important both as a way of providing tools for article summarization, and as part of larger systems that seek to gain deeper, semantic understanding of these articles. While many "off-the-shelf" tools exist that can extract embedded images from these documents, e.g. PDFBox, Poppler, etc., these tools are unable to extract tables, captions, and figures composed of vector graphics. Our proposed approach analyzes the structure of individual pages of a document by detecting chunks of body text, and locates the areas wherein figures or tables could reside by reasoning about the empty regions within that text. This method can extract a wide variety of figures because it does not make strong assumptions about the format of the figures embedded in the document, as long as they can be differentiated from the main article's text. Our algorithm also demonstrates a caption-to-figure matching component that is effective even in cases where individual captions are adjacent to multiple figures. Our contribution also includes methods for leveraging particular consistency and formatting assumptions to identify titles, body text and captions within each article. We introduce a new dataset of 150 computer science papers along with ground truth labels for the locations of the figures, tables and captions within them. Our algorithm achieves 96% precision at 92% recall when tested against this dataset, surpassing previous state of the art. We release our dataset, code, and evaluation scripts on our project website for enabling future research.
Our implementation can be found on github.
Data to evaluate our extractor was gathered from a variety of computer science conferences and then hand annotated. In total 150 papers were annotated, resulting in 458 figures and 190 tables. Bounding regions for figures, tables, and captions were identified using LabelMe and post processed to get cropped bounding squares.
We have evaluated both our extractor and a baseline extractor originally built targeting paper for high energy physics.
We have run our extractor on a corpus of papers from the ACL Anthology.