PDF metadata extraction quality

Hi! Thanks for checking out JabRef.

I haven’t used Zotero much, but here is what I can tell you from knowing internals of JabRef:

JabRef has several methods (heuristics) for extracting metadata from PDFs. This includes:

  • Finding embedded .bib source code or (or a whole .bib file) in the PDF.
  • Finding XMP metadata in the PDF file (XMP is a general format for metadata).
  • If PDF “came” from Springer or IEEE, then JabRef can handle them easily, as the format of those publishers is “well-defined”. In other words, it can parse files of
    specific styles, though not all of them.
  • JabRef can also send files to Grobid (well known service/technology/system for analyzing documents), however currently it doesn’t work.

JabRef applies all of those methods and forms a final .bib entry.

By looking at Zotero docs (retrieve_pdf_metadata [Zotero Documentation]), it seems that Zotero sends couple of pages from PDF (several pages at the beginning) to some external online service. I think it’s probably Grobid as well (or one of the services). It also fetches information from DOIs or ISBNs.

JabRef can also extract metadata from DOIs and ISBNs, but it doesn’t do this automatically.

To conclude, both JabRef and Zotero rely on external services to retrieve metadata of PDF, but:

  1. Currently, JabRef’s Grobid instance is down.
  2. JabRef’s algorithms for handling PDFs could be a bit refined (e.g.: to automatically fetch metadata if it has found a DOI inside PDF).

And thus, JabRef might be a little bit behind Zotero in this functionality

1 Like