PDF metadata extraction quality

gittibit · February 5, 2025, 10:01am

Hi all,
I come from Zotero (which is nice but limiting for group use) and my first impression of Jabref is really good!
Only i noticed that the quality of meta-data extraction from pdfs for Zotero is superb, while Jabref has quite some problems. I saw that in the unreleased 6.0 version some improvements have happend.
In general, will the JabRef parser be on par with Zotero? What are the bottle necks?

Thank you for your comments

Siedlerchr · February 5, 2025, 2:04pm

@InAnYan You have some more insight

InAnYan · February 5, 2025, 2:54pm

Hi! Thanks for checking out JabRef.

I haven’t used Zotero much, but here is what I can tell you from knowing internals of JabRef:

JabRef has several methods (heuristics) for extracting metadata from PDFs. This includes:

Finding embedded .bib source code or (or a whole .bib file) in the PDF.
Finding XMP metadata in the PDF file (XMP is a general format for metadata).
If PDF “came” from Springer or IEEE, then JabRef can handle them easily, as the format of those publishers is “well-defined”. In other words, it can parse files of
specific styles, though not all of them.
JabRef can also send files to Grobid (well known service/technology/system for analyzing documents), however currently it doesn’t work.

JabRef applies all of those methods and forms a final .bib entry.

By looking at Zotero docs (retrieve_pdf_metadata [Zotero Documentation]), it seems that Zotero sends couple of pages from PDF (several pages at the beginning) to some external online service. I think it’s probably Grobid as well (or one of the services). It also fetches information from DOIs or ISBNs.

JabRef can also extract metadata from DOIs and ISBNs, but it doesn’t do this automatically.

To conclude, both JabRef and Zotero rely on external services to retrieve metadata of PDF, but:

Currently, JabRef’s Grobid instance is down.
JabRef’s algorithms for handling PDFs could be a bit refined (e.g.: to automatically fetch metadata if it has found a DOI inside PDF).

And thus, JabRef might be a little bit behind Zotero in this functionality

InAnYan · February 5, 2025, 2:58pm

Here are some good points of JabRef, you might be interested in, regarding PDF import:

JabRef has strict policies for connect to Internet services. It never does (at least it should never do) this without explicit user permission. If needed, you can redirect from Internet-based services, to local ones in a local network (example: you can serve your own Grobid instance locally, and if I remember correctly, you can setup JabRef to use it, instead of some service from Internet).
Even without access to external services, JabRef still can import bits of PDFs.

So, JabRef has some compelling advantages for this task. If you are doing private research (or concerned about privacy), JabRef will be a good reference management system.

InAnYan · February 5, 2025, 3:04pm

And, of course, I’ll repeat what Zotero docs said (retrieve_pdf_metadata [Zotero Documentation]):

It’s not the best idea to import articles/entries by PDF, as storing metadata in PDFs (especially bibliography metadata for research/librarian purposes) is not well-defined. Sometimes PDF has metadata, sometimes not. Someone call \hypersetup (or use pdfx) in a LaTeX document, someone not.

It’s always better to rely on identifiers like DOI.

gittibit · February 6, 2025, 8:18am

Thank you for the detailed answers!
I guess it really is the use of the external service (like grobid), that makes Zotero so good in importing the meta-data. It’s sad to hear JabRef has this option but that currently it does not work. Is there a way to work around this without setting up a local instance of such a service?

ThiloteE · February 14, 2025, 5:09pm

Importing bibliographic metadata by PDF is very much recommended, if it was confirmed that correct metadata was written to it prior, such as with the “write XMP metadata to PDFs” feature JabRef offers. Best as a short form of exchange, if all colleagues follow established rules, but not recommended with arbitrary PDFs from the internet.

There, identifiers like DOI are indeed the way to go.

As a workaround to GROBID not being accessible, we can only point to the LLM functionality in JabRef. We cannot guarantee that results will be similar. All parsers, that includes Grobid, Rule-based and LLMs have a tendency to confabulate. Manually fact checking results is highly recommended. I think a workflow is good, if there is no ID: (1) Import PDF (2) Search ID (3) Compare metadata with metadata obtained by ID (4) If there is no ID, manually fact check.

ryan-carpenter · April 9, 2025, 11:17pm

I believe the Grobid service has been restored.

gittibit · April 11, 2025, 5:53am

I have heard - great Thanks!

Topic		Replies	Views
Extract information from PDF import Features	14	1987	December 22, 2021
How to fetch data after droping a pdf file? Help	8	1334	May 5, 2020
Why Zotero and JabRef are two different apps? Help ui	6	29049	November 3, 2017
Parsing references from the PDF Beta Testing	1	302	February 7, 2025
Web based jabref Features	1	1203	June 20, 2020

PDF metadata extraction quality

Related topics