Batch Import about 100,000 pdf articles forming a new library from my hard disk

JaketheJacko2009 · September 23, 2025, 1:48pm

Advice on how to batch import about 10,000 pdf articles forming a new library from my hard disk. I currently import about 350 files in a single run to minimise the freeze time at the end of the process before saving the updated library.

I also notice many new entries have a false author taken from the article. Any way to consistently obtain the correct author for each article?

Cheers

John

InAnYan · September 23, 2025, 2:25pm

Hi, John! Thanks for your question!

Sorry, please allow me to answer your question partially. Take a look at my comment on one of the topics that was opened a long time ago: PDF metadata extraction quality - #3 by InAnYan .

That comment lists all techniques that are used for extracting metadata from PDF. TL;DR JabRef uses several layers of metadata extraction, but most of the time extracting metadata directly from PDF gives not the best results.

The best way to retrieve metadata is through DOIs

JabRef’s GROBID should be working

JaketheJacko2009 · September 24, 2025, 7:18am

Thanks for the help.

How to install GROBID if there is no import button in preferences in Rabref v6?

Cheers

John

JaketheJacko2009 · September 24, 2025, 1:50pm

I’ve installed grobid and the jabref produces false metadata in the bibtex file and seems to not find the DOI number in the pdf file. Any suggestions to arrive at more accurate data collection inside Jabref?

John

ThiloteE · September 24, 2025, 2:29pm

Yes, to a certain degree that is to be expected, as was already explained in the link that Ruslan shared.

Different parsers will lead to different results. It is strongly recommended to fact-check all conversions, regardless of parser choice. All of them can confabulate. For comparison, adding entries using an ID is much more reliable and accurate.

The different parsers available are described (in a non-technical way) here: https://docs.jabref.org/collect/newentryfromplaintext#parser-explanation. To improve consistency, you could set up your own Grobid instance and use better deep learning models. JabRef hosts the default model for free, which is small, fast and compatible with the machine we are hosting it on, but it is a less precise model. The larger better models are more expensive to host and since JabRef has limited monetary resources, we opted for this default.

ThiloteE · September 24, 2025, 2:39pm

Since you are importing from somewhere (I think it was Zotero, right?), I wonder, if you already linked the files in your previous application to the entries? If so, you wouldn’t necessarily need to do the import via parsing the metadata from the PDF in an unreliable way, but rather you have the option to export and import a library file with metadata containing the correct link to the PDFs. JabRef has custom import filters: Custom import filters | JabRef. With your 100 000 entries, it would probably make sense to look into this. Or maybe the app that you came from supports writing XMP to the pdfs? Then you can parse the XMP.

You can check what metadata is attached to your PDFs by using ExifTool

Start the commandline on the folder holding the PDF(s)
exiftool.exe has to be in this folder
To extract really ALL metadata attached to the file, use the following command: exiftool -ee3 -U -G3:1 -api requestall=3 -api largefilesupport FILE

Topic		Replies	Views
Extract information from PDF import Features	14	2026	December 22, 2021
How to fetch data after droping a pdf file? Help	8	1355	May 5, 2020
PDF metadata extraction quality Features	8	159	April 11, 2025
How to read bib data for/from a PDF file Help fetcher , entry-editor	8	3266	April 24, 2021
Automated legal download of multiple full-text PDFs by parsing from entries in bib file Help	5	3650	April 19, 2018

Batch Import about 100,000 pdf articles forming a new library from my hard disk

Related topics