Advice on how to batch import about 10,000 pdf articles forming a new library from my hard disk. I currently import about 350 files in a single run to minimise the freeze time at the end of the process before saving the updated library.
I also notice many new entries have a false author taken from the article. Any way to consistently obtain the correct author for each article?
That comment lists all techniques that are used for extracting metadata from PDF. TL;DR JabRef uses several layers of metadata extraction, but most of the time extracting metadata directly from PDF gives not the best results.
I’ve installed grobid and the jabref produces false metadata in the bibtex file and seems to not find the DOI number in the pdf file. Any suggestions to arrive at more accurate data collection inside Jabref?
Yes, to a certain degree that is to be expected, as was already explained in the link that Ruslan shared.
Different parsers will lead to different results. It is strongly recommended to fact-check all conversions, regardless of parser choice. All of them can confabulate. For comparison, adding entries using an ID is much more reliable and accurate.
The different parsers available are described (in a non-technical way) here: https://docs.jabref.org/collect/newentryfromplaintext#parser-explanation. To improve consistency, you could set up your own Grobid instance and use better deep learning models. JabRef hosts the default model for free, which is small, fast and compatible with the machine we are hosting it on, but it is a less precise model. The larger better models are more expensive to host and since JabRef has limited monetary resources, we opted for this default.
Since you are importing from somewhere (I think it was Zotero, right?), I wonder, if you already linked the files in your previous application to the entries? If so, you wouldn’t necessarily need to do the import via parsing the metadata from the PDF in an unreliable way, but rather you have the option to export and import a library file with metadata containing the correct link to the PDFs. JabRef has custom import filters: Custom import filters | JabRef. With your 100 000 entries, it would probably make sense to look into this. Or maybe the app that you came from supports writing XMP to the pdfs? Then you can parse the XMP.
You can check what metadata is attached to your PDFs by using ExifTool
Start the commandline on the folder holding the PDF(s)
exiftool.exe has to be in this folder
To extract really ALL metadata attached to the file, use the following command: exiftool -ee3 -U -G3:1 -api requestall=3 -api largefilesupport FILE