Extract information from PDF import

malacology · August 16, 2021, 1:00pm

I don’t know if JabRef can extract information from PDF, from my view, it can’t be used. After I drag some PDF files into JabRef and read some guide from the PDF import, and nothing correct is extracted.

Siedlerchr · August 16, 2021, 5:29pm

Hi,

we have good news for you. This year’s GSOC student @btut is already working on improved PDF importing and extraction (using GROBID under the hood).

btut · August 16, 2021, 6:10pm

Hi!
Indeed this is a feature I was very much looking forward to myself, that’s why I implemented it
I am happy to report that most things are done. I am working on some details and working for some changes in Grobid to be accepted, then we have a much more comprehensive pdf import.
You can track progress here, but as it depends on Grobid, which is not updated yet, it cannot be tested (you would need to build your own Grobid server from my Grobid PR and point JabRef to your server). Expect the feature to be in the main branch in the comming weeks and in the next release!

malacology · August 17, 2021, 3:15am

I hope I can use this function as soon as possible, it’s very helpful and thanks.

btut · August 30, 2021, 12:45pm

Hi @malacology! The new PDF import features are now available in the main branch, but not in the latest release. If you want to try it out already you can try out the builds here.

We use multiple ways of extracting metadata from PDFs now! One of them is Grobid, a deep-learning approach. JabRef now runs a Grobid server for that purpose. You will be prompted to allow JabRef to send your PDFs to that service the first time you try to import a pdf. Allow for best results, deny if you don’t want to transmit your files.

I hope this new feature is helpful to you!

ThiloteE · November 28, 2021, 10:50pm

Sorry to continuing this old thread. @btut, does this mean that when grobid is activated, it takes precedence over any local XMP metadata that was attached to the pdf file if i import it into Jabref?

I tried using the write XMP metadata to pdf feature and metadata was written, but when i import the pdf into another library in Jabref it does not show me data i pushed to the pdf, but rather quite different data.

See here: Writing XMP metadata to PDFs skips my linked pdf file · Issue #8278 · JabRef/jabref · GitHub

btut · November 29, 2021, 8:45am

Hi @ThiloteE!
Indeed, if Grobid is enabled the importer uses the following order to obtain metadata:

Look for bibtex entry on first page of pdf
Look for embedded bib file
Grobid
XMP metadata
Attempt to find metadata on first page (not in bibtex format).

If you want to force an XMP import, you can go to file → import → either to current or new library and select XMP-annotated PDF (last in the drop-down list) in the bottom right corner.

ThiloteE · November 29, 2021, 12:16pm

Thank you! This explanation was perfect and after following your advice, i managed to import the XMP metadata that i had pushed to it with the method you posted.

ThiloteE · December 8, 2021, 4:42pm

@btut what would be the order for importing if Grobid is not enabled? - The same as above, just without Grobid?

ThiloteE · December 8, 2021, 4:47pm

Maybe it would be a good idea to be able to set/change the order (within the preferences).

malacology · December 22, 2021, 3:41pm

@btut sorry, I still don’t know how to use the grobid in the newest version. When I import it, it said it would related with grobid server, but can I custom what server should I use?

ThiloteE · December 22, 2021, 5:29pm

Hello Malacology,

you can enable the grobid services under options>preferences>import and export>remote services.

This is enough and grobid should work just fine.

Alternatively you can link Jabref to other custom/alternative Grobid instances (if you know or have one at your disposal) by entering another url.

btut · December 22, 2021, 6:11pm

There should also be the option to read metadata for a pdf from already linked files. In the entry editor, where the files are listed, there should be a button next to the filename (right) which opens a merge dialog that shows metadata from all sources, Grobid being one of them.
I am not near a computer for the next days so I cannot be more specific. I hope you can find it anyway.

ThiloteE · December 22, 2021, 6:44pm

I think this is the button btut is talking about:

btut · December 22, 2021, 8:21pm

Exactly! Thanks @ThiloteE.

Topic		Replies	Views
PDF metadata extraction quality Features	8	106	April 11, 2025
Table of recognised sources from pdf entry Features	8	663	April 8, 2024
How to read bib data for/from a PDF file Help fetcher , entry-editor	8	3228	April 24, 2021
Parsing references from the PDF Beta Testing	1	302	February 7, 2025
Creating Bibtex or DOI list from bibliography Features	4	1042	March 12, 2024

Extract information from PDF import

Related topics