Extract information from PDF import

I don’t know if JabRef can extract information from PDF, from my view, it can’t be used. After I drag some PDF files into JabRef and read some guide from the PDF import, and nothing correct is extracted.

Hi,

we have good news for you. This year’s GSOC student @btut is already working on improved PDF importing and extraction (using GROBID under the hood).

1 Like

Hi!
Indeed this is a feature I was very much looking forward to myself, that’s why I implemented it :wink:
I am happy to report that most things are done. I am working on some details and working for some changes in Grobid to be accepted, then we have a much more comprehensive pdf import.
You can track progress here, but as it depends on Grobid, which is not updated yet, it cannot be tested (you would need to build your own Grobid server from my Grobid PR and point JabRef to your server). Expect the feature to be in the main branch in the comming weeks and in the next release!

1 Like

I hope I can use this function as soon as possible, it’s very helpful and thanks.

Hi @malacology! The new PDF import features are now available in the main branch, but not in the latest release. If you want to try it out already you can try out the builds here.

We use multiple ways of extracting metadata from PDFs now! One of them is Grobid, a deep-learning approach. JabRef now runs a Grobid server for that purpose. You will be prompted to allow JabRef to send your PDFs to that service the first time you try to import a pdf. Allow for best results, deny if you don’t want to transmit your files.

I hope this new feature is helpful to you!

1 Like

Sorry to continuing this old thread. @btut, does this mean that when grobid is activated, it takes precedence over any local XMP metadata that was attached to the pdf file if i import it into Jabref?

I tried using the write XMP metadata to pdf feature and metadata was written, but when i import the pdf into another library in Jabref it does not show me data i pushed to the pdf, but rather quite different data.

See here: Writing XMP metadata to PDFs skips my linked pdf file · Issue #8278 · JabRef/jabref · GitHub

Hi @ThiloteE!
Indeed, if Grobid is enabled the importer uses the following order to obtain metadata:

  • Look for bibtex entry on first page of pdf
  • Look for embedded bib file
  • Grobid
  • XMP metadata
  • Attempt to find metadata on first page (not in bibtex format).

If you want to force an XMP import, you can go to file → import → either to current or new library and select XMP-annotated PDF (last in the drop-down list) in the bottom right corner.

2 Likes

Thank you! This explanation was perfect and after following your advice, i managed to import the XMP metadata that i had pushed to it with the method you posted.

1 Like

@btut what would be the order for importing if Grobid is not enabled? - The same as above, just without Grobid?

Maybe it would be a good idea to be able to set/change the order (within the preferences).

@btut sorry, I still don’t know how to use the grobid in the newest version. When I import it, it said it would related with grobid server, but can I custom what server should I use?

Hello Malacology,

you can enable the grobid services under options>preferences>import and export>remote services.

This is enough and grobid should work just fine.

Alternatively you can link Jabref to other custom/alternative Grobid instances (if you know or have one at your disposal) by entering another url.

1 Like

There should also be the option to read metadata for a pdf from already linked files. In the entry editor, where the files are listed, there should be a button next to the filename (right) which opens a merge dialog that shows metadata from all sources, Grobid being one of them.
I am not near a computer for the next days so I cannot be more specific. I hope you can find it anyway.

I think this is the button btut is talking about:

Exactly! Thanks @ThiloteE.