Fixing broken paths to linked files

Moving a large .bib file from BibDesk to JabRef. In the process, LOTS of links to pdf files are broken, largely because BibDesk stopped supporting the human-readable file path.
I can do ‘Search for unlinked files’ but there seems to be no direct way to:

  1. open a PDF in the list to see titles, authors, etc.
  2. Find the appropriate entry in the JabRef main window.
  3. connect the PDF to entry:Files

Is there a direct way to do this?

Is there a way to define a group of just unlinked files?

Thanks,
Tom

This would require JabRef to read metadata from the pdf file, but not all pdf files do have metadata attached. JabRef can read XMP metadata or if you have enabled Grobid, JabRef will try to extract bibliographic metadata from the pdf by analysing and scanning the first page(s) for relevant info. Both are not always accurate, I would suggest to not blindly trust these two methods, especially Grobid. XMP metadata is fine, if you know for sure, that the correct metadata is attached to the pdfs. Instead, I would suggest importing metadata by ID (e.g. ISBN or DOI) yields far more accurate and complete results.

There are multiple methods that allow you to see entries without a file. I will show you two:

  • checking the entry table:
    image
  • create a group. Go to “free search expression”. Enable “Regular Expression”. Enter file != .+ into the “search expression” field.
    image

If you have an existing library file with lots of entries, it is usually more efficient to use the feature Quality > Automatically set file links (F7). Here are the docs: Manage associated files - JabRef

It’s best to keep the filenames of pdfs and the citationkey pattern similar, so that you can easily detect the pdfs. To avoid duplicate file names, using the DOI or ISBN as filename is great.

You will find I have had a little trouble to find a great regular expression to capture my pdf files too: Help with search expression for 'Automatically set file links' (F7); Attach files to entry, but it’s fine now :smiley:

In general, if you have a big library and lots of unidentified pdfs laying around with messy filenames you will have to do a lot of work.

Either you import by “search for unlinked files”, which will import the file and create a separate entry with potentially incomplete metadata in your library, which you then have to merge with existing entries via the duplicate detection feature, or you go the “Automatically set file links (F7)” feature, which would require you to rename all of your pdf files to a pattern that can be found via JabRefs regex engine.

Thanks for your response.
Unfortunately, I had found clues to using
file!=.+
but it flags everything as unlinked. About 4/5 of the near 12,000 files have links - mostly via directory scanning and bibtexparser.py. Flipping negation to file=.+ returns 0 files.

Some of my PDFs go back over many years before many of the metadata standards were available, and file naming conventions are all over the map.

Most of my PDFs are organized by directory hierarchy. Generally I hope to sort my unlinked bib entries by journal/year and to narrow down the PDFs to check to identify the appropriate link. Attempts to use the automated processes seem to generate a lot of flawed entries.

I’ve tried using the unlinked importer to import in small groups for easier management. Had to implement a group that presented Creationdate to tell which were the new entries. The lack of imbedded metadata created a mess of mostly useless entries, though a few worked reasonable. This might be a useful technique for adding any PDFs not yet with a bibtex entry.

It would be easier to identify which files are unlinked and use their location in the directory structure and read the file to narrow in to appropriate bib entry and know the bibtex file is complete. Tests with the automatic components suggests some of the extra fields I’ve installed for my projects may be confusing the merging functionality. Unfortunately the unlinked files dialog does not allow you to change to the main window of entries (modal dialog?) when it is up, so it’s difficult to do that search.

not what I experience. It must be your specific configuration. What version of JabRef are you using btw.?

I know, what this means, but what exactly do you do?
Could you provide a screenshot of your preferences and an example (the bibtex / biblatex source) of an entry that does not get detected?

Relevant preferences:

  1. Library > Library properties
  2. Options / File > Preferences > Linked files

about the bibtexparser.py, I am not sure if JabRef supports linking files that way.

JabRef version info:
JabRef 5.9–2023-01-08–76253f1a7
Linux 6.4.6-76060406-generic amd64
Java 19.0.1
JavaFX 19+11

Clearly I need to explain better.

My bibtexparser script is my primary converter for mapping field names, etc. from the ‘standards’ used in BibDesk, to those used in JabRef. Date-Added becomes Creationdate, Annote becomes Comment. Converting BibDesk’s Local-Url and Bdsk-File-# into JabRef’s File, etc. This conversion seems to have handled about 70% of what I needed to do. It’s only connection through JabRef is the .bib file.

The .bib file created by the above process is then loaded into JabRef where I want to use it’s entry management. However, I’m trying to find a way to do the remaining 30% of fixes I need to complete in JabRef, mostly just fixing broken links to PDF files.

It is not out of the question that this process has placed some entries in an odd state. All my entries with File fields seem to have valid filename entries (there are no entries with empty File values).

My directory of PDF files is not organized by citation, but is very mixed considering the broad range of my sources - journals, preprints, proceedings, PDFs of PowerPoint presentations, etc.

The unlinked file search is useful in finding unlinked PDFs, but locks you into only importing through it’s interface. And it seems to be unable to intelligently parse many of my PDFs, which is not surprising considering their provenance.

Ideally, from the unlinked file window, I’d want to go back to the main JabRef window and find the entries corresponding to the journal identified in the directory tree, then open the actual PDF file in a PDF reader to see precisely which entry it matches. Then drag and drop the link into JabRef. At least that way I get them merged and the .bib is at least correct, if not yet up-to-date.

Whatever PDFs are left unlinked can then hopefully be managed a little better.

Hopefully that’s a little clearer.

Thanks,
Tom

Thanks for the explanation:

About the bibtexparser.py. Would you mind sharing this? So maybe we can in future directly create an importer for bibdesk.

1 Like

What kind of standards do you have for contributed code?

Just FYI on my converter

  • it is written in Python 3 (working in v3.10)
  • it uses bibtexparser python module v2.0 (Welcome to BibtexParser’s documentation! — BibtexParser latest documentation), just recently converted.
  • the last BibDesk version it used was from about 2019.
  • there is much that is custom configured for my file layout, particularly in fixing file links.
  • there is also an (abandoned and buggy) class for converting to KBibTeX.
  • there is much that is hacked, including a table for converting unicode to LaTeX.

Others might be able to use it, but not without some serious reworking/customization.

Tom

In your script, are you planning/already support the conversion of groups?

Do you mean the conversion of groups defined in BibDesk to groups for JabRef?

The answer is probably no.

Currently I rebuild the groups manually in JabRef. BibDesk uses a bunch of XML in a comment at the end of the .bib file that I’d need to parse and then map to the JabRef functionality. I’ve got a lot of groups defined from my old system, but converting them for JabRef manually is presently much faster.

Tom

@astronomertom Can you share an example of such a bib file with groups?

There’s very little information for context. Here’s a couple of groups in the original .bib file from BibDesk.

<dict>
	<key>conditions</key>
	<array>
		<dict>
			<key>comparison</key>
			<integer>2</integer>
			<key>key</key>
			<string>Source</string>
			<key>value</key>
			<string>Library of Congress</string>
			<key>version</key>
			<string>1</string>
		</dict>
	</array>
	<key>conjunction</key>
	<integer>0</integer>
	<key>group name</key>
	<string>Available at Library of Congress</string>
</dict>
<dict>
	<key>conditions</key>
	<array>
		<dict>
			<key>comparison</key>
			<integer>2</integer>
			<key>key</key>
			<string>BibTeX Type</string>
			<key>value</key>
			<string>book</string>
			<key>version</key>
			<string>1</string>
		</dict>
		<dict>
			<key>comparison</key>
			<integer>4</integer>
			<key>key</key>
			<string>Read</string>
			<key>value</key>
			<string>0</string>
			<key>version</key>
			<string>1</string>
		</dict>
	</array>
	<key>conjunction</key>
	<integer>0</integer>
	<key>group name</key>
	<string>Books to be Read</string>
</dict>

Thanks for the hint, I setup BibDesk myself and I will prepare an issue for parsing that into our groups

Edit:: Import bib files from BIbDesk and parse the groups + linked files · Issue #10381 · JabRef/jabref · GitHub