Reasons to use UTF-8 in bib-files for BibTeX

mkaut · May 21, 2025, 7:48am

I am maintaining a rather large .bib file for our group at work.
We use bibtex, due to old habits and also because all journals’ LaTeX styles I have seen use bibtex, typically with natbib).
For this reason, I have kept the encoding at ‘windows-1252’, to get a warning if one pastes for ex. author names with non-ascii characters.

However, I see that JabRef defaults to UTF-8, even for bibtex databases.
What is the reasoning behind this, given that bibtex is a 8-bit-only program?

Or, to put it differently, what am I missing when using an 8-bit encoding?
The only thing I am aware of is warnings about bad characters when I paste abstract, typically caused by “wrong” dashes or hyphens. But this could be better addressed by auto-correcting these (a feature I would really appreciate, by the way)…

koppor · May 21, 2025, 8:22am

Thank you for reaching out. Encoding and meeting user expectations is a very interesting topic.

Since around 2018, pdflatex defaults to UTF-8: Default to UTF-8 in 8-bit TeX · Issue #24 · latex3/latex2e · GitHub. Similar to bibtex, pdflatex is also a program defaulting to 8 bits. - Note that UTF-8 has been invented to be compatible with 8 bit character systems to cause as less issues as possible. Moreover, UTF-8 also covers non-plain ASCII alphabets such as Polish or even even going beyond Latin such as Ukrainian or Chinese. There are scientific papers and books written in these languages. See Extend `PdfContentImporter` to extract information from bibliographical pages in books · Issue #12874 · JabRef/jabref · GitHub for an example. It is possible to cover these letters in UTF-8, but not in ASCII (without LaTeX encoding).

We want to give users the choice to use LaTeX-encoding or to use UTF-8. Our analyis of users use of files is that since a few years, unicode encoding is more used than LaTeX-encoding. Morevoer, there is no world-wide dominant encoding. For instance, windows-1252 is western Europeon, but cannot be used in Greece for instance. Therefore, we default to UTF-8.

Replacing all unicode characters with LaTeX equivalents is possible with our clean up functionality: Cleanup entries | JabRef. You see at the screenshot the last entry for “field formatters”? “All-text-fields”. But you need to Choose “Unicode to LaTeX”.

One can have this conversion automatically since about 10 years:

Documentation at Save actions | JabRef - We very much welcome contributions to that page to make it more discoverable and more user-centered. Currently, it is more a draftish page.

mkaut · May 21, 2025, 9:08am

Thanks a lot for the quick and detailed answer.
I really have to have a look at the field formatters - should have noticed them long time ago…

Small problem: in both my .bib files, the “Enable field formatters” is currently disabled. When I enable it and click “Apply”, Jabref shows the file has changed - but when I save it, there is no change in the file and the properties show that “Enable field formatters” is disabled again.
Am I missing something, or is this a bug? (I am on JabRef 5.15 on Windows)

koppor · May 21, 2025, 12:23pm

You are very welcome! - It could be even more detailed, but time ran out. Maybe, there could be a blog post. But I first should get my five year old draft for another post finished → Add blog post on changes of our .bib serialization by koppor · Pull Request #47 · JabRef/blog.jabref.org · GitHub

We did not advertise them properly . You know, first the documentation right. And then there come in other feature requests and that gets forgotten…

I just realize that JabRef 5.15 is more than a year old…

I tried it here with a minimal .bib file - and it worked (with both 5.15 and the latest dev version). Would you mind sharing your .bib file with me so that I can investigate what’s happening?

mkaut · May 21, 2025, 12:53pm

It is still the latest version, as far as I can see. I have the version that winget calls 5.15.60000; its ‘About’ window shows:

JabRef 5.15--2024-07-10--1eb3493
Windows 10 10.0 amd64 
Java 21.0.2 
JavaFX 22.0.1+7

Winget also shows version 6.0-alpha.60000, but that sounds scary…

I don’t think the .bib files are to blame. I tried with a fresh .bib file with a single entry taken from Wikipedia and it still does not work…
Maybe I have some esoteric combination of Preferences, or something like that?

mkaut · May 21, 2025, 1:13pm

Possible progress:

When I create a new file, there are 4 formatters by default:

image577×232 4.38 KB
When I save it like this, then the change is not saved and formatters remain deactivated
When I delete one entry, then it is saved correctly
After that, I can add new entries and it works

So there seems to be something weird about the initial selection…

Side note: is “Replace Unicode ligatures” (which is not mentioned in the documentation, but is included in the default list) covered by “Unicode to LaTeX”?

koppor · May 23, 2025, 2:12pm

Thank you for the feedback. Investigation on our side takes time. I put it on my TODO list and I try to come back to this this month.

I don’t know about the versions distributed by WinGet. We offer “instant” builds at https://builds.jabref.org/main/. While I always use that version to work with my .bib file, we have the communictation trend to say: use the stable version

We have a high number of pull requests - and some need feed back by power users.

For instance at Use git for backup by khola22 · Pull Request #12252 · JabRef/jabref · GitHub, I am not sure if it works as intended. But there are more to try out at Pull requests · JabRef/jabref · GitHub

We are beginning to write a small guide how to try these PRs out. Current state: Add howto for local clone by koppor · Pull Request #569 · JabRef/user-documentation · GitHub - Maybe, you have feedback on that?

koppor · June 2, 2025, 1:10pm

@mkaut I could not manage to work on this.

Please open an issue at GitHub · Where software is built so that we can track it in our normal issue handling. I think, your steps are enough to post on the issue. It would be good to also have the error also as screenshot. That way, it is consumable by freshman and we can label it as “good second issue”.

Regarding the difference of “replace unicode ligatures” and the unicode-to-latex should be a different question. Someone needs to go into JabRef’s code and update the documentation accordingly then.

Topic		Replies	Views
"% Encoding: UTF-8" not in bib file Help	2	1291	July 9, 2022
Problem with special characters in references Help	2	3783	February 2, 2021
Could you please help me on the problem caused by wrong encoding and question about config file? Help	6	3440	August 12, 2016
JabRef 5.15 opens bib files with Shift_JIS encoding instead of UTF-8 Help	5	37	December 29, 2024
Using Hebrew and Check integrity Help	2	550	January 14, 2022

Reasons to use UTF-8 in bib-files for BibTeX

Related topics