GSoC 2026: Use PostgreSQL as full GUI data backend for JabRef - Aritra Banerjee

GSoC 2026 Introduction: PostgreSQL as Full GUI Data Backend (Issue #12708)

Hello , @koppor, @InAnYan and the JabRef community,

My name is Aritra Banerjee, and I am preparing my GSoC 2026 proposal for migrating JabRef’s main table data layer to PostgreSQL.

My background heavily involves bridging complex data streams into Java Spring Boot objects, and I have engineered low-level data structures in C from scratch. This gives me a strict appreciation for the memory allocation constraints currently bottlenecking JabRef’s in-memory model, and I understand exactly why loading entire libraries into RAM is no longer scalable.

I have reviewed the BibDatabaseContext and the current data architecture, and I am currently drafting my proposal around two core technical pillars:

  1. Hybrid Database Schema: To handle the dynamic, user-defined nature of BibTeX fields without breaking relational structure, I am designing a schema that utilizes standard columns for core metadata alongside PostgreSQL’s JSONB format for robust, indexable custom field storage.

  2. Keyset Pagination (Cursor-based): To ensure the JavaFX UI remains perfectly smooth and memory-efficient, the Java DAO layer will utilize keyset pagination rather than OFFSET/LIMIT, guaranteeing that fetching the 10,000th citation is as fast as the first. I plan to containerize the local PostgreSQL testing environment to ensure clean development.

I am finalizing the architectural diagrams and milestones in my proposal document now, and I will share the link in this thread very shortly for your feedback.

I look forward to contributing!

Hi again @koppor , @InAnYan and the community,

Following up on my initial post, I wanted to share some concrete data to validate the architecture I proposed. To ensure this approach is viable before the GSoC selection period, I spent the last few days building a standalone prototype of the data layer.

To accurately mirror JabRef’s desktop environment, this was built in pure Java using Gradle , utilizing HikariCP for connection pooling and the PostgreSQL JDBC driver.

The Test Setup:

  • I created the Hybrid Schema I discussed: core columns for standard metadata, and a JSONB column with a GIN index for all dynamic BibTeX fields.

  • I seeded the database with 50,000 realistic dummy entries to simulate a massive library.

  • I implemented the Keyset Pagination (WHERE id > ? LIMIT 50) to stream the data into Java without overwhelming the heap.

The Results:

I ran a continuous scrolling simulation across all 50,000 rows while monitoring the JVM with VisualVM.

  1. Speed: Fetch times stabilized instantly at 2-3 milliseconds per 50 rows.

  2. Memory Footprint: Because the keyset pagination allowed the Garbage Collector to efficiently clear old pages, the used heap space maintained a completely flat sawtooth pattern. It peaked at ~46MB and idled around 15MB, definitively proving we can achieve an O(1) memory footprint regardless of library size.

Visual Proof & Code:

You can view the full prototype repository and reproduce the simulation here:
Code : GitHub - programmer1128/GSoCJabRef2026 · GitHub

If the mentors have a moment, I would highly value your architectural feedback on this PoC. Specifically, are there any edge-case BibTeX fields or specific JSONB nesting structures you would like me to test against this index next?

Thank you!

JabRef is a volunteer project and we are searching for GSoC students being able to work on their own. Thus, it is nice to see that a prototype has been delivered.

It is unclear to me how this prototype wires into a JavaFX table with lazy loading.

The protoype uses a very simple BibEntry data type, which is very different from the datatype of JabRef: jabref/jablib/src/main/java/org/jabref/model/entry/BibEntry.java at main · JabRef/jabref · GitHub

The challange is:

  • handle arbitrary bibtex entry types - Customize entry types | JabRef
  • handle unicode and latex encoding - Düsseldorf vs. Duesseldorf vs D\”{u}sseldorf - all meaning the same thing.
  • handle authors (this is IMHO the most challenging part with respect to the data structures)

Hi @koppor

Thank you so much for taking the time to review my prototype and for outlining these specific challenges. I completely agree—my initial PoC was a simplified approach just to prove the O(1) memory constraint. Wiring it into JabRef’s actual domain model and UI is where the real engineering challenge lies.

Here is my architectural strategy for addressing the excellent points you raised:

1. Wiring into JavaFX Lazy Loading:

The keyset pagination (LIMIT 50) in the backend is designed to act as the data provider for a virtualized UI component. As the user scrolls the JavaFX TableView and approaches the end of the cached data, a background Task will trigger the next keyset fetch, appending it to the ObservableList. This ensures the UI thread never blocks and the JVM heap is never flooded by a 50,000-entry file.

2. The Complex BibEntry & Arbitrary Fields:

Looking at BibEntry.java, the use of an internal Map (and ObjectProperty for observability) to handle arbitrary fields is exactly why I selected PostgreSQL’s JSONB over a strict relational schema. We do not need to alter the database schema when users define custom fields. The DAO layer will serialize JabRef’s internal field Map directly into the dynamic_fields JSONB column. The GIN index ensures that searching across any of those arbitrary fields remains highly performant.

3. Unicode vs. LaTeX Search Equivalence (Düsseldorf vs D"{u}sseldorf):

To handle this without mutating the user’s raw data, I propose using PostgreSQL Generated Columns or Functional Indexes. We store the exact, raw LaTeX string in the JSONB field, but generate a normalized, plain-text “search vector” column upon insertion (utilizing JabRef’s existing LatexToUnicodeAdapter logic). Search queries execute against the normalized index, but the UI always receives the raw, accurate string.

4. Handling Authors:

I agree that mapping JabRef’s complex AuthorList (First, Last, Von, Jr) to a traditional relational database would require heavily bottlenecked JOIN tables. My strategy relies on parsing at the edge. When JabRef parses a BibTeX string into Java Author objects, the DAO will serialize that structured list directly into a JSON Array within the database (e.g., [{"first": "John", "von": "von", "last": "Neumann"}]). To ensure searches remain instantaneous, we can apply a specific GIN index (jsonb_path_ops) optimized for JSON array path queries. For broader full-text searches (e.g., a user just typing “Albert”), we can utilize a PostgreSQL tsvector Generated Column that automatically maintains a searchable index of the author JSON without table scans.

Next Steps:

My immediate next step is to update my prototype . I am going to refactor my simple BibEntry class into a Map-backed mock that mirrors your structure, to prove that the JSONB serialization and indexing work flawlessly with JabRef’s actual data model.

Thank you again for the guidance! I look forward to more such challenges..