GSoC 2026 project

Hi,

I am applying for “Improve handling of older documents by OCR and AI” project for GSoC. Before drafting my proposal for this project I want to make sure my understanding of this project is correct on a high level:

1.A Java interface that allows switching of multiple OCR engines
2.A frontend to format settings of the OCR config
3.Adding a searchable layer of text onto the original pdf
4.Ensuring that Apache Lucene successfully processes it
5.Passing the extracted text to Grobid for automatic metadata extraction

It would be really helpful, if somebody can tell me if I have understood this part correctly

My Background & Setup:
I have already have JabRef set up locally set up on my laptop. Previously I tried to tackle with an issue of “Pseudonomyzation”. While I had successfully written the code with proper logic and it passed all test cases, it was ultimately rejected due to it not being upto architectural standards. I took that feedback seriously. I spent last few months learning about LLD, OOP, Multithreading, Concurrency and writing cleaner production ready code.

Now I am ready to demonstrate my improved coding quality.Could you point me toward a specific good first issue related to the OCR or PDF manipulation components so I can start contributing?

Thank you.

Sounds like one acceptable approach, although I am sure that the “switching between OCR engines” part might be very tough to code, if the engines you are interested in don’t provide Java bindings. Ultimately it is your baby. It is you who aspires to code and maintain it (together with the community), so the question is also: what do you want JabRef be able to do? If you have specific workflows in mind, I am sure you will find a way.

Thanks for taking the time to reply.
I have already started working on drafting the proposal. But to draft a well planned proposal I need to see and understand the relevant chunks for codebase. I have already read the readme file and contributing guides thoroughly. Can you tell me specific parts of the codebase or important pieces of information I should understand which would help me in drafting the proposal?

I cannot take all the work off you…

I have already started to familiarize myself with the particular codebase for this project. The Ai logic files, logic/pdf files, Logic/search and indexing files are the most important part of the entire codebase I feel. Especially the files under ai provide the basic templates for the code. But the main hurdle I am facing is to learn the OCR libraries as well as to how they work(this is completely foreign to me). Another I want to ask is do I need to learn JabRef specific JavaFX or I need to have an overall understanding of it to contribute?

If the OCR libraries and JavaFX are completely foreign to you at this point in time right before the deadline of the proposal and you feel like it’s too big of a challenge to familiarize yourself with them in a short timeframe, maybe you would have needed to start preparing earlier. The alternative is to come up with a different proposal in a field that you feel more knowledable and confident in. As far as I am aware we are not using a customized version of JavaFX, so if it works with JavaFX, in theory it should also work in JabRef. We do make use of some third party projects like GemsFX to enhance some UI components and to reduce maintenance load, so feel free to look into that.