More control on the Duplicate Finder


(Dellu) #1

The duplicate finder is one of the most invaluable tools in Jabref. It would be great if the user has more control on the tool; for a through searching of duplicates, just like Bookends. Searching by just the title; author, year, and the combination of them would give a power for the user to clean up duplicates.

I don’t know which of the fields the current duplicate finder checks. But, I have a number of duplicate entries with the same Title. It is not detecting them. Look at this entry, for example.

https://snag.gy/F8ja4S.jpg

The Titles are exactly the same.


Duplicated entries: which comparison mode do you use?
(Tmorrell) #2

JabRef definitely also looks at the author list (we’ve been experiencing problems where two sources treat names slightly differently), but I’m not sure what else it compares. It would be great to get a listing of the specific elements that JabRef compares and have the ability to choose elements that define a duplicate.


(oscar.gustafsson) #3

I can only agree. It is clearly an area which can be improved significantly. I changed it to use a proper edit distance algorithm a few months back and while that was a clear step forward compared to the previous approach (difference computed from the ASCII numbering of the characters…), there is still both tuning and more likely a general rewrite required.

If anyone is interested, this is the file to look at: https://github.com/JabRef/jabref/blob/master/src/main/java/net/sf/jabref/model/DuplicateCheck.java


(Dellu) #4

I wish I know how to write Java program. I am not going to catch up with the programming skill I have, 0-level. But, I still wish the Duplicate finder can be fine tuned. Have you guys seen this tool?

With fine tuned Duplicate finder in Jabref and the PDFextract, we can solve all the manual labor of collecting references from Google Scholar.


(Tobias Diez) #6

You can help us nonetheless by reporting false-positives (i.e. different entries which JabRef marks as duplicates) or duplicates which JabRef does not recoginze. Then we can use these cases to fine-tune the algorithm to find duplicates.


(Dellu) #7

Of course, gladly.

So far, there is no false positive. That is, there is minimal chance that Jabref detects non-duplicates as duplicate. Given the Merge option, the false positives are not really much a worry. And, the comparison feature at that point is amazing. I totally love that part.

Detecting by Title with options like (±Case; ±diacritics, ±punctuations) would make it ultimate tool for duplicate removal.

Looking forward to see the progresses.
Cheers