More control on the Duplicate Finder

The duplicate finder is one of the most invaluable tools in Jabref. It would be great if the user has more control on the tool; for a through searching of duplicates, just like Bookends. Searching by just the title; author, year, and the combination of them would give a power for the user to clean up duplicates.

I don’t know which of the fields the current duplicate finder checks. But, I have a number of duplicate entries with the same Title. It is not detecting them. Look at this entry, for example.

The Titles are exactly the same.

JabRef definitely also looks at the author list (we’ve been experiencing problems where two sources treat names slightly differently), but I’m not sure what else it compares. It would be great to get a listing of the specific elements that JabRef compares and have the ability to choose elements that define a duplicate.

1 Like

I can only agree. It is clearly an area which can be improved significantly. I changed it to use a proper edit distance algorithm a few months back and while that was a clear step forward compared to the previous approach (difference computed from the ASCII numbering of the characters…), there is still both tuning and more likely a general rewrite required.

If anyone is interested, this is the file to look at:

1 Like

I wish I know how to write Java program. I am not going to catch up with the programming skill I have, 0-level. But, I still wish the Duplicate finder can be fine tuned. Have you guys seen this tool?

With fine tuned Duplicate finder in Jabref and the PDFextract, we can solve all the manual labor of collecting references from Google Scholar.

You can help us nonetheless by reporting false-positives (i.e. different entries which JabRef marks as duplicates) or duplicates which JabRef does not recoginze. Then we can use these cases to fine-tune the algorithm to find duplicates.

Of course, gladly.

So far, there is no false positive. That is, there is minimal chance that Jabref detects non-duplicates as duplicate. Given the Merge option, the false positives are not really much a worry. And, the comparison feature at that point is amazing. I totally love that part.

Detecting by Title with options like (±Case; ±diacritics, ±punctuations) would make it ultimate tool for duplicate removal.

Looking forward to see the progresses.

This feature does not appear to have been addressed in 4 years and with the increase focus on systematic literature reviews I have seen over that period I think it is something that needs to be revived and worked on.

From my observation it appears the duplicate detection algorithm tries to be too clever for its own good, using some form of overall similarity measure with a few exceptions or weightings on specific data fields. While this minimizes false positives, it leaves many potential duplicates completely unidentified. During a systematic literature review, false positives are the least of my worries as they are minimal as compared to the number of actual duplicates that occur, so I think the feature would better serve a more complete identification of potential duplicates.

The following are some of the issues I have observed while trying to use the duplicate identification feature while performing a systematic literature review of my own.

Too much weighting is placed on the type of the entry

Some databases create BibTeX entries differently to others, in which case duplicates are completely missed because they are not of the same type, even when most other details match. Springer Link, for example, assigns conference papers as InCollection as opposed to InProceedings, so matching against other databases does not identify the duplicates correctly. About the only information that seems to override this is having an exactly matching DOI, leading to the next problem.

DOI matching is exact match only

It would be nice if everyone output DOI identifiers in exactly the same way, unfortunately, this is not the case. Some databases output the DOI as the URL (sometimes in the url field and not the doi field even) rather than the plain identifier by itself. Since there is no quality tool at the moment to normalise the DOIs, this causes problems with matching them. Also, some databases will escape LaTeX characters in the DOIs, such as underscores (\_ instead of just _) which also prevents the DOIs from being matched. Ideally, the following should all be considered matching:

  • 10.1007/978-981-15-4474-3_1
  • 10.1007/978-981-15-4474-3\_1
  • (some references may still have the old URL)
  • etc.

Comparison of BibTeX entries of very different sizes

Some databases export a bunch of their own metadata along with the core BibTeX fields for each entry. This appears to interfere with the duplicate detection as it increases the degree of dissimilarity, especially if the entry it is compared to is very small. Web of Science, for example, is particularly bad for creating bloated entries with unnecessary details. It is even worse if two databases export info using the same field name but in different ways, so the content is completely different. For example, one using the notes field to include citation details and another using the notes field for a comment like online first.

For the former problem, the extraneous data fields should not be incorporated into the matching process, while the latter is more difficult and may not be easily solved and, for actual duplicates, the dissimilarity of the one field should be outweighed by the matching of others. However, it might be worth excluding certain ‘generic’ fields from the matching such as notes and comments as they could potentially contain arbitrary unmatching content. This is particularly important if I have started adding some custom fields to record some data about the entries and/or been adding comments and then pull in the results of another search.

There is no way to mark entries as not duplicates

For the case where there is a false positive and you do not want to merge or remove one of the entries, there is no option to mark the entries as not duplicates. So if you pull in additional results and run the duplicate detection again you are presented with the same candidates again. While I found the number of false positives to be quite low, it would still get annoying as I would have to recheck them again to determine that it was indeed a false: after going through 100 candidates, remembering to go back to the one false positive and put in a comment or something to remind me later just wouldn’t happen.

Possible Solution

At least in the context of a systematic literature review, it might be worth having the duplicate detection put its emphasis on a few key fields (such as author, title, date, doi, and url) irrespective of their entry type. I think this would maximize candidate detection, which is more important in this context than minimizing false positives. If using general similarity measures on those fields, the thresholds might need to be set appropriately high, but I would assume that, after appropriate normalisation, those fields would be almost exact matches in duplicate entries.

@mattys101 Thanks for your observations. I suggest you open a github issue for this feature

JabRef has a DOI parser which can parse and normalize all kind of DOIs.
Looking at the code., the duplicate detection algorithm just checks the content of the identifier fields, but does not in case of doi any parsing beforehand. So that should be added and would probably resolve a lot of cases.
You can find the implementation of JabRef’s Duplicate detection algorithm here.

Had a quick look at the code and it is good to see that my observations appear to be accurate. I added a ‘improvement suggestion’ in GitHub. Now we just need someone to pick it up and make the changes :wink:
Bets regards,