This feature does not appear to have been addressed in 4 years and with the increase focus on systematic literature reviews I have seen over that period I think it is something that needs to be revived and worked on.
From my observation it appears the duplicate detection algorithm tries to be too clever for its own good, using some form of overall similarity measure with a few exceptions or weightings on specific data fields. While this minimizes false positives, it leaves many potential duplicates completely unidentified. During a systematic literature review, false positives are the least of my worries as they are minimal as compared to the number of actual duplicates that occur, so I think the feature would better serve a more complete identification of potential duplicates.
The following are some of the issues I have observed while trying to use the duplicate identification feature while performing a systematic literature review of my own.
Too much weighting is placed on the type of the entry
Some databases create BibTeX entries differently to others, in which case duplicates are completely missed because they are not of the same type, even when most other details match. Springer Link, for example, assigns conference papers as InCollection
as opposed to InProceedings
, so matching against other databases does not identify the duplicates correctly. About the only information that seems to override this is having an exactly matching DOI, leading to the next problem.
DOI matching is exact match only
It would be nice if everyone output DOI identifiers in exactly the same way, unfortunately, this is not the case. Some databases output the DOI as the URL (sometimes in the url
field and not the doi
field even) rather than the plain identifier by itself. Since there is no quality tool at the moment to normalise the DOIs, this causes problems with matching them. Also, some databases will escape LaTeX characters in the DOIs, such as underscores (\_
instead of just _
) which also prevents the DOIs from being matched. Ideally, the following should all be considered matching:
10.1007/978-981-15-4474-3_1
10.1007/978-981-15-4474-3\_1
https://doi.org/10.1007/978-981-15-4474-3_1
https://doi.org/10.1007/978-981-15-4474-3\_1
-
https://dx.doi.org/10.1007/978-981-15-4474-3_1
(some references may still have the old URL)
- etc.
Comparison of BibTeX entries of very different sizes
Some databases export a bunch of their own metadata along with the core BibTeX fields for each entry. This appears to interfere with the duplicate detection as it increases the degree of dissimilarity, especially if the entry it is compared to is very small. Web of Science, for example, is particularly bad for creating bloated entries with unnecessary details. It is even worse if two databases export info using the same field name but in different ways, so the content is completely different. For example, one using the notes
field to include citation details and another using the notes
field for a comment like online first
.
For the former problem, the extraneous data fields should not be incorporated into the matching process, while the latter is more difficult and may not be easily solved and, for actual duplicates, the dissimilarity of the one field should be outweighed by the matching of others. However, it might be worth excluding certain ‘generic’ fields from the matching such as notes
and comments
as they could potentially contain arbitrary unmatching content. This is particularly important if I have started adding some custom fields to record some data about the entries and/or been adding comments and then pull in the results of another search.
There is no way to mark entries as not duplicates
For the case where there is a false positive and you do not want to merge or remove one of the entries, there is no option to mark the entries as not duplicates. So if you pull in additional results and run the duplicate detection again you are presented with the same candidates again. While I found the number of false positives to be quite low, it would still get annoying as I would have to recheck them again to determine that it was indeed a false: after going through 100 candidates, remembering to go back to the one false positive and put in a comment or something to remind me later just wouldn’t happen.
Possible Solution
At least in the context of a systematic literature review, it might be worth having the duplicate detection put its emphasis on a few key fields (such as author, title, date, doi, and url) irrespective of their entry type. I think this would maximize candidate detection, which is more important in this context than minimizing false positives. If using general similarity measures on those fields, the thresholds might need to be set appropriately high, but I would assume that, after appropriate normalisation, those fields would be almost exact matches in duplicate entries.