Parse medical subject headings (MeSH) when importing PubMed text files

References exported from PubMed in the Medline format contain medical subject headings (MeSH) and subheadings as MH - <heading/subheading/subheading>, with asterisks indicating “major” topics. There is another labe (OT - ) for “other therms” that are not MeSH terms.

Here is an example where subheading1, Heading4, and Heading5 are major topics.

MH  - Heading1/subheading1
MH  - Heading2/subheading1/subheading2
MH  - Heading3/*subheading1/subheading3
MH  - *Heading4
MH  - *Heading5/subheading1/subheading2
OT  - some term from the authors
OT  - another term

Jabref imports each MH or OT line as one keyword. I transform the Medline data before importing to preserve the details. This is how I like to do it:

MH  - Heading1 [mh]
MH  - Heading1/subheading1 [mh]
MH  - Heading2 [mh]
MH  - Heading2/subheading1 [mh]
MH  - Heading2/subheading2 [mh]
MH  - Heading3 [mh]
MH  - Heading3/subheading1* [mh]
MH  - Heading3/subheading3 [mh]
MH  - Heading4* [mh]
MH  - Heading5* [mh]
MH  - Heading5*/subheading1 [mh]
MH  - Heading5*/subheading2 [mh]
OT  - some term from the authors
OT  - another term

This preserves the distinction between MeSH and other terms and identification of major topics, while allowing convenient alphabetisation.

Commas and asterisks can be a problem for some reference managers, so I have also used this variation:

MH  - *Comma, has one
MH  - Heading, not major
MH  - Comma - has one [mj]
MH  - Heading - not major [mh]

I make these changes in a text editor before importing the records to JabRef, but it would be a nice improvement if JabRef performed the cleanup. I imagine that anyone who uses PubMed references would appreciate this.

Note: The behaviour is slightly different for references fetchedas XML from PubMed. In this case, I believe that the heading/subheading relationships are imported correctly, though still as generic keywords and without preservation of major topics.


Please vote if you support this proposal:

  • I would like to have this feature, too!
  • I don’t care.
0 voters

Thanks for your suggestion.

If I understand this right, the structure of the keywords should be changed, so that it takes the delimiters and splits by slash so that we can have a kind of hierarchy with topics?
Could you provide an example how it should look like in bibtex at the end?

currently the first value encountered is stored as keyword and subsequent encounters are then added:

the structure of the keywords should be changed, so that it takes the delimiters and splits by slash

Yes. Headings and subheadings have a many-to-many relationship, represented in PubMed format as one-heading to zero-or-more subheadings (one heading per line).

Here is a basic example of terms that I modified before importing.

The original form of the record indexed with all the terms was:

MH  - Ankle Joint/innervation/physiology

Usually, I split lines like this into three. While it is redundant to include the heading as a term of its own, this puts all “Ankle Joint” entries in one group, without using subgroups or searches, regardless of subheadings. I find this convenient for working with the references.

Here is another example, this time including major topics, as indicated by an asterisk.

In the original form each record had one of the following forms:

MH  - Imaging, Three-Dimensional
MH  - *Imaging, Three-Dimensional
MH  - *Imaging, Three-Dimensional/methods
MH  - Imaging, Three-Dimensional/*methods

In the transformation, I

  • retained commas from the original terms, and set JabRef’s delimiter to ;.
  • moved asterisks to the end of the heading and/or subheading to keep like-terms adjacent alphabetically
  • marked each keyword with [mh], because the abbreviated syntax for searching subject headings in PubMed is "Some Heading"[mh] and, brackets are not used in medline subject headings (as far as I know).

Notice that all of the entries in the second example deal with three-dimensional imaging, but if I am studying imaging methods, then I am particularly interested in the last group, where “methods” is a major topic and not merely incidental. Headings and/or subheadings can be a major topic.

More info:

  • The label OT - maps to keywords in JabRef. These terms need no special treatment.
  • The label RN - , for registry number named substances, also maps to keywords in JabRef. These terms often include characters that can be incorrectly interpreted as delimiters. I have not dealt with this recently, and I am not sure if JabRef has any difficulty with this.

Here is an example of a named substance registry number (no delimiter-like characters in this case).

RN  - 0 (Chromatin)

Here is an example of MeSH terms that were imported to JabRef without prior transformation.

The heading and subheadings remain altogether in a string, exactly as they appear in the source file. This is an acceptable but inconvenient result because:

  • Subheadings after the first one are difficult to see.
    (try browsing for “Carbamates/adverse effects” or “Carbamazepine/adverse effects”)
  • Finding a specific heading and subheading combination requires regex or boolean expressions.

The keywords in the screenshot above are split by commas, but the entry viewer shows all the MeSH terms strung together as one giant keyword.

Adding to the complication, MeSH terms include commas that should not be interpreted as a delimiter, so the defaults in Jabref split MeSH terms in unexpected places. This may be possible to prevent by changing the delimiter in Jabref. For sure, the problem can be remedied by changing the delimiter after-the-fact in the bibtex file.

Edit:

  • If the delimiter set in File > Preferences > Entries does not match the delimiter used in the keywords field of the bibtex entries, then the keywords will not be parsed properly in the entry editor (as expected).
  • Jabref always inserts a comma between MeSH terms during import, regardless of the delimiter provided in settings (as shown in the code from @Siedlerchr).
    This is a problem with PubMed files, because MeSH terms include commas that should not be interpreted as delimiters.
  • Commas within MeSH terms are indistinct from those added by Jabref between terms, so the input file has to be modified to allow correct parsing of keywords after the import.

Note: Jabref inserts the user-specified delimiter between MeSH terms when retrieving entries directly from PubMed (via Jabref’s Web Search). Unfortunately, records retrieved this way are missing the major topics (indicated by * in exports from PubMed) and have commas instead of slashes separating headings from subheadings and separating subheadings from each other. Here is an example.

keywords = {
    Breast Neoplasms, drug therapy, genetics, metabolism, pathology
}

Capitalisation is the only way to distinguish the heading from its subheadings, and there is no indication of which is the major topic.

Here is the same example imported from a PubMed file (including my pre-import modifications).

keywords = {
    Breast Neoplasms/drug therapy* [mh];
    Breast Neoplasms/genetics [mh];
    Breast Neoplasms/metabolism [mh];
    Breast Neoplasms/pathology [mh]
}

Now I can see easily from the keywords that drug therapy for breast neoplasms is the main subject of the article, and the article also deals with genetics, metabolism, and pathology of the same disease.

1 Like

Renamed this post to clarify the suggested enhancement.

Demo files

If anyone is interested in experimenting, here is a demo PubMed text file.

Simplified PubMed records:
pubmed-demo-set.txt (13.6 KB)

I removed several fields and truncated the abstracts, so you would not want to use this file as a general example of PubMed data. This collection of only nine entries includes 68 unique strings of MeSH terms. They are listed here, in alphabetical order.

Terms list:
pubmed-demo-mh.txt (1.6 KB)

When parsed as individual headings and combinations of heading+subheading, the list expands to 96 unique strings, increasing in storage size but also in utility.

Parsed terms list:
pubmed-demo-mh-mods.txt (2.1 KB)

Edit: Replaced pubmed-demo-mh.txt to remove an unintended blank line that was breaking one of the records into two.

1 Like

Thanks so much for your preparation! I created an internal issue so we can talk about it in our next Devcall (on monday) and add some implementation details/refine it

Amazing. Thanks. If you implement this, I imagine that you will want to minimise the amount of (arbitrary) data transformation. You could simply parse the heading/subheading pairs without making any other changes. In fact, this is what the importer already does, now that the delimiter issue is fixed – at least, this is true when there is only one subheading.

Additional parsing only matters when the term includes more than one subheading (two or more slashes in one line).

I have regex replacements to share if you want, but they are designed to act on a whole file rather than one string at a time as you would do with a function.

A few more things you may want to consider about PubMed files:

  • Subheadings need to remain paired with a heading (don’t create orphans by splitting the strings at /)
  • Jabref imports the AID identifier but not PMID (PubMed ID), which is arguably the most important field of all.
  • Jabref does not import the RN field, which is functionally equivalent to a keyword (for named substances). I trick the importer by moving [rn] to the end of the string and changing its label to MH. This way, the substances become keywords, and I can still filter the keywords by substance using my [rn] marker. Edit: RN stand for registry number and this field is included in the Jabref importer.
  • PubMed xml uses a binary attribute inside of tags, instead of asterisks, to mark major topics. The xml records also include much more data than PubMed text files. Unfortunately, xml has been removed from the PubMed user interface and is now only accessible by API.
  • Some applications “import” PubMed records by ingesting PMIDs from the input file and then retrieving the associated data via API. Replacing user-entered data and corrections with the original uncorrected records is not nice. Doing something like this would be totally uncharacteristic for Jabref, obviously.

Maybe you want to check with the code, some fields are imported as Other Fields, e.g. RN is imported in the Other field registry-number

You’re right. I should have checked the code, and I see that I was also using the wrong name for RN (registry number; substance name is NM). I did check some imported data, so I must have overlooked the registry numbers. Maybe I accidentally checked an entry that did not have any.

I see that PMID, PMC and PMCR IDs are used to identify the record type, but these fields are not included in the importer’s hashmap. Personally, I find them useful. PMIDs can be used by various research utilities and the PMC IDs let me know that free full text will be available. Also, the PMC ID is used in the NIH citation style.

For reference: MEDLINE/PubMed Data Element (Field) Descriptions

1 Like

Thanks, I now see the issue with PMID. From what I saw, we only import the field in the XML version.

So I think the first step would be to create a small issue about adding the missing fields for PMID and the other values you think are important. I will take care of it creating one tomorrow. This makes it a suitable candidate for newcomers.

https://www.nlm.nih.gov/bsd/mms/medlineelements.html

Someone like me :slight_smile:

If I get to it before another newcomer, I will tackle it myself, but it won’t be immediate. It looks like any of the one-line fields that are not already included just need to be added to the hashmap.

Sure;) Feel free! We have a getting started guide for setting up the workspace
https://devdocs.jabref.org/getting-into-the-code/guidelines-for-setting-up-a-local-workspace/

PMID import and the delimiter preference are now fixed in JabRef 5.15 for the plain text and XML importers. Many thanks to the devs. :raised_hands: