fenominal: Phenomenal text mining for disease and phenotype concepts
fenominal
Fenominal is a Java 17 library for text-mining
Human Phenotype Ontology (HPO) terms from text. Fenominal is
a multimodule project with a core
module with the text-mining logic and a cli
module for use from the
command line. A graphical user interface (GUI) is available in a separate project
called Fenominal-GUI.
Fenominal implements the T-BLAT algorithm, which is inspired by the BLAST algorithm for biosequence alignment. T-BLAT screens texts for potential matches on the basis of matching k-mer counts and scores candidates based on conformance to typical patterns of spelling errors derived from 2.9 million clinical notes. Fenominal also implements exact matching but matches also on multitoken HPO labels or synonyms that are permuted.
Fenominal does not rely on external APIs and can be used in settings in which a firewall does not permit applications to access the internet. Fenominal is intended for use as a software library and the CLI module only contains simple demo applications.
Fenominal’s Matching Algorithms
Fenominal performs both exact and fuzzy (T-BLAT) matching. In both cases, the order of the tokens of each HPO term
can be present in any order and stop words are ignored. For instance,
Mallet finger HP:0030771 will be inferred from both mallet finger
and finger mallet
. For the term Anomalous hepatic venous drainage into the left atrium HP:0032181,
the stopwords the
and into
are not considered.
Exact matching
The exact matching algorithm searches for exact matches to term or synonym labels, ignoriing stop words, whereby the tokens can be present in any order in the text.
T-BLAT (fuzzy) matching
TODO – short description
CLI fenominal application
fenominal is a Java library written in Java 17. fenominal contains a command-line interface (cli) module that demonstrates some of the functionality of the library.
See CLI fenominal app: Setup for instructions on building the application.
See CLI fenominal app: Downloading the hpo file for instructions on how to download the hp.json
file that is needed for the app.
Once you have built the application, you should see this
java -jar fenominal-cli/target/fenominal.jar
Usage: fenominal [-hV] [COMMAND]
phenotype/disease NER
-h, --help Show this help message and exit.
-V, --version Print version information and exit.
Commands:
download, D Download files for fenominal
parse, P Parse text
in the following we will show only fenominal.jar
. Adjust the path accordingly.
Fenominal’s Matching
Fenominal can perform exact matching and fuzzy matching. See Fenominal’s Matching Algorithms for an explanation.
To see the options, run java -jar fenominal.jar parse -h
.
Short option |
Long option |
Explanation |
---|---|---|
-e |
–exact |
Use exact matching algorithm |
-h |
–help |
Show this help message and exit. |
–hp=<path> |
Path to HP json file (default data/hp.json) |
|
-i |
–input=<path> |
Path to HP json file (default data/hp.json) |
-0 |
–output=<path> |
Path to output file |
-V |
–version |
Print version information and exit. |
–verbose |
Show parse results in shell |
Exact matching
For this example, create a file called text-exact.txt
with the following contents
A 28-year-old woman who was diagnosed with Noonan syndrome at age 4 because of growth retardation, cardiomyopathy, and hypertelorism.
Run fenominal as follows.
java -jar fenominal.jar parse --exact -i text-exact.txt --verbose
(...)
Growth delay HP:0001510 growth retardation observed 79 97 A 28-year-old woman who was diagnosed with Noonan syndrome at age 4 because of growth retardation, cardiomyopathy, and hypertelorism.
Cardiomyopathy HP:0001638 cardiomyopathy observed 99 113 A 28-year-old woman who was diagnosed with Noonan syndrome at age 4 because of growth retardation, cardiomyopathy, and hypertelorism.
Hypertelorism HP:0000316 hypertelorism observed 119 132 A 28-year-old woman who was diagnosed with Noonan syndrome at age 4 because of growth retardation, cardiomyopathy, and hypertelorism.
T-BLAT (fuzzy) Matching
For this example, create a file called text-errors.txt
with the following contents.
A 28-year-old woman who was diagnosed with Noonan syndrome at age 4 because of growth retadation, cardiomyopathic, and hypertelorisn.
Run fenominal as follows.
java -jar fenominal.jar parse -i text-errors.txt --verbose
(...)
Agnosia HP:0010524 agnosia observed 28 37 A 28-year-old woman who was diagnosed with Noonan syndrome at age 4 because of growth retadation, cardiomyopathic, and hypertelorisn.
Growth delay HP:0001510 growth retardation observed 79 96 A 28-year-old woman who was diagnosed with Noonan syndrome at age 4 because of growth retadation, cardiomyopathic, and hypertelorisn.
Cardiomyopathy HP:0001638 cardiomyopathy observed 98 113 A 28-year-old woman who was diagnosed with Noonan syndrome at age 4 because of growth retadation, cardiomyopathic, and hypertelorisn.
Hypertelorism HP:0000316 hypertelorism observed 119 132 A 28-year-old woman who was diagnosed with Noonan syndrome at age 4 because of growth retadation, cardiomyopathic, and hypertelorisn.
Fenominal picks up four terms. Agnosia is false postion (from the word diagnosed). The remaining three terms are
inferred correctly despite the presence of spelling errors or variants. Note that T-BLAT is the default
approach, and exact matching is only performed if the --exact
flag is passed.
fenominal library
To use fenominal as a library for Java 17 or higher applications, add the following to the POM file.
<dependency>
<groupId>org.monarchinitiative.fenominal</groupId>
<artifactId>fenominal-core</artifactId>
<version>...</version>
</dependency>
<dependency>
<groupId>org.monarchinitiative.phenol</groupId>
<artifactId>phenol-core</artifactId>
<version>...</version>
</dependency>
<dependency>
<groupId>org.monarchinitiative.phenol</groupId>
<artifactId>phenol-io</artifactId>
<version>...</version>
</dependency>
Using the latest versions of fenominal and phenol.
TODO explain how to use GRAIL.
Imports
Use the following imports
import org.monarchinitiative.fenominal.core.FenominalRunTimeException;
import org.monarchinitiative.fenominal.core.TermMiner;
import org.monarchinitiative.fenominal.model.MinedSentence;
import org.monarchinitiative.fenominal.model.MinedTermWithMetadata;
import org.monarchinitiative.phenol.io.OntologyLoader;
import org.monarchinitiative.phenol.ontology.data.Ontology;
import org.monarchinitiative.phenol.ontology.data.TermId;
Initialize with the path to the hp.json file
Ontology ontology = OntologyLoader.loadOntology(new File(hpoJsonPath));
Decide whether to do exact or T-BLAT (fuzzy) matching
boolean doExactMatching = ....// your code decides
TermMiner miner;
if (exact) {
miner = TermMiner.defaultNonFuzzyMapper(this.ontology);
} else {
miner = TermMiner.defaultFuzzyMapper(this.ontology);
}
You can use fenominal to retrieve three types of objects.
sentences
Retrieve a collection of MinedSentence
objects that represent each of the sentences in the input string in
which at least one HPO term is indentified. Each MinedSentence object has a collection of MinedTermWithMetadata objects.
String inputString = ....// your code provides input String
Collection<MinedSentence> setences = miner.mineSentences(inputString);
MinedTermWithMetadata
Returns a collection of MinedTermWithMetadata objects, each of which provides the following methods
String getMatchingString()
: the matching string in the original textdouble getSimilarity()
the similaroty score for the matchTermId getTermId()
: the HPO TermIdint getTokenCount()
: - the number of matching tokens
Additionally, all of the methods of MinedTerm are provided (see below)
String inputString = ....// your code provides input String
Collection<MinedTermWithMetadata> setences = miner.mineTermsWithMetadata(inputString);
MinedTerm
Returns a collection of MinedTerm objects, each of which provides the following methods
int getBegin()
: zero-based start coordinate of the match in the original textint getEnd()
: zero-based end coordinate (included) of the match in the original textString getTermIdAsString()
: String version of the HPO term id.boolean isPresent()
: true of the HPO term was observed, false if it was excluded according to the original text
String inputString = ....// your code provides input String
Collection<MinedTerm> sentences = miner.mineTerms(inputString);