Otter-Knowledge

Published in

IBM Data Science in Practice

10 min readJul 5, 2023

Did you know that otters are able to learn from each other? [1] And that otters frequently hold hands while they sleep to keep themselves from drifting apart? [2] This way, they can create a “network of otters” ensuring all of them stay together, and they can keep learning from each other.

This can be extrapolated to several other aspects of life, in which each part of a whole is important and contributes, in some way, to the final goal. And this can also be extrapolated to the creation of knowledge-enhanced foundational models based on Graph Neural Networks (GNN). Nowadays, there is a lot of knowledge stored in Knowledge Graphs (KGs), and GNNs can take advantage of this graph structure of KGs to become powerful foundational models.

Otter-Knowledge: Our approach

We present Otter-Knowledge (https://arxiv.org/abs/2306.12802), a way to extract knowledge from diverse multi-modal KGs for training GNNs. By using different sources and obtaining initial representations for each of the nodes in the KG, we can enhance the final learned representation, what we refer to as knowledge-enhanced learned representation.

We applied Otter-Knowledge to Drug Discovery, and demonstrated that knowledge-enhanced learned representation enriches protein sequence and SMILES drug databases with a large multi-modal Knowledge Graph fused from different sources. This improves results on TDC drug target binding affinity prediction benchmarks (https://tdcommons.ai/benchmark/dti_dg_group/bindingdb_patent/).

When dealing with several and diverse KGs some of the knowledge might come in text, some in images, or some might come in more specific formats, like protein sequences or SMILES. We refer to this as multi-modal KGs, as the nodes of the graph can have different modalities (i.e.: formats). In Otter-Knoweldge, we use different pre-trained models and/or algorithms to handle the different modalities of the KG, what we call handlers. The handlers take as input the nodes of the KG of a specific modality (e.g.: protein sequences), and compute initial embeddings for those nodes. These handlers might be complex pre-trained deep learning models, like MolFormer (https://github.com/IBM/molformer) or ESM (https://github.com/facebookresearch/esm), or simple algorithms like the morgan fingerprint (https://ibm.biz/Bdye9U).

Once the initial embeddings for all the different modalities have been computed, it’s time to create and train the GNN. The GNN will propagate the initial embeddings through a set of layers that upgrade input embedding according to the node neighbours. The architecture of the GNN consists of two main blocks: encoder and decoder.

For the encoder, we first define a projection layer that consists of a set of linear transformations for each node modality. This layer projects nodes into common dimensionality. Then, we apply several multi-relational graph convolutional layers (R-GCN), which distinguish between different types of edges connecting source and target nodes by having a set of trainable parameters for each edge type.
For the decoder, we consider a link prediction task, which consists of a scoring function that maps each triple of source and target nodes, along with the corresponding edge, to a scalar number defined within the interval [0; 1].

For link prediction, we consider three choices of scoring functions: DistMult, TransE and a Binary Classifier that are commonly used in the literature. The scoring outcomes of each triple are then compared against actual labels using negative log likelihood loss function.

Flow control: One crucial aspect of pre-training the GNN involves addressing the disparity between the data accessible during pre-training and the data accessible during subsequent tasks. Specifically, during pre-training, there are numerous attributes associated with proteins or drugs, whereas during downstream fine-tuning, only amino acid sequences and SMILES are available. Consequently, during pre-training, we explore two scenarios: one which controls the information propagated to the Drug/Protein entities and one without such control. In our experiments, we present results for both cases to provide an insight on the impact of restricting information flow during pre-training on the subsequent tasks.
Noisy Links: An additional significant consideration is the presence of noisy links within the up-stream data and how they affect the downstream tasks. To investigate the potential impact on these tasks, we manually handpick a subset of links from each database that are relevant to drug discovery. We then compare the outcomes when training the GNN using only these restricted links versus using all possible links present in the graphs.
Regression: Certain pre-training datasets, like Uniprot, contain numerical data properties. Hence, we incorporate an extra regression objective aimed at minimizing the root mean square error (MSE) of the predicted numerical data properties. In the learning process, we combine the regression objective and the link prediction objective to create a single objective function.

Datasets

We release 4 different datasets: UBC, PrimeKG, DUDe and STITCH.

UBC

UBC is a dataset comprising entities (Proteins/Drugs) from Uniprot (U), BindingDB (B) and. ChemBL (C). It contains 6,207,654 triples.

Uniprot [3] comprises of 573,227 proteins from SwissProt, which is the subset of manually curated entries within UniProt, including attributes with different modalities like the sequence (567,483 of them), full name, organism, protein family, description of its function, catalytics activity, pathways and its length. The number of edges are 38,665 of type target_of from Uniprot ids to both ChEMBL and Drugbank ids, and 196,133 interactants between Uniprot protein ids.
BindingDB [4] consists of 2,656,221 data points, involving 1.2 million compounds and 9,000 targets. Instead of utilizing the affinity score, we generate a triple for each combination of drugs and proteins. In order to prevent any data leakage, we eliminate overlapping triples with the TDC DTI dataset. As a result, the dataset concludes with a total of 2,232,392 triples.
ChemBL [5] comprises of drug-like bioactive molecules, 10,261 ChEMBL ids with their corresponding SMILES were downloaded from OpenTargets [9], from which 7,610 have a sameAs link to drugbank id molecules.

PrimeKG

PrimeKg [6] (the Precision Medicine Knowledge Graph) integrates 20 biomedical resources, it describes 17,080 diseases with 4 million relationships. PrimeKG includes nodes describing Gene/Proteins (29,786) and Drugs (7,957 nodes). The MKG that we built from PrimeKG contains 13 modalities, 12,757,300 edges (154,130 data properties, and 12,603,170 object properties), including 642,150 edges describing interactions between proteins, 25,653 edges describing drug-protein interactions, and 2,672,628 describing interactions between drugs.

DUDe

DUDe [7] comprises a collection of 22,886 active compounds and their corresponding affinities towards 102 targets. For our study, we utilized a preprocessed version of the DUDe, which includes 1,452,568 instances of drug-target interactions. To prevent any data leakage, we eliminated the negative interactions and the overlapping triples with the TDC DTI dataset. As a result, we were left with a total of 40,216 drug-target interaction pairs.

STITCH

STITCH [8] (Search Tool for Interacting Chemicals) is a database of known and predicted interactions between chemicals represented by SMILES strings and proteins whose sequences are taken from the STRING database. Those interactions are obtained from computational prediction, from knowledge transfer between organisms, and from interactions aggregated from other (primary) databases. For the MKG curation, we filtered only the interaction with the highest confidence, i.e., the one which is higher 0.9. This resulted into 10,717,791 triples for 17,572 different chemicals and 1,886,496 different proteins. Furthermore, the graph was split into 5 roughly same size subgraphs and GNN was trained sequentially on each of them by upgrading the model trained using the previous subgraphs.

Models

We release 12 models, 3 for each dataset, each of them with a different scoring type: DistMult, TransE and a Binary Classifier.

Models released:

Models results:

How to use it

Installation

Clone the repo:

git clone https://github.com/IBM/otter-knowledge.git  
cd otter-knowledge

Install the requirements:

pip install -r requirements.txt

Run inference

usage: inference.py [-h] 
    --input_path INPUT_PATH 
    [--sequence_column SEQUENCE_COLUMN] 
    [--input_type INPUT_TYPE] 
    [--model_path MODEL_PATH] 
    --output_path OUTPUT_PATH 
    [--batch_size BATCH_SIZE] 
    [--no_cuda]  
  
Inference  
  
options:  
  -h, --help      
    show this help message and exit  
  --input_path INPUT_PATH                        
    Path to the csv file with the sequence/smiles  
  --sequence_column SEQUENCE_COLUMN
    Name of the column with sequence/smiles information for proteins or molecules  
  --input_type INPUT_TYPE                        
    Type of the sequences. Options: Drug; Protein  
  --model_path MODEL_PATH                        
    Path to the model or name of the model in the HuggingfaceHub  
  --output_path OUTPUT_PATH                        
    Path to the output embedding file.  
  --batch_size BATCH_SIZE                        
    Batch size to use.  
  --no_cuda             
    If set to True, CUDA won't be used even if available.

Run the inference for Proteins:

Replace test_data with the path to a CSV file containing the protein sequences, name_of_the_column with the name of the column of the protein sequence in the CSV and output_path with the filename of the JSON file to be created with the embeddings.

python inference.py --input_path test_data --sequence_column name_of_the_column --model_path ibm/otter_dude_distmult --output_path output_path

Run the inference for Drugs:

Replace test_data with the path to a CSV file containing the Drug SMILES, name_of_the_column with the name of the column of the SMILES in the CSV and output_path with the filename of the JSON file to be created with the embeddings.

python inference.py --input_path test_data --sequence_column name_of_the_column input_type Drug --relation_name smiles --model_path ibm/otter_dude_distmult --output_path output_path

Benchmarks

Training benchmark models

We assume that you have used the inference script to generate embeddings for training and test proteins/drugs. The embeddings of training and test proteins/drugs should be combined into files with the following format that keep computed embeddings of drugs/proteins. It is important to notice that the inference only generates embeddings for either drugs or proteins so you need to combine and convert them into the following format so that they can be used as input to the model benchmark training as explained below.

{
  "Drug": {
    "CN(C)CC(=O)NC(COc1cncc(-c2ccc3cnccc3c2)c1)Cc1c[nH]c2ccccc12": [
      -1.2718517780303955, 0.6045345664024353,
      -0.03671235218644142, 0.9915799498558044,
      -0.7146453857421875
    ],
    "Cc1sc2ncnc(N)c2c1-c1ccc(NC(=O)Nc2cc(C(F)(F)F)ccc2F)cc1": [
      -0.6596673130989075, 0.2838267683982849,
      -0.042177166789770126, 0.7447476387023926,
      -0.27911311388015747
    ]  
  },   
  "Target": {
    "MTLDVGPEDELPDWAAAKEFYQKYDPKDVIGRGVSSVVRRCVHRATGHE": [
      -0.46595990657806396, -0.297667533159256,
      -0.048857495188713074
    ]
  }
}

Training benchmark models can be done with the following example command:

python -m benchmarks.dti.train --train train_val.csv --test test.csv --train_embeddings train_val_embeddings.json --test_embeddings test_embeddings.json

Where the input to the script are:

train_val.csv the path to the csv file that keep the training data from TDC benchmarks
test.csv the path to the csv file that keep the test data from TDC benchmarks
the input files train_val_embeddings.json and test_embeddings.json keeps the computed embeddings of train/test protein/drugs respectively in the format that we have discussed above.

There are other optional hyperparameter you can set such as the learning rate, the number of training steps etc as below

usage: train.py [-h] 
  [--train TRAIN] 
  [--test TEST] 
  [--train_embeddings TRAIN_EMBEDDINGS] 
  [--test_embeddings TEST_EMBEDDINGS] 
  [--lr LR] 
  [--steps STEPS] 
  [--seeds SEEDS] 
  [--batch_size BATCH_SIZE] 
  [--is_initial_embeddings IS_INITIAL_EMBEDDINGS]  
  [--gnn_embedding_dim GNN_EMBEDDING_DIM]  

TDC DG training  
  
optional arguments:  
  -h, --help
    show this help message and exit  
  --train TRAIN         
    Root directory with the training data  
  --test TEST           
    Root directory with the test data  
  --train_embeddings TRAIN_EMBEDDINGS                        
    Root directory with the embeddings of training drugs and proteins.  
  --test_embeddings TEST_EMBEDDINGS                        
    Root directory with the embeddings of test drugs and proteins.  
  --lr LR               
    Learning rate.  
  --steps STEPS         
    Maximum number of training steps  
  --seeds SEEDS         
    Random seeds.  
  --batch_size BATCH_SIZE                        
    Mini batch size.  
  --is_initial_embeddings IS_INITIAL_EMBEDDINGS                        
    Set this value to yes if want to train with initial embeddings without GNN embeddings.  
  --gnn_embedding_dim GNN_EMBEDDING_DIM                        
    Size of the GNN embeddings.

Ensemble learning

Ensemble method combines the predictions trained on different GNN embeddings provided by different pretrained models. The following example command run ensemble learning:

python -m benchmarks.dti.train_ensemble_model --train train_val.csv --test test.csv --train_embeddings train_embeddings.txt --test_embeddings test_embeddings.txt