--- language: - en pipeline_tag: graph-ml tags: - biology --- # PreMode This is the repository for our manuscript "PreMode predicts mode-of-action of missense variants by deep graph representation learning of protein sequence and structural context" posted on bioRxiv: https://www.biorxiv.org/content/10.1101/2024.02.20.581321v3 # Data Unzip the files with this script: ``` bash unzip.files.sh ``` Unfortunately we are not allowed to share the HGMD data, so in the `data.files/pretrain/training.*` files we removed all the pathogenic variants from HGMD (49218 in total). This might affect the plots of `analysis/figs/fig.sup.12.pdf` and `analysis/figs/fig.sup.13.pdf` if you re-run the R codes in `analysis/` folder. We shared the trained weights of our models trained using HGMD instead. # Install Packages Please install the necessary packages using ``` mamba env create -f PreMode.yaml mamba env create -f r4-base.yaml ``` You can check the installation by running ``` conda activate PreMode python train.py --conf scripts/TEST.yaml --mode train ``` If no error occurs, it means successful installation. # New Experiment ## Start from scratch and use our G/LoF datasets 1. Please prepare a folder under `scripts/` and create a file named `pretrain.seed.0.yaml` inside the folder, check `scripts/PreMode/pretrain.seed.0.yaml` for example. 2. Run pretrain in pathogenicity task: ``` python train.py --conf scripts/NEW_FOLDER/pretrain.seed.0.yaml ``` 3. Prepare transfer learning config files: ``` bash scripts/DMS.prepare.yaml.sh scripts/NEW_FOLDER/ ``` 4. Run transfer learning: ``` bash scripts/DMS.5fold.run.sh scripts/NEW_FOLDER TASK_NAME GPU_ID ``` If you have multiple tasks, just separate each task by comma in the TASK_NAME, like "task_1,task_2,task_3". 5. (Optional) To reuse the transfer learning tasks in our paper using 8 GPU cards, just do ``` bash transfer.all.sh scripts/NEW_FOLDER ``` If you only have one GPU card, then do ``` bash transfer.all.in.one.card.sh scripts/NEW_FOLDER ``` 6. Save inference results: ``` bash scripts/DMS.5fold.inference.sh scripts/NEW_FOLDER analysis/NEW_FOLDER TASK_NAME GPU_ID ``` 7. You'll get a folder `analysis/NEW_FOLDER/TASK_NAME` with 5 `.csv` files, each file has 4 columns `logits.FOLD.[0-3]`. Each column represent the G/LoF prediction at one cross-validation (closer to 0 means more likely GoF, closer to 1 means more likely LoF). We suggest averaging the predictions at 4 columns. ## Only transfer learning, user defined mode-of-action datasets 1. Prepare a `.csv` file for training and inference, there are two accepted formats: + Format 1 (only for missense variants): | uniprotID | aaChg | score | ENST | | :-: | :-: | :-: | :-: | | P15056 | p.V600E | 1 | ENST00000646891 | | P15056 | p.G446V | -1 | ENST00000646891 | + `uniprotID`: the uniprot ID of the protein. + `aaChg`: the amino acid change induced by missense variant. + `score`: 1 for GoF, -1 for LoF. For inference it is not required. For DMS, this could be experimental readouts. If you have multiplexed assays, you can change it to `score.1, score.2, score.3, ..., score.N`. + `ENST` (optional): the ensemble transcript ID that matched the uniprotID. + Format 2 (can be missense variant or multiple variants): | uniprotID | ref | alt | pos.orig | score | ENST | wt.orig | sequence.len.orig | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | | P15056 | V | E | 600 | 1 | ENST00000646891 | ... | 766 | | P15056 | G | V | 446 | -1 | ENST00000646891 | ... | 766 | | P15056 | G;V | V;F | 446;471 | -1 | ENST00000646891 | ... | 766 | + `uniprotID`: the uniprot ID of the protein. + `ref`: the reference amino acid, if multiple variants, separated by ";". + `alt`: the alternative, if multiple variants, separated by ";" in the same order of "ref". + `pos.orig`: the amino acid change position, if multiple variants, separated by ";" in the same order of "ref". + `score`: same as above. + `ENST` (optional): same as above. + `wt.orig`: the wild type protein sequence, in the uniprot format. + `sequence.len.orig`: the wild type protein sequence length. + If you prepared your input in Format 1, please run ``` bash parse.input.table/parse.input.table.sh YOUR_FILE TRANSFORMED_FILE ``` to transform it to Format 2, note it will drop some lines if your aaChg doesn't match the corresponding alphafold sequence. 2. Prepare a config file for training the model and inference. ``` bash scripts/prepare.new.task.yaml.sh PRETRAIN_MODEL_NAME YOUR_TASK_NAME YOUR_TRAINING_FILE YOUR_INFERENCE_FILE TASK_TYPE MODE_OF_ACTION_N ``` + `PRETRAIN_MODEL_NAME` could be one of the following: + `scripts/PreMode`: Default PreMode + `scripts/PreMode.ptm`: PreMode + ptm as input + `scripts/PreMode.noStructure`: PreMode without structure input + `scripts/PreMode.noESM`: PreMode, replaced ESM2 input with one-hot encodings of 20 AAs. + `scripts/PreMode.noMSA`: PreMode without MSA input + `scripts/ESM.SLP`: ESM embedding + Single Layer Perceptron + `YOUR_TASK_NAME` can be anything on your preference + `YOUR_TRAINING_FILE` is the training file you prepared in step 1. + `YOUR_INFERENCE_FILE` is the inference file you prepared in step 1. + `TASK_TYPE` could be `DMS` or `GLOF`. + `MODE_OF_ACTION_N` The number of dimensions of mode-of-action. For `GLOF` this is usually 1. For multiplexed `DMS` dataset, this could be the number of biochemical properties measured. Note that if it is larger than 1, then you have to make sure the `score` column in step 1 is replaced to `score.1, score.2, ..., score.N` correspondingly. 3. Run your config file ``` conda activate PreMode bash scripts/run.new.task.sh PRETRAIN_MODEL_NAME YOUR_TASK_NAME OUTPUT_FOLDER GPU_ID ``` This should take ~30min on a NVIDIA A40 GPU depending on your data set size. 4. You'll get a file in the `OUTPUT_FOLDER` named as `YOUR_TASK_NAME.inference.result.csv`. + If your `TASK_TYPE` is `GLOF`, then the column `logits` will be the inference results. Closer to 0 means GoF, closer to 1 means LoF. + If your `TASK_TYPE` is `DMS` and `MODE_OF_ACTION_N` is 1, then the column `logits` will be the inference results. If your `MODE_OF_ACTION_N` is larger than 1, then you will get multiple columns of `logits.*`, each represent a predicted DMS measurement. # Models & Figures in our manuscript ## Pretrained Models Here is the list of models in our manuscript: `scripts/PreMode/` PreMode, it takes 250 GB RAM and 4 A40 Nvidia GPUs to run, will finish in ~50h. `scripts/ESM.SLR/` Baseline Model, ESM2 (650M) + Single Layer Perceptron `scripts/PreMode.large.window/` PreMode, window size set to 1251 AA. `scripts/PreMode.noESM/` PreMode, replace the ESM2 embeddings to one hot encodings of 20 AA. `scripts/PreMode.noMSA/` PreMode, remove the MSA input. `scripts/PreMode.noPretrain/` PreMode, but didn't pretrain on ClinVar/HGMD. `scripts/PreMode.noStructure/` PreMode, remove the AF2 predicted structure input. `scripts/PreMode.ptm/` PreMode, add the onehot encoding of post transcriptional modification sites as input. `scripts/PreMode.mean.var/` PreMode, it will output both predicted value (mean) and confidence (var), used in adaptive learning tasks. ## Predicted mode-of-action | gene | file | | :-: | :-: | | BRAF | `analysis/5genes.all.mut/PreMode/P15056.logits.csv` | | RET | `analysis/5genes.all.mut/PreMode/P07949.logits.csv` | | TP53 | `analysis/5genes.all.mut/PreMode/P04637.logits.csv` | | KCNJ11 | `analysis/5genes.all.mut/PreMode/Q14654.logits.csv` | | CACNA1A | `analysis/5genes.all.mut/PreMode/O00555.logits.csv` | | SCN5A | `analysis/5genes.all.mut/PreMode/Q14524.logits.csv` | | SCN2A | `analysis/5genes.all.mut/PreMode/Q99250.logits.csv` | | ABCC8 | `analysis/5genes.all.mut/PreMode/Q09428.logits.csv` | | PTEN | `analysis/5genes.all.mut/PreMode/P60484.logits.csv` | For each file, column `logits.0` is predicted pathogenicity. column `logits.1` is predicted LoF probability, `logits.2` is predicted GoF probability. For PTEN, column `logits.1` is predicted stability, 0 is loss, 1 is neutral, `logits.2` is predicted enzyme activity, 0 is loss, 1 is neutral ## Figures Please go to `analysis/` folder and run the corresponding R scripts.