PheSeq, A Bayesian Deep Learning Model to Enhance and Interpret the Gene Disease Association Studies     lab-logo

An evidence-augmented data fusion is run to support the driver gene discovery via interpretable literature evidence and omics data. Case studies in Alzheimer's disease, Lung cancer and Breast cancer.

Pre-computed text and embedding data for 32 cancers in TCGA

To simplify the implementation of PheSeq for additional disease cases, pre-processed text and embedding data for 32 types of Pan-Cancers in the TCGA database are provided on this page. The dataset comprises richly annotated sentence support and pre-computed embeddings for each gene.

Colorectal Adenocarcinoma sentence file embedding file
Uterine Corpus Endometrioid Carcinoma sentence file embedding file
Glioblastoma Multiforme sentence file embedding file
Pancreatic Ductal Adenocarcinoma sentence file embedding file
Thyroid Papillary Carcinoma sentence file embedding file
Cholangiocarcinoma sentence file embedding file
Sarcoma sentence file embedding file
Hepatocellular Carcinoma sentence file embedding file
Kidney Papillary Cell Carcinoma sentence file embedding file
Uterine Carcinosarcoma sentence file embedding file
Paraganglioma & Pheochromocytoma sentence file embedding file
Lung Adenocarcinoma sentence file embedding file
Esophageal Carcinoma sentence file embedding file
Uveal Melanoma sentence file embedding file
Thymoma sentence file embedding file
Adrenocortical Carcinoma sentence file embedding file
Ovarian Serous Adenocarcinoma sentence file embedding file
Kidney Chromophobe Carcinoma sentence file embedding file
Gastric Adenocarcinoma sentence file embedding file
Prostate Adenocarcinoma sentence file embedding file
Mesothelioma sentence file embedding file
Breast Lobular Carcinoma sentence file embedding file
Lung Squamous Cell Carcinoma sentence file embedding file
Head and Neck Squamous Cell Carcinoma sentence file embedding file
Testicular Germ Cell Cancer sentence file embedding file
Bladder Urothelial Carcinoma sentence file embedding file
Cervical Carcinoma sentence file embedding file
Acute Myeloid Leukemia sentence file embedding file
Kidney Clear Cell Carcinoma sentence file embedding file
Lower Grade Glioma sentence file embedding file
Breast Ductal Carcinoma sentence file embedding file
Skin Cutaneous Melanoma sentence file embedding file

For each cancer type in 32 Pan-Cancers, an sentence file and a pre-computed embedding file are available for downloading.

The sentence file contain four columns: sentence number, PMID or PMCID, sentence, and annotation. Furthermore, the annotation contains tagging results from four taggers: AGAC, PubTator, OGER++, and Phenotagger. The taggings cover a good variety of biomedical entities, including gene, disease, GO, HPO, and other molecular/celluar process activities.

The embedding file provides pre-computed embedding data for all genes covered by the disease-related literature. This file contains two columns: the ENTREZ ID of each gene and the embedding vector with 1,024 dimensions.

How to use PheSeq model for more diseases?

First, for a selected disease, collect p-values by running sequence analysis out of user's own interest.

Second, download the embedding file from the above links. Visit PheSeq GitHub project, run run_model.py to integrate the p-value and text embedding data, thus obtaining a prioritization association result.

Third, download the evidence file from the above links. Visit PheSeq GitHub project, run Pathological_evidence_network_visualization.py instead. Phew! An literature-augmented pathological network is visualized for you own purpose.

HERE WE GO! (PheSeq GitHub project)

cc logo