Pre-computed text and embedding data for 32 cancers in TCGA

To simplify the implementation of PheSeq for additional disease cases, pre-processed text and embedding data for 32 types of Pan-Cancers in the TCGA database are provided on this page. The dataset comprises richly annotated sentence support and pre-computed embeddings for each gene.

Colorectal Adenocarcinoma	sentence file	embedding file
Uterine Corpus Endometrioid Carcinoma	sentence file	embedding file
Glioblastoma Multiforme	sentence file	embedding file
Pancreatic Ductal Adenocarcinoma	sentence file	embedding file
Thyroid Papillary Carcinoma	sentence file	embedding file
Cholangiocarcinoma	sentence file	embedding file
Sarcoma	sentence file	embedding file
Hepatocellular Carcinoma	sentence file	embedding file
Kidney Papillary Cell Carcinoma	sentence file	embedding file
Uterine Carcinosarcoma	sentence file	embedding file
Paraganglioma & Pheochromocytoma	sentence file	embedding file
Lung Adenocarcinoma	sentence file	embedding file
Esophageal Carcinoma	sentence file	embedding file
Uveal Melanoma	sentence file	embedding file
Thymoma	sentence file	embedding file
Adrenocortical Carcinoma	sentence file	embedding file
Ovarian Serous Adenocarcinoma	sentence file	embedding file
Kidney Chromophobe Carcinoma	sentence file	embedding file
Gastric Adenocarcinoma	sentence file	embedding file
Prostate Adenocarcinoma	sentence file	embedding file
Mesothelioma	sentence file	embedding file
Breast Lobular Carcinoma	sentence file	embedding file
Lung Squamous Cell Carcinoma	sentence file	embedding file
Head and Neck Squamous Cell Carcinoma	sentence file	embedding file
Testicular Germ Cell Cancer	sentence file	embedding file
Bladder Urothelial Carcinoma	sentence file	embedding file
Cervical Carcinoma	sentence file	embedding file
Acute Myeloid Leukemia	sentence file	embedding file
Kidney Clear Cell Carcinoma	sentence file	embedding file
Lower Grade Glioma	sentence file	embedding file
Breast Ductal Carcinoma	sentence file	embedding file
Skin Cutaneous Melanoma	sentence file	embedding file

For each cancer type in 32 Pan-Cancers, an sentence file and a pre-computed embedding file are available for downloading.

The sentence file contain four columns: sentence number, PMID or PMCID, sentence, and annotation. Furthermore, the annotation contains tagging results from four taggers: AGAC, PubTator, OGER++, and Phenotagger. The taggings cover a good variety of biomedical entities, including gene, disease, GO, HPO, and other molecular/celluar process activities.

The embedding file provides pre-computed embedding data for all genes covered by the disease-related literature. This file contains two columns: the ENTREZ ID of each gene and the embedding vector with 1,024 dimensions.

How to use PheSeq model for more diseases?

First, for a selected disease, collect p-values by running sequence analysis out of user's own interest.

Second, download the embedding file from the above links. Visit PheSeq GitHub project, run run_model.py to integrate the p-value and text embedding data, thus obtaining a prioritization association result.

Third, download the evidence file from the above links. Visit PheSeq GitHub project, run Pathological_evidence_network_visualization.py instead. Phew! An literature-augmented pathological network is visualized for you own purpose.

HERE WE GO! (PheSeq GitHub project)

PheSeq, A Bayesian Deep Learning Model to Enhance and Interpret the Gene Disease Association Studies

Pre-computed text and embedding data for 32 cancers in TCGA

How to use PheSeq model for more diseases?