Pre-computed text and embedding data for 32 cancers in TCGA
To simplify the implementation of PheSeq for additional disease cases, pre-processed text and embedding data for 32 types of Pan-Cancers in the TCGA database are provided on this page. The dataset comprises richly annotated sentence support and pre-computed embeddings for each gene.
For each cancer type in 32 Pan-Cancers, an sentence file and a pre-computed embedding file are available for downloading.
The sentence file contain four columns: sentence number, PMID or PMCID, sentence, and annotation. Furthermore, the annotation contains tagging results from four taggers: AGAC, PubTator, OGER++, and Phenotagger. The taggings cover a good variety of biomedical entities, including gene, disease, GO, HPO, and other molecular/celluar process activities.
The embedding file provides pre-computed embedding data for all genes covered by the disease-related literature. This file contains two columns: the ENTREZ ID of each gene and the embedding vector with 1,024 dimensions.
How to use PheSeq model for more diseases?
First, for a selected disease, collect p-values by running sequence analysis out of user's own interest.
Second, download the embedding file from the above links. Visit PheSeq GitHub project, run run_model.py to integrate the p-value and text embedding data, thus obtaining a prioritization association result.
Third, download the evidence file from the above links. Visit PheSeq GitHub project, run Pathological_evidence_network_visualization.py instead. Phew! An literature-augmented pathological network is visualized for you own purpose.
HERE WE GO! (PheSeq GitHub project)
