Evidence & Confidence Tiering
For all extracted Gene-Trait Associations (GTAs), we systematically collect Evidence Codes from diverse data sources and evaluate them based on the nature of their supporting evidence:
- Expert Manual Curation: Oryzabase, RAP-DB
- Automated Sequence Alignment: Ensembl
- Experimental Records: Planteome
- Literature Extraction Count: HZAU-BioNLP Pipeline
Based on these criteria, associations are classified into a three-tier confidence system:
| Confidence Level | Criteria | Primary Use |
|---|---|---|
| High Tier 1 (Curated) |
Supported by expert manual curation (Oryzabase/RAP-DB) or explicit experimental codes (EXP, IDA, IMP). | Ground-truth retrieval and high-fidelity validation. |
| Medium Tier 2 (Verified) |
Requires cross-validation from both NLP literature extraction AND external computational sources, with >10 independent articles. | Discovery of novel or uncatalogued functional associations. |
| Low Tier 3 (Emerging) |
Associations relying on limited text-mining co-occurrences (≤ 10 articles) or lacking cross-domain validation. | Broad knowledge exploration and hypothesis generation. |
Data Standardization & Mapping
To ensure structural consistency and cross-database interoperability, the following standardization pipelines were applied:
Phenotypic Trait Standardization
All extracted trait descriptions have been systematically mapped and unified to recognized semantic ontologies, including the Gene Ontology (GO), Plant Trait Ontology (TO), Plant Ontology (PO), and the Rice Trait Ontology (RTO).
Gene Nomenclature Standardization
Gene entities sourced from external databases including Oryzabase, RAP-DB, Ensembl Plants, and Planteome were standardized to the unified RAP ID system. For instance, genes from Oryzabase were anchored via their annotated RAP IDs, and data from Planteome were mapped using a "Protein-Gene-RAP ID" trajectory.
Data Records
The repository is logically divided into four primary layers. All files are available in structured formats. To download the datasets, please visit the Zenodo repository.
| File Name | Description | Format | Action |
|---|---|---|---|
| 1. Text Corpus & Evidence Data | |||
keyword_filtered_rice_sentences.jsonl |
Raw, keyword-filtered sentence segments from full-text literature. | JSONL | Download |
rice_context_sentences_compressed.tsv |
Compressed contextual sentences providing exact narrative evidence. | TSV | Download |
| 2. NLP-Extracted Association Databases | |||
NLP_Rice_GTA_Database.tsv |
Core text-mined Gene-Trait Association (GTA) dataset. | TSV | Download |
NLP_Rice_GVA_TVA_Database.tsv |
Mined Gene-Variety (GVA) and Trait-Variety (TVA) associations. | TSV | Download |
NLP_Rice_GTA_Trend.tsv |
Statistical and temporal trend data of mined GTAs over time. | TSV | Download |
| 3. Integration & Standardization Masters | |||
Unified_Gene_Master.json |
Standardized gene dictionary resolving synonym conflicts. | JSON | Download |
rice_multi_omics.json |
Integrated multi-omics data structures for cross-validation. | JSON | Download |
Unified_Rice_GTA_Database.tsv |
Definitive master database merging all evidence layers. | TSV | Download |
| 4. NLP Terminology & Dictionaries | |||
terminology.tar.gz |
A compressed archive of curated dictionaries (*_oger_dict.tsv) for NER and normalization.
Covers: Genes, Varieties, and Traits.
|
TAR.GZ | Download |