Dataset Documentation

Publication Date: April 23, 2026
Version: v1.0.0

Data Access & Archival

The complete structured datasets (.tsv, .json, .jsonl) are permanently archived and openly available via Zenodo.

Evidence & Confidence Tiering

For all extracted Gene-Trait Associations (GTAs), we systematically collect Evidence Codes from diverse data sources and evaluate them based on the nature of their supporting evidence:

  • Expert Manual Curation: Oryzabase, RAP-DB
  • Automated Sequence Alignment: Ensembl
  • Experimental Records: Planteome
  • Literature Extraction Count: HZAU-BioNLP Pipeline

Based on these criteria, associations are classified into a three-tier confidence system:

Confidence Level Criteria Primary Use
High
Tier 1 (Curated)
Supported by expert manual curation (Oryzabase/RAP-DB) or explicit experimental codes (EXP, IDA, IMP). Ground-truth retrieval and high-fidelity validation.
Medium
Tier 2 (Verified)
Requires cross-validation from both NLP literature extraction AND external computational sources, with >10 independent articles. Discovery of novel or uncatalogued functional associations.
Low
Tier 3 (Emerging)
Associations relying on limited text-mining co-occurrences (≤ 10 articles) or lacking cross-domain validation. Broad knowledge exploration and hypothesis generation.

Data Standardization & Mapping

To ensure structural consistency and cross-database interoperability, the following standardization pipelines were applied:

Phenotypic Trait Standardization

All extracted trait descriptions have been systematically mapped and unified to recognized semantic ontologies, including the Gene Ontology (GO), Plant Trait Ontology (TO), Plant Ontology (PO), and the Rice Trait Ontology (RTO).

Gene Nomenclature Standardization

Gene entities sourced from external databases including Oryzabase, RAP-DB, Ensembl Plants, and Planteome were standardized to the unified RAP ID system. For instance, genes from Oryzabase were anchored via their annotated RAP IDs, and data from Planteome were mapped using a "Protein-Gene-RAP ID" trajectory.

Data Records

The repository is logically divided into four primary layers. All files are available in structured formats. To download the datasets, please visit the Zenodo repository.

File Name Description Format Action
1. Text Corpus & Evidence Data
keyword_filtered_rice_sentences.jsonl Raw, keyword-filtered sentence segments from full-text literature. JSONL Download
rice_context_sentences_compressed.tsv Compressed contextual sentences providing exact narrative evidence. TSV Download
2. NLP-Extracted Association Databases
NLP_Rice_GTA_Database.tsv Core text-mined Gene-Trait Association (GTA) dataset. TSV Download
NLP_Rice_GVA_TVA_Database.tsv Mined Gene-Variety (GVA) and Trait-Variety (TVA) associations. TSV Download
NLP_Rice_GTA_Trend.tsv Statistical and temporal trend data of mined GTAs over time. TSV Download
3. Integration & Standardization Masters
Unified_Gene_Master.json Standardized gene dictionary resolving synonym conflicts. JSON Download
rice_multi_omics.json Integrated multi-omics data structures for cross-validation. JSON Download
Unified_Rice_GTA_Database.tsv Definitive master database merging all evidence layers. TSV Download
4. NLP Terminology & Dictionaries
terminology.tar.gz A compressed archive of curated dictionaries (*_oger_dict.tsv) for NER and normalization. Covers: Genes, Varieties, and Traits. TAR.GZ Download