RiceMind | Dataset Documentation

Evidence & Confidence Tiering

For all extracted Gene-Trait Associations (GTAs), we systematically collect Evidence Codes from diverse data sources and evaluate them based on the nature of their supporting evidence:

Expert Manual Curation: Oryzabase, RAP-DB
Automated Sequence Alignment: Ensembl
Experimental Records: Planteome
Literature Extraction Count: HZAU-BioNLP Pipeline

Based on these criteria, associations are classified into a three-tier confidence system:

Confidence Level	Criteria	Primary Use
High Tier 1 (Curated)	Supported by expert manual curation (Oryzabase/RAP-DB) or explicit experimental codes (EXP, IDA, IMP).	Ground-truth retrieval and high-fidelity validation.
Medium Tier 2 (Verified)	Requires cross-validation from both NLP literature extraction AND external computational sources, with >10 independent articles.	Discovery of novel or uncatalogued functional associations.
Low Tier 3 (Emerging)	Associations relying on limited text-mining co-occurrences (≤ 10 articles) or lacking cross-domain validation.	Broad knowledge exploration and hypothesis generation.

Data Standardization & Mapping

To ensure structural consistency and cross-database interoperability, the following standardization pipelines were applied:

Phenotypic Trait Standardization

All extracted trait descriptions have been systematically mapped and unified to recognized semantic ontologies, including the Gene Ontology (GO), Plant Trait Ontology (TO), Plant Ontology (PO), and the Rice Trait Ontology (RTO).

Gene Nomenclature Standardization

Gene entities sourced from external databases including Oryzabase, RAP-DB, Ensembl Plants, and Planteome were standardized to the unified RAP ID system. For instance, genes from Oryzabase were anchored via their annotated RAP IDs, and data from Planteome were mapped using a "Protein-Gene-RAP ID" trajectory.

Data Records

The repository is logically divided into four primary layers. All files are available in structured formats. To download the datasets, please visit the Zenodo repository.

File Name	Description	Format	Action
1. Text Corpus & Evidence Data
`keyword_filtered_rice_sentences.jsonl`	Raw, keyword-filtered sentence segments from full-text literature.	JSONL	Download
`rice_context_sentences_compressed.tsv`	Compressed contextual sentences providing exact narrative evidence.	TSV	Download
2. NLP-Extracted Association Databases
`NLP_Rice_GTA_Database.tsv`	Core text-mined Gene-Trait Association (GTA) dataset.	TSV	Download
`NLP_Rice_GVA_TVA_Database.tsv`	Mined Gene-Variety (GVA) and Trait-Variety (TVA) associations.	TSV	Download
`NLP_Rice_GTA_Trend.tsv`	Statistical and temporal trend data of mined GTAs over time.	TSV	Download
3. Integration & Standardization Masters
`Unified_Gene_Master.json`	Standardized gene dictionary resolving synonym conflicts.	JSON	Download
`rice_multi_omics.json`	Integrated multi-omics data structures for cross-validation.	JSON	Download
`Unified_Rice_GTA_Database.tsv`	Definitive master database merging all evidence layers.	TSV	Download
4. NLP Terminology & Dictionaries
`terminology.tar.gz`	A compressed archive of curated dictionaries (`_oger_dict.tsv`) for NER and normalization. Covers: Genes, Varieties, and Traits*.	TAR.GZ	Download

Data Access & Archival

Evidence & Confidence Tiering

Data Standardization & Mapping

Phenotypic Trait Standardization

Gene Nomenclature Standardization

Data Records