Workshop

Workshop#

Exercise: Data, Metadata Minimal Information Models and Data Interoperability (Version 1)#

Introduction#

Online repositories sharing scientific data are vital for the advancement of science. Data sharing improves research transparency, promotes the validation of experimental methods and scientific conclusions, enables data reuse, and facilitates knowledge discovery using new analysis tools. Essential for reusing scientific data is the availability of machine-readable metadata about the scientific experiments conducted with a degree of completeness that reflects the FAIR guiding principles: Findable, Accessible, Interoperable, Reusable.

Several tools have been created to help make data FAIR. The ISA metadata framework standard outlines a model for capturing experiment metadata using 3 levels: Investigation, Study, and Assay. A key feature of properly FAIRified data is a high level of data interoperability. From a data producer/user point of view, tow levels are important: structural and semantic interoperability.

Structural interoperability defines the format of the (meta)data, allowing the data to be interpreted by multiple systems. For example, the FASTA sequence format is the most implemented and best machine-actionable data standard for sequence data and therefore directly understood by many sequence analysis tools.

Semantic interoperability entails the transformation of ambiguous human-understandable metadata in a standardized machine-actionable open format, allowing computational support systems to automatically find, access, and reuse data. To ensure that the set of metadata is sufficient for the data to be unambiguously described, standardized minimal information models and checklists, detailing those requirements, have been developed for a wide array of experiment data.

Exercise Context#

In this exercise, we study the metadata available for ENA Project: PRJDB10485 available at ENA Browser Project PRJDB10485. Here we can learn about the background of this study. In the ISA framework, this overarching information would be placed within the Investigation/Study level.

“The aim of this project is to analyze the dysbiosis of fecal microbiome in HIV-1 infected individuals in Ghana. Gut microbiome dysbiosis has been correlated to the progression of non-AIDS diseases such as cardiovascular and metabolic disorders. Because the microbiome composition is different among races and countries, analyses of the composition in different regions is important to understand the pathogenesis unique to specific regions. In the present study, we examined fecal microbiome compositions in HIV-1 infected individuals in Ghana. In a cross-sectional case-control study, age- and gender-matched HIV-1 infected Ghanaian adults (HIV-1 [+]; n = 55) and seronegative controls (HIV-1 [-]; n = 55) were enrolled.”

Subsequently, we can download the metadata XML files of 55 HIV-1 infected adults and 55 seronegative controls and open them in Excel via ENA Browser Sample SAMD00244427 or use the link to directly get the associated metadata ENA Browser Sample Metadata.

Ontologies and Minimal Information Models#

Ontologies and minimal information models are both essential for managing and representing this information, but they have distinct purposes: ontologies focus on capturing domain knowledge and semantics. Minimal information models focus on standardizing data reporting on a specific type of research.

In an ontology, an attribute refers to a property or characteristic that is associated with a particular entity or concept within a domain. For example, an ontology describing dogs would include properties such as “breed,” “shoulder height,” and “weight.” These attributes help to distinguish dogs based on their specific characteristics and are typically represented as key-value pairs (e.g., weight: 30 kg), where the key denotes the name or label of the attribute, and the value specifies the corresponding property or characteristic associated with the entity.

So the question is what are the Attributes associated with the samples obtained in this study? Are these Attributes understandable by a computer (Are they interoperable) ? And how do these Attributes align with the proposed minimal information model for this type of study (here human gut: ERC000015)?

In the table below we have collected the Attributes (metadata types) associated with the 110 samples obtained in this study. (The values will vary depending on the origin of the sample – shown are the values linked to the first sample)

Sample Attributes Table#

Type	Interoperable? (Yes/No/Alternative)	Value
ART_drugs_current	No	TDF/3TC/EFV
ART_duration_at baseline_months		29
ART_start_date		2015-04-08
ART_status_at_baseline		ART
CD4_count(cells/ul)		473
CD8_count(cells/ul)		1155
Co-trimoxazole duration (mths)		24
External Id		SAMD00244427
HIV_risk_exposure	No	Heterosexual
INSDC center name		AIDS Research Center, National Institute of Infectious Diseases
INSDC first public		2021-03-26T00:00:00Z
INSDC last update		2024-01-14T05:42:57.197Z
INSDC secondary accession		DRS176868
INSDC status		live
Marital_status		Single
NCBI submission model		MIMARKS.survey.human-gut
NCBI submission package		MIMARKS.survey.human-gut.6.0
SRA accession		DRS176868
Viral_load (copies/ml)		15746
age		host age: 23
collection date		2017-09-13
description		Keywords: GSC:MIxS;MIMARKS:6.0
education		Secondary school
env_broad_scale		human gut
env_local_scale		human gut environment
env_medium		fecal material
geo loc name		Ghana:Koforidua
host		Homo sapiens
host_disease_stat		HIV-1 positive
log VL		4.197170247
occupation		Hair dresser
organism		human gut metagenome
project name		Dysbiotic fecal microbiome in HIV-1 infected individuals in Ghana
sample name		5P
sex		female
title		16s rDNA sequence from fecal sample of HIV-1 infected female from Koforidua, Ghana, sample ID HG-P-006-KO

Questions#

Go to ENA Checklists. How many checklists are currently available?
Using the ENA checklists, select the appropriate minimal information model (human gut: ERC000015). Check the interoperability of the currently used Attributes (Column 1) and replace with the interoperable version when available.
FAIR should go hand in hand with privacy regulations. Which of these (or combination of) Attributes would be invading the privacy of the tested persons?
Trying to build on existing domain-specific principles and workflows on the one hand, while trying to get to a maximum level of cross-domain interoperability on the other are competing goals. Which of the Attributes are domain specific? Hint: cross-check with the ENA default checklist -> ERC000011 (ENA Browser ERC000011).
Which attributes are too specific – not important for data reuse?
Which attributes would be essential for data reuse?