Workshop#

Exercise: Data, Metadata Minimal Information Models and Data Interoperability (Version 1)#

Introduction#

Online repositories sharing scientific data are vital for the advancement of science. Data sharing improves research transparency, promotes the validation of experimental methods and scientific conclusions, enables data reuse, and facilitates knowledge discovery using new analysis tools. Essential for reusing scientific data is the availability of machine-readable metadata about the scientific experiments conducted with a degree of completeness that reflects the FAIR guiding principles: Findable, Accessible, Interoperable, Reusable.

Several tools have been created to help make data FAIR. The ISA metadata framework standard outlines a model for capturing experiment metadata using 3 levels: Investigation, Study, and Assay. A key feature of properly FAIRified data is a high level of data interoperability. From a data producer/user point of view, tow levels are important: structural and semantic interoperability.

Structural interoperability defines the format of the (meta)data, allowing the data to be interpreted by multiple systems. For example, the FASTA sequence format is the most implemented and best machine-actionable data standard for sequence data and therefore directly understood by many sequence analysis tools.

Semantic interoperability entails the transformation of ambiguous human-understandable metadata in a standardized machine-actionable open format, allowing computational support systems to automatically find, access, and reuse data. To ensure that the set of metadata is sufficient for the data to be unambiguously described, standardized minimal information models and checklists, detailing those requirements, have been developed for a wide array of experiment data.

Exercise Context#

In this exercise, we study the metadata available for ENA Project: PRJDB10485 available at ENA Browser Project PRJDB10485. Here we can learn about the background of this study. In the ISA framework, this overarching information would be placed within the Investigation/Study level.

“The aim of this project is to analyze the dysbiosis of fecal microbiome in HIV-1 infected individuals in Ghana. Gut microbiome dysbiosis has been correlated to the progression of non-AIDS diseases such as cardiovascular and metabolic disorders. Because the microbiome composition is different among races and countries, analyses of the composition in different regions is important to understand the pathogenesis unique to specific regions. In the present study, we examined fecal microbiome compositions in HIV-1 infected individuals in Ghana. In a cross-sectional case-control study, age- and gender-matched HIV-1 infected Ghanaian adults (HIV-1 [+]; n = 55) and seronegative controls (HIV-1 [-]; n = 55) were enrolled.”

Subsequently, we can download the metadata XML files of 55 HIV-1 infected adults and 55 seronegative controls and open them in Excel via ENA Browser Sample SAMD00244427 or use the link to directly get the associated metadata ENA Browser Sample Metadata.

Ontologies and Minimal Information Models#

Ontologies and minimal information models are both essential for managing and representing this information, but they have distinct purposes: ontologies focus on capturing domain knowledge and semantics. Minimal information models focus on standardizing data reporting on a specific type of research.

In an ontology, an attribute refers to a property or characteristic that is associated with a particular entity or concept within a domain. For example, an ontology describing dogs would include properties such as “breed,” “shoulder height,” and “weight.” These attributes help to distinguish dogs based on their specific characteristics and are typically represented as key-value pairs (e.g., weight: 30 kg), where the key denotes the name or label of the attribute, and the value specifies the corresponding property or characteristic associated with the entity.

So the question is what are the Attributes associated with the samples obtained in this study? Are these Attributes understandable by a computer (Are they interoperable) ? And how do these Attributes align with the proposed minimal information model for this type of study (here human gut: ERC000015)?

In the table below we have collected the Attributes (metadata types) associated with the 110 samples obtained in this study. (The values will vary depending on the origin of the sample – shown are the values linked to the first sample)

Sample Attributes Table#

Type

Interoperable? (Yes/No/Alternative)

Value

ART_drugs_current

No

TDF/3TC/EFV

ART_duration_at baseline_months

29

ART_start_date

2015-04-08

ART_status_at_baseline

ART

CD4_count(cells/ul)

473

CD8_count(cells/ul)

1155

Co-trimoxazole duration (mths)

24

External Id

SAMD00244427

HIV_risk_exposure

No

Heterosexual

INSDC center name

AIDS Research Center, National Institute of Infectious Diseases

INSDC first public

2021-03-26T00:00:00Z

INSDC last update

2024-01-14T05:42:57.197Z

INSDC secondary accession

DRS176868

INSDC status

live

Marital_status

Single

NCBI submission model

MIMARKS.survey.human-gut

NCBI submission package

MIMARKS.survey.human-gut.6.0

SRA accession

DRS176868

Viral_load (copies/ml)

15746

age

host age: 23

collection date

2017-09-13

description

Keywords: GSC:MIxS;MIMARKS:6.0

education

Secondary school

env_broad_scale

human gut

env_local_scale

human gut environment

env_medium

fecal material

geo loc name

Ghana:Koforidua

host

Homo sapiens

host_disease_stat

HIV-1 positive

log VL

4.197170247

occupation

Hair dresser

organism

human gut metagenome

project name

Dysbiotic fecal microbiome in HIV-1 infected individuals in Ghana

sample name

5P

sex

female

title

16s rDNA sequence from fecal sample of HIV-1 infected female from Koforidua, Ghana, sample ID HG-P-006-KO

Questions#

  1. Go to ENA Checklists. How many checklists are currently available?

  2. Using the ENA checklists, select the appropriate minimal information model (human gut: ERC000015). Check the interoperability of the currently used Attributes (Column 1) and replace with the interoperable version when available.

  3. FAIR should go hand in hand with privacy regulations. Which of these (or combination of) Attributes would be invading the privacy of the tested persons?

  4. Trying to build on existing domain-specific principles and workflows on the one hand, while trying to get to a maximum level of cross-domain interoperability on the other are competing goals. Which of the Attributes are domain specific? Hint: cross-check with the ENA default checklist -> ERC000011 (ENA Browser ERC000011).

  5. Which attributes are too specific – not important for data reuse?

  6. Which attributes would be essential for data reuse?