Validation#

In this tutorial we will show you how to use the FAIR Data Station which in turn allows you to work according to the FAIR By Design principles.

In this tutorial we will be working with a dataset that is publicly available under PRJDB10485. In this study they analyzed the dysbiosis of the fecal microbiome in HIV-1 infected individuals in Ghana.

You will start with a prefilled metadata excelsheet. The sheets were generated using the open and free to use FAIR Data Station application, Metadata configurator fairds.fairbydesign.nl.

Obtain the data#

The metadata file can be obtained here.

Excel metadata file

As you can see there are different sheets corresponding to the different levels of information.

Level

Description

Investigation

General research questions within the specified project & User access

Study

A series of observation units to answer a particular biological question

ObservationUnit

Objects that are subject to instances of observation and measurement (Bioreactor, Patients, fields)

Sample

Taken from an Observation Unit that can potentially be processed further to acquire data from

Assay

The data (for example a sequencing run) that was performed on a sample

Exercise 1: Validating the metadata#

While performing research the excel sheet can be continously populated. You can imagine that this can be done in the field while doing experiments, in the lab or by machines generating tabular information when a measurement or sample is taken.

During this registration process small mistakes are easily made and the validator in combination with the metadata schema in the backend will ensure that the predefined fields are conform a certain standard.

To validate this excel file go to fairds.fairbydesign.nl and click on the Validate Metadata button. You can now drag the excel file into the box at the top.

It will now start the validation…

As you can see it will start complaining about a field

Evaluation message

The value “5” of “biosafety level” in the “Sample” sheet which is obligatory does not match the pattern of (1|2|3|4|unknown) regex (1|2|3|4|unknown) such as in example “2”

As you can see in the excel sheet in the “sample” sheet under “biosafety level”

Mistaken biosafety level

biosafety level

3

3

3

3

3

3

3

4

5

You see that it is very likely that excel magic happend here. The only fields allowed as mentioned in the error message are (1|2|3|4|unknown) in this case all values should be of level 3. Correct the values and evaluate again.

The evaluation message should now show:

Evaluation message

Analysing investigation information
Analysing study information
Analyzing observation unit sheet
Finished processing Sample sheet
Processing Assay - Amplicon demultiplexed sheet Finished parsing Assay - Amplicon demultiplexed sheet
No lat long values to convert to GeoSPARQL format. Validating RDF file
RDF file passed validation
Validation successful, user not logged in.
Result file not uploaded to the data storage facility

Validation appeared to be successful.

The output is a database file which can be used for internal systems to query and process the data further which is currently beyond the scope of this tutorial.

Exercise 2: Transforming the metadata#

As you can see in the sheet names there are 3 sheets that contain actual experimental metadata. These sheets are the ObservationUnit, Sample and Assay. For the observation unit and assay we have specified a specific package which in turn is used by the validator.

These packages can have different requirements, for example: for Samples sheet the air package has an obligatory field geographic location (altitude) which is not applicable for a human gut sample.

To make the Sample sheet more specific change the sheet name Sample to Sample - human gut.

Validate your dataset again

As you can see a new error pops up. By changing the package name the requirements can change. In this case the message:

Evaluation message

in sheet “Sample - human gut” row 1 does not contain column “geographic location (country and/or sea)” which is obligatory

If you look at column Q in the Sample - human gut you see that there is some form of geolocation information under the columna name of geo_loc_name.

Now create a new column next to the Q column and name this geographic location (country and/or sea). As this column only accepts a country or sea name make sure you only fill in Ghana in that particular column.

After validation it should now appear to be successful.

Tip

What about geo_loc_name?

Maybe you can find a term that fits the description better? hint, look in https://fairds.fairbydesign.nl/terms and search for geographic in field name

“geographic location (region and locality)” can be added as an extra field since the city is also available in the “geo_loc_name” field.

And geo_loc_name be transformed to geographic location (latitude) and geographic location (longitude)?

Exercise 3: Fixing Observation Unit#

When reviewing your metadata it might be possible that you have missed a predefined (optional) column that in hindsight have a different name. In this case we will look at the ObservationUnit sheet, please download from here a slightly revised workbook in which the Oberservation Unit model is now changed to person.

If we just run validation on the workbook, it seems everthing is fine. Or, is it?

Warning

The last column gender is not in our standardised metadata schema and therefor it is not properly evaluated. At the moment it is a free form field. But how can we FAIRify this field?

Go to the terms overview at FAIR Data Station/Terms

As you can see there are many terms available for different sheets. In our case we are going to search for fields in the ObservationUnit sheet.

  • Filter for ObservationUnit in the sheet column

As you can see there is the person package name in the second column and a specific field in the field column that fits the gender column in our excel sheet.

  • Change the gender column to the name that could replace it with: sex

  • Validate the excel sheet again

Maybe you have noticed the typo in the now modified column? The message indicates what went wrong.

Evaluation message

The value “malee” of “sex” in the “ObservationUnit” sheet does not match the pattern of (female|hermaphrodite|male|neuter|not applicable|not collected|not provided|other|restricted access) regex (female|hermaphrodite|male|neuter|not applicable|not collected|not provided|other|restricted access) such as in example “female”

Go back to the excel file and change malee to male and validate again.