Validation#
In this tutorial we will show you how to use the FAIR Data Station which in turn allows you to work according to the FAIR By Design principles.
In this tutorial we will be working with a dataset that is publicly available under PRJDB10485. In this study they analyzed the dysbiosis of the fecal microbiome in HIV-1 infected individuals in Ghana.
You will start with a prefilled metadata excelsheet. The sheets were generated using the open and free to use FAIR Data Station application, Metadata configurator fairds.fairbydesign.nl.
Obtain the data#
The metadata file can be obtained here.
As you can see there are different sheets corresponding to the different levels of information.
Level |
Description |
---|---|
Investigation |
General research questions within the specified project & User access |
Study |
A series of observation units to answer a particular biological question |
ObservationUnit |
Objects that are subject to instances of observation and measurement (Bioreactor, Patients, fields) |
Sample |
Taken from an Observation Unit that can potentially be processed further to acquire data from |
Assay |
The data (for example a sequencing run) that was performed on a sample |
Exercise 1: Validating the metadata#
While performing research the excel sheet can be continously populated. You can imagine that this can be done in the field while doing experiments, in the lab or by machines generating tabular information when a measurement or sample is taken.
During this registration process small mistakes are easily made and the validator in combination with the metadata schema in the backend will ensure that the predefined fields are conform a certain standard.
To validate this excel file go to fairds.fairbydesign.nl and click on the Validate Metadata button. You can now drag the excel file into the box at the top.
It will now start the validation…
As you can see it will start complaining about a field
Evaluation message
The value “5” of “biosafety level” in the “Sample” sheet which is obligatory does not match the pattern of (1|2|3|4|unknown) regex (1|2|3|4|unknown) such as in example “2”
As you can see in the excel sheet in the “sample” sheet under “biosafety level”
biosafety level |
---|
3 |
3 |
4 |
5 |
You see that it is very likely that excel magic happend here. The only fields allowed as mentioned in the error message are (1|2|3|4|unknown) in this case all values should be of level 3. Correct the values and evaluate again.
The evaluation message should now show:
Evaluation message
Analysing investigation information Analysing study information Analyzing observation unit sheet Finished processing Sample sheet Processing Assay - Amplicon demultiplexed sheet Finished parsing Assay - Amplicon demultiplexed sheet Validating RDF file: ./fairds_storage//validation/ValidationDemo.ttl Validation successful, user not logged in. Result file not uploaded to the data storage facility
Validation appeared to be successful.
The output is a database file which can be used for internal systems to query and process the data further which is currently beyond the scope of this tutorial.
Exercise 2: Transforming the metadata#
As you can see in the sheet names there are 3 sheets that contain actual experimental metadata. These sheets are the ObservationUnit, Sample and Assay. For the observation unit and assay we have specified a specific package which in turn is used by the validator.
These packages can have different requirements, for example: for Samples sheet the air package has an obligatory field geographic location (altitude) which is not applicable for a human gut sample.
To make the Sample sheet more specific change the sheet name Sample to Sample - human gut.
Validate your dataset again
As you can see a new error pops up. By changing the package name the requirements can change. In this case the message:
Evaluation message
in sheet “Sample - human gut” row 1 does not contain column “geographic location (country and/or sea)” which is obligatory
If you look at column AB in the Sample - human gut you see that there is some form of geolocation information under the columna name of geo_loc_name.
You can rename the geo_loc_name to geographic location (country and/or sea) and do the validation again.
As you can see the validation still does not pass. This is because this field is more restricted. To not loose any information we should create a new column next to this one with the old information and only mention the country in the original field.
This should create:
Geographic Location (Country and/or Sea) |
Geo_Loc_Name |
---|---|
Ghana |
Koforidua |
Ghana |
Koforidua |
Ghana |
Koforidua |
Ghana |
Koforidua |
After validation it should now appear to be successful.
Tip
What about geo_loc_name?
Maybe you can find a term that fits the description better? hint, look in https://fairds.fairbydesign.nl/terms and search for geographic in field name
“geographic location (region and locality)” can be added as an extra field since the city is also available in the “geo_loc_name” field.
And geo_loc_name
be transformed to geographic location (latitude) and geographic location (longitude)
?
Exercise 3: Fixing Observation Unit#
When reviewing your metadata it might be possible that you have missed a predefined (optional) column that in hindsight have a different name. In this case we will look at the ObservationUnit sheet, please download from here a slightly revised workbook in which the Oberservation Unit model is now changed to person.
If we just run validation on the workbook, it seems everthing is fine. Or, is it?
Warning
The last column gender
is not in our standardised metadata schema and therefor it is not properly evaluated. At the moment it is a free form field. But how can we FAIRify this field?
Go to the terms overview at FAIR Data Station/Terms
As you can see there are many terms available for different sheets. In our case we are going to search for fields in the ObservationUnit sheet.
Filter for ObservationUnit in the sheet column
As you can see there is the person package name in the second column and a specific field in the field column that fits the gender column in our excel sheet.
Change the gender column to the name that could replace it with: sex
Validate the excel sheet again
Maybe you have noticed the typo in the now modified column? The message indicates what went wrong.
Evaluation message
The value “femalee” of “sex” in the “ObservationUnit” sheet does not match the pattern of (female|hermaphrodite|male|neuter|not applicable|not collected|not provided|other|restricted access) regex (female|hermaphrodite|male|neuter|not applicable|not collected|not provided|other|restricted access) such as in example “female”
Go back to the excel file and change femalee to female and validate again.