Tutorial#
In this tutorial we will show you how to use the FAIR Data Station which in turn allows you to work according to the FAIR By Design principles.
In this tutorial we will be working with a dataset that is publicly available under PRJDB10485. In this study they analyzed the dysbiosis of the fecal microbiome in HIV-1 infected individuals in Ghana.
You will start with a prefilled metadata excelsheet. The sheets were generated using the open and free to use FAIR Data Station application, Metadata configurator fairds.fairbydesign.nl.
1. Obtain the data#
The metadata file can be obtained here.
As you can see there are different sheets corresponding to the different levels of information.
Level |
Description |
---|---|
Project |
Project / Funding information |
Investigation |
General research questions within the specified project & User access |
Study |
A series of observation units to answer a particular biological question |
ObservationUnit |
Objects that are subject to instances of observation and measurement (Bioreactor, Patients, fields) |
Sample |
Taken from an Observation Unit that can potentially be processed further to acquire data from |
Assay |
The data (for example a sequencing run) that was performed on a sample |
Exercise 1: Validate!#
While performing research the excel sheet can be continously populated. You can imagine that this can be done in the field while doing experiments, in the lab or by machines generating tabular information when a measurement or sample is taken.
During this registration process small mistakes are easily made and the validator in combination with the metadata schema in the backend will ensure that the predefined fields are conform a certain standard.
To validate this excel file go to fairds.fairbydesign.nl and click on the Validate Metadata button. You can now drag the excel file into the box at the top.
It will now start the validation…
As you can see it will start complaining about a field
The value “5” of “biosafety level” in the “Sample” sheet which is obligatory does not match the pattern of (1|2|3|4|unknown) regex (1|2|3|4|unknown) such as in example “2”
As you can see in the excel sheet in the “sample” sheet under “biosafety level”
biosafety level |
---|
3 |
3 |
3 |
3 |
3 |
3 |
3 |
4 |
5 |
You see that it is very likely that excel magic happend here. The only fields allowed as mentioned in the error message are (1|2|3|4|unknown) in this case all values should be of level 3. Correct the values and evaluate again.
The evaluation message should now show:
Observation unit creation
Study creation
Investigation creation
Project creation
Total number of objects 21
All identifiers are accounted for saving the results
Validation successful, user not logged in.
Result file not uploaded to the data storage facility
Validation appeared to be successful.
The output is a database file which can be used for internal systems to query and process the data further which is currently beyond the scope of this tutorial.
Exercise 2#
As you can see in the sheet names there are 3 sheets that contain actual experimental metadata. These sheets are the ObservationUnit, Sample and Assay. For the observation unit and assay we have specified a specific package which in turn is used by the validator.
These packages can have different requirements as for example for samples the air package has an obligatory field geographic location (altitude) which is not available for a human gut sample.
To make the Sample sheet more specific change the sheet name Sample to Sample - human gut.
Validate your dataset again
As you can see a new error pops up. By changing the package name the requirements can change. In this case the message:
in sheet “Sample - human gut” row 1 does not contain column “geographic location (country and/or sea)” which is obligatory
If you look at column Q in the Sample - human gut you see that there is some form of geolocation information under the columna name of geo_loc_name.
Now create a new column next to the Q column and name this geographic location (country and/or sea). As this column only accepts a country or sea name make sure you only fill in Ghana in that particular column.
After validation it should now appear to be successful.
Exercise 3#
When reviewing your metadata it might be possible that you have missed a predefined (optional) column that in hindsight have a different name. In this case we will look at the ObservationUnit - person sheet and look at the last column. This particular column is not in our standardised metadata schema and therefor it is not properly evaluated. At the moment it is a free form field.
Go to the terms overview at FAIR Data Station/Terms
As you can see there are many terms available for different sheets. In our case we are going to search for fields in the ObservationUnit sheet.
Filter for ObservationUnit in the sheet column
As you can see there is the person package name in the second column and a specific field in the field column that fits the sex column in our excel sheet.
Change the sex column to the name that could replace it (gender)
Validate the excel sheet again
Maybe you have noticed the typo in the now gender column? The message shows you what went wrong.
The value “malee” of “gender” in the “ObservationUnit” sheet does not match the pattern of (female|hermaphrodite|male|neuter|not applicable|not collected|not provided|other|restricted access) regex (female|hermaphrodite|male|neuter|not applicable|not collected|not provided|other|restricted access) such as in example “female”
Go back to the excel file and change malee to male
BONUS#
Try to run the application#
This application is written in java and if you have java installed on your computer you should be able to start the program on your local machine.
To obtain the program go to http://download.systemsbiology.nl/unlock and download the fairds.jar to your computer
To start the program start a command line interface (Prompt, Shell, Terminal depending on your operating system).
For mac: Open the terminal app
For ubuntu: Open the terminal app
For windows: Open the Command Prompt app
Type: java -jar fairds.jar (When you are in the same folder as the jar file)
When you see the following line, the application has started… Tomcat started on port(s): 8083 ….
Now you can access the application using your browser at http://localhost:8083.
RDF#
The metadata is converted to an RDF data file and can be queried using the SPARQL query language.
Loading the data#
We will load the dataset into GraphDB. To install Graphdb, go to https://www.ontotext.com/products/graphdb/graphdb-free/ and register for an installation (registration is free and required!).
After successful installation you should be able to access the application at http://localhost:7200. To load the data into a database do the following:
Click on repositories
Create new repository
GraphDB Repository
Give it a name in the Repository ID*
Leave everything else as default
Top right choose repository change to the one you just created
Import
Import the RDF file
Click import
Leave everything at default and Click import again
The data should be loaded within a second
When you go back to the home screen (click GraphDB) you should see your local active repostiroy having a total of 1,143 statements.
Enable autocomplete#
To make life easier we will enable autocomplete for the SPARQL queries. To do this do the following:
Click on the repository you just created
Click on settings
Click on autocomplete
Click on enable autocomplete
Explore the data#
Now we have the data loaded we can start exploring the data. To do this we will use the Explore option first. To do this do the following:
Click on the repository you just created
Click on explore
Visual graph
In the Easy Graph bar type “Project” and select the “http://jermontology.org/ontology/JERMOntology#Project” URL.
Click on the “prj_HIV-Ghana” node
Follow the “hasPart” links to other nodes
Do you see the connections between the nodes and the excel sheet?
Click on one of the Observation Units (e.g., obs_XDRS176892)
A sidebar should appear
What properties are used and do you see different namespaces?
Query the data#
Now we have explored the data we can start querying the data. To do this we will use the SPARQL query language. To do this do the following:
Click on the repository you just created
Click on SPARQL
In the query box type the following query:
To obtain all observation units
PREFIX jerm: <http://jermontology.org/ontology/JERMOntology#>
PREFIX ppeo: <http://purl.org/ppeo/PPEO.owl#>
SELECT *
WHERE {
?ou a ppeo:observation_unit .
}
To obtain all observation units that are female
PREFIX jerm: <http://jermontology.org/ontology/JERMOntology#>
SELECT *
WHERE {
?ou a ppeo:observation_unit .
?ou mixs:0000811 'female' .
}
** To obtain all observation units that are female and a trader**
PREFIX jerm: <http://jermontology.org/ontology/JERMOntology#>
SELECT *
WHERE {
?ou a ppeo:observation_unit .
?ou mixs:0000811 'female' .
?ou fair:occupation 'Trader' .
}
As you might see the SPARQL query language is for variables case sensitive and making it crucial to have proper standardisation methods in place. For example there are no ‘traders’ in the dataset but there are ‘Traders’.
Shex visualization#
It is also possible to visualize the content using shape expressions. This is however beyond the scope of this tutorial.
A visual representation of the demo dataset