Tutorial#

In this tutorial we will show you how to use the FAIR Data Station which in turn allows you to work according to the FAIR By Design principles.

In this tutorial we will be working with a dataset that is publicly available under PRJDB10485. In this study they analyzed the dysbiosis of the fecal microbiome in HIV-1 infected individuals in Ghana.

You will start with a prefilled metadata excelsheet. The sheets were generated using the open and free to use FAIR Data Station application, Metadata configurator fairds.fairbydesign.nl.

1. Obtain the data#

The metadata file can be obtained here.

Excel metadata file

As you can see there are different sheets corresponding to the different levels of information.

Level

Description

Project

Project / Funding information

Investigation

General research questions within the specified project & User access

Study

A series of observation units to answer a particular biological question

ObservationUnit

Objects that are subject to instances of observation and measurement (Bioreactor, Patients, fields)

Sample

Taken from an Observation Unit that can potentially be processed further to acquire data from

Assay

The data (for example a sequencing run) that was performed on a sample

Exercise 1: Validate!#

While performing research the excel sheet can be continously populated. You can imagine that this can be done in the field while doing experiments, in the lab or by machines generating tabular information when a measurement or sample is taken.

During this registration process small mistakes are easily made and the validator in combination with the metadata schema in the backend will ensure that the predefined fields are conform a certain standard.

To validate this excel file go to fairds.fairbydesign.nl and click on the Validate Metadata button. You can now drag the excel file into the box at the top.

It will now start the validation…

As you can see it will start complaining about a field

The value “5” of “biosafety level” in the “Sample” sheet which is obligatory does not match the pattern of (1|2|3|4|unknown) regex (1|2|3|4|unknown) such as in example “2”

As you can see in the excel sheet in the “sample” sheet under “biosafety level”

biosafety level

3

3

3

3

3

3

3

4

5

You see that it is very likely that excel magic happend here. The only fields allowed as mentioned in the error message are (1|2|3|4|unknown) in this case all values should be of level 3. Correct the values and evaluate again.

The evaluation message should now show:

Observation unit creation
Study creation
Investigation creation
Project creation
Total number of objects 21
All identifiers are accounted for saving the results
Validation successful, user not logged in. 
Result file not uploaded to the data storage facility

Validation appeared to be successful.

The output is a database file which can be used for internal systems to query and process the data further which is currently beyond the scope of this tutorial.

Exercise 2#

As you can see in the sheet names there are 3 sheets that contain actual experimental metadata. These sheets are the ObservationUnit, Sample and Assay. For the observation unit and assay we have specified a specific package which in turn is used by the validator.

These packages can have different requirements as for example for samples the air package has an obligatory field geographic location (altitude) which is not available for a human gut sample.

To make the Sample sheet more specific change the sheet name Sample to Sample - human gut.

Validate your dataset again

As you can see a new error pops up. By changing the package name the requirements can change. In this case the message:

in sheet “Sample - human gut” row 1 does not contain column “geographic location (country and/or sea)” which is obligatory

If you look at column Q in the Sample - human gut you see that there is some form of geolocation information under the columna name of geo_loc_name.

Now create a new column next to the Q column and name this geographic location (country and/or sea). As this column only accepts a country or sea name make sure you only fill in Ghana in that particular column.

After validation it should now appear to be successful.

Exercise 3#

When reviewing your metadata it might be possible that you have missed a predefined (optional) column that in hindsight have a different name. In this case we will look at the ObservationUnit - person sheet and look at the last column. This particular column is not in our standardised metadata schema and therefor it is not properly evaluated. At the moment it is a free form field.

Go to the terms overview at FAIR Data Station/Terms

As you can see there are many terms available for different sheets. In our case we are going to search for fields in the ObservationUnit sheet.

  • Filter for ObservationUnit in the sheet column

As you can see there is the person package name in the second column and a specific field in the field column that fits the sex column in our excel sheet.

  • Change the sex column to the name that could replace it (gender)

  • Validate the excel sheet again

Maybe you have noticed the typo in the now gender column? The message shows you what went wrong.

The value “malee” of “gender” in the “ObservationUnit” sheet does not match the pattern of (female|hermaphrodite|male|neuter|not applicable|not collected|not provided|other|restricted access) regex (female|hermaphrodite|male|neuter|not applicable|not collected|not provided|other|restricted access) such as in example “female”

Go back to the excel file and change malee to male

BONUS#

Try to run the application#

This application is written in java and if you have java installed on your computer you should be able to start the program on your local machine.

  • To obtain the program go to http://download.systemsbiology.nl/unlock and download the fairds.jar to your computer

  • To start the program start a command line interface (Prompt, Shell, Terminal depending on your operating system).

    • For mac: Open the terminal app

    • For ubuntu: Open the terminal app

    • For windows: Open the Command Prompt app

  • Type: java -jar fairds.jar (When you are in the same folder as the jar file)

  • When you see the following line, the application has started… Tomcat started on port(s): 8083 ….

  • Now you can access the application using your browser at http://localhost:8083.

RDF#

The metadata is converted to an RDF data file and can be queried using the SPARQL query language.

Loading the data#

We will load the dataset into GraphDB. To install Graphdb, go to https://www.ontotext.com/products/graphdb/graphdb-free/ and register for an installation (registration is free and required!).

After successful installation you should be able to access the application at http://localhost:7200. To load the data into a database do the following:

  • Click on repositories

  • Create new repository

    • GraphDB Repository

  • Give it a name in the Repository ID*

    • Leave everything else as default

  • Top right choose repository change to the one you just created

  • Import

  • Import the RDF file

  • Click import

  • Leave everything at default and Click import again

  • The data should be loaded within a second

When you go back to the home screen (click GraphDB) you should see your local active repostiroy having a total of 1,143 statements.

Enable autocomplete#

To make life easier we will enable autocomplete for the SPARQL queries. To do this do the following:

  • Click on the repository you just created

  • Click on settings

  • Click on autocomplete

  • Click on enable autocomplete

Explore the data#

Now we have the data loaded we can start exploring the data. To do this we will use the Explore option first. To do this do the following:

  • Click on the repository you just created

  • Click on explore

  • Visual graph

  • In the Easy Graph bar type “Project” and select the “http://jermontology.org/ontology/JERMOntology#Project” URL.

  • Click on the “prj_HIV-Ghana” node

  • Follow the “hasPart” links to other nodes

Do you see the connections between the nodes and the excel sheet?

  • Click on one of the Observation Units (e.g., obs_XDRS176892)

  • A sidebar should appear

  • What properties are used and do you see different namespaces?

Query the data#

Now we have explored the data we can start querying the data. To do this we will use the SPARQL query language. To do this do the following:

  • Click on the repository you just created

  • Click on SPARQL

  • In the query box type the following query:

To obtain all observation units

PREFIX jerm: <http://jermontology.org/ontology/JERMOntology#>
PREFIX ppeo: <http://purl.org/ppeo/PPEO.owl#>
SELECT *
WHERE {
  ?ou a ppeo:observation_unit .
}

To obtain all observation units that are female

PREFIX jerm: <http://jermontology.org/ontology/JERMOntology#>
SELECT *
WHERE {
  ?ou a ppeo:observation_unit .
  ?ou mixs:0000811 'female' .
}

** To obtain all observation units that are female and a trader**

PREFIX jerm: <http://jermontology.org/ontology/JERMOntology#>
SELECT *
WHERE {
  ?ou a ppeo:observation_unit .
  ?ou mixs:0000811 'female' .
  ?ou fair:occupation 'Trader' .
}

As you might see the SPARQL query language is for variables case sensitive and making it crucial to have proper standardisation methods in place. For example there are no ‘traders’ in the dataset but there are ‘Traders’.

Shex visualization#

It is also possible to visualize the content using shape expressions. This is however beyond the scope of this tutorial.

A visual representation of the demo dataset