The Cancer Genomics Linkage Application funded by ANDS enables the
re-use and integration of data available from public repositories such as the
ICGC variant database or the DrugBank drug and drug target database by
leveraging the Genomics Virtual Lab capability on the research cloud. Researchers, such as Professor Andrew Biankin
and colleagues from the Garvan Institute for Medical Research are now able to
access genomic datasets of international importance and to integrate them with
their own clinical and genomic datasets in order to explore, discover and
validate key genomic abnormalities that cause cancer, using user friendly
computational workflows. The project further provides the mechanism for such
researchers to publish and to make available their analysis for re-use by the
community.
The solutions developed for this project consists of
- Collection Manager
- Third Party Solution BioMAJ
- Data Galaxy Servlet
- Galaxy Data Link Tool
- Galaxy Server
- Workflow Galaxy Servlet
- Automatic generation of collections descriptions and their submission to RDA
- OAI-PMH Server
as shown and referenced in Figure
1 "Overall Overview" and described in detail in the following.
|
Figure 1: Overall Overview |
|
|
|
|
1. Collection Manager:
The Collection Manager is a web interface
accessing a MySQL database that allows Galaxy administrators and users to edit
and to curate collection (data), service (Galaxy instance) and workflow descriptions. Metadata of the
collection and workflow descriptions (e.g. title, list of associated sites,
collection rights, ANZSRC Codes) can be modified. Furthermore the Collection
Manager allows them to publish workflow descriptions to RDA. A detailed user guide can be found here.
|
Figure 2: Welcome to the Collection Manager Interface |
2. Integration of Third Party Solution BioMAJ for Data
Synchronisation:
The download scheduler BioMAJ is used to mirror reference datasets such
as ICGC, Drugbank, etc. from public repositories. BioMAJ Watcher is the web interface. A shell script has been
developed to automatically send a Post request to the Data Galaxy Servlet when
downloading or updating a data library using BioMAJ. The Data Galaxy Servlet is
described in the following section.
|
Figure 3: BioMAJ Watcher web Interface - general functionalities |
3. Data Galaxy Servlet:
The Data Galaxy Servlet is a Java servlet
that creates a new record accessible by the Collection Manager containing all the
information and metadata related to the mirrored reference dataset (Figure 4). An email
informs the owner that the reference dataset description is ready to be modified and published to RDA.
|
Figure 4: Collection Manager - Data Library Overview |
4. Galaxy Data Link Tool:
The Galaxy Data Link Tool is a script written in Python that links the reference
datasets downloaded by BioMAJ with Galaxy (Figure 5). This tool uploads the specified data
files as a Galaxy Data Library resource. The files are linked to the specified
data path. If the specified path consists of a directory of files, this
directory structure will be automatically mirrored in Galaxy. This tool can be
used in conjunction as a post-process step with the data synchronisation tool
BioMAJ.
|
Figure 5: Data Libraries in Galaxy linked by Galaxy Data Link Tool |
|
|
|
|
5. Galaxy Server:
To allow automated feeds of workflow RIF-CS records
from Galaxy to RDA an extra button has been implemented in Galaxy as shown in Figure 6. To implement
this button, the galaxy-dist code hosted on BitBucket was forked and the code was modified. The button initiates a POST request to the Workflow Galaxy Servlet
which contains all the
information and metadata related to the published workflow. The Workflow
Galaxy Servlet is described in the following section.
|
Figure 6: New Galaxy feature: Publish workflow to RDA
|
6. Workflow Galaxy Servlet:
The Workfow Galaxy Servlet is a Java servlet which manages requests
initiated through the mentioned Galaxy server. The servlet creates a new record accessible via the Collection
Manager (Figure 7) that contains all the information and metadata related to the Galaxy
workflow (Figure 8). An email informs the owner that the workflow
description is ready to be modified and published to RDA.
|
Figure 7: Workflow description entry with the Collection Manager |
|
Figure 8: Galaxy Workflow |
7. Automatic generation of collections descriptions and their submission to
RDA:
Metadata stored in the MySQL database are aggregated to form a
collection description and written into a compliant RIF-CS xml using the ANDS
supplied RIF-CS Java library. Persistent record identifiers are assigned and
the RIF-CS files made accessible to a RDA harvest data source.
8. OAI-PMH Server (Records indexing):
The
OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) server
allows RIF-CS xml files generated by the Collection Manager to be exposed
as items in an OAI data repository and made available for harvesting by RDA. Once the collection descriptions are harvested, they will be available through RDA as shown in Figure 9 below.
|
Figure 9: Research Data Australia - Published Data Library "International Cancer Genome Consortium" |
The Cancer Genomics Linkage Application has been developed and tested on
a development server from the Genomics Virtual Laboratory Project, while the production
servers are being deployed, tuned and configured on the Research Cloud. Some of
the data sources and tools are currently available at the Garvan Institute and
will be deployed to the other GVL nodes as they become available during the
year 2013. The variant detection workflow developed for the Garvan Institute
will be made publicly available and published on the RDA as soon as the tools
developed by Professor Sean Grimmond’s group are published in the research
literature.
Licensing
No comments:
Post a Comment