The Cancer Genomics Linkage Application will enable the integration and re-use of the cancer genomics data available from public repositories such as the International Cancer Genome Consortium (ICGC). This will be accomplished through the capability being developed by the “Early Activity” of the Genomics Virtual Laboratory (GVL-EA). It will enable researchers, such as Professors Andrew Biankin, John Mattick (Garvan Institute for Medical Research) or Sean Grimmond (Queensland Centre for Medical Genomics), to access genomic datasets of international importance and to integrate them with their own clinical and genomic datasets in order to explore, discover and validate key genomic abnormality that cause cancer. The product will further provide the mechanism for such researchers to publish and to make available their analysis for re-use by the community.

The product aims to provide the ability for biologists and clinicians to easily integrate their own research data with datasets from multiple data sources. The Integration of the datasets into a common location and enabling access and mining using best practice workflow tools will enable the Australian cancer researchers to accelerate their discovery processes and to be internationally competitive. Although this project will have a particular focus on pancreatic cancer research as carried out by the Australian Pancreatic Cancer Genome Initiative (APGI), the application can also support the wider cancer research community.

Download the application from here.

Friday, 15 March 2013

Final Product



The Cancer Genomics Linkage Application funded by ANDS enables the re-use and integration of data available from public repositories such as the ICGC variant database or the DrugBank drug and drug target database by leveraging the Genomics Virtual Lab capability on the research cloud.  Researchers, such as Professor Andrew Biankin and colleagues from the Garvan Institute for Medical Research are now able to access genomic datasets of international importance and to integrate them with their own clinical and genomic datasets in order to explore, discover and validate key genomic abnormalities that cause cancer, using user friendly computational workflows. The project further provides the mechanism for such researchers to publish and to make available their analysis for re-use by the community.

The solutions developed for this project consists of
  1. Collection Manager
  2. Third Party Solution BioMAJ
  3. Data Galaxy Servlet
  4. Galaxy Data Link Tool
  5. Galaxy Server
  6. Workflow Galaxy Servlet
  7. Automatic generation of collections descriptions and their submission to RDA
  8. OAI-PMH Server
as shown and referenced in Figure 1 "Overall Overview" and described in detail in the following.
Figure 1: Overall Overview



1. Collection Manager

The Collection Manager is a web interface accessing a MySQL database that allows Galaxy administrators and users to edit and to curate collection (data), service (Galaxy instance) and workflow descriptions. Metadata of the collection and workflow descriptions (e.g. title, list of associated sites, collection rights, ANZSRC Codes) can be modified. Furthermore the Collection Manager allows them to publish workflow descriptions to RDA. A detailed user guide can be found here.

Figure 2: Welcome to the Collection Manager Interface
 

2. Integration of Third Party Solution BioMAJ for Data Synchronisation


The download scheduler BioMAJ is used to mirror reference datasets such as ICGC, Drugbank, etc. from public repositories. BioMAJ Watcher is the web interface. A shell script has been developed to automatically send a Post request to the Data Galaxy Servlet when downloading or updating a data library using BioMAJ. The Data Galaxy Servlet is described in the following section. 
Figure 3: BioMAJ Watcher web Interface - general functionalities

3. Data Galaxy Servlet: 

The Data Galaxy Servlet is a Java servlet that creates a new record accessible by the Collection Manager containing all the information and metadata related to the mirrored reference dataset (Figure 4). An email informs the owner that the reference dataset description is ready to be modified and published to RDA.

Figure 4: Collection Manager - Data Library Overview

4. Galaxy Data Link Tool: 

The Galaxy Data Link Tool is a script written in Python that links the reference datasets downloaded by BioMAJ with Galaxy (Figure 5). This tool uploads the specified data files as a Galaxy Data Library resource. The files are linked to the specified data path. If the specified path consists of a directory of files, this directory structure will be automatically mirrored in Galaxy. This tool can be used in conjunction as a post-process step with the data synchronisation tool BioMAJ. 

Figure 5: Data Libraries in Galaxy linked by Galaxy Data Link Tool




5. Galaxy Server: 

To allow automated feeds of workflow RIF-CS records from Galaxy to RDA an extra button has been implemented in Galaxy as shown in Figure 6. To implement this button, the galaxy-dist code hosted on BitBucket was forked and the code was modified. The button initiates a POST request to the Workflow Galaxy Servlet which contains all the information and metadata related to the published workflow. The Workflow Galaxy Servlet is described in the following section. 

Figure 6: New Galaxy feature: Publish workflow to RDA


6. Workflow Galaxy Servlet: 


The Workfow Galaxy Servlet is a Java servlet which manages requests initiated through the mentioned Galaxy server. The servlet creates a new record accessible via the Collection Manager (Figure 7)  that contains all the information and metadata related to the Galaxy workflow (Figure 8). An email informs the owner that the workflow description is ready to be modified and published to RDA.

Figure 7: Workflow description entry with the Collection Manager
Figure 8: Galaxy Workflow

7. Automatic generation of collections descriptions and their submission to RDA: 

Metadata stored in the MySQL database are aggregated to form a collection description and written into a compliant RIF-CS xml using the ANDS supplied RIF-CS Java library. Persistent record identifiers are assigned and the RIF-CS files made accessible to a RDA harvest data source. 

8. OAI-PMH Server (Records indexing): 

The OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) server allows RIF-CS xml files generated by the Collection Manager to be exposed as items in an OAI data repository and made available for harvesting by RDA. Once the collection descriptions are harvested, they will be available through RDA as shown in Figure 9 below.

Figure 9: Research Data Australia - Published Data Library "International Cancer Genome Consortium"
  
The Cancer Genomics Linkage Application has been developed and tested on a development server from the Genomics Virtual Laboratory Project, while the production servers are being deployed, tuned and configured on the Research Cloud. Some of the data sources and tools are currently available at the Garvan Institute and will be deployed to the other GVL nodes as they become available during the year 2013. The variant detection workflow developed for the Garvan Institute will be made publicly available and published on the RDA as soon as the tools developed by Professor Sean Grimmond’s group are published in the research literature.

Licensing

All documents and source code is made available under GPLv3 licence via Google Code - Project AP27 Cancer Genomics Linkage Application.
 

Thursday, 14 March 2013

User Acceptance Testing

In the final phase of this project our developed application was assessed by users representing three user groups. The following describes the test users; tasks each user group performed with this application, and the feedback and outcome from the user acceptance test.

Test user (groups)

User 1 is a postdoc from the department of Computing and Information Systems at the University of Melbourne.  This user is a representative of the general Bioinformatician that has  both basic biological knowledge and adept programming skills.

User 2 is a postdoc from the University of Queensland, representing biologists and clinicians who want to easily integrate their own research data with datasets from multiple data sources and analyse the data. This user group are generally researchers who have very limited or no programming skills.

User 3 a system administrator from the University of Queensland. This user represents the GVL Galaxy site administrator who works primarily on the server side, managing the Galaxy workflow system, environment and data libraries.

Tests performed by users

User 1 - the main task for this user group is to design and generate scientific workflows/protocols using the Galaxy workflow system. Then publish these workflows with the   research community so that other researchers can apply it to their own data.

User 2 - the main task or this user group is to reuse existing workflows for their own research, modify existing workflows and create new workflows. Additional important features for this user group include the: 
  • ability to publish and cite workflows.
  • ability to assign a Digital Object Identifier (DOI) to workflows .
  • ability to rerun shared workflows and cite them using the DOI in their publications.
  • ability to generate important datasets for the community and share them through Galaxy.
User 3 - the main task or this user group is to manage and update the data libraries shared with the community via the Galaxy workflow system. This user works primarily on the server side on tasks such as configuring user, tool or data repository access.

Testing outcomes

The Collection Manager has been introduced to User 1 via video conference and in person to Users 2 and 3. All functionalities have been presented to all users.

User testing identified some issues from the Collection Manager, the main issues were:
  • concurrent access to data records in the Collection manager result in database errors when saving the record
  • email notification functionality
All issues have been now been fixed.

Individual User Experiences

User 1 found the Collection Manager interface very user friendly and easy to use. User 1 comments that the application will be another means for them to promote their Galaxy workflows in the near future. This will increase awareness of their workflows further in the research community. An account has been created for this user on the production server and they have published two workflows to RDA using the Collection Manager.

User 2 found the Collection Manager impressive. In their opinion it is a very useful tool to easily share and promote workflows with the research community. An account has been created for them on the production server and this user will spend a couple of days testing the Collection Manager to see how it could be merged with their current working style.

User 2 expressed further interest in the DOI minting option provided by the Collection Manager. This feature was identified as a major advantage to using the application, which is not provided by the Galaxy workflow system. Further discussion with User 2 needs to take place to evaluate: when should a workflow be assigned a DOI during the publishing process.

User 3 who has a background in teaching Web Interface design found the Collection Manager Interface very easy to understand and use. They also found value in using this interface to help the public identify which Galaxy instances are available on the web and what data libraries and workflows they host. Many public users do not know where to go to use specific tools with specific data. Having a general repository indexed by Google will ease this process. In addition, User 3 raised a few points for the web-interface that include providing more help text for fields such as ANZSRC and Rights when users are creating a Data record. These changes  are currently being addressed.

Wednesday, 13 March 2013

What does the application do? The bigger picture


Although cancers may look the same under the microscope, they behave quite differently at the genomic level. By identifying these genetic differences, researchers can start to understand what treatment works for particular patients and administer those first. Understanding which cancers don’t respond to treatment is also vital as this allows researchers to focus down on the molecular mechanisms that are responsible and design new drugs to attack those particular mechanisms that cause the cancer to be lethal [1].


In this context, the primary goal of the International Cancer Genome Consortium (ICGC) is to generate comprehensive catalogues of genomic abnormalities in tumours from 50 different cancer types and/or subtypes which are of clinical and societal importance across the globe. The Australian component of that Consortium, the Australian Pancreatic Cancer GenomeInitiative (APGI) is delivering the genomic data associated with pancreatic tumour samples. For this, the APGI uses the best clinical material available, with well-characterised and accurately annotated clinico-pathological, treatment and outcome data acquired prospectively. However, discovery of variations and similarities within the genomes of a given cancer collection and being able to compare these to other datasets remain a significant challenge for biologist and clinician researchers.  The effective re-use of datasets of international importance is limited by the ability of research biologists and clinicians to access and to use computational and data infrastructure. A researcher performing such an analysis is currently expected to be conversant with programming, command line scripting, data management, high performance computing, network-based communications, and visualisation at a minimum. Additionally, they must ensure that the steps in the analysis are recorded in sufficient detail for the results to be reproduced in-house at least, and ideally would document the analysis in such a way as to allow publication of the method(s).

The Cancer Genomics Linkage Application funded by ANDS enables the re-use and integration of data available from public repositories such as the ICGC variant database or the DrugBank database of drugs and drug targets by leveraging the Genomics Virtual Lab capability on the research cloud.  Researchers, such as Professor Andrew Biankin and colleagues from the Garvan Institute for Medical Research are now able to access genomic datasets of international importance and to integrate them with their own clinical and genomic datasets in order to explore, discover and validate key genomic abnormality that cause cancer using user friendly computational workflows. The project further provides the mechanism for such researchers to publish and to make available their analysis for re-use by the community.
The project enables the in-depth interrogation of cancer genomic datasets and allows the comparison to other genomic datasets by providing research biologists and clinicians with direct access to them through the Genomics Virtual Lab (GVL). The key benefits to the Australian genomics community gained from this project are
  • Integration of analysis tools, public and private datasets, and visualisation platforms: streamlining research and reducing time from experiment to publication 
  • Enhanced collaboration between researchers and across the community through shared datasets, workflows and customised toolsets 
  • Reproducibility and research provenance: workflow engines record all aspects of an experiment allowing for confidence in repeatability and for the publication of workflows along with the resulting data. The application enables researchers to mint their workflows using Digital Object Identifier (DOI).

References

[1] http://www.youtube.com/watch?v=ctohpNzTFvg