Data Platforms
This repository gathers together components that take part in the Commons Alliance. These components describe interfaces and features that can be assembled to create “data platforms” useful for storing, and performing reproducible analyses on data and metadata.
This is a living document.
Note, if you are viewing this on github, the images may be cached, please visit:
https://databiosphere.github.io/data-platforms/
For more background read the Data Biosphere post.
Visit the DataBiosphere github organization.
Metadata Serialization
Communication between data platforms requires that metadata are serialized in a useful and predictable manner. This document describes approaches and case studies in use by some components.
View the Metadata Serialization document.
Identifier Interoperability
When the same metadata are present in multiple locations, it is critical to provide guarantees of identity that are useful and portable. This document describes approaches to presenting and using interfaces that allow identifiers to be usefully exchanged.
View the Identifier Interoperability document.
Prototype
The various components coordinate to create a platform useful for data analysis.
Digital Object Catalog
Provides clients and services access to resources available in object stores. Digital objects can be files and the catalog itself maintains a registry of locations to find the files, as well as minimal metadata.
GUID Resolver
Allows globally unique identifiers to be “resolved” to digital objects. For more information please refer to Identifier Interoperability.
Namespace Service
Identifiers can be given different namespaces or “prefixes”. The namespace service allows commons members to easily manage GUIDs across projects and domains. For more information please refer to Identifier Interoperability.
Data Access
Once data have been discovered they must be localized, which requires interacting with object stores and performing authentication, authorization, downloading, and transfer as necessary.
Access Control
To guarantee authority and authenticity of requests, some access control services are provided. These services will at least be able to identify a client and delegate authority to the access control system of choice.
Analytical Engine
Software which can orchestrate and execute computational tasks in heterogeneous computing environments.
Tool Repository
A resource which contains templates of reusable computational tasks that can be directed at new data, and then executed by the Analytical Engine.
Workspaces
Clients accessing a commons infrastructure should be able to manage data for secondary and tertiary data analysis.
Indexing and Search
Data in commons infrastructure should be findable using Search mechanisms. Indexing makes data available for search.
Ontology
A controlled vocabulary informs indexers and or querying applications to make metadata usable.
Metadata Indexer
Metadata made available by a platform is indexed into a store. Indexers allow data to be made findable using a structured document scheme.
Metadata Querying
Once metadata have been indexed into a platform, these indices are made available by services that allow queries to be formed against the metadata.
Portal
Commons infrastructure should provide interfaces to make data easily findable. Once data has been found in a portal, it can be added to a workspace.
Application
Applications combine a variety of Commons components to carry out specific tasks.
Commons Alliance Components
Source Code Repository Table
Links to source code repositories for implementations are provided below:
| Component | Broad | UChicago CDIS | UCSC CGP |
|---|---|---|---|
| Digital Object Catalog | |||
| GUID Resolver | indexd* | dos-azul-lambda* | |
| Namespace Service | indexd* | ||
| Data Access | fence | cgp-data-store | |
| Access Control | |||
| Authorization | sam bond | fence | |
| Authentication | sam bond | fence | |
| Analytical Engine | Cromwell][29][ Leonardo |
toil | |
| Tool Repository | Agora* | Dockstore* | |
| Workspaces | Rawls | jupyterhub | |
| Indexing and Search | Orchestration | ||
| Ontology | datadictionary | ||
| Metadata Indexer | Orchestration | sheepdog | azul-indexer |
| Metadata Querying | Orchestration | peregrine | azul-webservice |
| Portal | Firecloud | windmill | boardwalk |
| Application | xena |
Applications marked with a * implement a standard interface being developed with the GA4GH.
Clients can interact with these applications using an open protocol
- indexd, dos-azul-lambda and Cromwell implement the Data Object Service.
- Dockstore and Agora implement the Tool Registry Service.
- Cromwell implememnts the Workflow Execution Service
UChicago CDIS
The University of Chicago, CDIS groups presents software for easily managing the submission and access control of bioinformatics and medical informatics data in cloud environments.
UC Santa Cruz Computational Genomics Platform
Broad Institute
This section is in progress
Development
This document is under active development. If you feel misrepresented or something has been miscommunicated, please open an issue or make a Pull Request!
Editing diagrams
The program used to edit the “dia” files is dia.
Github caches images when they display READMEs so be sure to check the actual file if it seems out of date!