ClinVar Ingest
Batch ETL pipeline to mirror ClinVar releases into the Jade Data Repository.
How this ingest process works
Ingest orchestration is driven by Argo. A stand-alone
WorkflowTemplate defines the steps:
- Generate a temporary local volume
- Download the latest ClinVar release from the ClinVar FTP server
to the volume
- In parallel:
- Upload the raw release to GCS using gsutil
- Extract the release into JSON-list:
- Generate another local volume to store extracted data
- Run the compressed XML release through our XML -> JSON tool,
storing the outputs on the 2nd volume
- Upload the JSON-list files to GCS
In the near future, we will add additional steps to the workflow:
- Dataflow to transform the raw extracted JSON into the desired output schema
- BigQuery jobs to diff new data against the rows already in dev
- Jade API calls to ingest the raw archive and transformed JSON rows