# Identifier Interoperability in the DSS

This case study demonstrates how Data Objects offered by the DSS can be assigned globally unique identifiers, providing them with persistence and the ability to be found across platforms. It is meant to support the development of a common approach to [Identifier Interoperability](https://github.com/DataBiosphere/identifier-interoperability) and refers to Use Cases described there.

To enable this demonstration, a development instance of the [DSS](https://github.com/HumanCellAtlas/data-store) has been prepared. The DSS is an Open Source cloud storage solution which enable replication across cloud environments. The data loaded are for demonstration and are not immediately accessible with proper credentials.

The URLs used and identifiers issued in this case study should be considered ephemeral and for demonstration only.

## Accessing Data Objects from the DSS

* [Use Case 1.1 Get a Data Object by Data Object Identifier](https://github.com/DataBiosphere/identifier-interoperability#1.1)

The DSS provides indices to data replicated across cloud storage environments. To ease interoperability, a subset of its features are provided as a Data Object Service, which enables basic listing and getting of items. We'll use the `requests` module and simple JSON requests to first get some data in the DSS.

This DOS instance is backed by the (dos-azul-lambda)[https://github.com/DataBiosphere/dos-azul-lambda], which presents DSS data in a file-based index.

In [19]:
dos_azul_url = "https://5ybh0f5iai.execute-api.us-west-2.amazonaws.com/api"

from ga4gh.dos.client import Client

client = Client(dos_azul_url)
c = client.client
models = client.models

Now that we have instantiated a client, we can make requests against the service to find a Data Object.

In [62]:
ListDataObjectsRequest = models.get_model('ListDataObjectsRequest')
list_request = ListDataObjectsRequest(page_size=1)
list_response = c.ListDataObjects(body=list_request).result()
data_object = list_response.data_objects[0]
print(data_object.id)
print(data_object.name)

46c8a5f1-15ab-48fa-8d1c-63099422e3c7
NWD259170.recab.cram.crai


## Generating a GUID

We will use the [minid](https://github.com/fair-research/minid) service, which can be used to register identifiers using a third-party service. Generating a minid results in data that is resolvable by [Archival Resource Key](https://en.wikipedia.org/wiki/Archival_Resource_Key) and services like [n2t.net](https://n2t.net).

The minid client can be installed using `pip install minid`.

In [63]:
!minid --help

usage: minid [-h] [--register] [--batch-register] [--update] [--test] [--json]
             [--server SERVER] [--title TITLE]
             [--locations LOCATIONS [LOCATIONS ...]] [--status STATUS]
             [--obsoleted_by OBSOLETED_BY] [--content_key CONTENT_KEY]
             [--config CONFIG] [--register_user] [--email EMAIL] [--name NAME]
             [--orcid ORCID] [--code CODE]
             [--globus_auth_token GLOBUS_AUTH_TOKEN] [--quiet] [--version]
             [filename]

BD2K minid tool for assigning an identifier to data

positional arguments:
  filename              file or identifier to retrieve information about or
                        register

optional arguments:
  -h, --help            show this help message and exit
  --register            Register the file
  --batch-register      Register multiple files listed in a JSON manifest
  --update              Update a minid
  --test                Run a test of this registration using the test min

### Registering an Account

If you would like to use the default minid server, you must first register your name and any emails or "ORCIDs", which are identifiers for researchers, to your account. Follow the directions here: https://github.com/fair-research/minid

### Gathering necessary metadata

First, we'll gather as much metadata from the item as we can usefully send to the minid service.

We'll need a `title`, one or more `locations`, and a manifest of the Data Object.

In [64]:
filename = data_object.name
print(filename)

NWD259170.recab.cram.crai


To get more information about the identifier, the DOS URL is provided.

In [65]:
dos_url = '{}/ga4gh/dos/v1/dataobjects/{}'.format(dos_azul_url, data_object.id)
print(dos_url)

https://5ybh0f5iai.execute-api.us-west-2.amazonaws.com/api/ga4gh/dos/v1/dataobjects/46c8a5f1-15ab-48fa-8d1c-63099422e3c7


#### Generating a manifest

Manifests to the minid service can take this form:

```
[
    {
        "length":321,
        "filename":"file1.json",
        "md5":"9faccdb6f9a47a10d9a00bd2b13f7ab3",
        "sha256":"eb42cbc9682e953a03fe83c5297093d95eec045e814517a4e891437b9b993139"
    },
]
```

So we'll make a JSON using our data of the same structure and write it to the disk. This manifest is used to generate the ark.

In [66]:
length = data_object.size
checksum_key = data_object.checksums[0].type
checksum = data_object.checksums[0].checksum
print(length)
print(filename)
print(checksum_key)
print(checksum)

1230140
NWD259170.recab.cram.crai
md5
be947abb597d1a21f2da9d97d96f58e7ca07a214


In [67]:
def data_object_to_minid_item(data_object):
    return {
        'length': data_object.size,
        'filename': data_object.name,
        checksum_key: checksum,
        'url': dos_url}

In [68]:
minid_manifest = [data_object_to_minid_item(data_object)]
print(minid_manifest)

[{'url': 'https://5ybh0f5iai.execute-api.us-west-2.amazonaws.com/api/ga4gh/dos/v1/dataobjects/46c8a5f1-15ab-48fa-8d1c-63099422e3c7', 'length': 1230140L, u'md5': u'be947abb597d1a21f2da9d97d96f58e7ca07a214', 'filename': u'NWD259170.recab.cram.crai'}]


In [69]:
import json
with open('minid.json', 'w') as outfile:
    json.dump(minid_manifest, outfile)

### Generating a test minid

Before we issue a minid, we can test to make sure our settings are correct by issuing a test request.

In [37]:
!minid --batch-register --test minid.json

2018-04-26 16:37:58,125 - INFO - Checking if the TEST entity be947abb597d1a21f2da9d97d96f58e7ca07a214 already exists on the server: http://minid.bd2k.org/minid
2018-04-26 16:37:58,314 - INFO - Creating new identifier
2018-04-26 16:37:59,965 - INFO - Created/updated minid: ark:/99999/fk42j7n71s
[
  {
    "url": "ark:/99999/fk42j7n71s", 
    "length": 1230140, 
    "md5": "be947abb597d1a21f2da9d97d96f58e7ca07a214", 
    "filename": "NWD259170.recab.cram.crai"
  }
]


### Generating a minid

* Use Case 1.2 [Register the Data Object URL at an Identifier Service](https://github.com/DataBiosphere/identifier-interoperability#1.2)

Since the test seemed to work, we can run it again, this time without the test flag to get an identifier we'd like to reuse.

In [38]:
!minid --batch-register minid.json

2018-04-26 16:39:22,705 - INFO - Checking if the entity be947abb597d1a21f2da9d97d96f58e7ca07a214 already exists on the server: http://minid.bd2k.org/minid
2018-04-26 16:39:23,026 - INFO - Creating new identifier
2018-04-26 16:39:24,651 - INFO - Created/updated minid: ark:/57799/b9t991
[
  {
    "url": "ark:/57799/b9t991", 
    "length": 1230140, 
    "md5": "be947abb597d1a21f2da9d97d96f58e7ca07a214", 
    "filename": "NWD259170.recab.cram.crai"
  }
]


### Resolving the new minid!

The minid that resolves to the Archival Resource Key `ark:/57799/b9t991` now points to our Data Object! Now, if given this identifier, a client will be able to resolve our Data Object.

A landing page exists at the URL: http://minid.bd2k.org/minid/landingpage/ark:/57799/b9t991 and resolves from n2t.net as well: http://n2t.net/ark:/57799/b9t991 .

## Updating the Data Object

* [Use Case 1.3 Update the Data Object Metadata](https://github.com/DataBiosphere/identifier-interoperability#1.3)

Now that we have a minid, we can link back to the item in the DSS by making an `UpdateDataObjectRequest`. This request accepts a Data Object to update in its payload and the `dos-azul-lambda` has been configured to accept `minid` as a modifiable key.

Since this is an authorized request, we set a token in the header.

In [70]:
access_token = "f4ce9d3d23f4ac9dfdc3c825608dc660"
data_object['aliases'].append("minid:ark:/57799/b9t991")

In [74]:
data_object.updated = None
data_object.size = str(data_object.size)
update_response = cc.client.UpdateDataObject(
    data_object_id=data_object.id, body={'data_object': data_object},
    _request_options={'headers': {'access_token': access_token}}).result()
print(update_response.data_object_id)

46c8a5f1-15ab-48fa-8d1c-63099422e3c7


### Verifying the update

Now that we have updated the item to include the GUID, we can check using a GetDataObjectRequest for the same Data Object.

In [76]:
updated_object = c.GetDataObject(data_object_id=data_object.id).result().data_object
print(updated_object.id)
print("minid:ark:/57799/b9t991" in updated_object.aliases)

46c8a5f1-15ab-48fa-8d1c-63099422e3c7
True


### Listing via minid

We can also use the `ListDataObjectsRequest` to return all Data Objects in a platform that match an alias, in this case, the minid.

In [78]:
list_request = ListDataObjectsRequest(alias="minid:ark:/57799/b9t991")
list_response = c.ListDataObjects(body=list_request).result()
print(list_response.data_objects[0].id)
print(len(list_response.data_objects))

46c8a5f1-15ab-48fa-8d1c-63099422e3c7
1


As we hoped, only a single item would be returned matching our minid. Now for this Data Object, we have links that allow it to be resolved using GUIDs both by a third party service, and within the platform!

## Future Directions

The [Data Storage System (DSS)](https://github.com/HumanCellAtlas/data-store), [dos-azul-lambda](https://github.com/DataBiosphere/dos-azul-lambda), and [dss-azul-indexer](https://github.com/DataBiosphere/dss-azul-indexer) are all available as Open Source software. Please check out their respective repositories for more information on current issues and places of active development.

Many thanks to the UCSC CGP Team for making this demonstration possible.

This case study was prepared to support [Identifier Interoperability](https://github.com/DataBiosphere/identifier-interoperability), please head to that document and issues to help make it easy for data platforms to interoperate to assist scientific discovery!