Using National Cancer Institute GDC API – Back From Across The Pond

A couple of weeks ago I authored a blog post on downloading demographic and diagnosis content for IDH1 mutant gliomas, also known as Low Grade Glioma, from the NCI Genomic Data Commons (GDC) website portal. The portal supports downloading this data typically in JSON or TSV formats. I wanted to understand the portal interface before attending a Web Conference a week later on an Overview of the GDC API. I joined the web conference and found it a great introduction and interactive demonstration of consuming the GDC API via Jupyter Notebooks.

While the presented python code was not shared, the GDC Application Programing Interface documentation on the portal website is quite comprehensive. I was interested in trying to replicate the query I had done via the portal. I started out using the cases endpoint, but my ability to filter the query was only successful with primary_site, disease_type and project_id. The fourth portion of my filter to be successful was on somatic gene (specifically IDH1). The cases.follow_ups.molecular_tests.gene_symbol field returned a smaller than expected result set, as it applied to collected clinical data not supplied bioinformatic data. So I needed to investigate other endpoints. You can view the field mappings for the cases endpoint JSON or the Data Dictionary Viewer using:

I then tried the genes endpoint and the ssms endpoint, neither of which I could get to return the case details desired using the specified criteria. I got close with both, but the result sets were a bit wonky. So I posted a support request about what I was trying to accomplish. I included a python code snippet and a screenshot from expected data from the GDC exploration portal. I got a quick response, and the short answer is I should have been using the ssm_occurrences endpoint. So I plugged it into my Jupyter Notebook code, called NIH_LGG_GDC_API.ipynb, explored the endpoint fields and made a few tweaks for cleansing the data to run using the analysis plots from my previous blog post. The filter and fields returned query code is below.

# Filtered Query via GDC using "ssm_occurences" endpoint

# Build query criteria to match
filters = {
    "op": "and",
    "content":[
        {
        "op": "in",
        "content":{
            "field": "case.project.project_id",
            "value": ["CPTAC-3", "TCGA-GBM", "TCGA-LGG"]
            }
        },
        {
        "op": "=",
        "content":{
            "field": "ssm.consequence.transcript.gene.symbol",
            "value": ["IDH1"]
            }
        },
        {
        "op": "=",
        "content":{
            "field": "case.primary_site",
            "value": ["Brain"]
            }
        },
        {
        "op": "=",
        "content":{
            "field": "case.disease_type",
            "value": ["Gliomas"]
            }
        }
    ]
}

# Specify fields GDC to return
fields = [
    "case.case_id",
    "case.demographic.gender",
    "case.demographic.vital_status",
    "case.demographic.year_of_birth",
    "case.demographic.year_of_death",
    "case.diagnoses.age_at_diagnosis",
    "case.diagnoses.days_to_recurrence",
    "case.diagnoses.primary_diagnosis", # Astrocytoma, Glioblastoma...
    "case.diagnoses.site_of_resection_or_biopsy", # Brain, Frontal lobe...
    "case.diagnoses.tumor_grade", # Not seeing much data
    "case.project.project_id", # CPTAC-3, TCGA-GBM, TCGA-LGG...
    "ssm.gene_aa_change",
    "ssm.genomic_dna_change"
    ]

fields = ",".join(fields)

# Build and execute GET request
endpoint = "https://api.gdc.cancer.gov/ssm_occurrences"

# With a GET request, the filters parameter needs to be converted
# from a dictionary to JSON-formatted string

params = {
    "filters": json.dumps(filters),
    "fields": fields,
    "format": "CSV",
    "size": "500",
    "pretty": True
    }

response = requests.get(endpoint, params = params)

# Write to a file
file = open("input/ssm_occurrences_query.CSV", "w")
file.write(response.text)
file.close()

# View output
print(response.url)
#print(response.text)

I noticed when analyzing fields returned based on project_id (CPTAC-3, TCGA-GBM, TCGA-LGG), not all the cases fields had data or the data supplied was generic in nature. A good example of this was site_of_resection_or_biopsy and tumor_grade. For the site field, some records specified the lobe (Temporal, Frontal…), while others were populated with Cerebrum or Brain, NOS. None specified whether surgical site was left or right hemisphere. I tried a few other fields to see whether they had data, but much of it was empty or inconsistent.

I noticed the ssm_occurrences endpoint has a large array of bioinformatic fields. I did not explore them, but they are worth querying. The ssm_occurrences fields that can be expanded are as follows:

  "expand": [
    "case",
    "case.demographic",
    "case.diagnoses",
    "case.diagnoses.pathology_details",
    "case.diagnoses.treatments",
    "case.exposures",
    "case.family_histories",
    "case.observation",
    "case.observation.input_bam_file",
    "case.observation.normal_genotype",
    "case.observation.read_depth",
    "case.observation.sample",
    "case.observation.tumor_genotype",
    "case.observation.validation",
    "case.observation.variant_calling",
    "case.project",
    "case.project.program",
    "case.samples",
    "case.tissue_source_site",
    "ssm",
    "ssm.clinical_annotations",
    "ssm.clinical_annotations.civic",
    "ssm.consequence",
    "ssm.consequence.transcript",
    "ssm.consequence.transcript.annotation",
    "ssm.consequence.transcript.gene",
    "ssm.consequence.transcript.gene.external_db_ids"
  ],

While ssm.consequence.* looks very interesting, there seems to be a lot more fields available under case.diagnoses.* and case.observation.” than returned via the exploration portal. Whether these have consistently usable data is worth exploring. I’m not confident understanding my experience with the cases endpoint. That exploration and analysis will have to be as part of another blog post.

In the meantime, many thanks to Charlotte Sackett who quickly responded to my GDC support question, and to Bill Wysocki and Himanso Sahni who presented the GDC API web conference.

Guy Lipof