National Cancer Institute National Human Genome Research Institute Data Portal Home Visit the Cancer Genome Atlas Home Site
Portal Help |  Data Access Matrix User Guide |  Search by Archives User Guide |  Cite Data |  Additional Help

I Search by Archives

Overview of User Guide

   Download as PDF

  1. Search by Archives: Getting Started
  2. Performing a Data Archives Search
    1. The Basics
    2. Selecting Multiple Query Terms within a Search Parameter
    3. Retrieving Query Results
  3. Data Availability within the Portal


  1. Search by Archives: Getting Started

  2. The Cancer Genome Atlas Data Portal "Search by Archive" feature allows users to search for and download complete data archives as submitted by TCGA Research Network.

    The five parameters to choose to retrieve data are:

    1. Cancer Type
      TCGA Pilot Project is studying three cancers:
      • brain cancer, listed as glioblastoma multiforme (GBM),
      • ovarian serous cystadenocarcinoma, abbreviated as serous cystadenocarcinoma (OV).
      • lung squamous adenocarcinoma, abbreviated as squamous carcinoma (LG) and
    1. Center
      There are 10 TCGA research centers that are depositing data into the Data Portal. Those ten centers are:
        1. Baylor College of Medicine, Human Genome Sequencing Center, denoted as Baylor College of Medicine
        2. The Eli and Edythe L. Broad Institute of the Massachusetts Institute of Technology (MIT) and Harvard University and the Dana Farber Cancer Institute, denoted as Broad Institute of MIT and Harvard
        3. Brigham and Women's Hospital of Harvard Medical School and Dana Farber Cancer Institute, denoted as Harvard Medical School
        4. International Genomics Consortium, denoted as IGC Biospecimen Core Resource
        5. Johns Hopkins University and University of Southern California joint group, denoted as Johns Hopkins/University of Southern California
        6. Lawrence Berkeley National Laboratory
        7. Memorial Sloan-Kettering Cancer Center
        8. University of North Carolina, Lineberger Comprehensive Cancer Center, denoted as University of North Carolina
        9. Stanford University School of Medicine, denoted as Stanford University
        10. Washington University School of Medicine, Genome Sequencing Center, denoted as Washington University School of Medicine.
    1. Platform

    2. In TCGA Pilot Project, data is being generated by genomic characterization and sequencing platforms, described below:

      Platform

      Description

      Associated Data Type

      Affymetrix HT Human Genome U133 Array Plate Set

      High-throughput expression profile of approximately 40,0000 transcripts and variants

      Expression-Genes

      Affymetrix Human Exon 1.0 ST Array

      Contains approximately 1million predicted and confirmed exon transcripts

      Expression-Exon

      Affymetrix Genome-Wide Human SNP Array 6.0

      Allows detection of copy number variation with more than 906,600 single nucleotide polymorphisms (SNPs) and over 946,000 probes

      Copy Number Results,
      SNP,
      SNP Frequencies

      Agilent 8 x 15K Human miRNA-specific Microarray

      Highly specific and sensitive microRNA expression profiling system

      Expression-miRNA

      Agilent Human Genome CGH Microarray 244A

      Allows analysis of ~240,000 coding and non-coding sequences for chromosomal DNA alterations

      Copy Number Results

      Agilent Whole Human Genome Microarray Kit,
      4 x 44K

      High-density profiling analysis tool that covers over 41,000 unique human genes and transcripts

      Copy Number Results,
      Expression-Genes

      Illumina DNA Methylation OMA002 Cancer Panel I

      First array-based high-throughput, high multiplexing, single-site CpG resolution platform; contains 1,505 CpG loci selected from 807 genes

      Methylation

      Illumina 550K Infinium® HumanHap550 SNP Chip

      Covers 550,000 SNP loci across genome

      SNP

      Biospecimen Metadata – Complete Set

      Detailed clinical phenotype and outcome data

      Complete Clinical Set

      Biospecimen Metadata – Minimal Set

      Clinical Diagnosis, Histologic Diagnosis, Pathologic Status, Tissue Anatomic Site

      Minimal Clinical Set

      Applied Biosystems Sequence Data

      High-throughput Sanger/di-deoxy technology sequencing

      Sequence Data

    1. Data Type
      Users can also search the Data Portal based on the different types of data that is generated by TCGA research centers. Clinical and genomic data available within the database are outlined below:
    2. Data Type

      Content

      Associated Platforms

      All

      All data types employed by TCGA centers

      All

      Complete Clinical Set

      Detailed clinical phenotype and outcome data

      N/A

      Minimal Clinical Set

      Clinical Diagnosis, Histologic Diagnosis, Pathologic Status, Tissue Anatomic Site

      N/A

      Expression-Exon

      Exon Expression Profiling

      Affymetrix Human Exon 1.0 ST Array

      Expression-miRNA

      miRNA Expression Profiling

      Agilent Human Genome CGH Microarray 244A

      Methylation

      DNA methylation patterns within the genome

      Illumina DNA Methylation OMA002 Cancer Panel I

      SNP

      Single Nucleotide Polymorphisms, raw genotype calls, computed genotype frequencies

      Illumina 550K Infinium® HumanHap550 SNP Chip, Affymetrix Genome-Wide Human SNP Array 6.0

      Copy Number Results

      Interpreted copy number and loss of heterozygosity data

      Affymetrix Genome-Wide Human SNP Array 6.0, Illumina 550K Infinium® HumanHap550 SNP Chip, Agilent Whole Human Genome Microarray Kit 4 x 44K, Agilent Human Genome CGH Microarray 244A

      Somatic Mutations

      Somatic variants compiled from Genomic Sequencing Centers

      Applied Biosystems Sequence data

      Trace-Gene-Sample Relationship

      Trace files with NCBI-required annotation linked to gene-sample identifier

      Applied Biosystems Sequence data

      Sequencing Trace

      Trace files with NCBI-required annotations

      Applied Biosystems Sequence data

      SNP Frequencies

      Frequencies of SNPs over many samples and/or patients

      Illumina 550K Infinium® HumanHap550 SNP Chip, Affymetrix Genome-Wide Human SNP Array 6.0

      Expression-Genes

      Transcription profiling based upon genes and some unidentified transcripts

      Affymetrix HT Human Genome U133 Array Plate Set, Agilent Whole Human Genome Microarray Kit 4 x 44K

    1. Submission Date
      The "Submission Date" search parameter allows users to search and retrieve data based upon an "On or After" and a "Before" date. Users should enter a date in the following format: mm/dd/yy; for example: 07/04/07. After a user inputs a beginning ("On or After") and an end date ("Before"), the query will return all data that was deposited into the Portal during that time interval.



  3. Performing a Search


  4. 1. The Basics

    Users can select to query one or all 5 of these search parameters (i.e.: cancer type, center, platform, data type, data submitted after). For searches where more than one search parameter is selected, the query will produce results that are the intersection of the selected search parameters.

    For example, when a user selects: Cancer type: Glioblastoma multiforme (GBM), the query will return all data for glioblastoma multiforme.

     


    When a user selects the following search parameters:
    Tumor Type: Glioblastoma multiforme (GBM)
    Center: Lawrence Berkeley National Laboratories
    Data Type: Expression - Exon,

     

    the query will return exon-based expression data generated by Lawrence Berkeley National Laboratories on glioblastoma multiforme samples.


    2. Selecting Multiple Query Terms within a Search Parameter
    Within each of the five search parameters, users can select more than one option within each category by keeping the Control key depressed while clicking on the multiple options of their choice.

    An example is illustrated below:


    3. Retrieving Query Results
    After selecting the search options, click “Find” to retrieve query results.
    Query results are returned in a tabular format, with each row containing separate data files. An example is shown below:


    For each data file returned, nine parameters are denoted for the file. Results can be sorted by each of the nine fields. Files are sorted by clicking on the parameter title at the top of the column; e.g. "Added On" to sort the files by submission date; click "Center" to sort the files by research center.

    The parameters are described in the table below:

    Parameter

    Description

    Archive

    Name of the data file

    Added On

    Date the data file was added to the Portal

    Center

    Name of the institution that provided the data file; e.g. “Broad Institute of  MIT and Harvard”

    Version

    Version of the data file; e.g.: 2.1.0
    The first number (from left to right) is a serially increasing index that represents the number of files transferred to the Portal.
    The second number is for revisions. If a center decided to make changes (add files, correct files, etc.) to a previously deposited file, they would increase the revision. The revision starts at zero.
    The third number represents the series and starts at zero. If a file is very large, a center could separate a file into many parts (or series) for transfer and then users would download all of the series to re-compile the entire file.  In this case, the entire file would be represented by the same first 2 digits and the third digit (the series) would increase depending on how many parts the file was separated into.

    Cancer Type

    Cancer type from which the data was generated; e.g., “Glioblastoma multiforme (GBM)”

    Platform

    Technology platform from which the data file was generated

    Data Type

    Specific type of genomic, clinical, or genetic characterization data contained in data file

    Status

    Availability status of file for download.

    • Available: The file was submitted, has passed quality control and is available for users to download.
    • In review: The data file has been submitted, but is not ready to download. The text for files “In review” will be yellow.
    • Open-Access: Files can be downloaded by all users.
    • Controlled-Access: To download file, users must agree to TCGA Data Use Certification and become authorized users through the Data Access Request process 1.
    • Text in red denotes Controlled-Access data.

    Download

    Users can choose to download the file, download the associated MD5 file2 or view the file.
    Note:

    • If “Download” and “MD5” text is red, the file contains controlled-access data files. 
    • If “Download” and “MD5” text is yellow, the files are “In Review” and are not yet available for download.
    • If “Download” and “MD5” text is black, then files are open-access and available to download. 

    1 More information on data access can be found at: http://cancergenome.nih.gov/dataportal/data/access 

    2 MD5 files are used to verify the integrity of a transferred file. Users can do that by downloading the MD5 and the corresponding file of interest. A user would then create their own MD5 and compare it with the one downloaded. If they are the same, the file has maintained its integrity.


  5. Data Availability within the Portal

  6. Note: Data is being continually added to this database. Please check back frequently for updates and additional data submissions.