I Search by Archives
Overview of User Guide
- Search by Archives: Getting Started
- Performing a Data Archives Search
- Data Availability within the Portal
-
Search by Archives: Getting Started
- Cancer Type
TCGA Pilot Project is studying three cancers:- brain cancer, listed as glioblastoma multiforme (GBM),
- ovarian serous cystadenocarcinoma, abbreviated as serous cystadenocarcinoma (OV).
- lung squamous adenocarcinoma, abbreviated as squamous carcinoma (LG) and
- Center
There are 10 TCGA research centers that are depositing data into the Data Portal. Those ten centers are: - Baylor College of Medicine, Human Genome Sequencing Center, denoted as Baylor College of Medicine
- The Eli and Edythe L. Broad Institute of the Massachusetts Institute of Technology (MIT) and Harvard University and the Dana Farber Cancer Institute, denoted as Broad Institute of MIT and Harvard
- Brigham and Women's Hospital of Harvard Medical School and Dana Farber Cancer Institute, denoted as Harvard Medical School
- International Genomics Consortium, denoted as IGC Biospecimen Core Resource
- Johns Hopkins University and University of Southern California joint group, denoted as Johns Hopkins/University of Southern California
- Lawrence Berkeley National Laboratory
- Memorial Sloan-Kettering Cancer Center
- University of North Carolina, Lineberger Comprehensive Cancer Center, denoted as University of North Carolina
- Stanford University School of Medicine, denoted as Stanford University
- Washington University School of Medicine, Genome Sequencing Center, denoted as Washington University School of Medicine.
- Platform
- Data Type
Users can also search the Data Portal based on the different types of data that is generated by TCGA research centers. Clinical and genomic data available within the database are outlined below: - Submission Date
The "Submission Date" search parameter allows users to search and retrieve data based upon an "On or After" and a "Before" date. Users should enter a date in the following format: mm/dd/yy; for example: 07/04/07. After a user inputs a beginning ("On or After") and an end date ("Before"), the query will return all data that was deposited into the Portal during that time interval. -
Performing a Search
- Available: The file was submitted, has passed quality control and is available for users to download.
- In review: The data file has been submitted, but is not ready to download. The text for files “In review” will be yellow.
- Open-Access: Files can be downloaded by all users.
- Controlled-Access: To download file, users must agree to TCGA Data Use Certification and become authorized users through the Data Access Request process 1.
- Text in red denotes Controlled-Access data.
- If “Download” and “MD5” text is red, the file contains controlled-access data files.
- If “Download” and “MD5” text is yellow, the files are “In Review” and are not yet available for download.
- If “Download” and “MD5” text is black, then files are open-access and available to download.
- Data Availability within the Portal
The Cancer Genome Atlas Data Portal "Search by Archive" feature allows users to search for and download complete data archives as submitted by TCGA Research Network.
The five parameters to choose to retrieve data are:
In TCGA Pilot Project, data is being generated by genomic characterization and sequencing platforms, described below:
|
Platform |
Description |
Associated Data Type |
|
Affymetrix HT Human Genome U133 Array Plate Set |
High-throughput expression profile of approximately 40,0000 transcripts and variants |
Expression-Genes |
|
Affymetrix Human Exon 1.0 ST Array |
Contains approximately 1million predicted and confirmed exon transcripts |
Expression-Exon |
|
Affymetrix Genome-Wide Human SNP Array 6.0 |
Allows detection of copy number variation with more than 906,600 single nucleotide polymorphisms (SNPs) and over 946,000 probes |
Copy Number Results,
|
|
Agilent 8 x 15K Human miRNA-specific Microarray |
Highly specific and sensitive microRNA expression profiling system |
Expression-miRNA |
|
Agilent Human Genome CGH Microarray 244A |
Allows analysis of ~240,000 coding and non-coding sequences for chromosomal DNA alterations |
Copy Number Results |
|
Agilent Whole Human Genome Microarray Kit,
|
High-density profiling analysis tool that covers over 41,000 unique human genes and transcripts |
Copy Number Results, |
|
Illumina DNA Methylation OMA002 Cancer Panel I |
First array-based high-throughput, high multiplexing, single-site CpG resolution platform; contains 1,505 CpG loci selected from 807 genes |
Methylation |
|
Illumina 550K Infinium® HumanHap550 SNP Chip |
Covers 550,000 SNP loci across genome |
SNP |
|
Biospecimen Metadata – Complete Set |
Detailed clinical phenotype and outcome data |
Complete Clinical Set |
|
Biospecimen Metadata – Minimal Set |
Clinical Diagnosis, Histologic Diagnosis, Pathologic Status, Tissue Anatomic Site |
Minimal Clinical Set |
|
Applied Biosystems Sequence Data |
High-throughput Sanger/di-deoxy technology sequencing |
Sequence Data |
|
Data Type |
Content |
Associated Platforms |
|
All |
All data types employed by TCGA centers |
All |
|
Complete Clinical Set |
Detailed clinical phenotype and outcome data |
N/A |
|
Minimal Clinical Set |
Clinical Diagnosis, Histologic Diagnosis, Pathologic Status, Tissue Anatomic Site |
N/A |
|
Expression-Exon |
Exon Expression Profiling |
Affymetrix Human Exon 1.0 ST Array |
|
Expression-miRNA |
miRNA Expression Profiling |
Agilent Human Genome CGH Microarray 244A |
|
Methylation |
DNA methylation patterns within the genome |
Illumina DNA Methylation OMA002 Cancer Panel I |
|
SNP |
Single Nucleotide Polymorphisms, raw genotype calls, computed genotype frequencies |
Illumina 550K Infinium® HumanHap550 SNP Chip, Affymetrix Genome-Wide Human SNP Array 6.0 |
|
Copy Number Results |
Interpreted copy number and loss of heterozygosity data |
Affymetrix Genome-Wide Human SNP Array 6.0, Illumina 550K Infinium® HumanHap550 SNP Chip, Agilent Whole Human Genome Microarray Kit 4 x 44K, Agilent Human Genome CGH Microarray 244A |
|
Somatic Mutations |
Somatic variants compiled from Genomic Sequencing Centers |
Applied Biosystems Sequence data |
|
Trace-Gene-Sample Relationship |
Trace files with NCBI-required annotation linked to gene-sample identifier |
Applied Biosystems Sequence data |
|
Sequencing Trace |
Trace files with NCBI-required annotations |
Applied Biosystems Sequence data |
|
SNP Frequencies |
Frequencies of SNPs over many samples and/or patients |
Illumina 550K Infinium® HumanHap550 SNP Chip, Affymetrix Genome-Wide Human SNP Array 6.0 |
|
Expression-Genes |
Transcription profiling based upon genes and some unidentified transcripts |
Affymetrix HT Human Genome U133 Array Plate Set, Agilent Whole Human Genome Microarray Kit 4 x 44K |
1. The Basics
Users can select to query one or all 5 of these search parameters (i.e.: cancer type, center, platform, data type, data submitted after). For searches where more than one search parameter is selected, the query will produce results that are the intersection of the selected search parameters.
For example, when a user selects: Cancer type: Glioblastoma multiforme (GBM), the query will return all data for glioblastoma multiforme.

When a user selects the following search parameters:
Tumor Type: Glioblastoma multiforme (GBM)
Center: Lawrence Berkeley National Laboratories
Data Type: Expression - Exon,

the query will return exon-based expression data generated by Lawrence Berkeley National Laboratories on glioblastoma multiforme samples.

2.
Selecting Multiple Query Terms within a Search Parameter
Within each of the five search parameters, users can select more than one option
within each category by keeping the Control key depressed while clicking
on the multiple options of their choice.
An example is illustrated below:

3. Retrieving Query Results
After selecting the search options, click “Find” to retrieve query results.
Query results are returned in a tabular format, with each row containing separate
data files. An example is shown below:

For each data file returned, nine parameters are denoted for the file. Results can be sorted by each of the nine fields. Files are sorted by clicking on the parameter title at the top of the column; e.g. "Added On" to sort the files by submission date; click "Center" to sort the files by research center.
The parameters are described in the table below:
|
Parameter |
Description |
|
Archive |
Name of the data file |
|
Added On |
Date the data file was added to the Portal |
|
Center |
Name of the institution that provided the data file; e.g. “Broad Institute of MIT and Harvard” |
|
Version |
Version of the data file; e.g.: 2.1.0 |
|
Cancer Type |
Cancer type from which the data was generated; e.g., “Glioblastoma multiforme (GBM)” |
|
Platform |
Technology platform from which the data file was generated |
|
Data Type |
Specific type of genomic, clinical, or genetic characterization data contained in data file |
|
Status |
Availability status of file for download. |
|
Download |
Users can choose to download the file, download the associated MD5 file2
or view the file.
|
1 More information on data access can be found at: http://cancergenome.nih.gov/dataportal/data/access
2 MD5 files are used to verify the integrity of a transferred file. Users can do that by downloading the MD5 and the corresponding file of interest. A user would then create their own MD5 and compare it with the one downloaded. If they are the same, the file has maintained its integrity.
Note: Data is being continually added to this database. Please check back frequently for updates and additional data submissions.







