Widget HTML Atas

Download Tcga Data Using R

Tutorial:Protocol To Downlad TCGA Data From GDC

3

Now that TCGA moved under Genomic data commons (GDC), Almost all the prevous user are struggling to retrive the same information. This tutorial try to show how to download TCGA data from GDC

Step 1. Obtaining a Manifest File for Data Download (manifest is use to specify type of the data to download)

                        https://gdc-portal.nci.nih.gov/legacy-archive/search/f                                              

Step 2. Install download software: GDC Data Transfer Tool (Linux, Windows, MACS)

                        https://gdc.nci.nih.gov/access-data/gdc-data-transfer-tool                                              

Step 3.1 Downloading Data Using a Manifest File (gdc_manifest.lungCancer.txt)

                        gdc-client download -m gdc_manifest.lungCancer.txt                                              

Step 3.2 Downloading Single Data Using a UUID (UUID can be found in manifest file)

                        gdc-client download 22a29915-6712-4f7a-8dba-985ae9a1f005                                              

Step 3.3 Downloading Controlled Data (user authentication token is required)

                        gdc-client download -m gdc_manifest_controled.txt -t gdc-user-passwdcode.txt                                              

FQA:

                        1, ./gdc-client: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /tmp/_MEI5oSpPi/libz.so.1)                                              

Answer: glibc 2.12 is the latest that's available for CentOS 6. that means CentOS cannot used to download the data(UCSD, TSCC).

2, How to download controlled data from GDC

3, Eventually, I asked TSCC manager to help me install fastq-dump in TSCC

4, Download failed happened sometimes since the internet problem, but don't worry, just try again

GDC TCGA Download methylation Tutorial • 32k views

If you are looking for a flexible programmatic approach, you might take a look at the GenomicDataCommons Bioconductor package: https://bioconductor.org/packages/GenomicDataCommons

find data

The following code builds a manifest that can be used to guide the download of raw data. Here, filtering finds gene expression files quantified as raw counts using HTSeq from ovarian cancer patients.

                    library(GenomicDataCommons) library(magrittr) ge_manifest = files() %>%      filter( ~ cases.project.project_id == 'TCGA-OV' &                 type == 'gene_expression' &                 analysis.workflow_type == 'HTSeq - Counts') %>%     manifest()                                      

Download data

The next code block downloads the 379 gene expression files specified in the query above. Using multiple processes to do the download very significantly speeds up the transfer in many cases. On a standard 1Gb connection, the following completes in about 30 seconds.

                    destdir = tempdir() fnames = lapply(ge_manifest$id,gdcdata,                 destination_dir=destdir,overwrite=TRUE,                 progress=FALSE)                                      

If the download had included controlled-access data, the download above would have needed to include a token.

For the CentOS, you need to download the gdc-client source code to compile yourself.

gdc-client github issued this problem that glibc 2.12 is the latest that's available for CentOS 6.

If your system is CentOS release 6.6, I think you should download the gdc-client source code and compile it yourself. gdc-client is based on the py2.

  1. git clone https://github.com/NCI-GDC/gdc-client
  2. python setup.py install

You may meet the problem

The 'lxml==3.5.0b1' distribution was not found and is required by gdc-client

or

ImportError: /usr/lib64/libxml2.so.2: version `LIBXML2_2.9.0' not found (required by lxml/etree.so)

You need to install libxslt and libxml2 in your home path. And add xml2-config and xslt-config to your path. export PATH="/prog_path/libxslt-1.1.29/bin:/prog_path/libxml2-2.9.4/bin:$PATH"

Then

  1. pip uninstall lxml
  2. pip install lxml==3.5.0b1 --install-option="--auto-rpath"

Finnaly, compile gdc-client source code.

  1. python setup.py install

It worked.

Take Bladder cancer as example:

1, Go the following link (legacy-archive at GDC):

https://gdc-portal.nci.nih.gov/legacy-archive/search/f?filters=%7B%22op%22:%22and%22,%22content%22:%5B%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22cases.project.program.name%22,%22value%22:%5B%22TCGA%22%5D%7D%7D,%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22cases.project.project_id%22,%22value%22:%5B%22TCGA-BLCA%22%5D%7D%7D,%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22files.platform%22,%22value%22:%5B%22Illumina%20Human%20Methylation%20450%22%5D%7D%7D,%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22files.data_format%22,%22value%22:%5B%22TXT%22%5D%7D%7D,%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22files.data_category%22,%22value%22:%5B%22DNA%20methylation%22%5D%7D%7D%5D%7D

2, Add all 440 files to cart and download Manifest file

3, You will see the first and second column of the Manifest file is UUID and Sample ID

enter image description here enter image description here

Login before adding your answer.

Source: https://www.biostars.org/p/204092/

Posted by: blekroom.blogspot.com