Download Tcga Data Using R

Question

Answer 1

If you are looking for a flexible programmatic approach, you might take a look at the GenomicDataCommons Bioconductor package: https://bioconductor.org/packages/GenomicDataCommons

find data

The following code builds a manifest that can be used to guide the download of raw data. Here, filtering finds gene expression files quantified as raw counts using HTSeq from ovarian cancer patients.

                    library(GenomicDataCommons) library(magrittr) ge_manifest = files() %>%      filter( ~ cases.project.project_id == 'TCGA-OV' &                 type == 'gene_expression' &                 analysis.workflow_type == 'HTSeq - Counts') %>%     manifest()

Download data

The next code block downloads the 379 gene expression files specified in the query above. Using multiple processes to do the download very significantly speeds up the transfer in many cases. On a standard 1Gb connection, the following completes in about 30 seconds.

                    destdir = tempdir() fnames = lapply(ge_manifest$id,gdcdata,                 destination_dir=destdir,overwrite=TRUE,                 progress=FALSE)

If the download had included controlled-access data, the download above would have needed to include a token.

Answer 2

For the CentOS, you need to download the gdc-client source code to compile yourself.

gdc-client github issued this problem that glibc 2.12 is the latest that's available for CentOS 6.

If your system is CentOS release 6.6, I think you should download the gdc-client source code and compile it yourself. gdc-client is based on the py2.

git clone https://github.com/NCI-GDC/gdc-client
python setup.py install

You may meet the problem

The 'lxml==3.5.0b1' distribution was not found and is required by gdc-client

or

ImportError: /usr/lib64/libxml2.so.2: version `LIBXML2_2.9.0' not found (required by lxml/etree.so)

You need to install libxslt and libxml2 in your home path. And add xml2-config and xslt-config to your path. export PATH="/prog_path/libxslt-1.1.29/bin:/prog_path/libxml2-2.9.4/bin:$PATH"

Then

pip uninstall lxml
pip install lxml==3.5.0b1 --install-option="--auto-rpath"

Finnaly, compile gdc-client source code.

python setup.py install

It worked.

Answer 3

Take Bladder cancer as example:

1, Go the following link (legacy-archive at GDC):

https://gdc-portal.nci.nih.gov/legacy-archive/search/f?filters=%7B%22op%22:%22and%22,%22content%22:%5B%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22cases.project.program.name%22,%22value%22:%5B%22TCGA%22%5D%7D%7D,%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22cases.project.project_id%22,%22value%22:%5B%22TCGA-BLCA%22%5D%7D%7D,%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22files.platform%22,%22value%22:%5B%22Illumina%20Human%20Methylation%20450%22%5D%7D%7D,%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22files.data_format%22,%22value%22:%5B%22TXT%22%5D%7D%7D,%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22files.data_category%22,%22value%22:%5B%22DNA%20methylation%22%5D%7D%7D%5D%7D

2, Add all 440 files to cart and download Manifest file

3, You will see the first and second column of the Manifest file is UUID and Sample ID

enter image description here

Black Room

Widget HTML Atas

Download Tcga Data Using R

find data

Download data