Initialize UK Biobank directory
In this section, we will go through how to initialize the UK Biobank directory. After this section, you should have got the master UK Biobank directory setup.
Before you start
Make sure you have the following
-
nextflow
Note
You can install nextflow with
curl -s https://get.nextflow.io | bash
Java 8 or later is required
git
to download the current repository- A md5 key file for one UK Biobank application
- This is required so that we can download the genetic files.
- Internet connection.
Step by step guide
-
First, clone the current repository to where you want to put your UK Biobank directory
This should generate thegit clone https://gitlab.com/choishingwan/ukb-administration.git
ukb-administration
folder. You can rename it withmv ukb-administration <desired name>
-
Move into the ukb folder
mv ukb-administration
Tips
docs
contain code generating this website. If that isn't required, you can simply delete it -
Run the following command
nextflow run scripts/administration/init_structure.nf \ --key <your key file> \ --id <application id> \ -c scripts/download.config
Tips
You can change the
download.config
file to cater to the settings of your server. For example, you can change the executor tolsf
,slum
etc. For more information, please visit nextflow documentation -
Once the pipeline successfully complete, you can remove the intermediate folder
work
rm -rf work
Final structure
After running the script, your directory should look like the following
<root>
|
|- .genotype
| |
| |-- genotyped # Folder contain whole genome bed and bim file
| |
| |-- imputed # Folder contain per chromosome bgen and bgi files
|
|- .exome
| |
| |-- PLINK # Folder contain plink format exome sequencing data
|
|- references # Folder contain UK Biobank related references
|
|- software # Folder contain UK Biobank related software and some utility tools
| |
| |-- bin # Where the software executable locates
|
|- scripts # Contain all scripts related to UK Biobank management
|
|-- administration
|
|-- QC
|
|-- examples
Important
Once all data are downloaded, remove the work
folder which contains all intermediate files.
Most importantly, the per-chromosome plink genotyped data are included.
If you want to keep a copy of the per-chromosome plink files, you can find the bed and bim file
with ls <root>/work/*/*/*{*bed,*bim}