Phenotype extraction - SQL

When to use the SQL approach?

There are many advantages of using the SQL database instead of the plain text phenotype files. The most important advantage is that with the SQL database phenotypes can be extracted using scripts that refer to the unique phenotype field IDs, improving reproducibility. In addition, data inside the SQL database is indexed, which allows quicker access to the data.

We recommended using the SQL approach when multiple phenotypes/field need to be extracted, or when complex ascertainment schemes are used.

The main disadvantage of using the SQL database is that it requires some knowledge of SQL language, which can be difficult for new comers. To overcome this issue, this document gives and overview of the basic structure of the SQL database and provides example SQL scripts to get well-acquainted with the SQL database.

Note

To access the database, you must have sqlite3 installed. Click here for a description and tutorial on SQLite.

Database Structure

To use the SQL database we must first understand the basic structure of a relational database, where each table is a file on its own. The relationships between tables/files are established through 'keys'. Depending on their role, keys are defined as primary or foreign. In brief, primary keys are a set of attributes in a table (e.g. one or more columns) that uniquely identifies rows in a given table, whereas foreign keys are a set of attributes that refers to the primary key of another table. The foreign key links the two tables.

In our UK Biobank SQL database, each phenotype is contained in one table with name fxxxx (where xxxx is the field ID of the phenotype). The fxxxx tables are all linked to a 'participant' table through the sample_id field, which contains information about participant withdrawal (Figure below, panel A).

As discussed in the previous section, a large proportion of the data fields within the UK Biobank repository are categorical and use data-coding to map the actual data and the values used to represent it within the database. To include the data-coding matching in the SQL scripts for phenotype extraction, the data-coding tables need to be considered in the SQL script. The structure of these tables are illustrated in Figure 1, panel B.

Figure 1. Structure of the SQL database for UK Biobank data.

Detailed descriptions of the different data tables within our SQL database can be found in the following table:

Participant
sample_id	ID of UKB Samples	int	primary key in this table
withdrawn	Date of which the samples withdrawn their consent (date of data-base update). 0 if sample has not withdrawn consent	text
fxxxx (where xxxx is the field ID of the phenotype)
sample_id	ID of UKB Samples	int	foreign (Participant:sample_id)
pheno	The phenotype	text
instance	0 for baseline measurement, 1 for first follow up, so and so forth	int
array	0 for first reported item, 1 for the second, so and so forth. This field is only presented for phenotypes that allow multiple input	int
code
code_id	Data-coding ID	int
value	The Data-coding	int
meaning	Meaning of the Data-coding	text
data_meta
category	Category ID	int
field_id	Unique Field ID	int	primary key in this table
field	Description of the field	TEXT
participants	Number of participant with this phenotype	int
items	Number of items in this phenotype	int
stability	Indicate if this phenotype is table	text
value_type	Indicate the type of phenotype	text
units	The units of which the phenotype is measured in (if any)	text
item_type	Type of items, e.g. data, bulk, etc.	text
strata	e.g. Primary, auxiliary	text
sexed	Inidicate if this phenotype is only measured in male, female, or is unisex	text
instances	Indicate the number of repeated measure for this phenotype	int
array	For phenotype that allow multiple responses (e.g. ICD10), this indicate the maximum number of response.	int
code_id	Data-coding used for this phenotype	int	foreign (code:code_id)
included	If this application has permission to access this phenotype. 1 = "yes"	int

Using the SQL database

Basic understand of the SQL language is required to efficiently use our SQL database. We recommend to write down the SQL commands in a .sql file and use the SQL database as follows:

sqlite3 ukb<ID>.db < command.sql

where command.sql is the sql command file containing the following header:

.header on
.mode csv
.output <name>.csv

Note

Useful basic SQL commands used in the SQL scripts:

Command FROM and JOIN --> indicate the tables that we are going to use
Command SELECT --> indicates the columns that we select
Command WHERE --> indicates the row filtering criteria

Below we provide users with the same three examples that we used in the Phenotype extraction - Plain text section, but here we extract the phenotypes using the SQL database:

Example 1: Basic usage

In the Basic usage section, we present an example on how to extract the first instance of the phenotype 'Height' (f.50.0.0) from UK Biobank.

.header on
.mode csv
.output Height.csv              -- Output to file named Height.csv

SELECT  s.sample_id AS FID,     -- select sample_id from table 's' and call it FID 
s.sample_id AS IID,             -- select sample_id from table 's' and call it IID
age.pheno AS Age,               -- select pheno from table 'age' and call it Age
sex.pheno AS Sex,               -- select pheno from table 'sex' and call it Sex
bmi.pheno AS BMI,               -- select pheno from table 'bmi' and call it BMI
centre.pheno AS Centre          -- select pheno from table 'centre' and call it Centre

FROM    Participant s                            -- using table 'participant', now named as 's'
            JOIN    f21001 bmi ON 
                    s.sample_id=bmi.sample_id    -- join the BMI table by sample ID
                    AND bmi.instance = 0         -- only getting the baseline phenotype
            JOIN    f31 sex ON
                    s.sample_id=sex.sample_id    -- join the Sex table by sample ID
                    AND sex.instance = 0         -- only getting the baseline phenotype
            JOIN    f21003 age ON 
                    s.sample_id=age.sample_id    -- join the Age table by sample ID
                    AND age.instance = 0         -- only getting the baseline phenotype
            JOIN    f54 centre ON 
                    s.sample_id=centre.sample_id -- join the UKB assessment centre table by sample ID
                    AND centre.instance = 0      -- only getting the baseline phenotype
            WHERE   s.withdrawn = 0;             -- Exclude any samples who withdrawn their consent
.quit

Example 2: Phenotypes with data-coding

In this section we present an example on how to extract the first instance of a categorical field, and replace the coded values/categories with their actual meaning. The example consists on extracting the question "How many periods did you have in your life lasting two or more weeks where you felt like this?" (f.20442.0.0).

An illustration of what the code does and the tables that are used and created is included in Figure 2.

Figure 2. Diagram of example 2. The code can be divided in two steps, STEP 1, where a temporary table called 'pheno_code' is created, and STEP 2, where Participant and f.20442 tables are joined and values are replaced by their meaning. Please see code below for details.

.header on
.mode csv    
.output NumDepress.csv      -- Output to file named NumDepress.csv 

-- ### STEP 1 ### --
--------------------
-- The following code creates a temporary table named `pheno_code` containing
-- code.value and code.meaning for all the entries with 
-- data_meta.field_id=20442:

CREATE  TEMP TABLE pheno_code
AS
SELECT  cm.value AS value,                      -- select value from table 'cm' and call it value
        cm.meaning AS meaning                   -- select meaning from table 'cm' and call it meaning
FROM    code cm                                 -- using table 'code', now named as cm
JOIN    data_meta dm ON dm.code_id=cm.code_id   -- join dm and cm tables by code_id
WHERE   dm.field_id=20442;                      -- select rows with field_id 20442


-- ### STEP 2 ### --
---------------------
-- The following code creates a dataset that joins
-- table participant (called s), and table f20442 where withdrawn=0 
-- For each row in the new dataset:
--  FID=s.sample_id
--  IID=s.sample_id
--  COALESCE returns the first non-null value in a list
--         if pheno_code.meaning is null, Pheno = depress.pheno,
--         else Pheno = pheno_code.meaning

SELECT      s.sample_id AS FID,
            s.sample_id AS IID,
            COALESCE(
               pheno_code.meaning, 
               depress.pheno) AS Pheno          
FROM        f20442 depress
JOIN        Participant s

-- Join pheno_code into 'depress' so that meaning will be an empty string if 
-- pheno is not present as a value in pheno_code.value

LEFT JOIN   pheno_code ON pheno_code.value=depress.pheno 
WHERE       depress.instance=0 AND
            s.sample_id=depress.sample_id AND
            s.withdrawn=0;
.quit

Example 3: Phenotypes from Health Records Linkage

In the Phenotypes from Health Records Linkage section, we present an example on how to extract information from Health Records, using the ICD-10 coding

.header on
.mode csv    
.output scz.csv  -- Output to file named scz.csv 

SELECT  s.sample_id AS FID, 
        s.sample_id AS IID,
        (CASE WHEN s.sample_id IN
           (
                SELECT  icd10.sample_id
                FROM    f41270 as icd10
                WHERE   pheno LIKE '"F20_"' -- with this you dont need to creat a SCZ table
                        AND instance=0
                        AND s.sample_id=icd10.sample_id
            )
            THEN 1          -- samples found in scz table are cases (1)
            ELSE 0
        END) AS Pheno

FROM    Participant s 
WHERE   s.withdrawn= 0;     -- Exclude any samples who withdrawn their consent
.quit

Note

There are two wildcards used in conjunction with the LIKE operator −

The percent sign (%): 0, 1 or multiple numbers of characters
The underscore (_) : single number of character

For ICD10, remember to include the double quotes: '"XXXX"'