Phenotype extraction - SQL
When to use the SQL approach?
There are many advantages of using the SQL database instead of the plain text phenotype files. The most important advantage is that with the SQL database phenotypes can be extracted using scripts that refer to the unique phenotype field IDs, improving reproducibility. In addition, data inside the SQL database is indexed, which allows quicker access to the data.
We recommended using the SQL approach when multiple phenotypes/field need to be extracted, or when complex ascertainment schemes are used.
The main disadvantage of using the SQL database is that it requires some knowledge of SQL language, which can be difficult for new comers. To overcome this issue, this document gives and overview of the basic structure of the SQL database and provides example SQL scripts to get well-acquainted with the SQL database.
Note
To access the database, you must have sqlite3 installed. Click here for a description and tutorial on SQLite.
Database Structure
To use the SQL database we must first understand the basic structure of a relational database, where each table is a file on its own. The relationships between tables/files are established through 'keys'. Depending on their role, keys are defined as primary or foreign. In brief, primary keys are a set of attributes in a table (e.g. one or more columns) that uniquely identifies rows in a given table, whereas foreign keys are a set of attributes that refers to the primary key of another table. The foreign key links the two tables.
In our UK Biobank SQL database, each phenotype is contained in one table with name fxxxx
(where xxxx
is the field ID of the phenotype). The fxxxx
tables are all linked to a 'participant' table through the sample_id field, which contains information about participant withdrawal (Figure below, panel A).
As discussed in the previous section, a large proportion of the data fields within the UK Biobank repository are categorical and use data-coding to map the actual data and the values used to represent it within the database. To include the data-coding matching in the SQL scripts for phenotype extraction, the data-coding tables need to be considered in the SQL script. The structure of these tables are illustrated in Figure 1, panel B.
Figure 1. Structure of the SQL database for UK Biobank data.
Detailed descriptions of the different data tables within our SQL database can be found in the following table:
Participant | |||
---|---|---|---|
sample_id | ID of UKB Samples | int | primary key in this table |
withdrawn | Date of which the samples withdrawn their consent (date of data-base update). 0 if sample has not withdrawn consent | text | fxxxx (where xxxx is the field ID of the phenotype) |
sample_id | ID of UKB Samples | int | foreign (Participant:sample_id) |
pheno | The phenotype | text | |
instance | 0 for baseline measurement, 1 for first follow up, so and so forth | int | |
array | 0 for first reported item, 1 for the second, so and so forth. This field is only presented for phenotypes that allow multiple input | int | code |
code_id | Data-coding ID | int | |
value | The Data-coding | int | |
meaning | Meaning of the Data-coding | text | data_meta |
category | Category ID | int | |
field_id | Unique Field ID | int | primary key in this table |
field | Description of the field | TEXT | |
participants | Number of participant with this phenotype | int | |
items | Number of items in this phenotype | int | |
stability | Indicate if this phenotype is table | text | |
value_type | Indicate the type of phenotype | text | |
units | The units of which the phenotype is measured in (if any) | text | |
item_type | Type of items, e.g. data, bulk, etc. | text | |
strata | e.g. Primary, auxiliary | text | |
sexed | Inidicate if this phenotype is only measured in male, female, or is unisex | text | |
instances | Indicate the number of repeated measure for this phenotype | int | |
array | For phenotype that allow multiple responses (e.g. ICD10), this indicate the maximum number of response. | int | |
code_id | Data-coding used for this phenotype | int | foreign (code:code_id) |
included | If this application has permission to access this phenotype. 1 = "yes" | int |
Using the SQL database
Basic understand of the SQL language is required to efficiently use our SQL database.
We recommend to write down the SQL commands in a .sql
file and use the SQL database as follows:
sqlite3 ukb<ID>.db < command.sql
where command.sql
is the sql command file containing the following header:
.header on
.mode csv
.output <name>.csv
Note
Useful basic SQL commands used in the SQL scripts:
-
Command FROM and JOIN --> indicate the tables that we are going to use
-
Command SELECT --> indicates the columns that we select
-
Command WHERE --> indicates the row filtering criteria
Below we provide users with the same three examples that we used in the Phenotype extraction - Plain text section, but here we extract the phenotypes using the SQL database:
Example 1: Basic usage
In the Basic usage section, we present an example on how to extract the first instance of the phenotype 'Height' (f.50.0.0) from UK Biobank.
.header on
.mode csv
.output Height.csv -- Output to file named Height.csv
SELECT s.sample_id AS FID, -- select sample_id from table 's' and call it FID
s.sample_id AS IID, -- select sample_id from table 's' and call it IID
age.pheno AS Age, -- select pheno from table 'age' and call it Age
sex.pheno AS Sex, -- select pheno from table 'sex' and call it Sex
bmi.pheno AS BMI, -- select pheno from table 'bmi' and call it BMI
centre.pheno AS Centre -- select pheno from table 'centre' and call it Centre
FROM Participant s -- using table 'participant', now named as 's'
JOIN f21001 bmi ON
s.sample_id=bmi.sample_id -- join the BMI table by sample ID
AND bmi.instance = 0 -- only getting the baseline phenotype
JOIN f31 sex ON
s.sample_id=sex.sample_id -- join the Sex table by sample ID
AND sex.instance = 0 -- only getting the baseline phenotype
JOIN f21003 age ON
s.sample_id=age.sample_id -- join the Age table by sample ID
AND age.instance = 0 -- only getting the baseline phenotype
JOIN f54 centre ON
s.sample_id=centre.sample_id -- join the UKB assessment centre table by sample ID
AND centre.instance = 0 -- only getting the baseline phenotype
WHERE s.withdrawn = 0; -- Exclude any samples who withdrawn their consent
.quit
Example 2: Phenotypes with data-coding
In this section we present an example on how to extract the first instance of a categorical field, and replace the coded values/categories with their actual meaning. The example consists on extracting the question "How many periods did you have in your life lasting two or more weeks where you felt like this?" (f.20442.0.0).
An illustration of what the code does and the tables that are used and created is included in Figure 2.
Figure 2. Diagram of example 2. The code can be divided in two steps, STEP 1, where a temporary table called 'pheno_code' is created, and STEP 2, where Participant and f.20442 tables are joined and values are replaced by their meaning. Please see code below for details.
.header on
.mode csv
.output NumDepress.csv -- Output to file named NumDepress.csv
-- ### STEP 1 ### --
--------------------
-- The following code creates a temporary table named `pheno_code` containing
-- code.value and code.meaning for all the entries with
-- data_meta.field_id=20442:
CREATE TEMP TABLE pheno_code
AS
SELECT cm.value AS value, -- select value from table 'cm' and call it value
cm.meaning AS meaning -- select meaning from table 'cm' and call it meaning
FROM code cm -- using table 'code', now named as cm
JOIN data_meta dm ON dm.code_id=cm.code_id -- join dm and cm tables by code_id
WHERE dm.field_id=20442; -- select rows with field_id 20442
-- ### STEP 2 ### --
---------------------
-- The following code creates a dataset that joins
-- table participant (called s), and table f20442 where withdrawn=0
-- For each row in the new dataset:
-- FID=s.sample_id
-- IID=s.sample_id
-- COALESCE returns the first non-null value in a list
-- if pheno_code.meaning is null, Pheno = depress.pheno,
-- else Pheno = pheno_code.meaning
SELECT s.sample_id AS FID,
s.sample_id AS IID,
COALESCE(
pheno_code.meaning,
depress.pheno) AS Pheno
FROM f20442 depress
JOIN Participant s
-- Join pheno_code into 'depress' so that meaning will be an empty string if
-- pheno is not present as a value in pheno_code.value
LEFT JOIN pheno_code ON pheno_code.value=depress.pheno
WHERE depress.instance=0 AND
s.sample_id=depress.sample_id AND
s.withdrawn=0;
.quit
Example 3: Phenotypes from Health Records Linkage
In the Phenotypes from Health Records Linkage section, we present an example on how to extract information from Health Records, using the ICD-10 coding
.header on
.mode csv
.output scz.csv -- Output to file named scz.csv
SELECT s.sample_id AS FID,
s.sample_id AS IID,
(CASE WHEN s.sample_id IN
(
SELECT icd10.sample_id
FROM f41270 as icd10
WHERE pheno LIKE '"F20_"' -- with this you dont need to creat a SCZ table
AND instance=0
AND s.sample_id=icd10.sample_id
)
THEN 1 -- samples found in scz table are cases (1)
ELSE 0
END) AS Pheno
FROM Participant s
WHERE s.withdrawn= 0; -- Exclude any samples who withdrawn their consent
.quit
Note
There are two wildcards used in conjunction with the LIKE
operator −
- The percent sign (
%
): 0, 1 or multiple numbers of characters - The underscore (
_
) : single number of character
For ICD10, remember to include the double quotes: '"XXXX"'