Frequently asked questions

Expand all | Go to: The study | Association explorer | Risk calculator

This website has been developed to share our research findings with the public, researchers, healthcare professionals and anyone working in public health and policy. This section provides background information about the project and how its findings can be used. As researchers may use this study as a starting point for future research, more detailed, technical information can also be found in the ‘in-depth’ drop downs below.

You can read a non-technical summary of the research here, and the scientific paper here. If your questions are not answered, please let us know.

A. The study

This study is based on data collected from participants in a study called UK Biobank. You can read more about UK Biobank here. Briefly, 498,849 participants were enrolled between 2006-2010 at 21 centres across England, Wales and Scotland with the aim of improving the prevention, diagnosis and treatment of a wide range of serious and life-threatening illnesses, including cancer, heart disease and stroke. Participants visited assessment centres to answer detailed questionnaires and provide physical measurements and biological samples (variables), and agreed to have their health monitored. These variables are available to approved researchers who apply to use these data for health research.

We excluded participants with more than 80% missing variables (N=746), resulting in 498,103 (54% women) participants included in the main analyses. In the secondary analysis, which considered only healthy individuals, we excluded any UK Biobank participants who had any major disease or disorder before inclusion to the study (i.e. a Charlson comorbidity index of more than 0). This 'disease-free' subcohort included 355,043 (55% women) participants.

The study was approved by the North West Multi-centre Research Ethics Committee (MREC), and all participants provided written informed consent. The UK Biobank protocol is described in detail on the web and previously (Palmer LJ. UK Biobank: bank on it. Lancet 2007; 369(9578): 1980-2 and Sudlow C. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Medicine 2015; 31;12(3)).

In our analyses, we used all relevant UK Biobank information that was available on 10th April 2014. This included 655 measurements and answers about demographic, health and lifestyle factors (variables). These variables were grouped into 10 categories:

  • Blood assays - Measurements in blood, such as counts of different cell types (e.g. white blood cells).
  • Cognitive function - Cognitive measurements, such as memory and speed of information processing. For example, reaction time test.
  • Early life factors - Factors such as birth weight and country of birth.
  • Family history - Family history of disease and health.
  • Health and medical history - Personal medical history of disease and medication.
  • Lifestyle and environment - Factors such as physical activity and diet.
  • Physical measures - Physical measures, such as height, fat mass, and blood pressure.
  • Psychosocial factors - Psychosocial factors, such as mood and risk-taking behaviour.
  • Sex-specific factors - Factors specific to males or females.
  • Sociodemographics - Factors including occupation, ethnic group, and housing conditions.

You can see the full list of variables by clicking on the dropdown categories in the Association Explorer.

We excluded variables that were missing in more than 80% of the participants and all cardio-respiratory fitness test measurements, since summary data were not available. Sixty-seven per cent of the variables had less than 5% missing participants, and 73% of the variables had less than 20% missing participants. Measures of eye function were collected in only approximately 120,000 participants and hence were missing for the largest proportion of individuals.

All continuous variables were categorized into quintiles and constrained to have at least 20 deaths per category. If this was not possible, the categories were collapsed until this constraint was satisfied. The Charlson comorbidity index was calculated using self-reported diseases, obtained through a verbal interview by a trained nurse.

All participants were followed up from the date they were recruited by UK Biobank until 17th February 2014. All participants who died of any cause before 17th February 2014 were categorised as ‘All-cause mortality’. Information about causes of death was obtained from Health & Social Care Information Centre (formerly known as NHS Information Centre) for participants from England and Wales, and from NHS Central Register, Scotland for participants from Scotland. Detailed information about the linkage procedure is available here.

We also divided those who died before 17th February 2014 into six categories of causes of death: cancers, cardiovascular diseases, diseases of the respiratory system, diseases of the digestive system, external causes of mortality and morbidity, and other diseases.

We used the International Classification of Diseases, edition 10 (ICD-10) classification as follows:

  • Neoplasms, C00-D48; three most common death causes in males were lung cancer (C34), prostate cancer (C61) and oesophageal cancer (C15); three most common death causes in females were breast cancer (C50), lung cancer (C34), ovarian cancer (C56)
  • Diseases of the circulatory system, I05-I89; three most common death causes in males were chronic ischaemic heart disease (I25), acute myocardial infarction (I21), aortic aneurysm and dissection (I71); three most common death causes in females were acute myocardial infarction (I21), chronic ischaemic heart disease (I25), subarachnoid haemorrhage (I60)
  • Diseases of the respiratory system, J09-J99; three most common death causes in males were other chronic obstructive pulmonary disease (J44), other interstitial pulmonary diseases (J84), pneumonia, organism unspecified (J18); three most common death causes in females were other chronic obstructive pulmonary disease (J44), pneumonia, organism unspecified (J18), other interstitial pulmonary diseases (J84)
  • Diseases of the digestive system, K20-K93; three most common death causes in males were alcoholic liver disease (K70), vascular disorders of intestine (K55), fibrosis and cirrhosis of liver (K74); three most common death causes in females were vascular disorders of intestine (K55), alcoholic liver disease (K70), fibrosis and cirrhosis of liver (K74)
  • External causes of mortality and morbidity, V01-Y84; e.g. suicide
  • Other diseases, all remaining ICD-10 codes; three most common death causes in males were spinal muscular atrophy and related syndromes (G12), other ill-defined and unspecified causes of mortality (R99), unspecified dementia (F03); three most common death causes in females were spinal muscular atrophy and related syndromes (G12), other ill-defined and unspecified causes of mortality (R99), other septicaemia (A41)

We used the UK Biobank data in two ways:

  1. To investigate how well particular variables can predict death within five years. You can explore the results using the Association Explorer. See the FAQs for more information.
  2. To predict the risk of dying within five years for 40-70 year old men and women living in the UK. You can get an estimate of your individual risk and calculate your ‘Ubble age’ with our Risk Calculator. See the FAQs for more information.

For detailed information on how we analysed the data, please see our paper.

We studied the association between each variable and death within five years by analysing survival data (using a statistical model, known as a Cox model). When the association differed for different age categories, (which was assessed using a statistical metric called Schoenfeld residuals), we did an age-stratified analysis, where we examined each age group separately. For each variable, we calculated a measure of risk called C-index to determine how accurately a variable could predict death within five years.

The predictions of ‘five-year risk of death’ and ‘Ubble age’ were obtained using a multivariable survival model. This means that the model analysed the answers to the Risk Calculator questionnaire in combination rather than each answer separately: 11 questions (variables) for men and 13 for women (see more info about how questions were selected here). To avoid over-estimation of the accuracy of prediction, we calculated the score using only a portion of the UK Biobank participants. We used the remaining participants (those enrolled in the Scottish centers) to validate the score, i.e. to determine how well it performed at predicting mortality using C-index.

Statistical analyses
Imputation
We imputed missing data separately for men and women, using the 'multiple imputation by chained equations' approach, with five imputed datasets and ten iterations. For each variable, we specified a predictive mean matching model, including the ten most correlated predictors of the variable or of the missing status, the Nelson-Aalen estimate of cumulative hazard, the event indicator and self-reported health. All the analysis results were aggregated using Rubin’s rule after appropriate transformation. We checked whether the imputations were acceptable by comparing plots of the distribution of recorded and imputed values for all measurements.

Univariable analysis
We studied the sex-specific association of each variable with all-cause and cause-specific mortality using a Cox proportional hazard model or a cause-specific proportional hazard model with age as time-scale. The most common category within each variable was used as reference. The hazard proportionality assumption was considered violated if the test based on Schoenfeld residuals had a P-value < 0.00001. To model the age-dependent effect, we used an extended Cox model with three unit step functions for individuals younger than 53, between 53 and 62, and older than 62 years, respectively. These thresholds represent the tertiles of the age distribution in the population. Hence, hazard ratios were obtained for each age category.

Prediction
The prediction model was developed in the entire dataset excluding participants enrolled at the Scottish centres (N=35,810), which were used for validation (see below). First, we conducted a sex-specific univariate analysis, using time-in-study as time scale. Age was added as covariate in the model and an interaction with age was included if the hazard proportionality assumption was violated if a test based on Schoenfeld residuals had a P-value < 0.0001. Discrimination was assessed based on ten-fold cross-validated Harrell’s C-index accounting for competing risk. All C-indices reported include the effect of age in addition to the examined covariates. Second, we chose the 20 variables with highest C-index for each cause-specific mortality category. We excluded variables that were not self-reported, and hence unsuitable for inclusion in an online questionnaire. Third, we used a backward stepwise variable selection approach with Akaike information criterion (AIC) as criterion to select the variables to include in the final prediction model. The score was geographically validated on participants enrolled at the Scottish centers.

Calibration
Calibration was assessed using calibration plots and Hosmer-Lemeshow tests based on risk deciles. To obtain a five-year mortality risk representative of the UK population, we reweighted the baseline hazard using life-tables from England and Wales from the years 2009-2011. We further used census information from the year 2011.

For a more detailed description, please refer to the Lancet paper.

This project has been carried with funding from Knut and Alice Wallenberg Foundation and the Swedish Research Council. The authors do not have any conflicts of interest to declare.

This research has been conducted using the UK Biobank Resource and we would like to thank the UK Biobank participants and investigators for making this study possible.

We would also like to thank Sense About Science for providing invaluable input regarding the content of this website.

As the present study was based on data from the UK Biobank, we fully adhere to the UK Biobank ethics and governance framework set up by the UK Biobank. Since the beginning of the project, the funders of the UK Biobank have been committed to an ethically sound approach to the collection, storage and use of samples, and to extensive public consultation to identify public concerns and priorities. They have also sought to involve other key stakeholder communities, such as public health professionals, who would be instrumental to the project's success. These consultations and ethical enquiries have been providing valuable information that helped shape the policies and practice of the UK Biobank. You can find more information at the following webpages:

http://www.ukbiobank.ac.uk/ethics/
http://egcukbiobank.org.uk/
http://www.ukbiobank.ac.uk/wp-content/uploads/2011/05/EGF20082.pdf
http://www.wellcome.ac.uk/about-us/publications/reports/biomedical-ethics/wtd003284.htm

B: Association Explorer

The Association Explorer is an interactive graph where you can explore how closely 655 variables from the UK Biobank study are associated with different causes of death. It also shows how accurately a variable can predict death within five years. The Association Explorer only investigated the associations of variables with death within five years, and does not claim that any variables cause death. When trying to address whether a variable could be a cause of death, researchers would need to make adjustments to rule out other factors (confounders) that might be influencing its relationship with death. This has not been done in this study.

The C-index is a measurement of risk that calculates how well each variable can predict death within five years. It does this by evaluating how well each variable can discriminate between those who will die within five years and those who will not. We used the cause of death information and the 655 UK Biobank measurements to calculate the C-index. The higher the C-index, the more accurate its prediction ability. For example, in men, the measure of 'Usual walking pace' is a more accurate predictor of death within five years (C-index = 0.72), than the measure of 'Number of days per week of moderate physical activity' (C-index = 0.68).

In general, a C-index of:

  • 50-60% is considered poor
  • 60-70% is considered moderate
  • 70-80% is considered good
  • 80-90% is considered very good
  • >90% is considered excellent

R2 measures how closely each variable is associated with age. The higher the R2 value, the more strongly associated the variable is with age. For example, for men, 'Weekly usage of mobile phone in last 3 months' is more strongly associated with age (R2 = 0.2), than 'Salt added to food' (R2 = 0.001).

The C-index (the measure on the vertical axis of the graph) was chosen as this is a commonly used measure for predicting risk. Association with age (the measure on the horizontal axis of the graph) was chosen because age is the strongest predictor of death and because many other variables vary significantly with age.

The ability of a variable to discriminate those who will die within five years and those who will not is evaluated with a measure called C-index. This measure represent the probability that, given two participants, one alive at five years and one that died, the surviving participant has a lower predicted risk of dying than the one that actually died. The higher the C-index, the more accurate a variable is at discriminating between the two.

In general, a C-index of:

  • 50-60% is considered poor
  • 60-70% is considered moderate
  • 70-80% is considered good
  • 80-90% is considered very good
  • >90% is considered excellent

The best predictor of mortality in male UK Biobank participants was 'Self-reported health' (i.e. asking people to rate their overall health) (C-index = 0.74). For female participants, the best predictor of mortality was 'Previous cancer diagnosis' (C-index = 0.73).

This study also did a secondary analysis which only looked at healthy individuals; this excluded UK Biobank participants who had any major disease or disorder before becoming involved in the study. For the healthy participants, the best predictor of mortality was 'Past tobacco smoking' (C-index = 0.71 in males and 0.69 in females).

Men and women are likely to differ in terms of which measurements (variables) increase or decrease their risk of dying. These variations could be because men and women die from different causes, respond differently to measurements and questionnaires, for social and demographic reasons, or simply because men and women are biologically different.

In general, variables that can simply be reported by an individual through a questionnaire (without physical examination) were the strongest predictors of death from all causes. For example, asking people to rate their overall health ('Self-reported health') and to describe their usual walking pace were two of the strongest predictors in both genders and across different causes of deaths. This also explains why answers from questionnaires can provide relatively effective predictors of mortality, and how the Risk Calculator can be based on simple questions only.

UbbLE has been developed as a resource for the public, researchers and anyone working with public health advice or social policy to improve understanding of factors that might increase or reduce life expectancy in the UK. UbbLE can be used by:

  • Individuals to increase awareness of their health and to provide incentives for lifestyle changes.
  • Researchers as a starting point for future research.
  • Governmental and health organisations to inform public health advice and social policy.

The Association Explorer is developed from investigating the association between 655 measurements (variables) in the UK Biobank and death within five years, both from all causes of death and death from six specific causes. Therefore, by design, it cannot tell us anything about other measurements that were not examined, about predictions of death beyond five years, or about predictions of death in countries other than the UK.

Importantly, the Association Explorer only shows associations (the relationship) of each variable with death within five years, but does not provide any information about whether the associations are causal or not; i.e. it does not claim that any variables cause death. When trying to address whether a variable could be a cause of death, researchers would need to make adjustments to rule out other factors (confounders) that might be influencing its relationship with death. This has not been done in this study.

C: Risk Calculator

The Risk Calculator estimates an individual’s risk of dying within five years (‘five-year risk’). It is based on a prediction algorithm and calculates this risk by using answers to the Risk Calculator questionnaire (self-reported information) and information from the Association Explorer. It then uses the estimated individual risk to calculate an individual’s ‘Ubble age’ (see below).

See our disclaimer for more information.

Five-year risk of dying means the predicted absolute risk of dying within the next five years. The same absolute risk can be expressed in different ways. For example, if you have a 1 in 10 risk of dying in the next five years, this can also be described as a 10 % risk (using percentages), or a 0.1 risk (using decimals).

This risk calculator gives an estimate of how many people with similar answers will live and die within the next five years. However, it does not predict the future for any one individual; it cannot identify who will live and who will die.

See our disclaimer for more information.

’Five-year risk of dying’ is estimated using the responses from a questionnaire, which asks women 11 questions and men 13 questions, as well as information from the Association Explorer. The risk has been reweighted to be generalizable to the entire UK population. The questions include for example: 'How many cars or vans are owned, or available for use, by you or members of your household?' and 'Do you smoke tobacco now?

To do this, the Risk Calculator uses a prediction algorithm.

The ‘five-year risk of dying’ for individual i can be mathematically written as Ri(5)=1-S0(5)exp(f[x,M]) where f(x,M)= β1(xi,1-M1) + … + βp(xi,p-Mp). Here β1,…,βp are the regression coefficients for each measurement obtained from the prediction algorithm (Cox model) for overall mortality. M1,…,Mp are the means each measurement in the UK Biobank population. In case of categorical risk factors, M is the frequency of the category. S0(5) is the baseline survival at the mean values of the risk factors.

The Risk Calculator uses an individual’s absolute risk to calculate their ‘Ubble age’.

The risk score is estimated based on the responses to 11 or 13 questions in women or men, respectively. Examples of questions included in the score are 'How many cars or vans are owned, or available for use, by you or members of your household?' and 'Do you smoke tobacco now?

Ubble age is the age that has the most similar risk of dying in the next five years as the estimated absolute risk of the individual entering information in the Risk Calculator. If your Ubble age is higher than your actual age, you have a higher five-year mortality risk than the average person of your age in the UK. Conversely, if your Ubble age is lower than your actual age, you have a lower five-year mortality risk than the average person of your age in the UK.

The Risk Calculator calculates Ubble age by comparing the individual ‘five-year risk of dying’ (calculated from the questionnaire responses and weighted to be generalizable to the entire UK population) to UK life tables, and selects the age at which the risk of dying is most similar. UK life tables report the probability of dying in the next five years for an average person from the UK, given a certain age and gender.

For example, if you are a woman of any age between 40 and 70 and your estimated risk of dying within five years is 2.4% (calculated from the questionnaire responses), the most similar risk in the UK life tables is the average risk for a 56-year old woman. Hence, your Ubble age is 56 years.

We first define Ri(5) as the five-year absolute risk for an individual i. A more formal definition is given in the answer to this question. This quantity is recalibrated to be representative of the UK population as described in the answer to this question. The probability of surviving five years for individual i is then Si(5) = 1 – Ri(5).
Similarly, we define SaUK(5) as the five-year UK survival probability for a range of ages a. This quantity is obtained from the UK life tables from the years 2009-2011 and described in our paper. We define the Ubble age of an individual i, as the age of the UK survival closest to the estimated survival. In mathematical terms:

UBBLEagei = min{a,abs[Si(5) – SaUK(5)]| 18<a<90}

In our analyses, we found that the measures that most accurately predicted death from all causes within five years did not need to be measured by physical examination, but could be reported by the individual in response to a questionnaire. For example, asking people to rate their overall health (self-reported health) and to describe their usual walking pace were two of the strongest predictors in both men and women for different causes of death. This is why using questionnaires can provide relatively effective predictors of death, and why the Risk Calculator could be based on simple questions alone.

The questions used in the Risk Calculator were selected from the 655 measurements in UK Biobank and asked in the same way. To select questions, we used a computer-based approach to automatically select the combination of questions that gave the most accurate prediction of death within five years. The most accurate combination was found to be 13 questions for men, and 11 for women.

First, we chose the 20 variables with highest C-index for each cause-specific mortality category after excluding variables that were not self-reported, and hence unsuitable for inclusion in an online questionnaire. Self-reported measurements include all those variables that are directly obtained by asking the participants and not assessed through a medical specialist or a medical device. Second, we used a backward stepwise variable selection approach with Akaike information criterion (AIC) as criterion to select independent variables to include in the final prediction model. Finally, the score was geographically validated in participants enrolled at the only two Scottish centres (N=35,810) that were part of the UK Biobank.

Some predictors of death are gender-specific. Therefore, to maximise the accuracy of the Risk Calculator, the automatic selection process was applied separately for men and women.

The variable 'sex' in the UK Biobank did not include options other than male or female, which is why the Risk Calculator does not include transgender options.

The ability of a ‘five-year risk’ score to distinguish between those who will die within five years and those who will not is evaluated with a risk measure called C-index.

A C-index of:

  • 50-60% is considered poor
  • 60-70% is considered moderate
  • 70-80% is considered good
  • 80-90% is considered very good
  • >90% is considered excellent

When tested, the Risk Calculator’s ‘five-year risk’ score had a C-index of 80% for men and 79% for women. Therefore, it is considered to provide good to very good discrimination between those who will die and those who will not. This means the questionnaire-based Risk Calculator gives a reasonably good prediction of the chance of dying within five years.

See our disclaimer for more information.

Yes, although it has only been developed using UK Biobank participants, the Risk Calculator works well under the assumption that the associations observed in the UK Biobank can be generalised to the entire UK population. This is very likely to be true in most instances. In addition, we have used a strategy that uses UK life tables and UK census information to make the risk score generalisable to the whole UK population.

The ‘five-year risk of dying’ Ri(5) definition is given as answer to this question. Briefly, Ri(5)=1-S0(5)exp(f[x,M]).

To fully recalibrate the prediction score in the UK population, three pieces of information are needed:

  1. The β regression coefficients calculated using the entire UK population;
  2. M obtained from the entire UK population;
  3. S0(5) obtained from the entire UK population.

The first quantity cannot be obtained, since individual level data for each risk factor in the entire UK population are not available. Therefore, we assume that the β coefficients obtained in the UK biobank are generalizable to the entire UK population.
The second quantity is also difficult to obtain. However, some of the variables included in the prediction score have also been collected in the 2011 UK census. Specifically, self-reported overall health and number of vehicles or vans owned are both variables available from the census and included in the prediction score. For the other risk factors, we need to assume that the observed average in the UK biobank is similar to the UK population.
Finally, the third quantity can be obtained from the UK life tables.

A detailed description on the weights used for calibrating the score is given in the Supplementary Appendix of our paper.

It is not known how accurate the Risk Calculator will be at predicting death within five years in people from other countries. If the variables have similar associations with death, the prediction will be accurate in other settings too. Therefore, it is likely that the prediction works fairly well in countries that are similar to UK in terms of distribution of demographic and socioeconomic factors, provision of healthcare and lifestyle and risk factor distribution. However, there is no way of knowing this for sure without carrying out a similar study in each country.

The name brings to mind the 'Hubble Space Telescope', which also aims at exploring a large unknown space (and if you are Italian like Andrea, Ubble and Hubble also tends to sound very similar). UbbLE can also be used as an acronym for ‘UK Longevity Explorer’.

The Risk Calculator can be used to estimate an individual’s risk of dying within the next five years. It then uses the estimated risk to calculate an individual’s ‘Ubble age’.

However, risk calculators can never predict the future for any specific individual in a deterministic sense – it cannot identify who will live and who will die. Therefore, instead it should be interpreted on a population level. For example, a 2% risk of dying within five years should be interpreted as: of 100 people of the same age, sex and risk profile, 2 will die and 98 will survive over the next five years.

No, Ubble age is based on the risk of dying within the next five years. Therefore, this estimate cannot be extended to life-long predictions of your life expectancy. Further, the estimation of Ubble age might change once more data become available from UK Biobank and if the study is extended to investigate death within ten years, for example.

In addition to the Risk Calculator being used to predict Ubble age, we hope our research can be used by individuals to improve awareness of their own health, providing incentives for lifestyle changes. Also, doctors and public health experts may use this information to identify and target high-risk patients with specific interventions, and for policy makers to consider social factors that impact our health.

Although most of the questions in the Risk Calculator measure factors that cannot be altered by individuals making changes to their lifestyles, several can. Repeated studies have shown that increasing physical activity, stopping smoking and eating a healthy diet can improve lifespan and reduce the risk of developing major diseases.

If you have any worries or questions about your health and lifestyle, you can get more information at NHS Choice Live well and questions about your results can be directed to Sense About Science.