DuPRI Logo
Database Drawing

Data, Datasets, Databases

Contemporary population research often involves applications of sophisticated analytic models to large, complex datasets. DuPRI researchers are leaders in: (1) the development of key datasets that hold great potential for contributions to demographic knowledge (2) the analysis of several well-known datasets developed in other projects, or at other centers or institutes. Researchers work extensively with various files and datasets. Listed below are selected examples. For datasets not mentioned here, please contact DuPRI Administrative Director Heather Tipaldos (heather.tipaldos@duke.edu).

These datasets have been developed, in part or fully, by DuPRI researchers:

Developed, in part or fully, by DuPRI researchers

Chinese Longitudinal Healthy Longevity Survey (CLHLS)

The Chinese Longitudinal Healthy Longevity Survey (CLHLS) is the world's largest survey on centenarians, nonagenarians, and octogenarians with comparative group of younger elderly, and one of the world's largest surveys concerning health and aging. The CLHLS has interviewed 8,959, 11,161, 16,057 and 14,923 elderly in 1998, 2000, 20002 and 2005, respectively. Totaling in the four waves, 10,879, 13,985, 16,505, and 9,731 face-to-face interviews were conducted with centenarians, nonagenarians, octogenarians, and elders aged 65-79, respectively. At each wave, the longitudinal survivors were re-interviewed, and the deceased and lost-follow-up interviewees were replaced by additional participants. Data on mortality and health status before dying for the 12,136 elders aged 65-112 who died between the 1998, 2000, 2002, and 2005 waves were collected in interviews with a close family member of the deceased. The CLHLS 2008 wave with an expected sample size of about 21,000 interviewees is now in the field. Questions were asked concerning many varied areas, such as disability, health, cognitive ability, siblings, smoking/drinking habits, exercise, living arrangements, children, nutrition, living arrangements, housing, and other socioeconomic issues. The study's extensive data are used to investigate which determinants, out of a large set of social, behavioral, biological, and environmental risk factors, are important for healthy longevity. CLHLS datasets are available as cross-sectional data or longitudinal data through the programs at Duke University and Peking University. So far, there are 236 registered CLHLS data users (excluding their students/associates). According to incomplete records up to June 2008, longitudinal healthy longevity study datasets have resulted in five books, 27 papers published in international peer-reviewed journals, 103 papers published in Chinese peer-reviewed journals, 93 English papers presented at international conferences, 6 Ph.D dissertations, and 16 MA degree theses. The study's PI is Duke faculty member Zeng Yi. (NIA funded much of this work).
VIEW WEBSITE

Mexican Family Life Survey Database

The Mexican Family Life Survey Database is currently being developed with funding until 2013 from NICHD. Note there will be another database, a related effort, which will result from a new Longitudinal Study Of Older Adults And Their Families In Mexico funded by NIA as of August 2008. Duncan Thomas, PI (DuPRI) and Elizabeth Frankenberg (DuPRI) lead the design and implementation of both these NIH-funded studies in Mexico. The Mexican Family Life Survey (MxFLS) is a multi-dimensional, longitudinal survey of individuals, households, families and communities in Mexico. The baseline, conducted in 2002, interviewed over 35,000 respondents and is representative of the Mexican population. The first follow-up in 2005 re-interviewed over 90% of the respondents. Follow-ups are scheduled for 2009 and 2011. The study collects a broad array of socioeconomic, demographic and health indicators from each respondent. Older adults are interviewed, along with all other household members, and the modules are designed to be as comparable with HRS as feasible. Two features of the study design are important for DuPRI work. First, MxFLS collects an extensive battery of biomarker information including anthropometry, blood pressure, cholesterol, inflammation, levels of HbA1c – a marker for diabetes -- and hemoglobin levels. Given the very high rates of obesity in Mexico – which exceed those in the United States – this information opens up important new avenues for research into the social context of obesity while also speaking to the emerging global epidemic of obesity. Second, the survey follows respondents who move in order to minimize the impact of attrition on analyses. This includes following respondents who move to the United States. In the first follow-up, over 90% of the respondents who moved to the United States were interviewed there. These data provide unparalleled opportunities to test hypotheses about the Hispanic paradox – the fact that Hispanics living in the United States are poor but in better health than native-born whites. The study received the First Latin American Regional Award for Innovation in Statistics from the World Bank. (NICHD and NIA are funding these studies.) Duncan Thomas is PI of MxFLS.
VIEW WEBSITE

1905-COHORT, LSADT, and DOS databases

The 1905-COHORT, LSADT, and DOS databases on the old and the oldest-old are from studies carried out in Denmark by Kaare Christensen (DuPRI) and James Vaupel (DuPRI). The 1905-COHORT study (a 4-wave study) follows all Danes born in 1905 and living in Denmark. In the first year, 1998, of the study, 2,262 participated, and in the last year, 2005, 256 participated. Data for the 1905-COHORT database is drawn from assessments and physical tests given to the participants as well as from record-linkages to Denmark’s national population and health registries. The LSADT – Longitudinal Study of Aging Danish Twins (a 6-wave study) and DOS – Danish Oldest-Old Siblings (a study with a follow-up/longitudinal and family perspective) databases are similar to the 1905-COHORT study in that all three have used almost identical medical, physical, cognitive, and psychological assessments and each participant’s data records have been linked to Denmark’s national registries for more information. LSADT involves data for 4,700 people and DOS (still underway) is aiming for 2,500 people. (NIA funded this work). Study PIs are Duke faculty member James W. Vaupel and Duke Senior Research Scientist Kaare Christensen.

BABASE (the Baboon database)

BABASE (the Baboon database) is a long-term still-growing database, covering nearly 35 years and updated twice yearly. It was developed by researchers Susan Alberts (DuPRI) and Jeanne Altmann (Princeton) to record field data collected on the Amboselli baboons (approximately 300 individuals each year). The database includes a wide range of information - from demographic data such as age, fertility, paternity, health status, migration status, time/cause of death, to behavioral, other reproductive (including hormonal), locational, ecological, and meterological data. (NSF funded much of this work.) Duke PI is biology professor Susan C. Alberts.

National Long Term Care Survey

The National Long Term Care Survey database - With direction from Duke faculty, including: K. Manton, Yashin (DuPRI), and Stallard (DuPRI), the NLTCS effort resulted in information on Medicare recipients who were 65 years of age or older with emphasis on those with functional impairment. Regarding sample size: the Duke-managed 2004 NLTCS sample (the 6th wave of the study) had a longitudinal component of about 13,300 people who had responded to at least one of the previous five surveys, an “aged-in” component of 5,600 who turned 65 years old since the last (the 1999) survey, and an additional component of 1,000 people of 95 years old or older. With a total sample size for one year of close to 20,000, with multiple waves of similar size, and with face-to-face assessments plus Medicare record linkages, the database is an unparalleled resource on aging and disability in the US. The NLTCS multi-wave database provides data on prevalence and patterns of both physical and cognitive limitations as well as detailed information for each respondent on health care services used, demographic and economic characteristics, and health costs and expenditures. (NIA funded much of this work.) Contact is Duke Research Professor Anatoli I. Yashin.
VIEW WEBSITE

Study of the Tsunami Aftermath and Recovery (STAR)

The Study of the Tsunami Aftermath and Recovery (STAR) database, developed by PI and faculty member Elizabeth Frankenberg (DuPRI), has continued to field and analyze a multi-wave longitudinal survey, with biomarker measures, of some 40,000 individuals in the December 26, 2004 tsunami-affected areas of Sumatra and nearby comparison areas. Baseline data were collected in February 2004, (prior to the tsunami). The first re-survey took place between May 2005 and May 2006. Survival status was ascertained for 96 percent of the respondents to the 2004 baseline and interviews were conducted with 94 percent of known survivors. The second re-survey began in July 2006 and concluded in June 2007. Three additional annual follow-up surveys that focus on the evolution of well-being during the reconstruction phase will be conducted. The database provides information on mortality and the well-being of older adults across a range of health and economic outcomes. It can be used to assess the extent to which older individuals were particularly vulnerable to the tsunami's consequences. (NIA funded and is funding this work.) Duke PI is Elizabeth Frankenberg.
VIEW WEBSITE

Duke Social Security/Medicare Dataset

The Duke Social Security/Medicare Dataset consists of Master Beneficiary Records from Supplementary Medical Insurance program (Medicare Part B) merged by social security number to records from the Numerical Identification Files (NUMIDENT) of the Social Security Administration. The data are serviceably complete from 1976-2001. There are over 70 million records in the dataset, covering 95% of the population aged 65 years and older. Because enrolment requires proof of age, age validity is high compared with other data sources for the US elderly population. In addition to race, sex and age, information includes entitlement status (primary versus auxiliary beneficiary), zip code of the place of residence, and place of birth (including foreign countries). An advantage of using this dataset is that death and population counts are based on the same data source. Under the agreement between Duke University and SSA the data have been compiled by and can be used for 2 grants (James Vaupel, PI) on oldest-old mortality (PO1 AG08761) and mortality surface analysis (RO1 AG18444) until June 30, 2009. Vaupel intends to request permission to continue to use the data as part of the research activity of DuPRI. Access to the individual records is currently restricted to Duke University “personnel, their fellows, or collaborators”. (NIA funded and is funding this work.) Duke contact is Magdalena M. Muszynska.

[ top ]

Purchased or otherwise acquired for the pursuit of research at Duke

Framingham Heart Study (FHS) Dataset

The FHS includes 14,428 participants and is a unique source for almost all major types of data currently measured in large-scale longitudinal population studies. **1 - The FHS Original Cohort was launched at Exam 1 in 1948 (9/1948 - 4/1953) and has continued with biennial examinations to the present. The FHS Original Cohort consists of 5,209 respondents (55% females) aged 28–62 years residing in Framingham, Massachusetts, between 1948 and 1951. Nearly all subjects were Caucasians. Examination included an interview, physical examination, and laboratory tests. **2- The Offspring Cohort (FHSO). This was launched at Exam 1 in 1971 (8/1971 - 9/1975) and has on average been examined every 3 to 4 years since enrollment. The FHSO dataset consists of a sample of 3,514 biological descendants of the Original Cohort, 1,576 of their spouses and 34 adopted offspring for a total sample of 5,124 subjects (52% females). Contact is Duke Research Professor Anatoli I. Yashin.

The FHSO (Offspring Cohort) subjects were enrolled in 1971-1975 using research protocols similar to those of the FHS so that comparisons of the results from the FHSO and the FHS could be made. Beginning in 2002, 4,095 adults (53% females) having at least one parent in the Offspring Cohort enrolled in the Third Generation Cohort. The objective of new recruitment was to complement phenotypic and genotypic information obtained from prior generations, with priority assigned to larger families. Self-reports of ethnicity across all three generations were 99.7% whites, reflecting the ethnicity of the population of Framingham in 1948.

Phenotypic traits collected in the FHS cohorts over 59 years of follow up relevant to Center research include: physiological indices, disease risk factors and biomarkers (e.g., blood pressure, pulse pressure, pulse rate, blood glucose, cholesterol, height, weight, hematocrit), ages at disease onsets, behavioral and life history characteristics, selected markers of aging and stress resistance (e.g., age at natural menopause, vital capacity, cognitive tests assessing memory, motor speed, heart rate variability, bone density), and longevity. The FHS cohort has also been followed for the occurrence of cardiovascular disease (CVD), cancer, diabetes mellitus, and death through surveillance of hospital admissions, death registries, and other available sources. All the FHS participants remain under continuous surveillance and all deaths are included in this study. The date and cause of death (classified as due to coronary heart disease (CHD), stroke, cancer, other CVD causes, or unknown cause) have been also recorded.
VIEW WEBSITE

FHS & FHSO Genetic Data

The FHS & FHSO Genetic Data. The FHS & FHSO collected information about genetic polymorphisms in individuals of different ages. (The full list of candidate genes is available at the FHS website.) Contact is Duke Research Professor Anatoli I. Yashin.

FHS SNPs Data

Individuals from all three generations have been included in the 550K SNPs genotyping project. The genotyping of about 9,300 FHS participants and two generations of their offspring was conducted using Affymetrix 500K mapping array with approximately 550,000 SNPs representing a significant part of human genome variability. The individual information on the genome-wide SNP genotyping and hundreds of phenotypic traits collected in the FHS is publicly available through the Framingham SHARe web site upon request for controlled access to individual level data. The genotyping data are being used by DuPRI members to assess genetic contribution to a wide range of aging-related phenotypes and their combinations. Since the subset consists of families, broader analytic approaches will be used. Information on survival and incidence of diseases in FHS participants from the non-genetic part of the study are being used for statistical purposes, e.g., to better characterize connections of intermediate phenotypes with L-traits and to improve significance of observed genetic associations using our advanced statistical approaches. For joint analyses, genotyping data from the 100K SNPs project recently performed on a sub-set of the FHS participants (N=1,345) is being used. These data have 27,000 SNPs overlap with the SNPs from the 550K genotyping array. Contact is Duke Research Professor Anatoli I. Yashin.
VIEW WEBSITE

Honolulu Heart Study (HHS)

The Honolulu Heart Study was initiated in 1965 as a prospective study of environmental and biological causes of CVD among 8,006 Japanese American men (representing a response rate of 72%) living in Hawaii, identified from a roster of 23,000 men on the 1940-42 selective service registry for Hawaii, who were born 1900-1919. Four main examinations were conducted. The first examination was in 1965-68, the second in 1967-70, the third in 1971-74, and the fourth in 1991-93. Deaths and morbid events were tracked to 1994. The four exams collected data on anthropometric variables (incl. height and weight), physical measures (incl. ventricular rate, blood pressure, vital capacity), blood chemistries (incl. hematocrit, serum cholesterol, serum triglyceride, uric acid, and serum glucose), and indicators of CHD, stroke, and diabetes. Exams 1 and 4 collected exercise and lifestyle measures (incl. diet, tobacco, and alcohol use). Exam 4 added additional measures on physical activity, activity limitations, performance-based physical function, ADL, IADL, cognitive function, prescription, and OTC medications. Three additional examinations were conducted during 1970-82 on a 30% sub-sample of those completing Exam 2 plus all individuals in the highest decile of serum triglyceride or serum cholesterol in Exam 1. These examinations were conducted in 1970-72, 1975-78, and 1980-82, and the content focused on the lipoprotein chemistries. Although the content of the various examinations was comparable to that of the Framingham Heart Study, the final ages are not quite as high (maximum age at Exam 4 is 93 years), and the study included only men (but over 3 times as many). The average age at Exam 1 was 54; at Exam 4, 78; and at the 1994 surveillance, 80. The Honolulu Heart Study data have been collated and made available for research use by the NHLBI. Contact is Duke Research Professor Anatoli I. Yashin.

Cardiovascular Health Study (CHS)

The Cardiovascular Health Study (CHS) is a population-based, longitudinal study of risk factors for the development and progression of CHD and stroke in Medicare-eligible adults aged 65+ years at enrollment. The main objective of the CHS is to identify factors related to the onset and course of these diseases. The CHS is specifically designed to determine the importance of conventional CVD risk factors in older adults, and to identify new risk factors in this age group, especially those that may be protective and modifiable. An initial 5,201 study participants (2,962 women and 2,239 men) were examined annually from 1989 through 1999. An additional sample of 687 blacks (256 men and 431 women) was examined from 1992 to 1999. Extensive physical and laboratory evaluations were performed at baseline to identify the presence and severity of CVD risk factors such as hypertension, hypercholesterolemia and glucose intolerance; subclinical disease such as carotid artery atherosclerosis, left ventricular enlargement, and transient ischemia; and clinically overt CVD. These examinations in the CHS permit evaluation of CVD risk factors in older adults, particularly in groups previously under-represented in epidemiologic studies, such as women and the very old. Follow-up interviews for events were conducted semiannually.   Examination components have included physical function (ADL/IADL, Upper Extremity Score, grip strength), medical history (e.g., vision and hearing problems, heart problems in siblings, hypertension, diabetes), neurological history, behaviors (e.g., cigarette, alcohol), physical exercises, cognition (MMSE score), depression (CES-D score), prescription medication use, electrocardiograms, physiological markers (blood pressure, cholesterol), over the counter medications (annually from 1993-94), Benton Visual Retention (beginning in 1993-94), cerebral magnetic resonance imaging, spirometry and retinal photographs over the past decade. Up to 10 ICD9-CM hospital discharge codes and up to 10 procedure codes were recorded for all hospitalizations among CHS participants. Recently CHS was renewed for continued morbidity and mortality follow-up as well as collection of genetic information. Contact is Duke Research Professor Anatoli I. Yashin.
VIEW WEBSITE

Surveillance, Epidemiology and End Results (SEER)

The Surveillance, Epidemiology and End Results (SEER) database of the National Cancer Institute, is an authoritative source of information on cancer incidence and survival in the United States. SEER began in 1973 and captures approximately 14% of the US population; the expansion registries increase the coverage to approximately 26%. The information collected about each incident of cancer diagnosis includes the patient’s demographic characteristics, date of diagnosis, data about up to 10 diagnosed cancer cases (e.g., histology, stage, and grade), type of surgical treatment and radiation therapy recommended or provided within 4 months of diagnosis, follow-up of vital status, and cause of death, if applicable.

The SEER ( Surveillance, Epidemiology and End Results)-Medicare data file includes over 2.4 million Medicare-eligible persons appearing in the SEER data (from the National Cancer Institute) through 2002 and their Medicare claims through 2005, and combines clinical information from population-based SEER cancer registries with claims information from the Medicare. As a complement to the SEER-Medicare data, there are Medicare files for a random sample of 5% of Medicare beneficiaries residing in the SEER areas who do not have cancer. The Medicare files available for the control group are identical to those for the cancer cases. Contact is Duke Research Professor Anatoli I. Yashin.
VIEW WEBSITE

Multiple Cause of Death (MCD) data file

The Multiple Cause of Death (MCD) data file contains information about death certificate reports for individual decedents of all death occurring in the U.S. and U.S. territories during 1959–2005. In total, the file includes more than 96 million individual death certificate records. Each record in the MCD file includes data on the underlying cause and multiple causes of death. The data include date of death, geographic location (region, state, county, division) of death, residence of the deceased (region, state, county, city, and population size), sex, race, age, education, marital status, state of birth, and origin of descent. The multiple cause of death fields were coded using ICD-7 for 1959-1967, ICD-8 for 1968-1978, ICD-9 for 1979-1998, and ICD-10 for 1999 and later. The data for this database are collected from death certificates filed in the vital statistics offices of each state and the District of Columbia. Contact is Duke Research Professor Anatoli I. Yashin.

Medicare Claims Data

The Medicare Claims Data. Medicare is the primary health insurer for 97% of the US population aged 65+ years. All Medicare beneficiaries receive Part A benefits, which cover inpatient care in short- and long-stay hospitals, skilled nursing facilities, home health, and hospice care. 95% of beneficiaries also subscribe to Medicare Part B to obtain benefits that cover physician services, outpatient care, durable medical equipment, and home health in some cases. The Medicare claims data records contain information on dates and costs of service, types of providers, ICD-9-CM (International Classification of Diseases-9th Revision-Clinical Modification) diagnoses responsible for services, and auxiliary diagnostic codes and procedure codes. Contact is Duke Research Professor Anatoli I. Yashin.
VIEW WEBSITE

[ top ]