| Title: | Data Sets for Copula Additive Distributional Regression Using R |
|---|---|
| Description: | Data sets used in the book Marra and Radice (2025, ISBN:9781032973111) "Copula Additive Distributional Regression Using R", for illustrating the fitting of various joint (and univariate) regression models, with several types of covariate effects, in the presence of equations' errors association. |
| Authors: | Giampiero Marra [aut, cre] |
| Maintainer: | Giampiero Marra <[email protected]> |
| License: | GPL (>= 2) |
| Version: | 0.1-1 |
| Built: | 2026-05-26 09:23:30 UTC |
| Source: | https://github.com/cran/GJRM.data |
Real dataset of bivariate interval and right censored data with 628 subjects
and three covariates. The dataset is a reshaped version of the AREDS data from the CopulaCenR package. The dataset
was selected from the Age-related Eye Disease Study (AREDS Group, 1999). The two events are the
progression times (in years) to late-AMD in the left and right eyes.
data(areds)data(areds)
war is a 628 row data frame with the following columns:
left and right bounds of the intervals for the left eye. If t12 = NA then the observation is right-censored.
left and right bounds of the intervals for the right eye. If t22 = NA then the observation is right-censored.
baseline AMD severity scores for left and right eyes, respectively. Possible values are: 4, 5, 6, 7, 8.
age at baseline.
a genetic variant covariate highly associated with late-AMD progression. Possible values are: 0, 1, 2.
type of censoring for left and right eyes.
joint censoring indicator for left and right eyes.
Data are from:
AREDS Group (1999), The Age-Related Eye Disease Study (AREDS): design implications. AREDS report no. 1. Control Clinical Trials, 20, 573-600.
Blood pressure data in 11 year old children. The dataset is a subsample from Solomon-Moore et al. (2020).
data(bpc)data(bpc)
bpc is a 1052 row data frame with the following columns:
Systolic Blood Pressure (mmHg).
Diastolic Blood Pressure (mmHg).
1 = Male, 2 = Female.
Body Mass Index.
Average minutes of moderate to vigorous physical activity per day.
Average sedentary minutes per day.
Data are from Solomon-Moore E, Salway R, Emm-Collison L, Thompson JL, Sebire SJ, Lawlor DA, Jago R (PI), 2020.
Fictitious data designed to closely replicate the characteristics and patterns observed in the Africa Centre Demographic Information System (ACDIS).
data(cd4)data(cd4)
cd4 is a 2645 row data frame with the following columns:
CD4 count measurements.
Binary variable indicating whether an individual is HIV positive (hiv = 1) or not (hiv = 0).
Age in years.
Three levels: PER, RUR, URB.
Six levels: Married, Polygamous, Divorced/Separated/Widowed, Engaged, Never Married, Under Legal Age.
If present or not.
Four levels: None, Primary, Junior Secondary, Upper Secondary.
Km to nearest primary school.
Km to nearest secondary school.
The data have been produced as described in:
Tanser F. at al., (2007), Cohort Profile: Africa Centre Demographic Information System (ACDIS) and population-based HIV survey. International Journal of Epidemiology, 37(5), 956-962.
Simulated data with two endogenous variables and binary outcome.
data(dataDE)data(dataDE)
dataDE is a 2000 row data frame with the following columns:
First endogenous variable.
Second endogenous variable.
Binary outcome.
Covariates.
Covariate influencing only y1.
Covariate influencing only y2.
# Data have been simulated as shown below n <- 2000 x1 <- round(runif(n)) x2 <- runif(n) x3 <- runif(n) x4 <- rnorm(n) u <- rnorm(n) y1 <- ifelse(-1.55 + x1 - x2 + x3 + u + rnorm(n) > 0, 1, 0) y2 <- ifelse(-0.25 - 0.5*x1 + x2 + x4 + u + rnorm(n) > 0, 1, 0) y3 <- ifelse(-0.75 + 0.5*y1 - y2 + x1 + x2 + u + rnorm(n) > 0, 1, 0) dataDE <- data.frame(y1, y2, y3, x1, x2, x3, x4)# Data have been simulated as shown below n <- 2000 x1 <- round(runif(n)) x2 <- runif(n) x3 <- runif(n) x4 <- rnorm(n) u <- rnorm(n) y1 <- ifelse(-1.55 + x1 - x2 + x3 + u + rnorm(n) > 0, 1, 0) y2 <- ifelse(-0.25 - 0.5*x1 + x2 + x4 + u + rnorm(n) > 0, 1, 0) y3 <- ifelse(-0.75 + 0.5*y1 - y2 + x1 + x2 + u + rnorm(n) > 0, 1, 0) dataDE <- data.frame(y1, y2, y3, x1, x2, x3, x4)
Simulated data with double sample selection and binary outcome.
data(dataDSS)data(dataDSS)
dataDSS is a 10000 row data frame with the following columns:
First selection.
Second selection.
Binary outcome.
Covariates.
Covariate influencing only y1.
Covariate influencing only y2.
Original outcome, without missingness.
# Data have been simulated as shown below n <- 10000 x1 <- round(runif(n)) x2 <- runif(n) x3 <- runif(n) x4 <- rnorm(n) u <- rnorm(n) y1 <- ifelse(-1.55 + x1 - x2 + x3 + u + rnorm(n) > 0, 1, 0) y2 <- ifelse(-0.25 - 0.5*x1 + x2 + x4 + u + rnorm(n) > 0, 1, 0) y3 <- y3.o <- ifelse( -0.75 + x1 + x2 + u + rnorm(n) > 0, 1, 0) y2 <- y2*y1 y3 <- y3*y2 y3 <- ifelse(y2 == 0, NA, y3) dataDSS <- data.frame(y1, y2, y3, x1, x2, x3, x4, y3.o)# Data have been simulated as shown below n <- 10000 x1 <- round(runif(n)) x2 <- runif(n) x3 <- runif(n) x4 <- rnorm(n) u <- rnorm(n) y1 <- ifelse(-1.55 + x1 - x2 + x3 + u + rnorm(n) > 0, 1, 0) y2 <- ifelse(-0.25 - 0.5*x1 + x2 + x4 + u + rnorm(n) > 0, 1, 0) y3 <- y3.o <- ifelse( -0.75 + x1 + x2 + u + rnorm(n) > 0, 1, 0) y2 <- y2*y1 y3 <- y3*y2 y3 <- ifelse(y2 == 0, NA, y3) dataDSS <- data.frame(y1, y2, y3, x1, x2, x3, x4, y3.o)
Data from the 2019 World Happiness Report, an annual publication of the United Nations Sustainable Development Solutions Network.
data(happy)data(happy)
happy is a 155 row data frame with the following columns:
Country.
Gross domestic product per capita.
Indicator of social support (or having someone to count on in times of trouble) calculated at national level.
Indicator of healthy life expectancies at birth.
Freedom to make life choices is the national average of responses to the question: Are you satisfied or dissatisfied with your freedom to choose what you do with your life?
Generosity is the residual of regressing national average of response to the question: Have you donated money to a charity in the past month? on GDP per capita.
Corruption Perception: The measure is the national average of the survey responses to two questions in the: Is corruption widespread throughout the government or not? and Is corruption widespread within businesses or not? The overall perception is just the average of the two 0-or-1 responses.
Subjective well-being. 1 low, 2 medium low, 3 medium, 4 high.
Full description available at the web link below.
data(hie)data(hie)
hie is a 7734 row data frame with the following columns:
Equal to 1 if the individual is in the HIE group and agreed to participate, and 0 if the individual is assigned to the control group or refuses to participate.
Random allocation variable equal to 1 if the individual/employer was assigned to the hiring incentive experiment group and 0 to the control group. This is the IV.
Weekly benefit amount + dependents' allowance.
Weeks of benefits.
Equal to 1 if unemp.dur < 26 and 0 otherwise.
Age of claimant.
1 = male and 0 = female.
1 = black and 0 otherwise.
Claimant's pre-claim earnings.
https://www.upjohn.org/data-tools/employment-research-data-center/illinois-unemployment-incentive-experiments
HIV Zambian data by region, together with polygons describing the regions' shapes.
data(hiv) data(hiv.polys)data(hiv) data(hiv.polys)
hiv is a 6416 row data frame with the following columns:
binary variable indicating consent to test for HIV.
binary variable indicating whether an individual is HIV positive (status = 1) or not (status = 0).
age in years.
years of education.
wealth index.
code identifying region, and matching names(hiv.polys). It can take nine possible values: 1 central, 2 copperbelt, 3 eastern,
4 luapula, 5 lusaka, 6 northwestern, 7 northern, 8 southern, 9 western.
never married, currently married, formerly married.
had a sexually transmitted disease.
had high risk sex.
number of partners.
used condom during last intercourse.
equal to 1 if would care for an HIV-infected relative.
equal to 1 if know someone who died of HIV.
equal to 1 if previously tested for HIV.
smoker or not.
bemba, lunda (luapula), lala, ushi, lamba, tonga, luvale, lunda (northwestern), mbunda, kaonde, lozi, chewa, nsenga, ngoni, mambwe, namwanga, tumbuka, other.
English, Bemba, Lozi, Nyanja, Tonga, other.
interviewer identifier.
age the individual had sex.
four categories.
survey weights.
hiv.polys contains the polygons defining the areas in the format described below.
The data frame hiv relates to the regions whose boundaries are coded in hiv.polys.
hiv.polys[[i]] is a 2 column matrix, containing the vertices of the polygons defining the boundary of the ith
region. names(hiv.polys) matches hiv$region (order unimportant).
The data have been produced as described in:
McGovern M.E., Barnighausen T., Marra G. and Radice R. (2015), On the Assumption of Joint Normality in Selection Models: A Copula Approach Applied to Estimating HIV Prevalence. Epidemiology, 26(2), 229-237.
Marra G., Radice R., Barnighausen T., Wood S.N. and McGovern M.E. (2017), A Simultaneous Equation Approach to Estimating HIV Prevalence with Non-Ignorable Missing Responses. Journal of the American Statistical Association, 112(518), 484-496.
Data on 978 randomly selected patients admitted between January and September 2014 to an over-500-bed medical center (Lewis Gale Medical Center) in the state of Virginia.
data(hospital)data(hospital)
hospital is a 978 row data frame with the following columns:
Patient length of hospital stay (in days).
In-hospital mortality. 1 dead, 0 alive.
Age of the patient.
Either male or female
Body mass index.
Subjective assessment of severity level of patient. Value between 1 and 4, with 1 representing the lowest severity level.
Subjective assessment of risk of dying. Value between 1 and 4, with 1 representing the lowest level.
Oxygen saturation level.
Systolic blood pressure.
Diastolic blood pressure.
Pulse rate.
Respiratory rate.
AVPU score (A: alert, V: responding to voice, P: responding to painful stimuli, U: unresponsive).
Temperature.
Azadeh-Fard N, Ghaffarzadegan N, Camelio JA (2016), Can a Patient's In-Hospital Length of Stay and Mortality Be Explained by Early-Risk Assessments?, PLoS ONE 11(9): e0162976.
Individual-level infant mortality data on 20000 randomly selected births of female babies in the U.S. state of North Carolina, in 2008, together with polygons describing the county shapes.
data(infants) data(NC.polys)data(infants) data(NC.polys)
infants is a 20000 row data frame with the following columns:
Number code identifying North Carolina county in which birth occurred, and matching names(NC.polys). It can take 100 possible values.
Age of mother.
Completed weeks of gestation.
Equal to 1 if married, and 0 otherwise.
Infant's birth weight.
Equal to 1 if infant's birth weight < 2500 grams, and 0 otherwise.
Four categories of ethnicity: White, Hispanic, Black, Other.
Education of mother: Primary, Secondary, Tertiary.
Equal to 1 if smoker, and 0 otherwise.
Equal to 1 if it was the mother's first birth, and 0 otherwise.
Equal to 1 if completed weeks of gestation < 37.
NC.polys contains the polygons defining the areas in the format described below.
The data frame infants relates to the counties whose boundaries are coded in NC.polys.
NC.polys[[i]] is a 2 column matrix, containing the vertices of the polygons defining the boundary of the ith
county. names(NC.polys) matches infants$county (order unimportant).
The data were compiled by the North Carolina State Center for Health Statistics (https://schs.dph.ncdhhs.gov/).
Subsample of the 2012 MEPS data, collected and published by the U.S. Agency for Healthcare Research and Quality.
data(meps)data(meps)
meps is a 10638 row data frame with the following columns:
General health: 1 excellent, 2 very good, 3 good, 4 fair, 5 poor.
Mental health (as above).
Body mass index.
Income.
Age.
Male 1, Female 0.
1 white, 2 black, 3 native american, 4 others.
Education in years.
1 Northeast, 2 Midwest, 3 South, 4 West.
Equal to 1 if hypertension present and 0 otherwise.
Equal to 1 if hyperlipidemia present and 0 otherwise.
Number of doctor (physicians) visits.
Number of non doctor visits (non-physician providers).
Expenditure on doctor visits.
Expenditure on non doctor visits.
https://meps.ahrq.gov
Civil war data from Fearon and Laitin (2003).
data(war)data(war)
war is a 6326 row data frame with the following columns:
equal to 1 for all country-years in which a civil war started.
equal to 1 if unstable government.
equal to 1 for oil exporter country.
equal to 1 if the country had a distinct civil war ongoing in the previous year.
GDP per capita (measured as thousands of 1985 U.S. dollars) lagged one year.
equal to 1 for non-contiguous state.
equal to 1 for new state.
log(population size).
log(mountainous).
measure of ethnic fractionalization (calculated as the probability that two randomly drawn individuals from a country are not from the same ethnicity).
measure of religious fractionalisation.
measure of political democracy (ranges from -10 to 10) lagged one year.
Data are from:
Fearon J.D., Laitin D.D. (2003), Ethnicity, Insurgency, and Civil War. The American Political Science Review, 97, 75-90.