Baldur’s Gate Characters

  • baldur.csv: Data scraped from the Baldur’s Gate Fandom Wiki. This contains the ability scores, alignments, races, and classes for the playable companions in the Baldur’s Gate I and II video games. Only scores of the first appearances of the companions were recorded. Variables include:
    • companion: Name of the companion.
    • ethics: Law versus chaos alignment of the companion. Possible values are "Lawful", "Neutral", and "Chaotic".
    • morals: Good versus evil alignment of the companion. Possible values are "Good", "Neutral", and "Evil".
    • race: The race of the companion. There are 13 possible values.
    • class: The class of the companion. Some individuals have more than one class, and some individuals changed class. Both of these scenarios are delimited with /.
    • str: The strength of the companion.
    • dex: The dexterity of the companion.
    • con: The constitution of the companion.
    • int: The intelligence of the companion.
    • wis: The wisdom of the companion.
    • cha: The charisma of the companion.
    • warrior: An indicator for if the companion belongs to the “warrior” class group.
    • rogue: An indicator for if the companion belongs to the “rogue” class group.
    • priest: An indicator for if the companion belongs to the “priest” class group.
    • wizard: An indicator for if the companion belongs to the “wizard” class group.

Body Fat

  • body.csv: Dataset from Section 7.1 of KNNL. Data were collected from 20 healthy females, 25–34 years old, to study the relationship between body fat and three predictors. Variables include
    • triceps: Triceps skinfold thickness.
    • thigh: Thigh circumference.
    • midarm: Midarm circumference
    • fat: Body fat.

Brand Preference

  • brand.csv: Dataset from problem 6.5 of KNNL. An experiment was performed on various characteristics of a brand. Variables include
    • like: The degree of brand liking, on a 0-100 scale.
    • moisture: The moisture content (higher is more moisture).
    • sweetness: The sweetness content (higher is sweeter).

Car Purchase

  • carpurch.csv: Dataset from problem 14.13 of KNNL. A marketing research firm was engaged by an automobile manufacturer to conduct a pilot study to examine the feasibility of using logistic regression for ascertaining the likelihood that a family will purchase a new car during the next year. A random sample of 33 suburban families was selected. Variables include
    • purchased: An indicator on whether the family actually purchased a new car 12 months later (1) or not (0).
    • income: The annual family income in thousands of dollars.
    • age: The age of the oldest family automobile in years.

Copiers

  • copiers.csv: Dataset from exercise 1.20 of KNNL. A copier repair company wants to study the the amount of time it takes per copier on a service call. Variables include:
    • copiers: The number of copiers serviced during a single call.
    • minutes: The total number of minutes spend by the service person.
    • model: The model of the copiers. Either small ("S") or large ("L").

County Demographic Information

  • cdi.csv: Dataset C.2 from KNNL. Demographic information on counties in 1990 and 1992. Variables include:
    • id: Identification number of the county.
    • county: County name.
    • state: Two-letter state abbreviation.
    • area: Land area (square miles)
    • pop: Estimated 1990 population.
    • percent_18_34: Percent of 1990 population aged 18-34
    • percent_65: Percent of 1990 population aged 65 or older.
    • physicians: Number of professionally active non-federal physicians during 1990.
    • beds: Total number of hospital beds, cribs, and bassinets during 1990.
    • crimes: Total number of serious crimes in 1990, including murder, rape, robbery, aggravated assault, burglary, larceny-theft, and motor vehicle theft, as reported by law enforcement agencies.
    • high_school: Percent of adult population (25 and older) who completed 12 or more years of school.
    • bachelors: Percent of adult population (25 and older) with bachelor’s degree.
    • poverty: Percent of 1990 population with income below poverty level.
    • unemployment: Percent of 1990 labor force that is unemployed.
    • capita_income: Per capita income (dollars).
    • total_income: Total person income (millions of dollars).
    • region: Geographic region, Northeast ("NE"), North-central ("NC"), South ("S"), or West ("W").

DC COVID Tests

  • dccovid.csv: Test data taken (and cleaned) from the COVID-19 surveillance website run by the DC government. These data were downloaded on 2021-07-21. Variables include:
    • day: The day of the measurements.
    • cleared: Number of individuals cleared from isolation.
    • lost: Total number of COVID deaths.
    • dctested: Total number of DC residents tested.
    • tested: Total overall number of tests.
    • positives: Total number of positive test results.

Disease Outbreak

  • disease.csv: Dataset C.10 from KNNL. Data were collected on a probability sample of 196 individuals during a disease outbreak. Variables include:
    • id: Identification number.
    • age: Age of the individual.
    • socioeconomic: Socioeconomic status of the individual. 1 = upper, 2 = middle, 3 = lower.
    • sector: Sector of the city sampled. Either "s1" or "s2".
    • disease: Disease status indicator. 1 = with disease, 0 = without disease.
    • savings: Savings account status indicator. 1 = has savings account, 0 = does not have savings account.

Earnings Data

  • earnings.csv: Data from ROS exploring the relationship between demographic variables and earnings. Data were originally from the “Work, Family, and Well-being in the United States, 1990” survey. Variables include:
    • height: Height of the individual (inches).
    • weight: Weight of the individual (lbs).
    • sex: Sex of individual (either "male" or "female").
    • earn: Personal income (in dollars)
    • earnk: Personal income (in thousands of dollars).
    • ethnicity: Ethnicity of the individual. Either "Black", "Hispanic", "White", or "Other".
    • education: Years of education completed by the individual. 17 means “some graduate school” and 18 means “graduate or professional degree”.
    • mother_education: Years of education completed by the mother. 17 means “some graduate school” and 18 means “graduate or professional degree”.
    • father_education: Years of education completed by the father. 17 means “some graduate school” and 18 means “graduate or professional degree”.
    • walk: How often does the respondent take a walk? (Includes walking to work/train station etc.) (1 = “Never”, 2 = “Once a month or less”, 3 = “About twice a month”, 4 = “About once a week”, 5 = “Twice a week”, 6 = “Three times a week”, 7 = “More than 3 times a week”, 8 = “Every day”).
    • exercise: How often does the respondent do strenuous exercise such as running, basketball, aerobics, tennis, swimming, biking, and so on? (1 = “Never”, 2 = “Once a month or less”, 3 = “About twice a month”, 4 = “About once a week”, 5 = “Twice a week”, 6 = “Three times a week”, 7 = “More than 3 times a week”).
    • smokenow: Does the respondent currently smoke 7 or more cigarettes a week? (either "yes" or "no").
    • tense: On how many of the past 7 days has the respondent felt tense or anxious?
    • angry: On how many of the past 7 days has the respondent felt angry?
    • age: Age of the individual (years).

Generic Ballot Data

  • ballot.csv: These data were extracted from the scatterplot from a FiveThirtyEight artcle. They were exploring the association between their generic ballot average versus the election results. Variables include
    • poll: Generic ballot average on election day.
    • vote: House popular vote margin.

Heating Equipment

  • heating.csv: Dataset C.8 from KNNL. A heating company wants to forecast the volume of monthely orders of some equipment. Data were collected for 43 consecutive months. Variables include:
    • id: Month identification number.
    • orders: Number of heating equipment orders duing the month.
    • interest: Prime rate in effect during the month.
    • homes: Number of new homes completed and for sale in sales region during the month.
    • discount: Percent discount offered to distributers during the month. Usually 0.
    • inventories: Distributor inventories in warehouses during the month.
    • sold: Number of units sold by distributer to contractors in previous month.
    • tempdev: Difference between average temperature for the month and 30-year average for that month.
    • date: YYYY-MM-01, indicating year and month of data.

Hibb’s Bread and Peace

  • hibbs.csv: Data from Chapter 7 of ROS based on the “bread and peace” model of Hibbs (2000) which predicts incumbent party’s share of the two-party vote as a function of income growth. That is, this model predicts that higher income growth yields a higher vote share for the incumbent party.
    • year: Year of the election.
    • growth: Inflation-adjusted growth in average personal income.
    • vote: Incumbant party’s vote share of the two-party vote (excludes third parties).
    • inc_party_candidate: The candidate of the incumbant party.
    • other_candidate: The candidate of the other party.

Insurance Firm

  • firm.csv: Data from Table 8.2 of KNNL. An innovation in the insurance industry was introduced, and a researcher wanted to study what factors affect how quickly different insurance firms adopted this new innovation. Variables include
    • months: How long, in months, it took the firm to adopt the new innovation.
    • size: The amount of total assets of the insurance firm, in millions of dollars.
    • type: The type of firm. Either a mutual company ("mutual") or a stock company ("stock").

IPO

  • ipo.csv: Dataset C.11 from KNNL. Data were collected on 482 companies that underwent an initial public offering (IPO). Variables include:
    • id: Identification number.
    • vc: Presence or absence of venture capital funding. Either "yes" or "no".
    • value: Estimated face value of company from prospectus (in dollars).
    • shares: Total number of shares offered.
    • buyout: Prsence or absence of leveraged buyout. Either "yes" or "no".

Ischemic Heart Disease

  • ischemic.csv: Dataset C.9 from KNNL. Data were collected on 788 insurance subscribers who made claims resulting from ischemic (coronary) heart disease. The goal is to find what factors contribute to cost. Varibles include:
    • id: Identification number.
    • cost: Total cost of claims by subscriber (dollars).
    • age: Age of subscriber (years).
    • gender: "male" or "other".
    • interventions: Total number of interventions or procedures carried out.
    • drugs: Number of tracked drugs prescribed.
    • er: Number of emergency room visits.
    • complications: Number of other complications that arose during heart disease treatment.
    • comorbidities: Number of other diseases that the subscriber had during period.
    • duration: Number of days of duration of treatment condition.

Job Profficiency

  • job.csv: Data from exercise 9.10 of KNNL. A personnel officer in a governmental agency administered four newly developed aptitude tests to each of 25 applicants for entry-level clerical positions in the agency. For purpose of the study, all 25 applicants were accepted for positions irrespective of their test scores. After a probationary period, each applicant was rated for proficiency on the job. Variables include
    • proficiency: Job proficiency score.
    • test1: Score on the first aptitude test.
    • test2: Score on the second aptitude test.
    • test3: Score on the third aptitude test.
    • test4: Score on the fourth aptitude test.

Kleiber’s law

  • kleiber.csv: Kleiber’s law states that animals’ metabolic rates follow a power law to their masses. Data are from Savage et al. (2004), where they collected data on mammals to study this power law. Variables include
    • order: The order of the species.
    • family: The family of the species.
    • species: The genus and species names.
    • mass: The average mass (in grams) of the species.
    • bmr: The average basal metabolic rate (in watts) of the species.

Market Share

  • market.csv: Dataset C.3 from KNNL. Market data for a packaged food product were collected for each month for 36 consecutive months. The goal is to determine what factors influence market share. Variables include:
    • num: Month number (1–36)
    • share: Average monthly market share for product (percent)
    • price: Average monthly price of product (dollars)
    • nielsen: An index of the amount of advertising exposure that the product recieved (higher is larger exposure).
    • discount: Was there a discount during this month? ("yes" or "no").
    • promotion: Was there a promotion during this month? ("yes" or "no").
    • month: Month.
    • year: Year.

Mile Times

  • mile.csv: World record mile time progressions from 1913 to 1999. Data are from Figure A.1 of ROS. Variables include
    • year: The date (in years) of the new world record.
    • seconds: The new world record (in seconds).

Muscle Mass

  • muscle.csv: A person’s muscle mass is expected to decrease with age. To explore this relationship in women, a nutritionist randomly selected 15 women from each 10-year age group, beginning with age 40 and ending with age 79. This is from KNNL. Variables include
    • mass: A measure of muscle mass (g).
    • age: Age (years)

Nocturia

  • nocturia.csv: Data from Chung et al. (2022). Nocturia is the condition that causes some people to wake up at night to urinate. Many individuals with sleep apnea also have nocturia. This study recruited patients with sleep apnea, measured if they also had nocturia, and used a questionnaire to obtain various predictors that the authors thought might be associated with nocturia. Variables include
    • nocturia: An indicator for whether the individual has nocturia (1) or not (0).
    • age: Age of the individual (years)
    • ahi: “Apnea-hypopnea index”, average number of times per hour an individual either has an apnea event or a hypopnea event.
    • alcohol: Alcohol intake in grams per week.
    • apnea: “Apnea index”, average number of times per hour an individual stopped breather for 10 seconds or longer.
    • arousal: “Arousal index”, average number of awakenings per hour.
    • bed_time: Total amount of time in bed in minutes.
    • bmi: Body mass index (kg/m\(^2\)).
    • caffeine: Caffeine intake (cups/day).
    • cardiovascular_disease: Indicator for whether the individual has cardiovascular disease (1) or not (0).
    • diabetes: Indicator for whether the individual has diabetes mellitus (1) or not (0).
    • diuretics: Indicator for whether the individual takes diuretics (1) or not (0).
    • dyslipidemia: Indicator for whether the individual has dyslipidemia (1) or not (0) .
    • ess: Epworth sleep scale. Questionnaire index of “situational sleep propensity”. Values are between 0 and 32, with higher values indicating greater subjective sleepiness.
    • hypertension: Indicator for whether the individual has hypertention (high blood pressure) (1) or not (0).
    • hypopnea: “Hypopnea index”, average number of times per hour an individual has a partial loss of breath for 10 seconds or longer.
    • id: Patient ID.
    • isi: Korean insomnia severity index, ranging from 0 to 28, with higher scores indicating worse insomnia.
    • kbdiii: Korean Beck depression inventory-II. Questionnaire index of severity of depression. Scores range from 0 to 63, with higher scores indicating more severe depressive symptoms.
    • kidney_disease: Indicator for whether the individual has chronic kidney disease (1) or not (0).
    • low_sa02: Lowest percent of oxygen saturation detected during sleep.
    • N1: Percent of time in “N1 sleep”, non-rapid eye movement sleep stage 1. See sleep cycle
    • N2: Percent of time in “N2 sleep”, non-rapid eye movement sleep stage 2. See sleep cycle
    • N3: Percent of time in “N3 sleep”, non-rapid eye movement sleep stage 3. See sleep cycle
    • odi: Oxygen desaturation index (the number of 3% oxygen desaturations per hour of estimated sleep time).
    • odi90: 90% oxygen desaturation index (the number of 3% oxygen desaturations below 90% per hour of estimated sleep time).
    • parkinsons: Indicator for whether the individual has Parkinson’s disease (1) or not (0).
    • psqi: Korean version of the Pittsburgh sleep quality index. A composite score between 0 and 21 based on a questionaire where smaller values indicate better self-reported sleep.
    • rem: : Percent of time in “REM sleep”. See sleep cycle
    • rem_latency: REM latency (time it takes to reach REM sleep) in minutes.
    • sex: Biological sex of the individual. Female (F) or male (M).
    • sleep_efficiency: Percent of time spent asleep while in bed.
    • sleep_latency: Sleep latency (time it takes for a person to fall asleep) in minutes.
    • sleep_time: Total sleep time in minutes.
    • smoking: Smoking frequency in pack-years (packs smoked per day times the number of years as a smoker).
    • stroke: Indicator for whether the individual has had a stroke (1) or not (0).
    • waso: Wakefulness after sleep onset.

Prostate Cancer

  • prostate.csv: Dataset C.5 from KNNL. Researchers were interested in the association between prostate-specific antigen (PSA) and a few prognostic clincical measurements in men with prostate cancer. Data were collected on 97 men with prostate cancer. The variables are:
    • id: Identification number of the patient.
    • psa: Serum prostate-specific antigen level (mg/ml)
    • volume: Estimate of prostate cancer volume (cubic centimeters)
    • weight: Prostate weight (grams)
    • age: Age of patient (years)
    • benign: Amount of benign prostatic hyperplasia (square centimeters)
    • seminal: Presence or absence of seminal vesicle invation ("yes" or "no").
    • capsular: Degree of cpsular penetration (cm)
    • gleason: Pathologically determined grade of disease using total score of two patters (summed scores were either 6, 7, or 8, with higher scores indicating worse prognosis).

Real Estate Sales

  • estate.csv: Dataset C.7 from KNNL. Data on 522 home sales in a Midwestern city during the year 2002. The goal was to predict residential home sales prices from the other variables. The 13 variables are
    • price: Sales price of residence (in dollars)
    • area: Finished area of residence (in square feet)
    • bed: Total number of bedrooms in residence
    • bath: Total number of bathrooms in residence
    • ac: "yes" = presence of air conditioning, "no" = absence of air conditioning
    • garage: Number of cars that a garage will hold
    • pool: "yes" = presence of a pool, "no" = absence of a pool
    • year: Year property was originally constructed
    • quality: Index for quality of construction. high, medium, or low.
    • style: Categorical variable indicating architectural style
    • lot: Lot size (in square feet)
    • highway: "yes" = highway adjacent, "no" = highway not adjacent.

Roman Public Spaces

  • rome2.csv: Data from Hanson et al. (2019), a study on urban design in ancient Rome. The authors were interested in the relationship between the population size of a settlement and measures of the settlement’s infrastructure. The authors collected data on 125 settlements with the following variables:
    • id: The identification number of the settlement.
    • name: The name of the settlement.
    • area: Area of the settlement in hectares (1hc = 10000m\(^2\)).
    • pop: The population of the settlement.
    • fa_area: Area of the settlement devoted to public space (i.e. fora and agora) in square meters.
    • st_area: Area of the settlement devoted to public streets in square meters.
    • st_length: Total length of public streets in meters.
    • st_width: Average street width in meters.
    • bl_area: Average area of city blocks in square meters.

Roman Wheat Prices

  • rome.csv: Data from Figure 1 of Temin (2019). The researcher was interested in the association between the price of wheat and the distance from Rome. Data were collected during various time points between the late Roman republic and early Roman empire. Variables include.
    • distance: Distance from Rome (in km).
    • discount: Difference in wheat price between Rome and the location, in units of sestertius per modius. Think about this as cost per unit of wheat. Negative numbers mean that wheat was more expensive in Rome.
    • location: The name of the location where the price was measured.

Soap Production Line

  • soap.csv: Table 8.5 of KNNL. Researchers were studying the relationship between line speed and the amount of scrap for two production lines in a soap production company. The variables include
    • scrap: Amount of scrap (coded).
    • speed: Line production speed (coded).
    • line: Production line. Either "line1" or "line2".

Steroid Level

  • steroid.csv: From problem 8.6 of KNNL. A researcher was interested in the relationship between the level of a steroid and the age in healthy female subjects between the ages of 8 and 25. Variables include
    • steroid: Level of the steroid.
    • age: Age in years.

Study on the Efficacy of Nosocomial Infection Control (SENIC)

  • senic.csv: Dataset C.1 from KNNL. The goal was to study if surveillance and control programs reduced the number of hospital-acquired infections. The observational units are 113 hospitals surveyed. The variables are
    • id: Hospital identification number.
    • length: Average length of stay of all patients in hospital (days).
    • age: Average age of patients (years).
    • risk: Average estimated probability of acquiring infection in hospital (percent).
    • culturing_ratio: Ratio of number of cultures performed to number of patients without signs or symptoms of hospital-acquired infection, times 100.
    • xray_ratio: Ratio of number of X-rays perfromed to number of patients without signs or symptoms of pneumonia, times 100.
    • beds: Average number of beds in hospital during study period.
    • med_school: Whether a hospital was affiliated with a medical school (yes or no).
    • region: Geographic region, Northeast ("NE"), North-central ("NC"), South ("S"), or West ("W").
    • patients: Average number of patients in hospital per day during study period.
    • nurses: Average number of full-time equivalent registered and licensed practical nurses during study period (full time plus half part-time).
    • facilities: Percent of 35 potential facilities and services that are provided by the hospital.

Taxes in Tudor England

  • tax.csv: These data were extracted from the scatterplot of Figure 1 of Cesaretti et al. (2020). Economists think modern cities are more efficient engines of economic growth, producing more output per individual. This is because of various factors like easier access to skilled labor, logistical cost savings because folks are closer together, etc. Cesaretti et al. (2020) wanted to test if this was also true in old England. They collected data from the 1524–1525 tax years from 93 towns. Variables include:
    • taxpayers: Number of taxpayers in the town.
    • tax: The total tax collected by from the town, in British pounds £.

Textile Data

  • textile.csv: These data from Shadid and Rahman (2010) measures the strength to resist breakage of knited fabric across conditions of length and yarn count. These data were downloaded from Larry Winner’s data page. Variables include:
    • count: Yarn count (thickness of the yarn) (g/km).
    • length: Stitch length (mm).
    • strength: Bursting strength of the fabric against a multidirectional flow of pressure (100kpa).

University Admissions

  • university.csv: Dataset C.4 from KNNL. Academic data were collected on 705 students. The goal is to determine if GPA could be predicted by entrance test scores and high school class rank. Variables include
    • id: The student identification number.
    • gpa: Grade-point average following freshman year.
    • rank: High school class rank as percentile. Lower percentiles imply higher class ranks.
    • act: ACT entrance examination score.
    • year: Calendar year that the freshman entered university.

Website Developer

  • website.csv: Dataset C.6 from KNNL. A company was interested on factors affecting the number of websites completed and delivered. Data were collected for 13 teams over 8 quarters.
    • id: Identification number for row.
    • number: Number of websites completed and delivered to customers during the quarter.
    • backlog: Number of website orders in backlog at the close of the quarter.
    • team: Team number (1–13)
    • experience: Number of months the team has been together.
    • change: A change in the website development process occurred during the second quarter of 2002. This is an indicator variable adjusting for this change. 1 if quarter 2 or 3 of 2002, and 0 otherwise.
    • year: Year, either 2001 or 2002.
    • quarter: 1, 2, 3, or 4.

References

Cesaretti, Rudolf, José Lobo, Luis M. A. Bettencourt, and Michael E. Smith. 2020. “Increasing Returns to Scale in the Towns of Early Tudor England.” Historical Methods: A Journal of Quantitative and Interdisciplinary History 53 (3): 147–65. https://doi.org/10.1080/01615440.2020.1722775.
Chung, Yeon Hak, Jae Rim Kim, Su Jung Choi, and Eun Yeon Joo. 2022. “Prevalence and Predictive Factors of Nocturia in Patients with Obstructive Sleep Apnea Syndrome: A Retrospective Cross-Sectional Study.” PLOS ONE 17 (4): 1–12. https://doi.org/10.1371/journal.pone.0267441.
Hanson, John W, Scott G Ortman, Luı́s MA Bettencourt, and Liam C Mazur. 2019. “Urban Form, Infrastructure and Spatial Organisation in the Roman Empire.” Antiquity 93 (369): 702–18. https://doi.org/10.15184/aqy.2018.192.
Hibbs, Douglas A. 2000. “Bread and Peace Voting in US Presidential Elections.” Public Choice 104 (1): 149–80. https://doi.org/10.1023/A:1005292312412.
Savage, V. M., J. F. Gillooly, W. H. Woodruff, G. B. West, A. P. Allen, B. J. Enquist, and J. H. Brown. 2004. “The Predominance of Quarter-Power Scaling in Biology.” Functional Ecology 18 (2): 257–82. https://doi.org/10.1111/j.0269-8463.2004.00856.x.
Shadid, S A, and M F Rahman. 2010. “Study on the Effect of Stitch Length on Bursting Strength of Knit Fabric.” Annals of the University of Oradea, Fascicle of Textiles, Leatherwork 1: 142–46. http://textile.webhost.uoradea.ro/Annals/VolumeXI-no%202-2010.pdf.
Temin, Peter. 2019. “Words and Numbers: A New Approach to Writing Ancient History.” Journal of Interdisciplinary History 50 (1): 31–58. https://doi.org/https://doi.org/10.1162/jinh_a_01375.