Here are some datasets that I have either curated or stolen borrowed for educational purposes that I intend to use in Stat 234.

Descriptions

big_mac.csv

Since 1986, The Economist has been publishing the “Big Mac Index” as a light-hearted way to compare the price of goods between countries (there is a more complex economic theory behind this goal that you can read about here). The big_mac dataset contains the price of Big Macs across 58 countries from 2000 to 2017. The Economist sometimes collected data bianually, but I thinned it down to include only data from the summer. I obtained the data here. The data consists of a data-frame of three columns:

  • country: A character-vector where each element is one of 58 possible countries.
  • dollar: The price of a Big Mac in U.S. dollars. I did not adjust for inflation.
  • year: The year the data were collected.

bob.csv

Bob Ross was a painter who hosted a popular television series on PBS. Each episode would consist of him completing an entire painting. His paintings almost always consisted heavily of elements from nature: trees, clouds, mountains, lakes, etc. These data consist of indicators for what elements are in each episode. This dataset was taken from the excellent crew at fivethirtyeight. See here for their article. The data were obtained using this code. The varaibles are:

  • Episode: The season and episode number of the episode.
  • Title: The title of the painting.
  • Every other variable is an indicator for whether the episode contains the element described by the variable name. For example, BARN is 0 if the episode does not have a barn in the painting and 1 if the episode does have a barn in the painting.

movie.csv

The data in movie.csv consist of various ratings of the movies that were in theaters in 2015 and had at least 30 user ratings on Fandango. The data is entirely taken from fivethirtyeight. See here for their original article. I cleaned the data here. The variables in this dataset are:

  • FILM: The name of the film.
  • RottenTomatoes: The raw Rotten Tomatoes score of the film (an aggregate of critics ratings). On a scale from 0 to 100.
  • RottenTomatoes_User: The raw Rotten Tomatoes user score of the film (an aggregate of website-user ratings). On a scale from 0 to 100.
  • Metacritic: The raw Metacritic score (an aggregate of critics ratings). On a scale from 0 to 100.
  • Metacritic_User: The raw Metacritic user score (an aggregate of website-user ratings). On a scale from 0 to 10.
  • IMDB: The raw IMDB score of the film.
  • Fandango_Star: The raw Fandango score. On a scale from 1 to 5 but only takes values every half point.
  • Fandango_Ratingvalue: The less-rounded (to every tenth instead of to every half) ratings on Fandango. On a scale from 1 to 5.
  • RT_norm: The Rotten Tomatoes score normalized to a 5-point scale.
  • RT_user_norm: The Rotten Tomatoes user score normalized to a 5-point scale.
  • Metacritic_norm: The Metacritic score normalized to a 5-point scale.
  • Metacritic_user_norm: The Metacritic user score normalized to a 5-point scale.
  • IMDB_norm: The IMDB score normalized to a 5-point scale.
  • RT_norm_round: Same as RT_norm but rounded to every half point.
  • RT_user_norm_round: Same as RT_user_norm but rounded to every half point.
  • Metacritic_norm_round: Same as Metacritic_norm but rounded to every half point.
  • Metacritic_user_norm_round: Same as Metacritic_user_norm but rounded to every half point.
  • IMDB_norm_round: Same as IMDB_norm but rounded to every half point.
  • Metacritic_user_vote_count: The number of Metacritic users who voted for the film.
  • IMDB_user_vote_count: The number of IMDB users who voted for the film.
  • Fandango_votes: The numuber of Fandango reviews a film obtained.
  • Fandango_Difference: Fandango_Stars - Fandango_Ratingvalue.

NBA Player Statistics

Player statistics for the 2016-2017 season of the National Basketball League (NBA). I obtained these data here. The variables are:

  • player: The name of the player
  • pos: position
  • age: age of the player (in years) at the start of February 1st of 2017.
  • tm: team
  • g: games played
  • gs: games started
  • mp: minutes played
  • fg: field goals made
  • fga: field goal attempts
  • fgp: field goal percentage
  • three_p: 3 point field goals made
  • three_pa: 3 point field goal attempts
  • three_pp: 3 point field goal percentage (three_p / three_pa)
  • two_p: two point field goals made
  • two_pa: two point field goal attempts
  • two_pp: two point field goal percentage (two_p / two_pa)
  • efg: effective field goal percentage. This adjusts for the fact that three point field goals are worth more but are more difficult to make. efg = (fg + 0.5 * three_p) / fga
  • ft: free throws made
  • fta: free thow attempts
  • ftp: free throw percentage
  • orb: offensive rebounds
  • drb: defensive rebounds
  • trb: total rebounds
  • ast: assists
  • stl: steals
  • blk: blocks
  • tov: turnovers
  • pf: personal fouls
  • pts: points

trump.csv

Donald Trump loves Twitter. These data consist of some characteristics of his tweets during the election from 2015-12-14 to 2016-08-08. These include whether a tweet has a word that is associated with one of two sentiments (positive/negative) or one of eight emotions (anger, fear, anticipation, trust, surpirse, sadness, joy, and disgust). The annotations for the sentiments were obtained from the NRC Emotion Lexicon. I obtained the data from this blog and cleaned the data here. The variables in this dataset are:

  • id: A unique label for each tweet.
  • source: Whether the tweet came from an Android or an iPhone.
  • text: The text of the tweet.
  • created: The date the tweet was created in “year-month-day” format.
  • hour: The hour of the day the tweet was created, from 0 to 23.
  • quote: Whether the tweet starts with a quotation mark.
  • picture: Whether the tweet contains a picture.
  • very: Whether the tweet has the word “very” in it.
  • length: The length of the tweet.
  • anger: Whether the tweet has a word in it that evokes anger.
  • anticipation: Whether the tweet has a word in it that evokes anticipation.
  • disgust: Whether the tweet has a word in it that evokes disgust.
  • fear: Whether the tweet has a word in it that evokes fear.
  • joy: Whether the tweet has a word in it that evokes joy.
  • negative: Whether the tweet has a word in it that evokes negative sentiment.
  • positive: Whether the tweet has a word in it that evokes positive sentiment.
  • sadness: Whether the tweet has a word in it that evokes sadness.
  • surprise: Whether the tweet has a word in it that evokes surprise.
  • trust: Whether the tweet has a word in it that evokes trust.

trek.csv

The data in trek.csv consist of word counts and proportion of words spoken by the main characters from the excellent television series Star Trek: The Next Generation. I obtained the original dataset here and cleaned it here. The variables in this dataset are

  • episode: The name of the episode.
  • index: The index of the episode. “1” for the first episode, “2” for the second, etc…
  • released: The date the episode was released in “year-month-day” format.
  • episode_number: The index of the episode within each season. “1” for the first episode of a season, “2” for the second episode of a season, etc…
  • season: The season number of the episode.
  • picard, riker, geordi, tasha, worf, beverly, pulaski, troi, data, wesley: The raw counts of words spoken by each main character in each episode.
  • tot_words: The total number of words spoken by the main characters in each episode.
  • picard_prop, riker_prop, geordi_prop, tasha_prop, worf_prop, beverly_prop, pulaski_prop, troi_prop, data_prop, wesley_prop: The proportion of words spoken by each main character in each episode.

This R Markdown site was created with workflowr