class: center, middle, inverse, title-slide .title[ # PPOL 670-03: Introduction to Data Science ] .subtitle[ ## Week 03 Data Wrangling ] .author[ ### Alexander Podkul, PhD ] .date[ ### Spring 2023 ] --- ## Tonight's Outline .pull-left[ - Working on Loading Data and Projects - Data Wrangling: An Introduction - `dplyr` - Piping - Transforming Variables - Subsetting - Joining (also: merge) - Collapsing and Aggregating - `tidyr` - Reshaping - __BREAK__ - Introduction to R Markdown and Wrangling Examples - Problem Set Assignment ] .pull-right[ <img src="sticker.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Picking Up From Last Week Most of the time, we do not want to hand code our data sets but rather load some already existing data set (e.g. a .csv, .dta, or .rds file). To load data in R, we can load data sets directly from the Internet (as we did last week) or we can load data from our local files. To load data, we need to point R to our "working directory" which is simply the default file path location being used by R. -- In addition to looking at file paths or the Files tab in R Studio, we can also find the working directory using: ```r getwd() ``` -- If we want to redirect R to an alternative file path, we can use: ```r setwd('file/path/listed/here') ``` ... let's be a bit more hands on now. --- class: inverse, center, middle ## Data Wrangling: An Introduction --- ## Data Wrangling .pull-left[ __Data wrangling__ (or data _munging_) generally refers to the process of preparing data (i.e. complicated, unformatted data) for analysis and exploration. This could include things such as: 1. merging multiple data sources 2. isolating relevant data points 3. changing the shape of the data set (e.g. country to country-year) 4. putting data in the proper _type_ One survey of data scientists found that data scientists report 45% of their time is spent on data preparation tasks (Anaconda 2020). ] .pull-right[ <img src="wrangle.png" width="65%" style="display: block; margin: auto;" /> ] -- We'll only be scratching the surface tonight. Tonight is meant to be the basic toolkit that we'll build on later in the semester. --- ## Data Wrangling: Example Let's imagine we were trying to answer a research question about whether University Covid vaccine mandates help lower community rates of transmission. -- Without even discussing our statistical or machine learning model, we first need to: - collect information about university Covid vaccine mandates _and_ information about community transmission - we need to summarize both data sources into a machine-readable format (and we may need to do work to make sure our measures are stored in the same format across communities) - finally we need to join these two data sources together in some way to begin to explore the relationship between them --- class: inverse, center, middle ## dplyr, tibbles, and piping --- ## 🚨`dplyr`🚨 .pull-left[ `dplyr` is one of the packages of the __tidyverse__, which is a suite of packages developed by Hadley Wickham et al. for data science use. dplyr is specifically used for data manipulation, including adjusting the ordering and selection of cases and attributes within a particular data set. Tonight we'll cover the following functions: - `mutate()` for transforming variables - `rename()` for changing variable names - `select()` for identifying columns - `filter()` for identifying rows - `summarize()` for summarizing a dataset or group - `group_by()` for grouping For more on the expanse of `dplyr`, check out their vignettes and help pages ] .pull-right[ <img src="dplyr.png" width="65%" style="display: block; margin: auto;" /> ] --- ## Tibbles Last week we spoke about a number of different ways to organize and store data. The most commonly used object is the tabular data of mixed-type called the __data frame__. The tidyverse (including `dplyr`) uses a slight variation of the data frame called a "tibble." It is similarly a two dimensional, mixed-type object with a few differences: 1. the output of a printed tibble only shows a few data points and reports the data type associated with each column 2. working with tibbles will always return another tibble (but fortunately the pipe operator -- next slide -- makes this easy to work with) The difference between tibbles and data frames are often so negligible that the two terms are used interchangeably. --- ## %>% The pipe operator -- or `%>%` -- is a part of the `magittr` package in R (and automatically loaded when you load `dplyr`) and helps users pass data points from one function to the next via the _first argument_. Using the pipe operator makes reading code a lot easier. -- Conceptually (using pseudocode) -- .pull-left[ Without the pipe ```r display it using a bar chart( get the mean mpg( car_data[isolate the red cars] ) ) ``` ] -- .pull-right[ With the pipe ```r car_data %>% isolate the red cars %>% get the mean mpg %>% display it using a bar chart ``` ] -- Reading the code, we can think about %>% as saying "then" --- ## %>% Examples ```r summary(df) df %>% summary() ``` -- ```r mean(df$value[df$category == 'A']) df %>% filter(category == 'A') %>% summarize(mean(value)) ``` --- ## %>% Examples It's important to note here that using the pipe operator does __not__ overwrite the object being used at the beginning of the workflow. If we want to overwrite that object we need to explicitly assign it. -- Not overwriting ```r df %>% summary() ``` Overwriting ```r df <- df %>% summary() ``` --- ## A note about %>% .pull-left[There are a few other types of pipes in R. The most common alternative to %>% is available in base R (version 4.1 and later) and stylized as |> To stick with the tidyverse, we're going to continue using %>% but there are a few minor differences between the two (in case you stumble upon |> in the wild). Differences include needing parentheses for all functions (even without any arguments) and lack of support for dot notation for harder to use functions. (No need to memorize this just noting it in case you stumble upon these issues.) ] .pull-right[ <img src="logo.png" width="65%" style="display: block; margin: auto;" /> ] --- class: inverse, center, middle ## Transforming Variables --- ## Transforming Variables There are a variety of reasons why researchers might want to create _new_ variables. In addition to cases where of appending new columns or merging in new data, it is possible to also create new columns that either incorporate new information or build on existing columns. In this case, we'll use the `mutate()` function as part of the `dplyr` package. Using the pipe notation, we can create these new variables such that: -- ```r data %>% mutate(NEWNAME = EXPRESSION) ``` --- ## Transforming Variables ```r data ``` -- <table class=" lightable-paper lightable-striped" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:right;"> Var1 </th> <th style="text-align:left;"> Var2 </th> <th style="text-align:right;"> Var3 </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Blue </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> Green </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> Blue </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> Yellow </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:left;"> Red </td> <td style="text-align:right;"> 8 </td> </tr> </tbody> </table> --- ## Transforming Variables ```r data %>% mutate(Var_Sum = Var1 + Var3) ``` -- <table class=" lightable-paper lightable-striped" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:right;"> Var1 </th> <th style="text-align:left;"> Var2 </th> <th style="text-align:right;"> Var3 </th> <th style="text-align:right;"> Var_Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Blue </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> Green </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> Blue </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> Yellow </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:left;"> Red </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 13 </td> </tr> </tbody> </table> --- ## Transforming Variables ```r data %>% mutate(blue_flag = Var2 == 'Blue') ``` -- <table class=" lightable-paper lightable-striped" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:right;"> Var1 </th> <th style="text-align:left;"> Var2 </th> <th style="text-align:right;"> Var3 </th> <th style="text-align:left;"> blue_flag </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Blue </td> <td style="text-align:right;"> 5 </td> <td style="text-align:left;"> TRUE </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> Green </td> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> FALSE </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> Blue </td> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> TRUE </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> Yellow </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> FALSE </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:left;"> Red </td> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> FALSE </td> </tr> </tbody> </table> --- ## Renaming Another dplyr function that performs a similar task to `mutate()` is the `rename()` function which allows you to rename existing column names within your dataframe. Renaming can be especially useful in the data wrangling stage to keep column names following a particular convention and can also be helpful when trying to present results. ```r data %>% rename(NEWNAME = OLDNAME) ``` --- ## Renaming: Example ```r data %>% rename(color = Var2) ``` -- <table class=" lightable-paper lightable-striped" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:right;"> Var1 </th> <th style="text-align:left;"> color </th> <th style="text-align:right;"> Var3 </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Blue </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> Green </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> Blue </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> Yellow </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:left;"> Red </td> <td style="text-align:right;"> 8 </td> </tr> </tbody> </table> --- ## Renaming: Conventions R is quite flexible when it comes to creating custom variable names. The rules for valid variable names are: - a name must start with a character (or a period) and can include letters, digits, periods, and/or underscores (no spaces) - a name _cannot_ start with a number or underscore - variable names are case-sensitive - words used in other settings in R cannot be used (e.g. you cannot use TRUE as a variable name) -- Standard multi-word conventions used (even beyond R) include: - snakecase - `variable_name1` - Pascalcase - `VariableName1` - Camelcase - `variableName1` --- ## Renaming: Conventions However, despite those rules, R also allows you to set variable names with a back quote when using a tibble. As long as we use the back quote, we can violate the rules we just reviewed! -- ```r data %>% rename(`2 This 1$ An 3xampl3` = old_var, `TRUE` = old_var2) ``` ... just remember to use the back quote at the beginning and end of the variable name every time you use it! --- class: inverse, center, middle ## Subsetting --- ## Subsetting Data To __subset__ a data frame means to isolate part of the data contained. We can subset data by isolating _rows_ or _columns_. -- In tidy-speak, subsetting data by _rows_ is referred to as __Filtering__ data (used in the same sense as Microsoft Execl's "filtering" function.) We can filter data randomly, by index, or by some logic. -- Subsetting data by _columns_ is refered to as __Selecting__ data. This is often done to isolate a particular grouping of columns for analysis. --- ## Filtering Data (Subsetting Rows) Imagine some data frame: <img src="Original.png" width="65%" style="display: block; margin: auto;" /> -- To filter rows would return a data frame with the following: <img src="Filter.png" width="65%" style="display: block; margin: auto;" /> --- ## Selecting Data (Subsetting Cols) Imagine some data frame: <img src="Original.png" width="65%" style="display: block; margin: auto;" /> -- To select cols would return a data frame with the following: <img src="Select.png" width="65%" style="display: block; margin: auto;" /> --- ## Filtering Data ```r dx ``` <table class=" lightable-paper lightable-striped" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:right;"> x </th> <th style="text-align:right;"> y </th> <th style="text-align:left;"> z </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> C </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:left;"> A </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> B </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> B </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> A </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> B </td> </tr> <tr> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:left;"> C </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> A </td> </tr> </tbody> </table> --- ## Filtering Data ```r dx %>% filter(z == 'A') ``` <table class=" lightable-paper lightable-striped" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:right;"> x </th> <th style="text-align:right;"> y </th> <th style="text-align:left;"> z </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:left;"> A </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> A </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> A </td> </tr> </tbody> </table> --- ## Selecting Data ```r dx %>% select(x, z) ``` <table class=" lightable-paper lightable-striped" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:right;"> x </th> <th style="text-align:left;"> z </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> C </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> A </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> B </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> B </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:left;"> A </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> B </td> </tr> <tr> <td style="text-align:right;"> 7 </td> <td style="text-align:left;"> C </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> A </td> </tr> </tbody> </table> --- ## Selecting Data: Helpers We'll cover the extensive list more in the future but the `select` function can also make use of a wide array of __selection helpers__, which are minor functions in dplyr that help get certain variable names without needing to explicitly type each one. -- Some available helpers include: - `starts_with('pre')` selects any variable that starts with a prefix 'pre' - `ends_with('suf')` selects any variable that ends with a suffix 'suf' - `contains('here')` selects any variable that contains the substring 'here' - `last_col()` selects the last variable (in whatever order variables are stored) -- E.g. ```r dx %>% select(starts_with('starts')) ``` --- class: inverse, center, middle ## Joining --- ## Joining Datasets __Joins__ refer to the combination of various datasets where there is some common key (or set of a keys) connecting them. -- Assume we have two datasets named _x_ and _y_ -- There are different _types_ of joins (also refered to as merges): - __1:1__ - when one record in dataset _x_ refers to exactly one record in dataset _y_ - __1:Many__ - when one record in dataset _x_ refers to many records in dataset _y_ (can also be the inverse, i.e. Many:1) - __Many:Many__ - when there are many records in dataset _x_ and many records in dataset _y_. -- <img src="merge_types.png" width="65%" style="display: block; margin: auto;" /> --- ## Joining Datasets .pull-left[ Additionally, there are also different ways to refer to which records to keep. Among others, the main types of joins in our work flow will be - __Inner join__ - keeping only records that overlap between both _x_ and _y_ (intersection) - __Full join__ - keeps all records in _x_ and _y_ (including those that overlap) (union) - __Left join__ - keeps all record in _x_ and the records from _y_ that overlap - __Right join__ - keeps all record in _y_ and the records from _x_ that overlap ] -- .pull-right[ <div class="figure" style="text-align: center"> <img src="sql-joins.png" alt="Join Types" width="75%" /> <p class="caption">Join Types</p> </div> ] --- ## Joining Datasets: Examples .pull-left[ X: <table class=" lightable-paper lightable-striped" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:right;"> id </th> <th style="text-align:left;"> name </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Hoyas </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> Hoyas </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> Wildcats </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> Friars </td> </tr> </tbody> </table> ] .pull-right[ Y: <table class=" lightable-paper lightable-striped" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> name </th> <th style="text-align:right;"> id2 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Hoyas </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> Mets </td> <td style="text-align:right;"> 20 </td> </tr> </tbody> </table> ] --- ## Joining Datasets: Examples ```r x %>% left_join(y, by = c('name' = 'name')) ``` -- <table class=" lightable-paper lightable-striped" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:right;"> id </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> id2 </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Hoyas </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> Hoyas </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> Wildcats </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> Friars </td> <td style="text-align:right;"> NA </td> </tr> </tbody> </table> --- class: inverse, center, middle ## Collapsing and Aggregating --- ## Collapsing Data __Collapsing__ or aggregating data often refers to finding summary data points for a data set or for particular groups within a data set. Using `dplyr` we can leverage the `summarize()` command paired with the function for the datapoint we'd like to explore. -- Let's say we have a dataset of car models and we want to measure summary statistics for miles per gallon.
--- ## Collapsing Data ```r mtcars %>% summarize(mean = mean(mpg), sd = sd(mpg)) ``` -- <table class=" lightable-paper lightable-striped" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:right;"> mean </th> <th style="text-align:right;"> sd </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 20.09062 </td> <td style="text-align:right;"> 6.026948 </td> </tr> </tbody> </table> --- ## Collapsing by Group Perhaps more importantly, we can uncover summaries by different groups within our data. Using the `group_by()` command within a piped workflow, we can identify groups and then summarize our data within those groups. -- Using the example from the previous slide, we can group by number of engine cylinders: ```r mtcars %>% group_by(cyl) %>% summarize(mean = mean(mpg), sd = sd(mpg)) ``` -- <table class=" lightable-paper lightable-striped" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:right;"> cyl </th> <th style="text-align:right;"> mean </th> <th style="text-align:right;"> sd </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 26.66364 </td> <td style="text-align:right;"> 4.509828 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 19.74286 </td> <td style="text-align:right;"> 1.453567 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 15.10000 </td> <td style="text-align:right;"> 2.560048 </td> </tr> </tbody> </table> --- class: inverse, center, middle ## Tidy Data, tidyr, and Reshaping --- ## Tidy Data The same data can be presented in many different formats. For example: -- .pull-left[ __Long Data__ <table class=" lightable-paper lightable-striped" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:right;"> Year </th> <th style="text-align:right;"> GOP </th> <th style="text-align:right;"> DEM </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 2004 </td> <td style="text-align:right;"> 0.507 </td> <td style="text-align:right;"> 0.483 </td> </tr> <tr> <td style="text-align:right;"> 2008 </td> <td style="text-align:right;"> 0.457 </td> <td style="text-align:right;"> 0.529 </td> </tr> <tr> <td style="text-align:right;"> 2012 </td> <td style="text-align:right;"> 0.472 </td> <td style="text-align:right;"> 0.511 </td> </tr> <tr> <td style="text-align:right;"> 2016 </td> <td style="text-align:right;"> 0.461 </td> <td style="text-align:right;"> 0.482 </td> </tr> <tr> <td style="text-align:right;"> 2020 </td> <td style="text-align:right;"> 0.469 </td> <td style="text-align:right;"> 0.513 </td> </tr> </tbody> </table> ] -- .pull-right[ __Wide Data__ <table class=" lightable-paper lightable-striped" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> Party </th> <th style="text-align:right;"> 2004 </th> <th style="text-align:right;"> 2008 </th> <th style="text-align:right;"> 2012 </th> <th style="text-align:right;"> 2016 </th> <th style="text-align:right;"> 2020 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> GOP </td> <td style="text-align:right;"> 0.507 </td> <td style="text-align:right;"> 0.457 </td> <td style="text-align:right;"> 0.472 </td> <td style="text-align:right;"> 0.461 </td> <td style="text-align:right;"> 0.469 </td> </tr> <tr> <td style="text-align:left;"> DEM </td> <td style="text-align:right;"> 0.483 </td> <td style="text-align:right;"> 0.529 </td> <td style="text-align:right;"> 0.511 </td> <td style="text-align:right;"> 0.482 </td> <td style="text-align:right;"> 0.513 </td> </tr> </tbody> </table> ] --- ## Tidy Data .pull-left[ Tidy data reformats a dataset such that (Wickham 2010): 1. Each variable forms a column 2. Each observation forms a row 3. Each type of observational unit forms a table ] -- .pull-right[ For example: <table class=" lightable-paper lightable-striped" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:right;"> Year </th> <th style="text-align:left;"> Party </th> <th style="text-align:right;"> Share </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 2004 </td> <td style="text-align:left;"> GOP </td> <td style="text-align:right;"> 0.507 </td> </tr> <tr> <td style="text-align:right;"> 2004 </td> <td style="text-align:left;"> DEM </td> <td style="text-align:right;"> 0.483 </td> </tr> <tr> <td style="text-align:right;"> 2008 </td> <td style="text-align:left;"> GOP </td> <td style="text-align:right;"> 0.457 </td> </tr> <tr> <td style="text-align:right;"> 2008 </td> <td style="text-align:left;"> DEM </td> <td style="text-align:right;"> 0.529 </td> </tr> <tr> <td style="text-align:right;"> 2012 </td> <td style="text-align:left;"> GOP </td> <td style="text-align:right;"> 0.472 </td> </tr> <tr> <td style="text-align:right;"> 2012 </td> <td style="text-align:left;"> DEM </td> <td style="text-align:right;"> 0.511 </td> </tr> <tr> <td style="text-align:right;"> 2016 </td> <td style="text-align:left;"> GOP </td> <td style="text-align:right;"> 0.461 </td> </tr> <tr> <td style="text-align:right;"> 2016 </td> <td style="text-align:left;"> DEM </td> <td style="text-align:right;"> 0.482 </td> </tr> <tr> <td style="text-align:right;"> 2020 </td> <td style="text-align:left;"> GOP </td> <td style="text-align:right;"> 0.469 </td> </tr> <tr> <td style="text-align:right;"> 2020 </td> <td style="text-align:left;"> DEM </td> <td style="text-align:right;"> 0.513 </td> </tr> </tbody> </table> ] --- ## 🚨`tidyr`🚨 .pull-left[ `tidyr` another __tidyverse__ package, which is used to help create tidy data by pivoting datasets. Tonight we'll cover the following functions: - `pivot_longer()` - `pivot_wider()` for more on `tidyr`, check out their vignettes and help pages ] .pull-right[ <img src="tidyr.png" width="65%" style="display: block; margin: auto;" /> ] --- ## Reshaping: Long to Tidy ```r library(tidyr) elec ``` -- <table class=" lightable-paper lightable-striped" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:right;"> Year </th> <th style="text-align:right;"> GOP </th> <th style="text-align:right;"> DEM </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 2004 </td> <td style="text-align:right;"> 0.507 </td> <td style="text-align:right;"> 0.483 </td> </tr> <tr> <td style="text-align:right;"> 2008 </td> <td style="text-align:right;"> 0.457 </td> <td style="text-align:right;"> 0.529 </td> </tr> <tr> <td style="text-align:right;"> 2012 </td> <td style="text-align:right;"> 0.472 </td> <td style="text-align:right;"> 0.511 </td> </tr> <tr> <td style="text-align:right;"> 2016 </td> <td style="text-align:right;"> 0.461 </td> <td style="text-align:right;"> 0.482 </td> </tr> <tr> <td style="text-align:right;"> 2020 </td> <td style="text-align:right;"> 0.469 </td> <td style="text-align:right;"> 0.513 </td> </tr> </tbody> </table> --- ## Reshaping: Long to Tidy ```r library(tidyr) elec %>% pivot_longer(cols = GOP:DEM, names_to = "Party", values_to = "Share") ``` -- <table class=" lightable-paper lightable-striped" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:right;"> Year </th> <th style="text-align:left;"> Party </th> <th style="text-align:right;"> Share </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 2004 </td> <td style="text-align:left;"> GOP </td> <td style="text-align:right;"> 0.507 </td> </tr> <tr> <td style="text-align:right;"> 2004 </td> <td style="text-align:left;"> DEM </td> <td style="text-align:right;"> 0.483 </td> </tr> <tr> <td style="text-align:right;"> 2008 </td> <td style="text-align:left;"> GOP </td> <td style="text-align:right;"> 0.457 </td> </tr> <tr> <td style="text-align:right;"> 2008 </td> <td style="text-align:left;"> DEM </td> <td style="text-align:right;"> 0.529 </td> </tr> <tr> <td style="text-align:right;"> 2012 </td> <td style="text-align:left;"> GOP </td> <td style="text-align:right;"> 0.472 </td> </tr> <tr> <td style="text-align:right;"> 2012 </td> <td style="text-align:left;"> DEM </td> <td style="text-align:right;"> 0.511 </td> </tr> <tr> <td style="text-align:right;"> 2016 </td> <td style="text-align:left;"> GOP </td> <td style="text-align:right;"> 0.461 </td> </tr> <tr> <td style="text-align:right;"> 2016 </td> <td style="text-align:left;"> DEM </td> <td style="text-align:right;"> 0.482 </td> </tr> <tr> <td style="text-align:right;"> 2020 </td> <td style="text-align:left;"> GOP </td> <td style="text-align:right;"> 0.469 </td> </tr> <tr> <td style="text-align:right;"> 2020 </td> <td style="text-align:left;"> DEM </td> <td style="text-align:right;"> 0.513 </td> </tr> </tbody> </table> --- ## Helpful Resources(!) 1. __Help pages__ - useful documentation in a regular format, which explains the arguments, outputs, and function details (often with examples) `?help` 2. __Vignettes__ - longer form guides (often include tutorials, lengthy examples) `vignette('tibble')` 3. __Cheat sheets__ - a collection of functions or routines associated with R packages (see the Resources tab on the course website) --- ## Next Week's Readings __February 08:__ Data Visualization - W & G: Chapter 3 - Healy, Data Visualization, Two Chapters (on Canvas) - Optional: Grammar of Graphics and Claus Wilke text (selected topics) Also: Problem Set #1 Due