There’s so much to discover and learn when working with spreadsheets. As a summer intern at RStudio my project has focused on creating a resource that lies at the intersection between spreadsheets and R, aimed at users who either exclusively use one or the other or whose work lives in the intersection of the two. This project has focused on a plethora of topics from comparing numerous R package functionalities that work with spreadsheets, to writing about tips and tricks on best practices for working in both a spreadsheet software and in R.
In this post I’d like to share a part of this work, specifically focused on the R versions of common spreadsheet workflows. To do that we’ll assume the role of a data analyst in two hypothetical scenarios and walk through the tidyverse equivalent of common analyses and tasks one is likely to come across when working with data stored in spreadsheets.
For our first story, it’s day one as a new data analyst and our boss has asked us to take a look at the company’s finances. They send you a spreadsheet, available in an Excel workbook, and ask you to analyze spending patterns over the current year and give your assessment into where spending is going overboard.
We’ve decided to run our analysis in R to take advantage of data wrangling and visualization functions from the tidyverse as well as to create a reproducible workflow in case we want to revisit this analysis later. First, let’s load in our data and take a look at it in R
If you would like to follow along, you can download the Excel file called spending.xlsx
here by clicking on the Download icon (arrow pointing down) on the top right corner.
Note that in the code chunk below we are reading this file from a folder called data
so we recommend you create such a folder in your working directory and place the file you downloaded there.
You can also follow along with the analysis on RStudio Cloud
here.
library(tidyverse)
library(readxl)
spending <- read_excel("data/spending.xlsx")
spending
## # A tibble: 48 x 5
## month expense amount num_of_expenses budget
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Jan lawsuits 9916 17 6422
## 2 Feb lawsuits 9575 16 3125
## 3 Mar lawsuits 5760 33 3913
## 4 Apr lawsuits 3828 36 4062
## 5 May lawsuits 5322 25 6347
## 6 Jun lawsuits 6809 32 643
## 7 Jul lawsuits 8568 37 5985
## 8 Aug lawsuits 7562 34 10709
## 9 Sep lawsuits 5682 26 9719
## 10 Oct lawsuits 3689 16 7887
## # … with 38 more rows
We take note of the month column as well as amount spent, budget, and, expense type.
We’ll hold off on looking at the column num_of_expenses
, it may prove useful down the line but not pressing for a quick look into the data.
IF statements with nested logic
One advantage of spreadsheets is that we can use formulas to dictate what should happen in a specific cell and, in a broader application of this, apply them to an entire column. Back to our spending sheet, while data for all 12 months is useful for our summary we want to take a look at spending by quarter, something not provided in the initial data set, so we’ll need to make our own quarter variable. We can see how this process works in spreadsheets, manipulating our data in an Excel workbook.
It’s a bit difficult to see the formula bar in the image above, so the formula to create used for creating the quarter column is provided below.
=IF(OR(A2="Jan",A2="Feb",A2="Mar"),"Q1",IF(OR(A2="Apr",A2="May",A2="Jun"),"Q2",IF(OR(A2="Jul",A2="Aug",A2="Sep"),"Q3",IF(OR(A2="Oct",A2="Nov",A2="Dec"),"Q4","NA"))))
For our tidyverse comparison, When creating a column that requires multiple or nested logical conditions, we can take advantage of the case_when()
function from dplyr.
Similar to a CASE statement in SQL, we can set parameters for what a given value should be based on conditions from other rows or columns in our data.
To do this we can create a new column using the mutate()
function and apply case_when()
within that call.
spending_quarter <- spending %>%
mutate(quarter = case_when(
month %in% c("Jan", "Feb", "Mar") ~ "Q1",
month %in% c("Apr", "May", "Jun") ~ "Q2",
month %in% c("Jul", "Aug", "Sep") ~ "Q3",
month %in% c("Oct", "Nov", "Dec") ~ "Q4",
TRUE ~ "NA"
))
spending_quarter
## # A tibble: 48 x 6
## month expense amount num_of_expenses budget quarter
## <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 Jan lawsuits 9916 17 6422 Q1
## 2 Feb lawsuits 9575 16 3125 Q1
## 3 Mar lawsuits 5760 33 3913 Q1
## 4 Apr lawsuits 3828 36 4062 Q2
## 5 May lawsuits 5322 25 6347 Q2
## 6 Jun lawsuits 6809 32 643 Q2
## 7 Jul lawsuits 8568 37 5985 Q3
## 8 Aug lawsuits 7562 34 10709 Q3
## 9 Sep lawsuits 5682 26 9719 Q3
## 10 Oct lawsuits 3689 16 7887 Q4
## # … with 38 more rows
Excellent! We’ve got our new quarter column set up.
Here we specified which months make up each quarter and as in our spreadsheet formula for good measure added an NA condition in the event we’ve missed something.
We can do a quick sanity check for our case_when()
statement by examining our data and seeing if any NA values show up in the new column we generated, which would indicate if a month is spelled differently than what’s listed in our conditions or if there’s an error in our code.
Pivot tables
Pivot tables are a powerful tool spreadsheets offer, allowing us to generate summaries of large quantities of data.
Let’s take a look at how we create our basic Pivot table in spreadsheets and then see how we can create summary tables using functions from the tidyverse
First we’ll create a separate sheet for our pivot table, specifying the range of the data we want it to be based on.
And then, it’s as simple as dragging and dropping our variables of interest into specific order with expense
and quarter
in rows and budget
and amount
in values, specifying a sum calculation.
For our tidyverse comparison, we can create summary tables as well. Let’s take a look total expenses and budget by expense type and quarter.
spending_table <- spending_quarter %>%
group_by(expense, quarter) %>%
summarise(
total_expenses = sum(num_of_expenses),
total_expense_amt = sum(amount),
total_budget_amt = sum(budget),
.groups = "drop"
)
spending_table
## # A tibble: 16 x 5
## expense quarter total_expenses total_expense_amt total_budget_amt
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 consulting Q1 70 16266 17949
## 2 consulting Q2 33 11329 22345
## 3 consulting Q3 93 17997 17460
## 4 consulting Q4 82 16622 17358
## 5 lawsuits Q1 66 25251 13460
## 6 lawsuits Q2 93 15959 11052
## 7 lawsuits Q3 97 21812 26413
## 8 lawsuits Q4 56 14829 17939
## 9 office happy hour Q1 32 17749 18688
## 10 office happy hour Q2 56 7879 17658
## 11 office happy hour Q3 39 16876 12384
## 12 office happy hour Q4 38 14752 18058
## 13 office supplies Q1 83 13460 25251
## 14 office supplies Q2 45 12517 19140
## 15 office supplies Q3 67 7527 21819
## 16 office supplies Q4 91 9401 14342
New variables
Sometimes we make a table and realize we can use the summary data to create new variables that can better answer our question. Coming back to our finance example, let’s say we want to examine how much money is left over from each quarter based off of budget and amount spent. A quick work around could be to manually add in a formula in a cell adjacent to our table and drag it down to each respective row, a process depicted below.
Alternatively, using the mutate()
function from dplyr, we can create a new column in our summary table and save that as a new data frame.
left_over <- spending_table %>%
mutate(left_over = total_budget_amt - total_expense_amt) %>%
arrange(left_over)
left_over
## # A tibble: 16 x 6
## expense quarter total_expenses total_expense_a… total_budget_amt left_over
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 lawsuits Q1 66 25251 13460 -11791
## 2 lawsuits Q2 93 15959 11052 -4907
## 3 office ha… Q3 39 16876 12384 -4492
## 4 consulting Q3 93 17997 17460 -537
## 5 consulting Q4 82 16622 17358 736
## 6 office ha… Q1 32 17749 18688 939
## 7 consulting Q1 70 16266 17949 1683
## 8 lawsuits Q4 56 14829 17939 3110
## 9 office ha… Q4 38 14752 18058 3306
## 10 lawsuits Q3 97 21812 26413 4601
## 11 office su… Q4 91 9401 14342 4941
## 12 office su… Q2 45 12517 19140 6623
## 13 office ha… Q2 56 7879 17658 9779
## 14 consulting Q2 33 11329 22345 11016
## 15 office su… Q1 83 13460 25251 11791
## 16 office su… Q3 67 7527 21819 14292
Great! Using the arrange()
function we can also bring the largest instances of overspending to the top of our table to focus our attention.
In Excel, this is something you’d normally do via Data > Sort.
We can also filter our table if we’re interested in a specific expense type. In Excel, you could to this via Data > Filter. Let’s hone in on spending for office happy hours.
left_over %>%
filter(expense == "office happy hour")
## # A tibble: 4 x 6
## expense quarter total_expenses total_expense_a… total_budget_amt left_over
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 office hap… Q3 39 16876 12384 -4492
## 2 office hap… Q1 32 17749 18688 939
## 3 office hap… Q4 38 14752 18058 3306
## 4 office hap… Q2 56 7879 17658 9779
Hmm, a lot of overspending in Quarter 3, which might make sense as these are the summer months. Alternatively, we can take advantage of Sort and Filter in spreadsheets to accomplish the same thing.
VLOOKUPs and joins
We don’t always have all of the information we need in a single data file and many times we need to bring in data from other sources to supplement and augment our existing data in the course of an analysis.
For our second scenario, we’ll leave the finance world and move on to something that is perhaps a little more exciting: superheroes!
Below is a data set containing information on various comic book characters with our first sheet providing some basic information on each character.
And a second sheet containing information on each character’s alter ego.
Over time this data set has been expanded on with multiple sheets including new information. Before we start exploring, we’ll need to join the relevant data from the two sheets so we can conduct some fun analyses.
Let’s first walk through the spreadsheet approach and utilize a VLOOKUP to bring together these data.
In each sheet, we notice there is a hero
column that allows us to match data from one sheet to the other.
The example below shows the process of pulling the alter_ego
column from our alter egos sheet into our initial heroes data set.
Starting our tidyverse comparison, we’ll first load our data. Our data set contains two sheets so we’ll load them in separately as two distinct data frames.
If you would like to follow along, you can download the Excel file called superheroes.xlsx
here and remember that you can also follow along with the analysis on RStudio Cloud
here.
superheroes <- read_excel("data/superheroes.xlsx", sheet = "heroes")
identities <- read_excel("data/superheroes.xlsx", sheet = "alter_egos")
Let’s take a look at the sheet with data on superheroes.
superheroes
## # A tibble: 26 x 4
## hero universe has_powers first_appearance
## <chr> <chr> <chr> <dttm>
## 1 Superman DC yes 1938-04-18 00:00:00
## 2 Batman DC no 1939-03-30 00:00:00
## 3 Spider-Man Marvel yes 1962-08-01 00:00:00
## 4 Green Lantern DC yes 1971-12-01 00:00:00
## 5 Daredevil Marvel no 1964-04-01 00:00:00
## 6 Wonder Woman DC yes 1941-10-01 00:00:00
## 7 Black Widow Marvel no 1964-04-01 00:00:00
## 8 Captain America Marvel yes 1941-03-01 00:00:00
## 9 Hulk Marvel yes 1962-05-01 00:00:00
## 10 Thor Marvel yes 1962-08-01 00:00:00
## # … with 16 more rows
And now let’s take a look at the sheet listing the alter ego of each character.
identities
## # A tibble: 26 x 3
## hero alter_ego is_secret
## <chr> <chr> <chr>
## 1 Superman Clark Kent yes
## 2 Batman Bruce Wayne yes
## 3 Spider-Man Peter Parker yes
## 4 Green Lantern John Stewart yes
## 5 Daredevil Matt Murdock yes
## 6 Wonder Woman Diana Prince yes
## 7 Black Widow Natasha Romanova no
## 8 Captain America Steve Rogers no
## 9 Hulk Bruce Banner no
## 10 Thor Donald Blake yes
## # … with 16 more rows
Using left_join()
from the dplyr package, we can bring in the data from the identities data frame into our superheroes data frame.
To do this we’ll call left_join()
on superheroes, putting the identities data set as our first argument and specify our common variable with the by
argument.
superhero_identities <- superheroes %>%
left_join(identities, by = "hero")
superhero_identities
## # A tibble: 26 x 6
## hero universe has_powers first_appearance alter_ego is_secret
## <chr> <chr> <chr> <dttm> <chr> <chr>
## 1 Superman DC yes 1938-04-18 00:00:00 Clark Kent yes
## 2 Batman DC no 1939-03-30 00:00:00 Bruce Wayne yes
## 3 Spider-Man Marvel yes 1962-08-01 00:00:00 Peter Parker yes
## 4 Green Lantern DC yes 1971-12-01 00:00:00 John Stewart yes
## 5 Daredevil Marvel no 1964-04-01 00:00:00 Matt Murdock yes
## 6 Wonder Woman DC yes 1941-10-01 00:00:00 Diana Prince yes
## 7 Black Widow Marvel no 1964-04-01 00:00:00 Natasha Roma… no
## 8 Captain Amer… Marvel yes 1941-03-01 00:00:00 Steve Rogers no
## 9 Hulk Marvel yes 1962-05-01 00:00:00 Bruce Banner no
## 10 Thor Marvel yes 1962-08-01 00:00:00 Donald Blake yes
## # … with 16 more rows
Great, we’ve joined the data frames with no problem!
There are, of course, instances in which we’ll need to pull in data from multiple sheets, and the superheroes data set is no exception. We’ve been informed there are an additional two sheets that have recently been added. One containing information on the main power or skill of each character,
powers <- read_excel("data/superheroes.xlsx", sheet = "powers")
powers
## # A tibble: 26 x 2
## hero main_power
## <chr> <chr>
## 1 Superman flight
## 2 Batman inexhaustable wealth
## 3 Spider-Man wall-crawling
## 4 Green Lantern power ring
## 5 Daredevil echo-location
## 6 Wonder Woman super strength
## 7 Black Widow training
## 8 Captain America peak physical fitness
## 9 Hulk super strength
## 10 Thor magic hammer
## # … with 16 more rows
and the other containing the occupation of the alter ego of each superhero.
occupation <- read_excel("data/superheroes.xlsx", sheet = "occupation")
occupation
## # A tibble: 26 x 2
## name occupation
## <chr> <chr>
## 1 Clark Kent Reporter
## 2 Bruce Wayne CEO
## 3 Peter Parker Photogragpher
## 4 John Stewart U.S. Marine
## 5 Matt Murdock Lawyer
## 6 Diana Prince Intelligence Officer
## 7 Natasha Romanova Spy/Assassin
## 8 Steve Rogers Solider
## 9 Bruce Banner Scientist
## 10 Donald Blake Doctor
## # … with 16 more rows
We see that the powers sheet has a hero
column allowing for another simple left join.
However, the occupation sheet doesn’t have a common column by that specific name, but we do recognize that the information in the name
column is the same for the alter_ego
column from our identities spreadsheet.
We can take advantage of this relationship to bring these data frames together as shown in the code below.
heroes_full <- superhero_identities %>%
left_join(powers, by = "hero") %>%
left_join(occupation, by = c("alter_ego" = "name"))
Here we conduct a simple left join bringing in the data from the powers data set, and then another left join to bring in the data from the occupation spreadsheet.
In the by
argument we specify that the column alter_ego
should be matched to the name
column from the occupation data frame.
We’ve expanded our dataset quite a bit.
Let’s take an alternative approach to view it using the glimpse()
function from dplyr.
glimpse(heroes_full)
## Rows: 26
## Columns: 8
## $ hero <chr> "Superman", "Batman", "Spider-Man", "Green Lantern",…
## $ universe <chr> "DC", "DC", "Marvel", "DC", "Marvel", "DC", "Marvel"…
## $ has_powers <chr> "yes", "no", "yes", "yes", "no", "yes", "no", "yes",…
## $ first_appearance <dttm> 1938-04-18, 1939-03-30, 1962-08-01, 1971-12-01, 196…
## $ alter_ego <chr> "Clark Kent", "Bruce Wayne", "Peter Parker", "John S…
## $ is_secret <chr> "yes", "yes", "yes", "yes", "yes", "yes", "no", "no"…
## $ main_power <chr> "flight", "inexhaustable wealth", "wall-crawling", "…
## $ occupation <chr> "Reporter", "CEO", "Photogragpher", "U.S. Marine", "…
For doing this same task in Excel, we can continue to utilize VLOOKUPS for each additional column of data we want to pull into our main sheet. To bring in our occupation data we simply adjust the initial reference point in our formula from column A2 which houses hero name to E2 which houses alter egos which was also added via a VLOOKUP.
There we have it, a fully combined data set ready for analysis! Hmmm, now that we think about it, let’s hope we aren’t sharing this data with any super villains. We may be an ace at working with spreadsheet data but we’re also an ethical data analyst!
I hope you’ve found this blog post useful and enjoyed reading through each example. It’s been such a pleasure working with Jenny Bryan and Mine Çetinkaya-Rundel on this project and we’re excited to share more of the Spreadsheets Using R resource with readers in the future. Until then, no matter what tools you’re using to work with your data, happy analyzing!