Gender and Race in Venture Capital

Richard Kerby wrote about diversity within venture capital. Interesting stuff. Even better, Kerby made his data public. Sadly, the data is a fairly non-tidy Excel file. Purpose of this post is to process it into something a little nicer.

raw <- read_csv("https://www.davidkane.info/files/blog_files/kerby.csv",
              
              # Column names are a mess. So, after running the simple read_csv()
              # command on the raw file, I use spec(x) to get the default column
              # types and then use the col_types argument to set them by hand.
              
              col_types = cols(
                URL = col_character(),
                Names = col_character(),
                Firm = col_character(),
                Title = col_character(),
                Male = col_integer(),
                Female = col_integer(),
                White = col_integer(),
                Black = col_integer(),
                Latinx = col_integer(),
                Asian = col_integer(),
                `White & Male` = col_integer(),
                `White & Female` = col_integer(),
                `Black & Male` = col_integer(),
                `Black & Female` = col_integer(),
                `Latino & Male` = col_integer(),
                `Latino & Female` = col_integer(),
                `Asian & Male` = col_integer(),
                `Asian & Female` = col_integer(),
                X19 = col_character(),
                `Engineering Degree` = col_integer(),
                Operator = col_integer(),
                Count = col_integer(),
                X23 = col_character(),
                `LinkedIn Profile` = col_character(),
                `Undergraduate School` = col_character(),
                `Graduate School` = col_character(),
                `Graduate School_1` = col_character(),
                `Stanford or Harvard` = col_integer(),
                White_1 = col_integer(),
                Black_1 = col_integer(),
                Latinx_1 = col_integer(),
                Asian_1 = col_integer()
                ), 
              
              # Bottom of the file performs summations. So, only read in the
              # first 1488 rows.
              
              n_max = 1487)
## Warning: Missing column names filled in: 'X19' [19], 'X23' [23]
## Warning: Duplicated column names deduplicated: 'Graduate School' => 'Graduate
## School_1' [27], 'White' => 'White_1' [29], 'Black' => 'Black_1' [30], 'Latinx' =>
## 'Latinx_1' [31], 'Asian' => 'Asian_1' [32]

These warning messages are unsightly, but I will leave them here for educational purposes. Let’s drop some variables and rename others.

x <- raw %>% 
  
  # Lots of these columns are, obviously, redundant. If I had more energy, I
  # would do some data integrity checks, like: Does a 1 in `White & Male`
  # correspond to all those, and only those, entries with a 1 for White and for
  # Male? Ignore that for now.
  
  select(-URL, -(`White & Male`:X19), 
         -(`Graduate School_1`:Asian_1),
         -(Count:`LinkedIn Profile`)) %>% 
  
  # Clean up variable names. lower case is better. I am not sure what some of
  # these (like operator) mean. Note that the original data included two
  # graduate schools, but I am ignoring that for now.
  
  rename(name = Names,
         firm = Firm,
         title = Title,
         operator = Operator,
         eng_degree = `Engineering Degree`,
         undergrad = `Undergraduate School`,
         grad = `Graduate School`)

I believe that this data classifies everyone to exactly one gender and one race. Let’s confirm.

(sum(x$Male, na.rm = TRUE) + sum(x$Female, na.rm = TRUE)) / length(x$Male)
## [1] 0.9993275

That should add to 1, but it doesn’t! Let’s find the problem.

x %>% 
  filter(is.na(x$Male), is.na(x$Female))
## # A tibble: 1 x 13
##   name   firm   title  Male Female White Black Latinx Asian eng_degree operator undergrad
##   <chr>  <chr>  <chr> <int>  <int> <int> <int>  <int> <int>      <int>    <int> <chr>    
## 1 Brend… Openv… Asso…    NA     NA    NA    NA     NA    NA          0        1 Harvard  
## # ... with 1 more variable: grad <chr>

We have a similar problem in the race data.

x %>% 
  filter((Asian == 1 & White == 1) |
          (is.na(White) & is.na(Black) & is.na(Asian) & is.na(Latinx)))%>% 
  select(name, firm, White, Asian, Black, Latinx)
## # A tibble: 4 x 6
##   name              firm                      White Asian Black Latinx
##   <chr>             <chr>                     <int> <int> <int>  <int>
## 1 Veronica Orellana Insight Venture Partners     NA    NA    NA     NA
## 2 Steven Hong       Kleiner Perkins               1     1    NA     NA
## 3 Brendan Rempel    Openview Venture Partners    NA    NA    NA     NA
## 4 Jay Zaveri        Social Capital                1     1    NA     NA

Judging from his Linked In profile, Brendan Rempel is male and white, Steven Hong is (East) Asian and Jay Zaveri is (South) Asian. Orellana is an Hispanic name. Let’s fix our data.

# Maybe there are better ways of doing this? Note that I am ignoring the
# possibility of duplicate names.

x <- x %>% 
  mutate(Male  = ifelse(name == "Brendan Rempel", 1, Male),
         White = ifelse(name == "Brendan Rempel", 1, White),
         White = ifelse(name %in% c("Steven Hong", "Jay Zaveri"), NA, White),
         Latinx = ifelse(name %in% c("Veronica Orellana"), 1, Latinx))

Now we can use stopifnot() to confirm that the data is good.

stopifnot(sum(x$Male, na.rm = TRUE) + 
             sum(x$Female, na.rm = TRUE) == length(x$Male))
stopifnot(sum(x$White, na.rm = TRUE) + 
             sum(x$Black, na.rm = TRUE) + 
             sum(x$Latinx, na.rm = TRUE) + 
             sum(x$Asian, na.rm = TRUE) == length(x$White))

Let’s make the data tidy by creating gender and race columns.

x <- x %>% 
  mutate(gender = case_when(Male == 1 ~ "Male", 
                            TRUE  ~ "Female"),
         race = case_when(White == 1 ~ "White",
                          Asian == 1 ~ "Asian",
                          Latinx == 1 ~ "Latinx",
                          TRUE ~ "Black")
                          ) %>% 
  select(-(Male:Asian)) %>% 
  
  # I delete the Associates. One advantage of looking closely at the data is
  # that you notice things. For example, Veronica Orellana graduated from high
  # school in 2015. She is currently an undergraduate! To the extent we want to
  # look at who holds the power in venture capital, we need to look at just the
  # Parters and Principals (the only other two titles in our data).
  
  filter(title != "Associate")

Let’s take a look at the biggest firms by number of partners/principals:

x %>% 
  group_by(firm) %>% 
  summarize(count = n()) %>% 
  arrange(desc(count))
## # A tibble: 193 x 2
##    firm                     count
##    <chr>                    <int>
##  1 Intel Capital               34
##  2 NEA                         30
##  3 Andreessen Horowitz         22
##  4 General Catalyst            22
##  5 Insight Venture Partners    21
##  6 Norwest                     21
##  7 Revolution                  21
##  8 Accel                       17
##  9 Bessemer                    17
## 10 TCV                         17
## # ... with 183 more rows

I don’t know the venture capital world that well, but this seems like a plausible list. I am more suspicious of the small firms in the list.

x %>% 
  group_by(firm) %>% 
  summarize(count = n()) %>% 
  arrange(count) %>% 
  filter(count == 1) %>% 
  print(n = 50)
## # A tibble: 10 x 2
##    firm                        count
##    <chr>                       <int>
##  1 Baseline Ventures               1
##  2 Brooklyn Bridge Ventures        1
##  3 Cloud Apps Capital Partners     1
##  4 Harrison Metal                  1
##  5 Haystack                        1
##  6 K9 Ventures                     1
##  7 Precursor Ventures              1
##  8 TenOneTen                       1
##  9 Unusual Ventures                1
## 10 Upside Partnership              1

It is hardly surprising if I (and you?) have never heard of any of these firms. After all, they are tiny! But how can we be sure that there are lots (scores?) of other small firms that Kerby did not know to include in his sample? That is not his fault, of course, but we always need to be aware of sample issues.

That is enough for today. I will look more closely at this data later this week.

comments powered by Disqus