In collaboration with Juthi Dewan, Sam Ding, and Vichy Meas, we designed this project for our Bayesian Statistics course taught by Dr. Alicia Johnson. We’d like to extend our thanks to Alicia for guiding us through Bayes and the capstone experience!
A reproducible version of this blog post with all code can be found here.
We were initially interested in characterizing New York City’s internal racial dynamics using demography, geographic mobility, community health, and economic outcomes. As this project developed, we found ourselves thinking about the relationships between transportation (in)access and housing inequity. In our project, there are two major sections: Subway Accessibility and Transportation and Structural Inequity. In Subway Accessibility we explore transportation deserts and what are the major determinants of Subway Inaccess in New York City using two Bayesian classification models. While in Transportation and Structural Inequity, we extend our discussion of transportation access to study its relationship to both rental prices and evictions using both simple and hierarchical Bayesian multivariate regression. In the extended document, we also fit non-hierarchical spatial models to control for the underlying spatial relationships between neighborhoods, but omit major discussion in this blog post as Bayesian spatial regression was beyond the scope of this course.
First, however, let’s do a data introduction:
# Load packages
library(tidyverse)
library(janitor)
library(here)
library(rstan)
library(rstanarm)
library(bayesrules)
library(tidyverse)
library(rstanarm)
library(broom.mixed)
library(tidybayes)
library(bayesplot)
library(bayesrules)
library(tidyverse)
library(rstanarm)
library(broom.mixed)
library(tidybayes)
library(forcats)
library(bayesplot)
library(sf)
library(nycgeo)
library(tidycensus)
library(extrafont)
library(extrafontdb)
# themes
theme_set(theme_minimal())
brown_green <- c("#E9DBC2","#7D9B8A","#4D6F5C","#D29B5B","#744410","#1C432D")
color_scheme_set(brown_green)
nyc_join <- merge(nta_sf,nta_acs_data)
nyc_join <- nyc_join %>%
st_transform(., crs=4269)
county_list <- nyc_join %>% pull(county_name) %>% unique()
census_api_key("0cc07f06386e317f312adef5e0892b0d002b7254")
census_data <- get_acs(state = "NY",
county = c(county_list),
geography = "tract",
variables = c(gini_inequality ="B19083_001"),
year = 2019,
output = "wide",
survey = "acs5",
geometry = TRUE) %>%
dplyr::select(-c(NAME, ends_with("M"))) %>%
rename_at(vars(ends_with("E")), .funs = list(~str_sub(., end = -2))) %>%
st_transform(., 4269) %>%
dplyr::select(-GEOID)
vari_names <- read_csv(here("clean_data", "nyc_names_vichy_complied2.csv"))
nyc_clean <- st_read(here("clean_data", "nyc_data_vichy_complied2.shp"))
colnames(nyc_clean) <- colnames(vari_names)
gini_neighborhood <- st_join(nyc_clean, census_data, left = TRUE) %>%
group_by(nta_id) %>%
summarize(gini_neighborhood=median(gini_inequality, na.rm=T)) %>%
as_tibble() %>%
dplyr::select(nta_id, gini_neighborhood)
nyc_clean <- nyc_clean %>%
as_tibble() %>%
left_join(., gini_neighborhood, by="nta_id")%>%
unique() %>%
st_as_sf()
nyc_compiled <- nyc_clean %>%
mutate(asian_perc = asian_count / total_pop) %>%
mutate(white_perc = white_count / total_pop) %>%
mutate(black_perc = black_count / total_pop) %>%
mutate(latinx_perc = latinx_count / total_pop) %>%
mutate(native_perc = native_count / total_pop) %>%
mutate(noncitizen_perc = noncitizen_count / total_pop) %>%
mutate(evictions_perc = eviction_count / total_pop) %>%
mutate(uninsured_perc = uninsured_count / total_pop) %>%
mutate(unemployment_perc = unemployment_count / total_pop) %>%
mutate(below_poverty_line_perc = below_poverty_line_count / total_pop) %>%
mutate(transportation_desert_3cat =
factor(transportation_desert_3cat, levels=c("Poor", "Typical", "Excellent"))) %>%
mutate(nta_type = as.factor(nta_type)) %>%
dplyr::select(-borough.y) %>%
dplyr::rename(borough = borough.x)
All data used in this project are from two major sources: the Tidycensus package and NYC Open Data.
Tidycensus is an R package interface, developed by Kyle Walker and Matt Herman, that enables easy access to the US Census Bureau’s data APIs and returns Tidyverse-ready data frames from various major US Census Bureau datasets. Our demographic and socioeconomic data are drawn from the 2019 American Community Survey results found in Tidycensus package. A summary of our ACS data variables is below:
borough:
NYC Boroughtotal_pop
: Total Population Count by Census Tractmean_income
: Mean Income by Census Tractbelow_poverty_line_count
: Number of People Living Below the 100% Poverty Line by Census Tractmean_rent
: Mean Rent by Censusunemployment_count
: Number of People on Unemployment by Census Tractlatinx_count
: Number of Latinx People by Census Tractwhite_count
: Number of White People by Census Tractblack_count
: Number of Black People by Census Tractnative_count
: Number of Native People by Census Tractasian_count
: Number of Asian People by Census Tractnaturalized_citizen_count
: Number of Naturalized Citizens by Census Tractnoncitizen_count
: Number of Foreign Born People by Census Tractuninsured_count
: Number of Uninsured Citizens of any Age by Census Tractgini_neighborhood
: Income inequality measured by the Gini Index per Census TractNote that these predictors are all measured at the census level. To aggregate these estimates at the neighborhood level, we performed two transformations:
Then, for count based demographic predictors, we divide by the total population in each neighborhood to define scaled demographic metrics. They are as follows:
asian_perc
: Percentage of Asian Peoplewhite_perc
: Percentage of White Peopleblack_perc
: Percentage of Black Peoplelatinx_perc
: Percentage of Latinx Peoplenative_perc
: Percentage of Native Peoplenoncitizen_perc
: Percentage of Foreign Born Peopleuninsured_perc
: Percentage of Uninsured Citizens of any Ageunemployment_perc
: Percentage of People on Unemploymentbelow_poverty_line_perc
: Percentage of people living below the 100% poverty line.We used NYC’s Open Data portal and the Baruch College GIS Lab to collect information on the remaining predictors. In particular, we used geotagged locations of Subway Stops, Bus Stops, Grocery Stores, Schools, and Evictions from the Departments of Transportation, Health, Education, and Housing, respectively, to calculate the following variables:
school_count
: Public school countseviction_count
: Eviction countsstore_count
: Grocery store and food vendor countssub_count
: Subway station countsbus_count
: Bus station countsperc_covered_by_transit
: Percent of Neighborhood Within Walking Distance (.5 miles) of Any Subway Stop.transportation_desert_3cat
: Subway Accessibility (Poor, Typical, Excellent)The process involved grouping geotagged locations by the defined neighborhood boundary regions in R’s SF package and ArcGIS. We will detail the process of identifying subway deserts in the “Subway Deserts” section.
We present a numeric summary on our data with 224 observations of 38 variables. However, note that we use percent equivalents for most demographic count variables.
#summary(modeling_data)
library(table1)
table_print <- table1(~
total_pop + mean_income + mean_rent +
gini_neighborhood +
below_poverty_line_perc +
eviction_count +
transportation_desert_3cat +
noncitizen_perc + white_perc + black_perc +
latinx_perc + asian_perc
| borough, data = nyc_compiled %>% as_tibble()) %>% as_tibble()
colnames(table_print) <- c("Variable", "Bronx (N=44)", "Brooklyn (N=64)","Manhattan (N=39)", "Queens (N=77)", "Overall (N=224)")
library(kableExtra)
table_print%>%
filter(Variable!="") %>% kable() %>% kable_styling() %>% scroll_box(height ="400px", width="100%")
Variable | Bronx (N=44) | Brooklyn (N=64) | Manhattan (N=39) | Queens (N=77) | Overall (N=224) |
---|---|---|---|---|---|
total_pop | |||||
Mean (SD) | 30100 (18500) | 37800 (24600) | 37900 (24300) | 25900 (20900) | 32300 (22700) |
Median [Min, Max] | 29800 [0, 69200] | 38300 [0, 97800] | 34700 [0, 95300] | 25000 [0, 87700] | 31300 [0, 97800] |
mean_income | |||||
Mean (SD) | 43400 (17600) | 69600 (28700) | 101000 (49900) | 74500 (14400) | 71700 (33800) |
Median [Min, Max] | 39200 [23100, 94200] | 61200 [27400, 148000] | 102000 [33300, 212000] | 72600 [37500, 104000] | 66700 [23100, 212000] |
Missing | 8 (17.8%) | 9 (14.1%) | 6 (15.0%) | 19 (24.7%) | 42 (18.6%) |
mean_rent | |||||
Mean (SD) | 1230 (194) | 1580 (465) | 1940 (674) | 1650 (198) | 1600 (464) |
Median [Min, Max] | 1250 [833, 1620] | 1450 [792, 3280] | 2000 [884, 3270] | 1630 [1140, 2250] | 1510 [792, 3280] |
Missing | 8 (17.8%) | 9 (14.1%) | 6 (15.0%) | 19 (24.7%) | 42 (18.6%) |
gini_neighborhood | |||||
Mean (SD) | 0.467 (0.0313) | 0.462 (0.0293) | 0.514 (0.0377) | 0.423 (0.0256) | 0.459 (0.0435) |
Median [Min, Max] | 0.469 [0.407, 0.519] | 0.467 [0.382, 0.515] | 0.523 [0.436, 0.582] | 0.421 [0.358, 0.484] | 0.457 [0.358, 0.582] |
below_poverty_line_perc | |||||
Mean (SD) | 0.251 (0.121) | 0.197 (0.143) | 0.170 (0.127) | 0.119 (0.0832) | 0.179 (0.128) |
Median [Min, Max] | 0.246 [0, 0.444] | 0.179 [0, 1.00] | 0.133 [0, 0.576] | 0.103 [0, 0.615] | 0.153 [0, 1.00] |
Missing | 3 (6.7%) | 6 (9.4%) | 3 (7.5%) | 15 (19.5%) | 27 (11.9%) |
eviction_count | |||||
Mean (SD) | 434 (337) | 245 (234) | 224 (230) | 126 (131) | 238 (254) |
Median [Min, Max] | 388 [1.00, 1130] | 163 [1.00, 829] | 155 [1.00, 1120] | 93.0 [1.00, 521] | 150 [1.00, 1130] |
transportation_desert_3cat | |||||
Poor | 7 (15.6%) | 9 (14.1%) | 2 (5.0%) | 44 (57.1%) | 62 (27.4%) |
Typical | 13 (28.9%) | 15 (23.4%) | 2 (5.0%) | 18 (23.4%) | 48 (21.2%) |
Excellent | 25 (55.6%) | 40 (62.5%) | 36 (90.0%) | 15 (19.5%) | 116 (51.3%) |
noncitizen_perc | |||||
Mean (SD) | 0.174 (0.0925) | 0.150 (0.132) | 0.141 (0.0505) | 0.174 (0.101) | 0.161 (0.103) |
Median [Min, Max] | 0.173 [0, 0.581] | 0.127 [0, 1.00] | 0.133 [0, 0.242] | 0.143 [0, 0.471] | 0.143 [0, 1.00] |
Missing | 3 (6.7%) | 6 (9.4%) | 3 (7.5%) | 15 (19.5%) | 27 (11.9%) |
white_perc | |||||
Mean (SD) | 0.115 (0.161) | 0.381 (0.277) | 0.446 (0.268) | 0.287 (0.247) | 0.308 (0.269) |
Median [Min, Max] | 0.0333 [0, 0.606] | 0.385 [0, 1.00] | 0.483 [0, 0.829] | 0.247 [0, 1.00] | 0.222 [0, 1.00] |
Missing | 3 (6.7%) | 6 (9.4%) | 3 (7.5%) | 15 (19.5%) | 27 (11.9%) |
black_perc | |||||
Mean (SD) | 0.297 (0.185) | 0.292 (0.308) | 0.146 (0.177) | 0.178 (0.273) | 0.230 (0.259) |
Median [Min, Max] | 0.271 [0.0631, 0.833] | 0.152 [0, 1.00] | 0.0612 [0.00873, 0.591] | 0.0406 [0, 0.899] | 0.121 [0, 1.00] |
Missing | 3 (6.7%) | 6 (9.4%) | 3 (7.5%) | 15 (19.5%) | 27 (11.9%) |
latinx_perc | |||||
Mean (SD) | 0.533 (0.196) | 0.190 (0.163) | 0.257 (0.213) | 0.255 (0.178) | 0.295 (0.223) |
Median [Min, Max] | 0.623 [0, 0.784] | 0.143 [0, 0.808] | 0.172 [0.0608, 0.724] | 0.217 [0, 0.856] | 0.204 [0, 0.856] |
Missing | 3 (6.7%) | 6 (9.4%) | 3 (7.5%) | 15 (19.5%) | 27 (11.9%) |
asian_perc | |||||
Mean (SD) | 0.0371 (0.0490) | 0.108 (0.121) | 0.124 (0.115) | 0.241 (0.190) | 0.137 (0.155) |
Median [Min, Max] | 0.0180 [0, 0.217] | 0.0733 [0, 0.472] | 0.113 [0, 0.661] | 0.205 [0, 0.757] | 0.0767 [0, 0.757] |
Missing | 3 (6.7%) | 6 (9.4%) | 3 (7.5%) | 15 (19.5%) | 27 (11.9%) |
We found that Manhattan had the largest population counts, highest mean rental prices, highest mean income, highest income inequality (e.g. gini value), most number of neighborhoods with excellent subway access, and the greatest proportion of white citizens.
Conversely, we observed that the Bronx has the lowest mean income and highest eviction counts, while having the highest proportions of people living below the poverty line, and the highest number of people with limited subway access. Importantly, the Bronx also had the highest densities of both Latinx and Black residents of any other borough in New York, meaning that the Black and Latinx residents in New York are experiencing the burden of New York’s structural inequities.
We will touch on the connections between demographics, transportation access, and housing more explicitly in the following sections.
New York City is the most populous city in the US with more than 8.8 million people. To support the daily commutes of its residents, NYC also built the New York City Subway, the oldest, longest, and currently busiest subway system in the US, averaging approximately 5.6 million daily rides on weekdays and a combined 5.7 million rides each weekend.
Compared to other US cities where automobiles are the most popular mode of transportation (ahem, Minneapolis), only 32% of NYC’s population chooses to commute by cars. NYC’s far-reaching transit system is then unique, given that more than 70% of the population commute by cars in other metropolitan areas.
Despite having the most extensive transit network in the entire US, NYC is still lacking in terms of transit accessibility for some neighborhoods. The general consensus in academia is that residents who walk more than 0.5 miles to get to reliable transit are considered lacking transportation access, or residing in a transportation desert. For our research, we adopted this concept to study these gaps in transportation access. Specifically, we attempt to identify and study “Subway Deserts”.
Extending the USDA’s definition of a food desert, we define subway deserts as the percentage of a neighborhood— or any arbitrary geographic area— that is within walking distance of any subway stop. Citing the U.S. Federal Highway Administration, we defined walking distance as a 0.5 mile radius and computed these regions in ArcGIS. We chose subway stations because of the subway’s reliable frequency, high connectivity between boroughs, and high ridership per vehicle. Our argument against including the number of bus stops in our calculations of transportation access is that the quantity of bus stops does not accurately imply public transport accessibility due to the variability in bus efficiency, punctuality, and use. A major limitation of our work was the omission of Staten Island because it is not connected to any other borough by subway. Rather, Staten Island users typically drive or train into the city. Further, we felt that the inclusion of Staten Island would mischaracterize the relationship between lacking access and not needing access since Staten Island is an overwhelmingly white, wealthy, borough that has high levels of car ownership.
We first geocoded subway stop locations in NYC from the NYC Department of Transportation. Then, using ArcGIS we created a 0.5-mile-radius buffer for each station and calculated what percent of each neighborhood was covered by a buffer region. We display an example below.
In the graph, buffer zones are in light pink with overlapping boundaries dissolved between stations, while the dark pink dots indicate the exact geographic locations of the stations. Each neighborhood, then, had a percentage score that defined it’s subway accessibility score.
Upon observation, we categorized the areas served by the subway network into four ordinal categories: Poor, Typical, and Excellent. These categories are defined at 0-25%, 25-75%, and 75-100% of area covered by transit, respectively. We defined these cutoffs using the distribution of subway coverage percentages and our own judgment on what constitutes a desert. As such, these cutoffs are specific to New York City and may not be perfectly reproducible. The following plot details the spatial locations of these transportation categories.
nyc_compiled %>%
ggplot() +
geom_sf(aes(fill = transportation_desert_3cat), color = "#8f98aa") +
scale_fill_manual(values=c("#996A37", "#EBDDC7", "#426350"),
guide = guide_legend(title = "Accessibility:"), na.value="#D6D6D6") +
theme_minimal() +
ggtitle("Subway Accessibility by Neighborhood")+
theme_minimal() +
ggtitle("Accesibility \nby Neighborhood")+
theme(axis.text = element_blank(),
panel.grid.major = element_line("transparent"),
plot.title = element_text(family="DIN Condensed", size = 2* 28, face = "bold", hjust=.5),
legend.key.width = unit(.5, "in"),
legend.title = element_text(size = 25, face="bold", family="DIN Condensed"),
legend.position = "bottom" ,
legend.text = element_text(size = 20, face="bold", family="DIN Condensed"))+
guides(fill = guide_legend(title = "Acessibility", title.position="top", title.hjust = 0.5))
In the following two subsections, we describe how we understood and classified transportation deserts using two models.
Naive Bayes Model is one of the most popular models for classifying a response variable with 2 or more categories.
We implemented a Naive Bayes classifier on subway access because it is both computationally efficient and applicable to Bayesian classification settings where outcomes may have 2+ categories. Specifically, we fit transportation access by taking mean income, percentage below the poverty line and the number of grocery stores. Because we are predicting 3 levels of transportation access, we initially fit this model using the e1071
package to classify subway transit level.
The goal of using Naive Bayes model is to see how this algorithm make prediction about the data. The result from Naive Bayes model can be used to compare with the Ordinal Model, which we will explore in the next section,in showing how various algorithm differ fundamentally and consequentially from one another.
library(e1071)
nyc_naive <- read_csv(here("clean_data", "nyc_names_vichy_complied2.csv"))%>%
mutate(transportation_desert_3cat = factor(transportation_desert_3cat, levels=c("Poor", "Typical", "Excellent"))) %>%
mutate(transportation_desert_3num = factor(as.numeric(transportation_desert_3cat))) %>%
mutate(asian_perc = (asian_count / total_pop) * 100) %>%
mutate(white_perc = (white_count / total_pop)* 100) %>%
mutate(black_perc = (black_count / total_pop)* 100) %>%
mutate(latinx_perc = (latinx_count / total_pop)* 100) %>%
mutate(native_perc = (native_count / total_pop)* 100) %>%
mutate(below_poverty_perc = (below_poverty_line_count / total_pop) * 100) %>%
mutate(noncitizen_perc = (noncitizen_count / total_pop)* 100) %>%
mutate(evictions_perc = (eviction_count / total_pop)* 100) %>%
mutate(uninsured_perc = (uninsured_count / total_pop)* 100) %>%
mutate(unemployment_perc = (unemployment_count / total_pop)* 100) %>%
filter(nta_type == 0) %>%
mutate(nta_type = as.factor(nta_type)) %>%
dplyr::select(-borough.y) %>%
dplyr::rename(borough = borough.x)
set.seed(454)
naive_model <- naiveBayes(transportation_desert_3cat ~
mean_income + borough +
below_poverty_perc +
store_count,
data = nyc_naive)
# saveRDS(nyc_naive, 'shiny_app_naive_model.rds')
naive2_prediction <- naive_classification_summary_cv(naive_model,
nyc_naive,
y="transportation_desert_3cat", k=10)$cv
naive2_prediction %>%
kable(align = "c", caption = "Naive Model - Summary") %>%
kable_styling()
transportation_desert_3cat | Poor | Typical | Excellent |
---|---|---|---|
Poor | 84.21% (32) | 5.26% (2) | 10.53% (4) |
Typical | 28.95% (11) | 21.05% (8) | 50.00% (19) |
Excellent | 7.55% (8) | 15.09% (16) | 77.36% (82) |
Under 10-fold cross validation, our Naive Bayes model had an overall cross-validated accuracy of 51.11%. However, our predictions were most accurate when predicting Poor transportation access (78.12%) and Excellent transportation access (72.62%). The following plot describes the cross-validated accuracy breakdown by each observed transportation access category.
prediction <- as.data.frame(naive2_prediction) %>%
pivot_longer(
cols= Poor:Excellent,
names_to = 'Predictions',
values_to = 'Probability'
) %>%
mutate(Probability = as.numeric(str_extract(Probability,'\\d+\\.\\d+'))/100) %>%
mutate(transportation_desert_3cat = factor(transportation_desert_3cat, levels=c("Poor", "Typical", "Excellent"))) %>%
mutate(Predictions = factor(Predictions, levels=c("Poor", "Typical", "Excellent")))
prediction %>%
ggplot(aes(x=transportation_desert_3cat, y=Probability, fill=Predictions)) +
geom_bar(position="fill", stat="identity") +
scale_y_continuous(labels = seq(0, 100, by = 25)) +
labs(title="Subway Accessibility Predictions by Observed Category", y="Proportion", x="")+
theme(panel.grid.major.x = element_line("transparent"),
# axis.text.y.left = element_blank(),
axis.text.x.bottom = element_text(size = 12, face = "bold"),
plot.title = element_text(size = 20, hjust=.5, face = "bold", family="DIN Condensed"),
legend.key.width = unit(.5, "in"),
legend.title = element_text(size = 20, face="bold", family="DIN Condensed"),
legend.position = "bottom" ,
legend.text = element_text(size = 15, face="bold", family="DIN Condensed"))+
scale_fill_manual(values=c("#996A37", "#EBDDC7", "#426350"),
guide = guide_legend(title = "Accessibility:"), na.value="#D6D6D6")
From the plot, it is clear that our naive Bayes model is sufficient when predicting the extrema of subway (in)access given the overwhelming proportion of true-poor and true-excellent classifications. However, it remains imperfect when considering the inaccuracy for both the limited and satisfactory transportation categories, our data’s distributions, and its interpretability.
Importantly, naive Bayes assumes that all quantitative predictors are normally distributed within each Y category and further it assumes (i.e. that predictors are independent within each Y category). Below we verify whether this is an appropriate assumption for our data.
plot_distribution <- function(data, category, title){
return(
data %>%
ggplot(aes(x={{category}}, color= transportation_desert_3cat, fill=transportation_desert_3cat))+
geom_density(alpha=.3) +
labs(y= "Density Distribution",
title= title) +
facet_wrap(~transportation_desert_3cat)+
theme(legend.position = "none",
plot.title = element_text(family="DIN Condensed", size = 25,hjust=.5, face = "bold"),
strip.background.x = element_rect(fill="Black", color="Black"),
axis.title.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.x = element_blank(),
strip.text.x = element_text(family="DIN Condensed", size = 20, color = "White", face = "bold")) +
scale_color_manual(values=c("#996A37", "#EBDDC7", "#426350")) +
scale_fill_manual(values=c("#996A37","#E9DBC2","#395645"))
)
}
library(egg)
ggarrange(
plot_distribution(nyc_naive, mean_income, 'Mean Income'),
plot_distribution(nyc_naive, store_count, 'Store Count'),
plot_distribution(nyc_naive, below_poverty_perc, 'Percent Below Poverty Line'),
ncol=1, nrow=3
)
Naive Bayes’s assumption of normality within categories does not hold, unfortunately. Meaning that although this is a “good” classifier, it is not an appropriate one. Further, naive Bayes is a black box classifier meaning that although our classifications might be accurate, we will never understand where these predictions come from or how subway access is related to these predictors. Our next section details anothe Bayesian classification model that can serve as an alternative to Naive Bayes.
An ordinal regression model, or ordered logistic regression model, predicts the outcome of an ordinal variable, which is a categorical variable whose classes exist on an arbitrary scale where only the relative ordering between different values is significant. In this case, our subway desert category is an ordinal variable with categories ranging from the least covered to the most covered (e.g. $[1,2,3]$
) by NYC’s subway stops. Once again, we defined these categories by splitting the percentage covered using 2 cut-points, $\zeta_1$
and $\zeta_2$
, to create 3 ordered categories— Poor, Typical, and Satisfactory. Our $\zeta$
s are listed below.
\[ \begin{align*} \zeta_1 = 0.25 \\ \zeta_2 = 0.75 \\ \end{align*} \]
Next, we introduce a latent variable \(y^*\), which would be a linear combination of the \(k\) predictor variables. Here, we selected mean neighborhood income ($X_1$
), percentage below the poverty line in a neighborhood ($X_2$
), the number of grocery stores and food vendors in a neighborhood ($X_3$
), and 3 dummy variables for the borough of our neighborhood ($X_4, X_5, X_6$
) as our predictors.
Then we predict the transportation desert status $Y$
for the $i$
th neighborhood with the following model:
\[ \begin{align*} Y_i| \zeta_1, \zeta_2,\zeta_3,\beta_1,\dots,\beta_6 &= \begin{cases} 1 & y^{*} < \zeta_1 \\ 2 & \zeta_1 \leq y^{*} < \zeta_2 \\ 3 & \zeta_2 \leq y^{*} \\ \end{cases} \\ \\ y_i^{*} & = \sum_{k=1}^6 \beta_kX_i \\ \end{align*} \\ \]
\(\text{Where}\)
\[\begin{align*} \zeta_1 = 0.25 \\ \zeta_2 = 0.75 \\ \end{align*}\]
\[
\beta_k \sim N(m_{k}, s_{k}^{2}) \\
\] It is important to note that there is no intercept in \(y_i^*\). This is an omission by the construction of stan_polr
’s model and the multi-class outcomes for this ordinal model.
Since we do not have prior information about the relationship between observed transportation access and these specific priors, we will be using the default prior in stan_polr
. Specifically, we establish a prior for the location of the proportion of the variance in outcome that is attributable to the predictors. This proportion is more commonly as the \(R^2\) metric used in frequentist linear regressions and could be located at any value between (0,1). Since we are using a uniform prior on the location of $R^2), we say the ratio of our \(R^2\)’s proportion is around 0.5 or that a perfect explanation of variance and non-explanation of variance in our outcome categories is equally plausible. For a slightly more technical overview, visit the \(R^2\) section in rstanarm
’s documentation here.
It is also worth noting that we could specify uniform Dirichlet count priors (e.g. \(\text{Dirichlet}(1,1,1)\) on the categoies of transportation access in stan_polr
.
One technical limitationn of our ordinal model is attributable to the paucity on cross-validated error metrics for the Bayesian ordinal regression model in stan_polr
. To test the model’s fitness on new data, we used a manual train-test split. Our training data would include 70% of the original observations, while our test data would include the remaining 30%.
library(rsample)
set.seed(454)
data_split <- initial_split(nyc_naive, prop = .7)
data_train <- training(data_split)
data_test <- testing(data_split)
ordinal_model <- stan_polr(transportation_desert_3num ~ mean_income +
below_poverty_perc + store_count + borough,
data =data_train, prior_counts = dirichlet(1),
prior=R2(0.5),
chains = 4, iter = 1000*2, seed = 84735, refresh = 0)
# saveRDS(ordinal_model, 'shiny_app_ordinal_model.rds')
tidy(ordinal_model, effects = "fixed", conf.int = TRUE, conf.level = 0.8) %>%
mutate(term = case_when(
term == "1|2" ~ "Poor | Typical",
term == "2|3" ~ "Typical | Excellent",
TRUE ~ term
))%>%
kable(align = "c", caption = "Ordinal Model - Summary") %>%
kable_styling()
term | estimate | std.error | conf.low | conf.high |
---|---|---|---|---|
mean_income | 0.0000193 | 0.0000138 | 0.0000024 | 0.0000366 |
below_poverty_perc | 0.0962982 | 0.0425849 | 0.0436633 | 0.1533395 |
store_count | 0.0220945 | 0.0066117 | 0.0140408 | 0.0313861 |
boroughBrooklyn | 0.1218249 | 0.6175306 | -0.6857892 | 0.8890578 |
boroughManhattan | 2.6072263 | 1.1478840 | 1.2758714 | 4.2042272 |
boroughQueens | -0.4073554 | 0.6087717 | -1.2454476 | 0.3704208 |
Poor | Typical | 2.7512963 | 1.7302249 | 0.5351128 | 5.0207404 |
Typical | Excellent | 4.3697139 | 1.8114385 | 2.0579639 | 6.7107747 |
Here is an interpretation of the coefficients in the tidy table:
mean_income
: When controlling all other factors, for each additional dollar a given neighborhood’s mean income has, the model estimates that the latent variable \(y^*\) increases by 0.0000197. However, there is an 80% chance that the increase in the latent variable \(y^*\) may be any value between (0.0000010, 0.0000383), indicating that there is almost certainly a positive relationship between mean income and \(y^*\), but its magnitude may vary.
below_poverty_perc
: When controlling all other factors, for each additional % people below poverty a given neighborhood has, the model estimates that the latent variable \(y^*\) increases by 0.0977415. However, there is an 80% chance that the increase in the latent variable \(y^*\) may be any value between (0.0422522, 0.1576466), indicating that there is almost certainly a positive relationship between below poverty percentage and \(y^*\), but its magnitude may vary.
store_count
: When controlling all other factors, for each additional store a given neighborhood has, the model estimates that the latent variable \(y^*\) increases by 0.0224638. However, there is an 80% chance that the increase in the latent variable \(y^*\) may be any value between (0.0138214, 0.0316698), indicating that there is almost certainly a positive relationship between number of stores and \(y^*\), but its magnitude may vary.
boroughBrooklyn
: When controlling all other factors, if a neighborhood is in Brooklyn, the model estimates that the latent variable \(y^*\) increases by 0.1084268 compared to that neighborhood in Bronx. However, there is an 80% chance that the increase in the latent variable \(y^*\) may be any value between (-0.6968746, 0.8669477), indicating that we are not certain with the relationship between between a neighborhood being in Brooklyn and \(y^*\).
boroughManhattan
: When controlling all other factors, if a neighborhood is in Manhattan, the model estimates that the latent variable \(y^*\) increases by 2.5965602 compared to that neighborhood in Bronx. However, there is an 80% chance that the increase in the latent variable \(y^*\) may be any value between (1.2840073, 4.3013850), indicating that we are not certain with the relationship between between a neighborhood being in Manhattan and \(y^*\), but its magnitude may vary.
boroughQueens
: When controlling all other factors, if a neighborhood is in Queens, the model estimates that the latent variable \(y^*\) increases by -0.4148264 compared to that neighborhood in Bronx. However, there is an 80% chance that the increase in the latent variable \(y^*\) may be any value between (-1.1315662, 0.3349786), indicating that we are not certain with the relationship between between a neighborhood being in Queens and \(y^*\).
Then, extending a function written by Connie Zhang into a tidy format, we compute the accuracy of the ordinal model when predicting transportation desert categories below.
tidy_ordinal_accuracy <- function(testing_data, predictions){
# write mode function since it does not exist in R
mode <- function(x) {
t <- table(x)
names(t)[which.max(t)]
}
#write mode function since it does not exist in R
mydata <- testing_data %>%
dplyr::select(`transportation_desert_3num`) %>%
rownames_to_column("observation")
# extract predictions
# transpose so each row is a case from the training data,
# then compute rowwise modes of classifications to find the most common prediction
raw_table <- predictions%>%
t() %>%
as_tibble()%>%
mutate_if(is.character, as.numeric) %>%
rownames_to_column("observation") %>%
rowwise(id="observation") %>%
mutate_if(is.numeric, as.factor) %>%
summarize(modal_prediction = mode(c_across(where(is.factor)))) %>%
right_join(mydata, ., by = "observation")
# compute transportation category accuracies
group_specific <- raw_table %>%
mutate(Accurate = ifelse(modal_prediction == transportation_desert_3num, 1,0))%>%
group_by(transportation_desert_3num) %>%
summarise(Accuracy = mean(Accurate)) %>%
mutate(transportation_desert_3num = case_when(
transportation_desert_3num == 1 ~ "Poor",
transportation_desert_3num == 2 ~ "Typical",
TRUE ~ "Excellent"))
# compute overall accuracy
overall <- raw_table %>%
mutate(Accurate = ifelse(modal_prediction ==transportation_desert_3num, 1,0))%>%
summarise(Accuracy = mean(Accurate)) %>%
mutate(transportation_desert_3num = "Overall", .before=1)
# bind into table
rbind(group_specific, overall) %>%
dplyr::rename("Access Level" = transportation_desert_3num) %>%
kable(align = "c", caption = "Non-Citizen Model - Summary") %>%
kable_styling()
}
set.seed(86437)
my_prediction2 <- posterior_predict(
ordinal_model,
newdata = data_test)
tidy_ordinal_accuracy(data_test, my_prediction2)
Access Level | Accuracy |
---|---|
Poor | 0.8571429 |
Typical | 0.0909091 |
Excellent | 0.9000000 |
Overall | 0.7272727 |
It seems that our model is incredibly accurate when classifying neighborhoods as having excellent and poor subway access (90%; 71.4%), while almost uniformly incorrect when classifying a neighborhood as having typical subway access (9.09%). Our model is more likely to categorize a neighborhood with typical subway access as having excellent or poor access. Note that in both the ordinal and the naive model, we could not accurately predict this “Typical” category.
In addition to accuracy within each group, we also consider the compositions of the actual vs. predictions. Below we visualize the same error metrics for the ordinal model.
mode <- function(x) {
t <- table(x)
names(t)[which.max(t)]
}
prediction_long <- my_prediction2 %>%
t() %>%
as_tibble() %>%
mutate_if(is.character, as.numeric) %>%
rownames_to_column("observation") %>%
rowwise(id="observation") %>%
mutate_if(is.numeric, as.factor) %>%
summarize(modal_prediction = mode(c_across(where(is.factor)))) %>%
ungroup() %>%
rename(predicted_desert = modal_prediction)%>%
mutate(predicted_desert = case_when(
predicted_desert==1 ~ "Poor",
predicted_desert ==2 ~ "Typical",
TRUE ~ "Excellent")) %>%
mutate(predicted_desert = factor(predicted_desert, levels=c("Poor", "Typical", "Excellent")))
data_test %>%
dplyr::select(nta_id, borough, transportation_desert_3cat) %>%
rownames_to_column("observation") %>%
left_join(., prediction_long, by="observation") %>%
ggplot(aes(x=transportation_desert_3cat, fill=predicted_desert)) +
geom_bar(position="fill")+
scale_y_continuous(labels = seq(0, 100, by = 25)) +
labs(title="Mean Accessibility Predictions by Observed Category", y="Proportion", x="")+
theme(panel.grid.major.x = element_line("transparent"),
# axis.text.y.left = element_blank(),
axis.text.x.bottom = element_text(size = 12, face = "bold"),
plot.title = element_text(size = 20, hjust=.5, face = "bold", family="DIN Condensed"),
legend.key.width = unit(.5, "in"),
legend.title = element_text(size = 20, face="bold", family="DIN Condensed"),
legend.position = "bottom" ,
legend.text = element_text(size = 15, face="bold", family="DIN Condensed"))+
scale_fill_manual(values=c("#996A37", "#EBDDC7", "#426350"),
guide = guide_legend(title = "Accessibility:"), na.value="#D6D6D6")
Here is a graph of detailed breakdown of proportions within each observed category. Once again, observe that for the “Typical” category our predictions are always incorrect. Likely, this must be a byproduct of our interval criteria for each respective level of transportation inaccessibility— that is, these categories may not be sufficiently different to find differences.
We investigated this issue with the density plot and the empirical CDF of the interval cutoffs.
nyc_naive %>%
ggplot(aes(x=perc_covered_by_transit, fill=transportation_desert_3cat)) +
geom_histogram(bins = 15) +
labs(title="Subway Accesibilty Percent Distribution", y="Count",
x="% of Neighborhood with Walking Distance of a Subway Stop")+
geom_vline(xintercept = 25) +
geom_vline(xintercept = 75) +
theme(panel.grid.major.x = element_line("transparent"),
# axis.text.y.left = element_blank(),
axis.text.x.bottom = element_text(size = 12, face = "bold"),
plot.title = element_text(size = 20, hjust=.5, face = "bold", family="DIN Condensed"),
legend.key.width = unit(.5, "in"),
legend.title = element_text(size = 20, face="bold", family="DIN Condensed"),
legend.position = "top" ,
legend.text = element_text(size = 15, face="bold", family="DIN Condensed"))+
scale_fill_manual(values=c("#996A37", "#EBDDC7", "#426350"),
guide = guide_legend(title = ""), na.value="#D6D6D6")
We found that the original data demonstrates a bimodal distribution with observations heavily concentrated in 0-25% and 75-100%. This is partially due to the over-saturated coverage in Manhattan and the lack of subway access in the suburban neighborhoods of Queens.
In the next section, we aimed to see how transportation inaccessibility affected real-world urban housing issues.
Transportation access is a pervasive structural issue. However, previous research on transportation access has demonstrated that many of these gaps in access also deepen the chasms of racial and class-based inequity. In this analysis, it seemed that low-income and racialized— typically Black and Latinx— communities were most likely to confront transportation inequities. Additionally, we have observed that predominantly Black and Latinx neighborhoods in the Bronx faced some of the highest rates of eviction, likely indicating that these inequities may overlap.
This next section aims to connect transportation access to housing-inequities that we know also have racial and class dimensions. In particular, we wanted to assess transportation access’s relationship with immigrant community size, rental prices, and eviction counts by neighborhood.
desert_map <- nyc_compiled %>%
filter(nta_type == 0) %>%
ggplot() +
geom_sf(aes(fill = transportation_desert_3cat), color = "#8f98aa") +
scale_fill_manual(values=c("#996A37","#E9DBC2","#395645"), na.value="#D6D6D6") +
theme_minimal() +
ggtitle("Subway Accessibility \nby Neighborhood")+
theme(axis.text = element_blank(),
panel.grid.major = element_line("transparent"),
plot.title = element_text(family="DIN Condensed", size = 2* 27, face = "bold", hjust=.5),
legend.key.width = unit(.45, "in"),
legend.title = element_text(size = 25, face="bold", family="DIN Condensed"),
legend.position = "bottom" ,
legend.text = element_text(size = 12, face="bold", family="DIN Condensed"))+
guides(fill = guide_legend(title = "Acessibility", title.position="top", title.hjust = 0.5))
rent_map <- nyc_compiled %>%
filter(nta_type == 0) %>%
ggplot() +
geom_sf(aes(fill = mean_rent), color = "#8f98aa") +
scale_fill_gradient(low = "#FCF5EE", high = "#30969C") +
theme_minimal() +
ggtitle("Mean Rent \nby Neighborhood")+
theme(axis.text = element_blank(),
panel.grid.major = element_line("transparent"),
plot.title = element_text(family="DIN Condensed", size = 2* 27, face = "bold", hjust=.5),
legend.key.width = unit(.5, "in"),
legend.title = element_text(size = 25, face="bold", family="DIN Condensed"),
legend.position = "bottom" ,
legend.text = element_text(size = 20, face="bold", family="DIN Condensed"))+
guides(fill = guide_colourbar(title = "Dollars", title.position="top", title.hjust = 0.5))
evict <- nyc_compiled %>%
filter(nta_type == 0) %>%
ggplot() +
geom_sf(aes(fill = eviction_count), color = "#8f98aa")+
scale_fill_gradient(low = "#FCF5EE", high = "#C47572") +
theme_minimal() +
ggtitle("Eviction Counts \nby Neighborhood")+
theme(axis.text = element_blank(),
panel.grid.major = element_line("transparent"),
plot.title = element_text(family="DIN Condensed", size = 2* 27, face = "bold", hjust=.5),
legend.key.width = unit(.5, "in"),
legend.title = element_text(size = 25, face="bold", family="DIN Condensed"),
legend.position = "bottom" ,
legend.text = element_text(size = 20, face="bold", family="DIN Condensed"))+
guides(fill = guide_colourbar(title = "Number of Evictions", title.position="top", title.hjust = 0.5))
noncit <- nyc_compiled %>%
filter(nta_type == 0) %>%
ggplot() +
geom_sf(aes(fill = noncitizen_perc), color = "#8f98aa") +
scale_fill_gradient(low = "#FCF5EE", high = "#5E76AD") +
theme_minimal() +
ggtitle("Immigrant Density \nby Neighborhood")+
theme(axis.text = element_blank(),
panel.grid.major = element_line("transparent"),
plot.title = element_text(family="DIN Condensed", size = 2* 27, face = "bold", hjust=.5),
legend.key.width = unit(.5, "in"),
legend.title = element_text(size = 25, face="bold", family="DIN Condensed"),
legend.position = "bottom" ,
legend.text = element_text(size = 20, face="bold", family="DIN Condensed"))+
guides(fill = guide_colourbar(title = "Percent Non-Citizen", title.position="top", title.hjust = 0.5))
ggarrange(desert_map, rent_map,evict, noncit,
ncol=2, nrow=2, padding = unit(3,
"line"))
Transportation access is typically worst in areas with the highest densities of non-citizen residents, while it is the best in neighborhoods with the highest mean rental prices. Further, observe that eviction counts are highly concentrated in north and south NYC, where some of these neighborhoods have mixed-accessibility to transit.
Unfortunately, rent, transportation access, and eviction counts may also be associated with respective density of nonwhite communities.
white <- nyc_compiled %>%
filter(nta_type == 0) %>%
ggplot() +
geom_sf(aes(fill = white_perc), color = "#8f98aa") +
scale_fill_gradientn(colors = c("#FCF5EE","#919BB6", "#5A6687", "#3D465C")) +
theme_minimal() +
ggtitle("White Population \nby Neighborhood")+
theme(axis.text = element_blank(),
panel.grid.major = element_line("transparent"),
plot.title = element_text(family="DIN Condensed", size = 2* 27, face = "bold", hjust=.5),
legend.key.width = unit(.5, "in"),
legend.title = element_text(size = 25, face="bold", family="DIN Condensed"),
legend.position = "bottom" ,
legend.text = element_text(size = 20, face="bold", family="DIN Condensed"))+
guides(fill = guide_colourbar(title = "Percent White", title.position="top", title.hjust = 0.5))
black <- nyc_compiled %>%
filter(nta_type == 0) %>%
ggplot() +
geom_sf(aes(fill = black_perc), color = "#8f98aa") +
scale_fill_gradientn(colors = c("#FCF5EE","#F8ABA6", "#F58581", "#DB3F37")) +
theme_minimal() +
ggtitle("Black Population \nby Neighborhood")+
theme(axis.text = element_blank(),
panel.grid.major = element_line("transparent"),
plot.title = element_text(family="DIN Condensed", size = 2* 27, face = "bold", hjust=.5),
legend.key.width = unit(.5, "in"),
legend.title = element_text(size = 25, face="bold", family="DIN Condensed"),
legend.position = "bottom" ,
legend.text = element_text(size = 20, face="bold", family="DIN Condensed"))+
guides(fill = guide_colourbar(title = "Percent Black", title.position="top", title.hjust = 0.5))
asian <- nyc_compiled %>%
filter(nta_type == 0) %>%
ggplot() +
geom_sf(aes(fill = asian_perc), color = "#8f98aa") +
scale_fill_gradientn(colors = c("#FCF5EE","#C1C5EC", "#A1A8E2", "#4451C5"),
guide = guide_colourbar(title = "Percent Asian")) +
theme_minimal() +
ggtitle("Asian Population \nby Neighborhood")+
theme(axis.text = element_blank(),
panel.grid.major = element_line("transparent"),
plot.title = element_text(family="DIN Condensed", size = 2* 27, face = "bold", hjust=.5),
legend.key.width = unit(.5, "in"),
legend.title = element_text(size = 25, face="bold", family="DIN Condensed"),
legend.position = "bottom" ,
legend.text = element_text(size = 20, face="bold", family="DIN Condensed"))+
guides(fill = guide_colourbar(title = "Percent Asian", title.position="top", title.hjust = 0.5))
latinx <- nyc_compiled %>%
filter(nta_type == 0) %>%
ggplot() +
geom_sf(aes(fill = latinx_perc), color = "#8f98aa")+
scale_fill_gradientn(colors = c("#FCF5EE","#F4E2B8", "#E9C572", "#E77E2F")) +
theme_minimal() +
ggtitle("Latinx Population \nby Neighborhood")+
theme(axis.text = element_blank(),
panel.grid.major = element_line("transparent"),
plot.title = element_text(family="DIN Condensed", size = 2* 27, face = "bold", hjust=.5),
legend.key.width = unit(.5, "in"),
legend.title = element_text(size = 25, face="bold", family="DIN Condensed"),
legend.position = "bottom" ,
legend.text = element_text(size = 20, face="bold", family="DIN Condensed"))+
guides(fill = guide_colourbar(title = "Percent Latinx", title.position="top", title.hjust = 0.5))
ggarrange(desert_map, white, black, latinx, asian,
ncol=3, nrow=2, padding = unit(3,
"line"))
From the plots, neighborhoods with the highest densities of Black and Asian community members also have the poorest scores of subway access, while the converse holds true for neighborhoods with the highest proportion of White residents.
Further, observe that eviction counts are most common in neighborhoods with the highest densities of Black and Latinx community members. The below faceted visualizations detail the specific relationships between the proportion of Black, Latinx, Asian, and White residents in a neighborhood with eviction counts.
black_eviction <- nyc_compiled %>%
filter(nta_type == 0) %>%
ggplot(aes(x=black_perc,
color=transportation_desert_3cat,
y=eviction_count)) +
geom_point(size=2)+
geom_smooth(aes(group=transportation_desert_3cat),span = 1, size=3, se=F)+
labs(title="Eviction Counts by Black Density \nper Transportation Category", y="", x="Black (%)")+
theme_linedraw()+
theme(legend.position = "none",
plot.title = element_text(family="DIN Condensed", size = 30 ,hjust=.5, face = "bold"),
strip.background.x = element_rect(fill="Black", color="Black"),
strip.text.x = element_text(family="DIN Condensed", size = 20, color = "White", face = "bold")) +
scale_color_manual(values=c("#996A37","#E9DBC2","#395645"))+
facet_wrap(.~transportation_desert_3cat, scales = "free", ncol=2, nrow=2)
asian_eviction <- nyc_compiled %>%
filter(nta_type == 0) %>%
ggplot(aes(x=asian_perc,
color=transportation_desert_3cat,
y=eviction_count)) +
geom_point(size=2)+
geom_smooth(aes(group=transportation_desert_3cat),span = 1, size=3, se=F)+
labs(title="Eviction Counts by Asian Density \nper Transportation Category", y="", x="Asian (%)")+
theme_linedraw()+
theme(legend.position = "none",
plot.title = element_text(family="DIN Condensed", size = 30 ,hjust=.5, face = "bold"),
strip.background.x = element_rect(fill="Black", color="Black"),
strip.text.x = element_text(family="DIN Condensed", size = 20, color = "White", face = "bold")) +
scale_color_manual(values=c("#996A37","#E9DBC2","#395645"))+
facet_wrap(.~transportation_desert_3cat, scales = "free", ncol=2, nrow=2)
latinx_eviction <- nyc_compiled %>%
filter(nta_type == 0) %>%
ggplot(aes(x=latinx_perc,
color=transportation_desert_3cat,
y=eviction_count)) +
geom_point(size=2)+
geom_smooth(aes(group=transportation_desert_3cat),span = 1, size=3, se=F)+
labs(title="Eviction Counts by Latinx Density \nper Transportation Category", y="", x="Latinx (%)")+
theme_linedraw()+
theme(legend.position = "none",
plot.title = element_text(family="DIN Condensed", size = 30 ,hjust=.5, face = "bold"),
strip.background.x = element_rect(fill="Black", color="Black"),
strip.text.x = element_text(family="DIN Condensed", size = 20, color = "White", face = "bold")) +
scale_color_manual(values=c("#996A37","#E9DBC2","#395645"))+
facet_wrap(.~transportation_desert_3cat, scales = "free", ncol=2, nrow=2)
white_eviction <- nyc_compiled %>%
filter(nta_type == 0) %>%
ggplot(aes(x=white_perc,
color=transportation_desert_3cat,
y=eviction_count)) +
geom_point(size=2)+
geom_smooth(aes(group=transportation_desert_3cat),span = 1, size=3, se=F)+
labs(title="Eviction Counts by White Density \nper Transportation Category", y="", x="White (%)")+
theme_linedraw()+
theme(legend.position = "none",
plot.title = element_text(family="DIN Condensed", size = 30 ,hjust=.5, face = "bold"),
strip.background.x = element_rect(fill="Black", color="Black"),
strip.text.x = element_text(family="DIN Condensed", size = 20, color = "White", face = "bold")) +
scale_color_manual(values=c("#996A37","#E9DBC2","#395645"))+
facet_wrap(.~transportation_desert_3cat, scales = "free", ncol=2, nrow=2)
ggarrange(black_eviction, latinx_eviction, asian_eviction, white_eviction, ncol=2, nrow=2)
Neighborhood Black and Latinx residential proportions by seem to be associated with increased eviction counts. However, we must specify that Black resident proportions are uniformly associated with increases in eviction counts across transportation levels, while the relationship between eviction counts and Latinx resident proportion is not. In contrast, both White and Asian proportions are uniformly associated with decreases in eviction counts.
Lastly, let’s consider mean neighborhood rental prices. The following visualizations detail the relationships between the proportion of Black, Latinx, Asian, and White residents in a neighborhood with that neighborhood mean rental price.
black_rent <- nyc_compiled %>%
filter(nta_type == 0) %>%
ggplot(aes(x=black_perc,
color=transportation_desert_3cat,
y=mean_rent)) +
geom_point(size=2)+
geom_smooth(aes(group=transportation_desert_3cat),span = 1, size=3, se=F)+
labs(title="Neighborhood Rental Price Averages \nby Black Density per Transportation Category", y="", x="Black (%)")+
theme_linedraw()+
theme(legend.position = "none",
plot.title = element_text(family="DIN Condensed", size = 30 ,hjust=.5, face = "bold"),
strip.background.x = element_rect(fill="Black", color="Black"),
strip.text.x = element_text(family="DIN Condensed", size = 20, color = "White", face = "bold")) +
scale_color_manual(values=c("#996A37","#E9DBC2","#395645"))+
facet_wrap(.~transportation_desert_3cat, scales = "free", ncol=2, nrow=2)
asian_rent <- nyc_compiled %>%
filter(nta_type == 0) %>%
ggplot(aes(x=asian_perc,
color=transportation_desert_3cat,
y=mean_rent)) +
geom_point(size=2)+
geom_smooth(aes(group=transportation_desert_3cat),span = 1, size=3, se=F)+
labs(title="Neighborhood Rental Price Averages \nby Asian Density per Transportation Category", y="", x="Asian (%)")+
theme_linedraw()+
theme(legend.position = "none",
plot.title = element_text(family="DIN Condensed", size = 30 ,hjust=.5, face = "bold"),
strip.background.x = element_rect(fill="Black", color="Black"),
strip.text.x = element_text(family="DIN Condensed", size = 20, color = "White", face = "bold")) +
scale_color_manual(values=c("#996A37","#E9DBC2","#395645"))+
facet_wrap(.~transportation_desert_3cat, scales = "free", ncol=2, nrow=2)
latinx_rent <- nyc_compiled %>%
filter(nta_type == 0) %>%
ggplot(aes(x=latinx_perc,
color=transportation_desert_3cat,
y=mean_rent)) +
geom_point(size=2)+
geom_smooth(aes(group=transportation_desert_3cat),span = 1, size=3, se=F)+
labs(title="Neighborhood Rental Price Averages \nby Latinx Density per Transportation Category", y="", x="Latinx (%)")+
theme_linedraw()+
theme(legend.position = "none",
plot.title = element_text(family="DIN Condensed", size = 30 ,hjust=.5, face = "bold"),
strip.background.x = element_rect(fill="Black", color="Black"),
strip.text.x = element_text(family="DIN Condensed", size = 20, color = "White", face = "bold")) +
scale_color_manual(values=c("#996A37","#E9DBC2","#395645"))+
facet_wrap(.~transportation_desert_3cat, scales = "free", ncol=2, nrow=2)
white_rent <- nyc_compiled %>%
filter(nta_type == 0) %>%
ggplot(aes(x=white_perc,
color=transportation_desert_3cat,
y=mean_rent)) +
geom_point(size=2)+
geom_smooth(aes(group=transportation_desert_3cat),span = 1, size=3, se=F)+
labs(title="Neighborhood Rental Price Averages \nby White Density per Transportation Category", y="", x="White (%)")+
theme_linedraw()+
theme(legend.position = "none",
plot.title = element_text(family="DIN Condensed", size = 30 ,hjust=.5, face = "bold"),
strip.background.x = element_rect(fill="Black", color="Black"),
strip.text.x = element_text(family="DIN Condensed", size = 20, color = "White", face = "bold")) +
scale_color_manual(values=c("#996A37","#E9DBC2","#395645"))+
facet_wrap(.~transportation_desert_3cat, scales = "free", ncol=2, nrow=2)
ggarrange(black_rent, latinx_rent, asian_rent, white_rent, ncol=2, nrow=2)
Increases in White-resident proportions were uniformly associated with increases in average neighborhood rental prices across all transportation access categories. The converse is true for the relationship between Black and Latinx neighborhood densities and rental prices. It then seems that despite living in cheaper neighborhoods, both NYC’s Black and Latinx communities are carrying the largest burden of eviction.
Our next section details the statistical models we used to better characterize the clear relationships between housing inequities, urban racism, and transportation access.
In order to understand the respective distributions of immigrant population size, evictions, and mean rental prices in NYC, we fit 3 non-hierarchical Bayesian models with each variable as an outcome. In the following section, we fit hierarchical regression models with similar likelihood and prior structure. However, in those models neighborhood values are grouped by borough.
We list our non-hierarchical models and their predictors below:
transportation_desert_3cat
, borough
, total_pop
, gini_neighborhood
, mean_income
, mean_rent
, unemployment_perc
, black_perc
, latinx_perc
, asian_perc
$x_1, x_2, ..., x_{14}$
transportation_desert_3cat
, borough
, gini_neighborhood
, mean_income
, black_perc
, latinx_perc
, asian_perc
, below_poverty_line_perc
,school_count
,store_count
$x_1, x_2, ..., x_{14}$
transportation_desert_3cat
, borough
, total_pop
, below_poverty_line_perc
, gini_neighborhood
, mean_income
, mean_rent
, black_perc
, latinx_perc
, asian_perc
$x_1, x_2, ..., x_{14}$
Because there are four levels of both transportation_desert_3cat
and borough
, stan_glm
defines $x_1, \dots, x_3$
and $x_5, \dots, x_7$
as dummy variables of each respective predictor’s categories, with one common reference category for our intercept.
Across all models we specified weakly-informative normal priors for the $\beta_{k}$
s associated with each predictor \(X_{k}\). However, there are differences in terms of model specifications that we outline below:
For 1 and 3, we used weakly informative normal priors on all predictors and the intercept, then allowed stan_glm
to estimate initial priors. This decision was ultimately due to our unfamiliarity with NYC’s history of evictions and non-citizen population and what their relationships to our predictors may be.
We fit parallels of 1 and 3 that both had a Poisson likelihood, as opposed to a Negative Binomial likelihood in previous iterations of this report. However, we observed an inconsistent spread of variance and increased 0 counts in both the eviction and immigrant count data. Because stan_glm
does not have a zero-inflated poisson distribution, we ultimately performed two Negative-Binomial regressions. Further discussions of our Poisson regressions have been removed from this project.
For 2, we also used weakly informative normal priors on all predictors. However, we specified a normal prior with $\mu = 600$
and $\sigma = 20$
for the scaled intercept of mean rental prices.
Our model specifications are detailed in the following subsections
modeling_data <- nyc_compiled %>%
mutate(black_perc = black_perc * 10) %>%
mutate(white_perc = white_perc * 10) %>%
mutate(latinx_perc = latinx_perc * 10) %>%
mutate(asian_perc = asian_perc * 10) %>%
mutate(native_perc = native_perc * 10) %>%
mutate(unemployment_perc = unemployment_perc * 5) %>%
mutate(uninsured_perc = uninsured_perc * 5) %>%
mutate(mean_income = mean_income / 100) %>%
mutate(mean_rent = mean_rent / 100) %>%
filter(nta_type==0) %>%
mutate(nta_type = as.factor(nta_type)) %>%
mutate(borough = as.factor(borough))
$$ \begin{split} \text{Non-Citizen Count}_{i} \mid \beta_{0c}, \beta_1, ..., \beta_k, r & \sim \text{NegBin}(\mu_i, r) \; \; \; \text{where } \log(\mu_i) = \beta_{0c} + \sum^{14}_{k=1} \beta_{k}X_{ik} \\ \beta_{0c} &\sim N(0,2.5^2)\\ r &\sim \text{Exp}(1)\\ \beta_{k} &\sim N(0,s_k^2) \end{split} $$
$\text{and where}$
$$ \begin{align} s_k \in \{ &5.5004, 7.9328, 4.9972, 5.4697, 6.443, 5.3347, .0001, \\ &56.9002, 0.0074, 0.5699, 39.9937, 0.993, 1.142, 1.6003 \} \end{align} $$
We fit our model below using stan_glm
.
noncit_model <- stan_glm(
noncitizen_count ~
transportation_desert_3cat +
borough + total_pop + gini_neighborhood +mean_income + mean_rent + unemployment_perc +
black_perc + latinx_perc + asian_perc,
data = modeling_data,
family = neg_binomial_2,
chains = 4, iter = 1000*2, seed = 84735, refresh = 0
)
Next, we assess its distributional fit to our data.
pp_check(noncit_model)+
xlab("Non-Citizen Count") +
labs(title = "Negative Binomial Model of \nImmigrant/Non-Citizen Count")+
theme(plot.title = element_text(family="DIN Condensed", face="bold", size=25, hjust=.5))
It seems that our simulations (light green) of non-citizen count distributions were largely consistent across our iterations. However, our simulations were strongly biased from the observed non-citizen counts where we would typically predict smaller non-citizen counts more frequently. Additionally, it seemed that variance between simulations increased as non-citizen counts increased. Acknowledging these trends, our negative-binomial model seems to be a pretty good distributional fit, but could certainly be improved upon!
$$ \begin{split} \text{Mean Rental Price}_{i} \mid \beta_{0c}, \beta_1, ..., \beta_k, r & \sim N(\mu_i, \sigma^2) \; \; \; \; \text{where } \mu_i = \beta_{0c} + \sum^{14}_{k=1} \beta_{k}X_{ik} \\ \beta_{0c} &\sim N(1600,20^2)\\ \sigma &\sim \text{Exp}(0.23)\\ \beta_{k} &\sim N(0,s_k^2) \end{split} $$
$\text{and where}$
$$ \begin{align} s_k \in \{ &24.1301 34.8009 21.9225 23.9953 28.2651 23.4030 249.6185 \\ & 0.0325 4.3562 5.0098 7.0204 110.1265 1.7160 0.2803 \} \end{align} $$
Once again note that we are specifying the parameters of our scaled intercept prior, $\beta_{0c}$
, so that the the typical mean neighborhood rental price $\sim $N(1600,20^2)$
, while our priors for the \(\beta_k\) are weakly-informed negative priors. We chose our prior intercept specifications of mean rental price ($\beta_{0c}$
) using Juthi’s experience renting in NYC and a group conversation about typical rental prices we would elect to pay in NYC, Los Angeles, and other major cities we have lived in or around. However, we decided to continue using weakly-informative normal priors for the predictors because we were unsure about their relationship— if any— to rental prices.
We fit our model below.
rent_model <- stan_glm(
mean_rent ~
transportation_desert_3cat + borough + gini_neighborhood +
mean_income + black_perc + latinx_perc + asian_perc+
below_poverty_line_perc + school_count + store_count,
data = modeling_data,
family = gaussian,
prior_intercept = normal(1600 , 20),
chains = 4, iter = 1000*2, seed = 84735, refresh = 0
)
Next, we assess its distributional fit to our data.
pp_check(rent_model)+
xlab("Mean Rental Price") +
labs(title = "Normal Model of \nMonthly Mean Rental Price")+
theme(plot.title = element_text(family="DIN Condensed", face="bold", size=25, hjust=.5))
Our simulations are relatively consistent. However, these simulated distributions are much more variable than the simulated distributions we saw in our non-citizen rent model. Importantly, it also seems our simulated normal posterior distributions had higher variance than what was observed and that is likely because the mean rental prices themselves are not perfectly normally distributed. Theoretically, we could change our likelihood model to adjust for the skew, but for now this is good!
$$ \begin{split} \text{Eviction Count}_{i} \mid \beta_{0c}, \beta_1, ..., \beta_k, r & \sim \text{NegBin}(\mu_i, r) \; \; \; \text{where } \log(\mu_i) = \beta_{0c} + \sum^{14}_{k=1} \beta_{k}X_{ik} \\ \beta_{0c} &\sim N(0,2.5^2)\\ r &\sim \text{Exp}(1)\\ \beta_{k} &\sim N(0,s_k^2) \end{split} $$
$\text{and where}$
$$ \begin{align} s_k \in \{ &5.5004, 7.9328, 4.9972, 5.4697, 6.443, 5.3347, .0001, \\ &25.1032, 56.9002, 0.0074, 0.5699, 0.993, 1.142, 1.6003 \} \end{align} $$
We fit our model below.
eviction_model <- stan_glm(
eviction_count ~
transportation_desert_3cat + borough + total_pop +
below_poverty_line_perc + gini_neighborhood + mean_income + mean_rent+
black_perc + latinx_perc + asian_perc,
data = modeling_data,
family = neg_binomial_2,
chains = 4, iter = 1000*2, seed = 84735, refresh = 0
)
Next, we assess its distributional fit to our data.
pp_check(eviction_model)+
xlab("Eviction Count") +
xlim(0,2000)+
labs(title = "Negative Binomial Model of \nEviction Count")+
theme(plot.title = element_text(family="DIN Condensed", face="bold", size=25, hjust=.5))
Our simulations of eviction count distributions are dramatically consistent across the iterations. Further, our simulations were relatively consistent with the observed eviction counts. So, our negative-binomial model seems to be a pretty good distributional fit, but could certainly be improved upon!
Given the potential differences in demographic characteristics, housing trends, and transportation by borough and the clear hierarchy from borough to neighborhood, we fit 3 hierarchical parallels of the non-hierarchical models above. Now, however, we let borough be a grouping variable in our data. Importantly, our priors for the \(\beta_k\)s have stayed the same.
We list our hierarchical models and their predictors below:
transportation_desert_3cat
, total_pop
, gini_neighborhood
, mean_income
, mean_rent
, unemployment_perc
, black_perc
, latinx_perc
, asian_perc
borough
transportation_desert_3cat
, gini_neighborhood
, mean_income
, black_perc
, latinx_perc
, asian_perc
, bus_count
,school_count
,store_count
, noncitizen_perc
borough
transportation_desert_3cat
, total_pop
, gini_neighborhood
, mean_income
, mean_rent
, black_perc
, latinx_perc
, asian_perc
, bus_count
,store_count
,unemployment_perc
, uninsured_perc
borough
Again for 4 and 6, we used weakly informative priors and allowed stan_glm
to estimate initial priors. While for 5, we specified the same prior intercept as a normal distribution with mean 1600 and standard deviation at 20.
$$ \begin{split} \text{Non-Citizen Count}_{ij} \mid \beta_{0}, \beta_1, ..., \beta_k, r & \sim \text{NegBin}(\mu_{ij}, r) \; \; \; \text{where } \log(\mu_{ij}) = \beta_{0j} + \sum^{11}_{k=1} \beta_{k}X_{ijk} \\ \beta_{0j}\mid \beta_{0}, \sigma_0 & \stackrel{ind}{\sim} N(\beta_0, \sigma_0^2)\\ \beta_{0c} &\sim N(0, 2.5^2) \\ r &\sim \text{Exp}(1)\\ \sigma_0 &\sim \text{Exp}(1)\\ \beta_{k} &\sim N(0,s_k^2) \end{split} $$
$\text{and where}$
$$ \begin{align} s_k \in \{ &5.5004, 7.9328, 4.9972, .0004, 56.9002,\\ & 0.0074, 0.5699, 39.9937, 0.993, 1.142, 1.6003\} \end{align} $$
We fit our model using stan_glmer
below.
hi_noncit_model <- stan_glmer(
noncitizen_count ~
transportation_desert_3cat +
total_pop + gini_neighborhood +
mean_income + mean_rent +
unemployment_perc +
black_perc + latinx_perc + asian_perc + (1|borough),
data = modeling_data,
family = neg_binomial_2,
chains = 4, iter = 1000*2, seed = 84735, refresh = 0
)
Next, we check it’s distributional fit to our data.
pp_check(hi_noncit_model)+
xlab("Non-Citizen Count") +
labs(title = "Hierarchical Negative Binomial Model of \nImmigrant/Non-Citizen Count")+
theme(plot.title = element_text(family="DIN Condensed", face="bold", size=25, hjust=.5))
It seems that our simulations of non-citizen count distributions in the hierarchical model were largely consistent across the iterations. Similar to the non-hierarchical model, our simulations are biased away from the observed non-citizen counts as our model predicts a higher number of non-citizens more frequently.
$$ \begin{split} \text{Mean Rental Price}_{ij} \mid \beta_{0j}, \beta_1, \dots, \beta_k,\sigma_y & \sim N(\mu_{ij}, \sigma_y^2) \; \; \; \text{where } \mu_{ij} = \beta_{0j} + \sum^{11}_{k=1} \beta_{k}X_{ijk} \\ \beta_{0j} & \stackrel{ind}{\sim} N(\beta_0, \sigma_0^2)\\ \beta_{0c} &\sim N(1600, 20^2) \\ \sigma_y &\sim \text{Exp}(1)\\ \sigma_0 &\sim \text{Exp}(1)\\ \beta_{k} &\sim N(0,s_k^2) \end{split} $$
$\text{and where}$
$$ \begin{align} s_k \in \{ &24.1301 34.8009 21.9225 249.6185 0.0325 4.3562 \\ & 5.0098 7.0204 110.1265 1.7160 0.2803 \} \end{align} $$
Once again, we chose our prior specifications of the mean rental price intercept $\beta_{0c}$
using Juthi’s experience renting in NYC and a group conversation about typical rental prices we would elect to pay in NYC, Los Angeles, and other major cities we have lived in or around.
We fit our model using stan_glmer
below.
hi_rent_model <- stan_glmer(
mean_rent ~
transportation_desert_3cat + gini_neighborhood +
mean_income + black_perc + latinx_perc + asian_perc+
below_poverty_line_perc + school_count + store_count + (1 | borough),
data = modeling_data,
family = gaussian,
prior_intercept = normal(1600, 20, autoscale = TRUE),
prior = normal(0, 2.5, autoscale = TRUE),
chains = 4, iter = 1000*2, seed = 84735, refresh = 0
)
Next, we check it’s distributional fit to our data.
pp_check(hi_rent_model)+
xlab("Mean Rental Price") +
labs(title = "Hierarchical Normal Model of \nMonthly Mean Rental Price")+
theme(plot.title = element_text(family="DIN Condensed", face="bold", size=25, hjust=.5))
Our simulations are relatively consistent. However, these simulated distributions are much more variable than the simulated distributions we saw in our non-citizen rent model. Importantly, it also seems our simulated normal posterior distributions had higher variance than what was observed and that is likely because the mean rental prices themselves are not perfectly normally distributed. Theoretically, we could change our likelihood model to adjust for the skew, but for now this is good!
$$ \begin{split} \text{Eviction Count}_{ij} \mid \beta_{0}, \beta_1, ..., \beta_k, r & \sim \text{NegBin}(\mu_{ij}, r) \; \; \; \text{where } \log(\mu_{ij}) = \beta_{0j} + \sum^{11}_{k=1} \beta_{k}X_{ijk} \\ \beta_{0j}\mid \beta_{0}, \sigma_0 & \stackrel{ind}{\sim} N(\beta_0, \sigma_0^2)\\ \beta_{0c} &\sim N(0, 2.5^2) \\ r &\sim \text{Exp}(1)\\ \sigma_0 &\sim \text{Exp}(1)\\ \beta_{k} &\sim N(0,s_k^2) \end{split} $$
$\text{and where}$
$$ \begin{align} s_k \in \{ &5.5004, 7.9328, 4.9972, 0.0001, 25.1032, 56.9002, \\ & 0.0074, 0.5699, 0.9930, 1.1420, 1.6003 \} \end{align} $$
We fit our model using stan_glmer
below.
hi_eviction_model <- stan_glmer(
eviction_count ~
transportation_desert_3cat +
total_pop + below_poverty_line_perc+
gini_neighborhood + mean_income + mean_rent +
black_perc + latinx_perc + asian_perc + (1|borough),
data = modeling_data,
family = neg_binomial_2,
chains = 4, iter = 1000*2, seed = 84735, refresh = 0
)
Next, we check it’s distributional fit to our data.
pp_check(eviction_model)+
xlab("Eviction Count") +
xlim(0,2000)+
labs(title = "Negative Binomial Model of \nEviction Count")+
theme(plot.title = element_text(family="DIN Condensed", face="bold", size=25, hjust=.5))
Our simulations of eviction count distributions are dramatically consistent across the iterations. Further, our simulations were relatively consistent with the observed eviction counts. Our negative-binomial model seems to be a pretty good distributional fit, but could certainly be improved upon!
In the following sections, we go through each particular model’s outcome and what they tell us about the relationships between transportation and housing.
table1 <- prediction_summary(model=noncit_model, data=modeling_data) %>%
mutate(Model = "Non-Hierarchical")
table2 <- prediction_summary(model=hi_noncit_model, data=modeling_data) %>%
mutate(Model = "Hierarchical")
rbind(table1, table2) %>%
dplyr::mutate(model = Model, .before=1) %>%
dplyr::select(-Model)%>% kable() %>% kable_styling()
model | mae | mae_scaled | within_50 | within_95 |
---|---|---|---|---|
Non-Hierarchical | 1100.467 | 0.5437981 | 0.5659341 | 0.967033 |
Hierarchical | 1106.751 | 0.5584631 | 0.5604396 | 0.967033 |
Using in-sample scaled MAE, it’s evident that the differences between our two negative binomial models of non-citizen counts were minute In other settings, we would typically recommend to use cross-validated error metrics to truly compare these models. However, because we are comparing a hierarchical model to a non-hierarchical model, cross-validation works slightly differently. In the former case, cross-validation via the prediction_summary_cv
function in bayesrules
makes data permutations to represent a “new borough”, rather than a set of arbitrary models. To circumvent these differences, we use the expected-log predictive density (ELPD) of each respective model to compare the performance of each model. The details of this approach can be found here.
mod1_elpd <- loo(noncit_model)
hi_mod1_elpd <- loo(hi_noncit_model)
# Compare the ELPD for our 2 models
loo_compare(mod1_elpd, hi_mod1_elpd) %>%
as.data.frame() %>%
rownames_to_column("model") %>%
kable(caption = "Non-Citizen Count EPLD Metrics", align="c") %>%
kable_styling()
model | elpd_diff | se_diff | elpd_loo | se_elpd_loo | p_loo | se_p_loo | looic | se_looic |
---|---|---|---|---|---|---|---|---|
hi_noncit_model | 0.0000000 | 0.0000000 | -1621.890 | 11.50595 | 12.62095 | 2.089444 | 3243.780 | 23.01189 |
noncit_model | -0.9374189 | 0.6633795 | -1622.827 | 11.57690 | 13.78129 | 2.164817 | 3245.655 | 23.15380 |
From the above ELPD rankings, it seems that the non-hierarchical model of non-citizen counts performed better than its hierarchical parallel. Explicitly, there was a decrease of .536 standard deviations when comparing the non-hiearchical to its hiearchical parallel.
modeling_data %>%
mutate(residuals = noncit_model$residuals) %>%
mutate(hi_residuals = hi_noncit_model$residuals) %>%
ggplot() +
geom_hline(yintercept=0)+
geom_point(aes(x=noncitizen_count, y=residuals), color="Black", alpha=.5)+
geom_smooth(aes(x=noncitizen_count, y=residuals), color="Black", alpha=.5, se=F, size=2)+
geom_point(aes(x=noncitizen_count, y=hi_residuals) ,color="#5E76AD", alpha=.5)+
geom_smooth(aes(x=noncitizen_count, y=hi_residuals), color="#5E76AD", alpha=.5, se=F, size=2)+
xlab("Non-Citizen Count") +
ylab("Residual Error")+
theme_linedraw()+
theme(plot.title = element_text(family="DIN Condensed", size = 30 ,hjust=.5, face = "bold"),
strip.background.x = element_rect(fill="Black", color="Black"),
strip.text.x = element_text(family="DIN Condensed", size = 20, color = "White", face = "bold"),
axis.title.y = element_text(size = 20, face="bold", family="DIN Condensed"),
axis.title.x = element_text(size = 20, face="bold", family="DIN Condensed")) +
scale_color_manual(values=c("#996A37","#E9DBC2","#395645"))
Our residuals don’t look perfect, given the slight heteroskedasticity as noncitizen count increased, but it seems that the distributions were almost perfectly comparable between the two models!
noncit_resid <- modeling_data %>%
mutate(resid = noncit_model$residuals) %>%
ggplot() +
geom_sf(aes(fill = resid), color = "#8f98aa")+
scale_fill_gradient(low = "#FCF5EE", high = "#5E76AD") +
theme_minimal() +
ggtitle("Non-Hierarchical Model:\nNon-Citizen Count Residuals ")+
theme(axis.text = element_blank(),
panel.grid.major = element_line("transparent"),
plot.title = element_text(family="DIN Condensed", size = 2* 30, face = "bold", hjust=.5),
legend.key.width = unit(.5, "in"),
legend.title = element_text(size = 25, face="bold", family="DIN Condensed"),
legend.position = "bottom" ,
legend.text = element_text(size = 20, face="bold", family="DIN Condensed"))+
guides(fill = guide_colourbar(title = "Residual Error", title.position="top", title.hjust = 0.5))
hi_noncit_resid <- modeling_data %>%
mutate(resid = hi_noncit_model$residuals) %>%
ggplot() +
geom_sf(aes(fill = resid), color = "#8f98aa")+
scale_fill_gradient(low = "#FCF5EE", high = "#5E76AD") +
theme_minimal() +
ggtitle("Hierarchical Model:\nNon-Citizen Count Residuals ")+
theme(axis.text = element_blank(),
panel.grid.major = element_line("transparent"),
plot.title = element_text(family="DIN Condensed", size = 2* 30, face = "bold", hjust=.5),
legend.key.width = unit(.5, "in"),
legend.title = element_text(size = 25, face="bold", family="DIN Condensed"),
legend.position = "bottom" ,
legend.text = element_text(size = 20, face="bold", family="DIN Condensed"))+
guides(fill = guide_colourbar(title = "Residual Error", title.position="top", title.hjust = 0.5))
ggarrange(noncit_resid, hi_noncit_resid, ncol=2, nrow=1)
The residuals on our hierarchical model are not perfectly randomly distributed across all neighborhoods in NYC, but the scale of these residuals is much smaller than the non-hierarchical model. One could also make the case that there are slight over predictions (darker blue) of non-citizen counts in the Brooklyn (bottom left) for the non-hierarchical model. However, these differences are mostly negligible.
waic_tab1 <- waic(noncit_model)$estimates%>%
as.data.frame() %>%
rownames_to_column() %>%
as_tibble()%>%
filter(rowname=="waic") %>%
dplyr::select(-c(rowname)) %>%
mutate(Model = "Non-Hierarchical", .before=1) %>%
dplyr::rename(WAIC = Estimate)
waic_tab2 <-waic(hi_noncit_model)$estimates%>%
as.data.frame() %>%
rownames_to_column() %>%
as_tibble()%>%
filter(rowname=="waic") %>%
dplyr::select(-c(rowname)) %>%
mutate(Model = "Hierarchical", .before=1) %>%
dplyr::rename(WAIC = Estimate)
rbind(waic_tab1, waic_tab2) %>%
kable() %>% kable_styling()
Model | WAIC | SE |
---|---|---|
Non-Hierarchical | 3245.342 | 23.11924 |
Hierarchical | 3243.470 | 22.93626 |
Our last error metric of non-citizen counts is the Watanabe–Akaike information criterion (WAIC). When comparing two models with the WAIC, the better predicting model would be the model with the smaller WAIC value. Here, it seems that non-hierarchical model of non-citizen counts has the lowest WAIC, but the difference is largely minimal. Altogether, we choose the hierarchical model of non-citizen counts given its lower ELPD, WAIC, and spatial residual patterning.
table1 <- prediction_summary(model=rent_model, data=modeling_data) %>%
mutate(Model = "Non-Hierarchical")
table2 <- prediction_summary(model=hi_rent_model, data=modeling_data) %>%
mutate(Model = "Hierarchical")
rbind(table1, table2) %>%
dplyr::mutate(model = Model, .before=1) %>%
dplyr::select(-Model)%>% kable() %>% kable_styling()
model | mae | mae_scaled | within_50 | within_95 |
---|---|---|---|---|
Non-Hierarchical | 0.8205323 | 0.5218767 | 0.6208791 | 0.9450549 |
Hierarchical | 0.7812764 | 0.4993776 | 0.5989011 | 0.9450549 |
Once again when using in-sample scaled MAE, the differences between our two models of mean rental prices were largely minimal. Our predictions of rental price in both models were about 80 dollars away from their observed rental prices. However, the scaled mean absolute error metrics in the hierarchical model indicate that across simulations the observed value is about 0.3 standard deviations away from their the medians of the posterior predictive distributions of rental prices. This is a nontrivial improvement from the .5 standard deviation in the case of the non-hierarchical distribution.
mod1_elpd <- loo(rent_model)
hi_mod1_elpd <- loo(hi_rent_model)
# Compare the ELPD for our 2 models
loo_compare(mod1_elpd, hi_mod1_elpd) %>%
as.data.frame() %>%
rownames_to_column("model") %>%
kable(caption = "Mean Rental Price EPLD Metrics", align="c") %>%
kable_styling()
model | elpd_diff | se_diff | elpd_loo | se_elpd_loo | p_loo | se_p_loo | looic | se_looic |
---|---|---|---|---|---|---|---|---|
hi_rent_model | 0.00000 | 0.0000000 | -342.1906 | 15.14736 | 16.02401 | 3.650284 | 684.3811 | 30.29472 |
rent_model | -1.95974 | 0.6898585 | -344.1503 | 15.07427 | 17.46025 | 3.773468 | 688.3006 | 30.14854 |
From the above ELPD rankings, it seems that the hierarchical model of mean neighborhood rental prices performed better than its non-hierarchical parallel.
modeling_data %>%
mutate(residuals = rent_model$residuals) %>%
mutate(hi_residuals = hi_rent_model$residuals) %>%
ggplot() +
geom_hline(yintercept=0)+
geom_point(aes(x=mean_rent, y=residuals), color="Black", alpha=.5)+
geom_smooth(aes(x=mean_rent, y=residuals), color="Black", alpha=.5, se=F, size=2)+
geom_point(aes(x=mean_rent, y=hi_residuals) ,color="#30969C", alpha=.5)+
geom_smooth(aes(x=mean_rent, y=hi_residuals), color="#30969C", alpha=.5, se=F, size=2)+
xlab("Mean Rental Price") +
ylab("Residual Error")+
theme_linedraw()+
theme(plot.title = element_text(family="DIN Condensed", size = 30 ,hjust=.5, face = "bold"),
strip.background.x = element_rect(fill="Black", color="Black"),
strip.text.x = element_text(family="DIN Condensed", size = 20, color = "White", face = "bold"),
axis.title.y = element_text(size = 20, face="bold", family="DIN Condensed"),
axis.title.x = element_text(size = 20, face="bold", family="DIN Condensed")) +
scale_color_manual(values=c("#996A37","#E9DBC2","#395645"))
Our residuals look pretty good, but there is some increasing trend that we are underpredicting the rental prices as they increase. However, it seems that the residual distributions are almost perfectly comparable between the two models!
rent_resid <- modeling_data %>%
mutate(resid = rent_model$residuals) %>%
ggplot() +
geom_sf(aes(fill = resid), color = "#8f98aa")+
scale_fill_gradient(low = "#FCF5EE", high = "#30969C") +
theme_minimal() +
ggtitle("Non-Hierarchical Model:\nMean Rental Price Residuals ")+
theme(axis.text = element_blank(),
panel.grid.major = element_line("transparent"),
plot.title = element_text(family="DIN Condensed", size = 2* 30, face = "bold", hjust=.5),
legend.key.width = unit(.5, "in"),
legend.title = element_text(size = 25, face="bold", family="DIN Condensed"),
legend.position = "bottom" ,
legend.text = element_text(size = 20, face="bold", family="DIN Condensed"))+
guides(fill = guide_colourbar(title = "Residual Error", title.position="top", title.hjust = 0.5))
hi_rent_resid <- modeling_data %>%
mutate(resid = hi_rent_model$residuals) %>%
ggplot() +
geom_sf(aes(fill = resid), color = "#8f98aa")+
scale_fill_gradient(low = "#FCF5EE", high = "#30969C") +
theme_minimal() +
ggtitle("Hierarchical Model:\nMean Rental Price Residuals ")+
theme(axis.text = element_blank(),
panel.grid.major = element_line("transparent"),
plot.title = element_text(family="DIN Condensed", size = 2* 30, face = "bold", hjust=.5),
legend.key.width = unit(.5, "in"),
legend.title = element_text(size = 25, face="bold", family="DIN Condensed"),
legend.position = "bottom" ,
legend.text = element_text(size = 20, face="bold", family="DIN Condensed"))+
guides(fill = guide_colourbar(title = "Residual Error", title.position="top", title.hjust = 0.5))
ggarrange(rent_resid, hi_rent_resid, ncol=2, nrow=1)
Regarding the non-hierarchical model, the distribution of our residuals indicate that our model’s mispredictions are randomly distributed. Further, our residuals in the non-hierarchical model are relatively small when compared to the hierarchical model. In the hierarchical model there is a slight indication of systematic mispredictions that appear distributed by borough. Crucially, it seems that we are systematically overpredicting rental prices in both Queens (bottom right) and Brooklyn (bottom left), but with greater overprediction occurring for Brooklyn.
waic_tab1 <- waic(rent_model)$estimates%>%
as.data.frame() %>%
rownames_to_column() %>%
as_tibble()%>%
filter(rowname=="waic") %>%
dplyr::select(-c(rowname)) %>%
mutate(Model = "Non-Hierarchical", .before=1) %>%
dplyr::rename(WAIC = Estimate)
waic_tab2 <-waic(hi_rent_model)$estimates%>%
as.data.frame() %>%
rownames_to_column() %>%
as_tibble()%>%
filter(rowname=="waic") %>%
dplyr::select(-c(rowname)) %>%
mutate(Model = "Hierarchical", .before=1) %>%
dplyr::rename(WAIC = Estimate)
rbind(waic_tab1, waic_tab2) %>%
kable() %>% kable_styling()
Model | WAIC | SE |
---|---|---|
Non-Hierarchical | 688.1365 | 30.23193 |
Hierarchical | 684.2327 | 30.35261 |
It seems that hierarchical model of rental prices has the lowest WAIC, but the difference is largely minimal. The hierarchical model’s residual patterning, ELPD metric, in-sample residual errors, and WAIC all indicate that the hierarchical version may be more appropriate.
table1 <- prediction_summary(model=eviction_model, data=modeling_data) %>%
mutate(Model = "Non-Hierarchical")
table2 <- prediction_summary(model=eviction_model, data=modeling_data) %>%
mutate(Model = "Hierarchical")
rbind(table1, table2) %>%
dplyr::mutate(model = Model, .before=1) %>%
dplyr::select(-Model)%>% kable() %>% kable_styling()
model | mae | mae_scaled | within_50 | within_95 |
---|---|---|---|---|
Non-Hierarchical | 43.07388 | 0.5273335 | 0.6043956 | 0.9615385 |
Hierarchical | 41.74725 | 0.5230725 | 0.5989011 | 0.9615385 |
Once again when using in-sample scaled MAE, the differences between our two negative-binomial models of evictions count were largely minimal. Our predictions of eviction counts in both models were about 40 cases away from their observed counts and with very similar scaled MAE metrics. Typically, we’d suggest the use of the non-hierarchical model given that the performance is so similar and the computational burden of a hierarchical model is no longer warranted.
mod1_elpd <- loo(eviction_model)
hi_mod1_elpd <- loo(hi_eviction_model)
# Compare the ELPD for our 2 models
loo_compare(mod1_elpd, hi_mod1_elpd) %>%
as.data.frame() %>%
rownames_to_column("model") %>%
kable(caption = "Eviction Count EPLD Metrics", align="c") %>%
kable_styling()
model | elpd_diff | se_diff | elpd_loo | se_elpd_loo | p_loo | se_p_loo | looic | se_looic |
---|---|---|---|---|---|---|---|---|
eviction_model | 0.0000000 | 0.0000000 | -1063.813 | 13.54569 | 15.78985 | 2.475974 | 2127.627 | 27.09138 |
hi_eviction_model | -0.1157322 | 0.4176623 | -1063.929 | 13.53727 | 15.54575 | 2.474341 | 2127.858 | 27.07454 |
From the above ELPD rankings, it also seems that the hierarchical model of eviction counts performed better than its non-hierarchical parallel. Next, we will compare how these 2 models’ residuals are spatially distributed.
modeling_data %>%
mutate(residuals = eviction_model$residuals) %>%
mutate(hi_residuals = hi_eviction_model$residuals) %>%
ggplot() +
geom_hline(yintercept=0)+
geom_point(aes(x=eviction_count, y=residuals), color="Black", alpha=.5)+
geom_smooth(aes(x=eviction_count, y=residuals), color="Black", alpha=.5, se=F, size=2)+
geom_point(aes(x=eviction_count, y=hi_residuals) ,color="#C47572", alpha=.5)+
geom_smooth(aes(x=eviction_count, y=hi_residuals), color="#C47572", alpha=.5, se=F, size=2)+
xlab("Eviction Count") +
ylab("Residual Error")+
theme_linedraw()+
theme(plot.title = element_text(family="DIN Condensed", size = 30 ,hjust=.5, face = "bold"),
strip.background.x = element_rect(fill="Black", color="Black"),
strip.text.x = element_text(family="DIN Condensed", size = 20, color = "White", face = "bold"),
axis.title.y = element_text(size = 20, face="bold", family="DIN Condensed"),
axis.title.x = element_text(size = 20, face="bold", family="DIN Condensed")) +
scale_color_manual(values=c("#996A37","#E9DBC2","#395645"))
Our residuals look fantastic! There is a slight heteroskedastic trend in our predictions, but for the most part it seems our residuals on eviction count predictions for both models are normally distributed around 0. Once again, it seems that the residual distributions are largely comparable, however in our hierarchical model (red), we are tending to underpredict eviction counts slightly more in the extreme cases. This is largely expected given that the hierarchical model will naturally shrink predictions of extreme cases.
evict_resid <- modeling_data %>%
mutate(resid = eviction_model$residuals) %>%
ggplot() +
geom_sf(aes(fill = resid), color = "#8f98aa")+
scale_fill_gradient(low = "#FCF5EE", high = "#C47572") +
theme_minimal() +
ggtitle("Non-Hierarchical Model:\nEviction Count Residuals ")+
theme(axis.text = element_blank(),
panel.grid.major = element_line("transparent"),
plot.title = element_text(family="DIN Condensed", size = 2* 30, face = "bold", hjust=.5),
legend.key.width = unit(.5, "in"),
legend.title = element_text(size = 25, face="bold", family="DIN Condensed"),
legend.position = "bottom" ,
legend.text = element_text(size = 20, face="bold", family="DIN Condensed"))+
guides(fill = guide_colourbar(title = "Residual Error", title.position="top", title.hjust = 0.5))
hi_evict_resid <- modeling_data %>%
mutate(resid = hi_eviction_model$residuals) %>%
ggplot() +
geom_sf(aes(fill = resid), color = "#8f98aa")+
scale_fill_gradient(low = "#FCF5EE", high = "#C47572",
guide = guide_legend(title = "Residual")) +
theme_minimal() +
ggtitle("Hierarchical Model:\nEviction Count Residuals ")+
theme(axis.text = element_blank(),
panel.grid.major = element_line("transparent"),
plot.title = element_text(family="DIN Condensed", size = 2* 30, face = "bold", hjust=.5),
legend.key.width = unit(.5, "in"),
legend.title = element_text(size = 25, face="bold", family="DIN Condensed"),
legend.position = "bottom" ,
legend.text = element_text(size = 20, face="bold", family="DIN Condensed"))+
guides(fill = guide_colourbar(title = "Residual Error", title.position="top", title.hjust = 0.5))
ggarrange(evict_resid, hi_evict_resid, ncol=2, nrow=1)
Our residuals are randomly distributed across all neighborhoods indicating that our model’s bias is consistent across observations— this is really good! One could make the case that there are slight under predictions (darker red) of eviction counts in the Bronx (top) for both models. However, these patterns are relatively minute.
waic_tab1 <- waic(eviction_model)$estimates%>%
as.data.frame() %>%
rownames_to_column() %>%
as_tibble()%>%
filter(rowname=="waic") %>%
dplyr::select(-c(rowname)) %>%
mutate(Model = "Non-Hierarchical", .before=1) %>%
dplyr::rename(WAIC = Estimate)
waic_tab2 <-waic(hi_eviction_model)$estimates%>%
as.data.frame() %>%
rownames_to_column() %>%
as_tibble()%>%
filter(rowname=="waic") %>%
dplyr::select(-c(rowname)) %>%
mutate(Model = "Hierarchical", .before=1) %>%
dplyr::rename(WAIC = Estimate)
rbind(waic_tab1, waic_tab2) %>%
kable() %>% kable_styling()
Model | WAIC | SE |
---|---|---|
Non-Hierarchical | 2127.202 | 27.03891 |
Hierarchical | 2127.321 | 26.98266 |
It seems that hierarchical model of rental prices has the lowest WAIC, but only by .03 units. But given the hierarchical model’s residual patterning, ELPD metric, in-sample residual patterns, and WAIC we select the hierarchical model of eviction counts.
Our previous section consistently demonstrated that our hierarchical models performed better than our non-hierarchical models, with respect to absolute error metrics, residual distributions, expected-log predictive densities, and WAICs. In this section, we detail our findings using the hierarchical models.
tidy(hi_noncit_model, effects = "fixed", conf.int = TRUE, conf.level = 0.8)%>%
mutate(estimate= ifelse(term == "(Intercept)", exp(estimate), (exp(estimate)-1)*100),
conf.low= ifelse(term == "(Intercept)", exp(conf.low), (exp(conf.low)-1)*100),
conf.high = ifelse(term == "(Intercept)", exp(conf.high), (exp(conf.high)-1)*100))%>%
filter(conf.low > 0 & conf.high > 0 | conf.low < 0 & conf.high < 0) %>%
kable(align = "c", caption = "Hierarchical Non-Citizen Model - Summary") %>%
kable_styling()
term | estimate | std.error | conf.low | conf.high |
---|---|---|---|---|
(Intercept) | 691.4630941 | 0.4739567 | 382.8732077 | 1295.1156715 |
transportation_desert_3catTypical | 27.4691415 | 0.0853898 | 13.9113180 | 42.5308249 |
transportation_desert_3catExcellent | 30.7028001 | 0.0901200 | 16.3100200 | 46.3674633 |
total_pop | 0.0025604 | 0.0000016 | 0.0023538 | 0.0027631 |
mean_income | -0.0780131 | 0.0002353 | -0.1089523 | -0.0481492 |
mean_rent | 6.2914876 | 0.0176260 | 3.9675710 | 8.7733996 |
black_perc | 4.8255601 | 0.0156236 | 2.8201967 | 6.9399553 |
latinx_perc | 12.3782756 | 0.0205425 | 9.5372092 | 15.3433431 |
asian_perc | 18.7476508 | 0.0240002 | 15.1645494 | 22.3457018 |
Limited Subway Access: When controlling for all other predictors, a neighborhood with limited transit access is expected to have approximately 21.82% more non-citizens than a neighborhood with poor transit access. There’s an 80% probability that this increase could lie anywhere between (7.48%, 38.27%) non citizen residents, indicating that neighborhoods with limited transit access almost certainly have more non citizen residents than neighborhoods with poor access.
Satisfactory Subway Access: When controlling for all other predictors, a neighborhood with satisfactory transit access is expected to have approximately 30.22% more non-citizens than a neighborhood with poor transit access. There’s an 80% probability that this increase could lie anywhere between (13.99%, 49.09%) non citizen residents, indicating that neighborhoods with satisfactory transit access almost certainly have more non citizen residents than neighborhoods with poor access.
Excellent Subway Access: When controlling for all other predictors, a neighborhood with excellent transit access is expected to have approximately 29.84% more non-citizens than a neighborhood with poor transit access. There’s an 80% probability that this increase could lie anywhere between (13.34%, 48.51%) non citizen residents, indicating that neighborhoods with excellent transit access almost certainly have more non citizen residents than neighborhoods with poor access.
Brooklyn: When controlling for all other predictors, a neighborhood in Brooklyn is expected to have approximately 16.61% more non-citizens than a neighborhood in the Bronx. There’s an 80% probability that this increase could lie anywhere between (1.68%, 34.07) non citizen residents, indicating that neighborhoods in Brooklyn almost certainly have more non citizen residents than the Bronx, but the magnitude of this increase may vary.
Manhattan: When controlling for all other predictors, a neighborhood in Manhattan is expected to have approximately 20.58% more non-citizens than in the Bronx. There’s an 80% probability that this increase could lie anywhere between (2.268, 42.85) non citizen residents, indicating that neighborhoods in Manhattan almost certainly have more non citizen residents than the Bronx.
Mean Income: When controlling for all other predictors, a 100 dollar increase in mean neighborhood income is associated with approximately a 0.0793% decrease in non citizen count. However, there is a 80% chance that the decrease in non citizen count may be any value between (0.1102, 0.0476), indicating that there is almost certainly a negative relationship between mean income and non citizen count, but its magnitude may vary.
Mean Rent: When controlling for all other predictors, a 100 dollar increase in mean neighborhood rent is associated with approximately a 6.29% increase in non citizen counts. However, there is an 80% chance that the increase in non citizen count may be any value between (3.96%, 8.68%), indicating that there is almost certainly a positive relationship between mean rent and non citizen count, but its magnitude may vary.
Black Percentage: When controlling for all other predictors, a 10% increase in the Black population in a neighborhood is associated with approximately a 5.20% increase in non citizen counts. However, there is an 80% chance that the increase in non citizen count may be any value between (2.99%, 7.45%), indicating that there is almost certainly a positive relationship between Black resident percentage and non citizen count, but its magnitude may vary.
Latinx Percentage: When controlling for all other predictors, a 10% increase in the Latinx population in a neighborhood is associated with approximately a 13.46% increase in non citizen count. However, there is an 80% chance that the increase in non citizen count may be any value between (10.08%, 17.07%), indicating that there is almost certainly a positive relationship between Latinx resident percentage and non citizen count, but its magnitude may vary.
Asian Percentage: When controlling for all other predictors, a 10% increase in the Asian population in a neighborhood is associated with approximately a 19.63% increase in non citizen count. However, there is an 80% chance that the increase in non citizen count may be any value between (15.75%, 23.69%), indicating that there is almost certainly a positive relationship between Asian resident percentage and non citizen count, but its magnitude may vary.
tidy(hi_rent_model, effects = "fixed", conf.int = TRUE, conf.level = 0.8)%>%
filter(conf.low > 0 & conf.high > 0 | conf.low < 0 & conf.high < 0) %>%
kable(align = "c", caption = "Hierarchical Rent Model - Summary") %>%
kable_styling()
term | estimate | std.error | conf.low | conf.high |
---|---|---|---|---|
(Intercept) | 7.9347285 | 1.9346936 | 5.3634139 | 10.4383862 |
transportation_desert_3catExcellent | 0.7241333 | 0.4048545 | 0.1802043 | 1.2573967 |
mean_income | 0.0122648 | 0.0007671 | 0.0112630 | 0.0132159 |
black_perc | -0.0944055 | 0.0712818 | -0.1839394 | -0.0061517 |
asian_perc | 0.2479826 | 0.1057468 | 0.1134148 | 0.3827827 |
school_count | -0.0344252 | 0.0263589 | -0.0677776 | -0.0006767 |
For rent model:
Mean Income: When controlling for all other predictors, a 100 dollar increase in mean neighborhood income is associated with approximately a $.012 increase in average rental prices. However, there is a 80% chance that the increase associated with rental price increases may be any value between (1.126, 1.326) dollars, indicating that there is almost certainly a positive relationship between mean income and mean rental prices, but its magnitude may vary slightly.
Asian Percentage: When controlling for all other predictors, a 10% increase in the Asian population in a neighborhood is associated with approximately a $0.241 increase in average rental prices. However, there is an 80% chance that the increase associated with rent may be any value between (0.0976, 0.385), indicating that there is almost certainly a positive relationship between Asian resident percentage and rent, but its magnitude may vary.
School Count: When controlling for all other predictors, for every school in a neighborhood there is an associated $0.0398 decrease in average rental prices. However, there is an 80% chance that the increase associated with rent may be any value between (-0.00564, -0.0739), indicating that there is almost certainly a positive relationship between Asian resident percentage and rent, but its magnitude may vary slightly.
tidy(hi_eviction_model, effects = "fixed", conf.int = TRUE, conf.level = 0.8)%>%
mutate(estimate= ifelse(term == "(Intercept)", exp(estimate), (exp(estimate)-1)*100),
conf.low= ifelse(term == "(Intercept)", exp(conf.low), (exp(conf.low)-1)*100),
conf.high = ifelse(term == "(Intercept)", exp(conf.high), (exp(conf.high)-1)*100))%>%
filter(conf.low > 0 & conf.high > 0 | conf.low < 0 & conf.high < 0) %>%
kable(align = "c", caption = "Eviction Model - Summary") %>%
kable_styling()
term | estimate | std.error | conf.low | conf.high |
---|---|---|---|---|
(Intercept) | 15.3289515 | 0.6613517 | 6.4566177 | 34.4416917 |
transportation_desert_3catTypical | 40.0869000 | 0.1050663 | 22.4822044 | 59.6660090 |
transportation_desert_3catExcellent | 37.0457658 | 0.1087596 | 19.5199573 | 57.5519356 |
total_pop | 0.0022105 | 0.0000018 | 0.0019730 | 0.0024513 |
gini_neighborhood | 2005.3098216 | 1.3241009 | 293.3969751 | 11274.2855625 |
mean_income | -0.1274083 | 0.0003238 | -0.1676593 | -0.0875388 |
mean_rent | 4.2729947 | 0.0197823 | 1.7104916 | 6.9679720 |
black_perc | 17.2498484 | 0.0186932 | 14.5688023 | 20.0616449 |
latinx_perc | 7.8825964 | 0.0298685 | 3.8192753 | 12.1054867 |
asian_perc | -5.6363304 | 0.0299594 | -9.2490291 | -2.0270201 |
After removing predictors whose 80% credible intervals included the possibility of non-effect when controlling for other covariates, we found that there were 10 remaining predictors of an arbitrary neighborhood’s eviction counts.
Next, we interpret each predictor:
We’ve confirmed that even after adjusting for income, rental prices, income inequality, and borough differences, neighborhoods in NYC with a substantial proportion of Black and Latinx residents are experiencing dramatic increased risks of eviction. Interestingly, when accounting for economic and demographic predictors, the relationships with transportation we expected to see were reversed. That is, increased transportation access was associated with more evictions. This could be related to the increased population sizes in areas with the best access to transportation (e.g. Manhattan) or it could be because the disparities that transportation inaccess reflects (e.g. racial and class-based tensions) have been accounted for.
Importantly, we found that there was statistically significant spatial clustering of both eviction counts and mean rental prices using Moran’s I from the spdep
package.
library(spdep)
modeling_data <- modeling_data %>%
st_as_sf()
col_sp <- as(modeling_data, "Spatial")
col_nb <- poly2nb(col_sp) # queen neighborhood
col_listw <- nb2listw(col_nb, style = "B") # listw version of the neighborhood
W <- nb2mat(col_nb, style = "B") # binary structure
evict_moran <- tidy(moran.mc(col_sp$eviction_count, listw = col_listw, nsim = 999, alternative = "greater")) %>%
mutate(Term = "Eviction", .before=1)
rent_moran <- tidy(moran.mc(col_sp$mean_rent, listw = col_listw, nsim = 999, alternative = "greater")) %>%
mutate(Term = "Mean Rent", .before=1)
rbind(rent_moran, evict_moran) %>%
kable() %>%
kable_styling()
Term | statistic | p.value | parameter | method | alternative |
---|---|---|---|---|---|
Mean Rent | 0.6509345 | 0.001 | 1000 | Monte-Carlo simulation of Moran I | greater |
Eviction | 0.5667908 | 0.001 | 1000 | Monte-Carlo simulation of Moran I | greater |
Our results are then likely biased by the spatial relationships between neighborhoods. As such, we could extend this work to the spatial domain using methods from the CARBayes
or INLA
packages. Following work by Katie Jolly and Raven McKnight, we have outlined a spatial workflow for both the eviction and rental price models using CARBayes
in the appendix. However, because spatial models were beyond the scope of this project and class, we’d like to emphasize that in a more detailed analysis the models we outline in the appendix would likely be adjusted in STAN
or CARBayes
to use different likelihoods, spatial effect priors (e.g. BYM, Intrinsic CAR, etc), and different priors for the predictors’ coefficients. Further, in a more detailed analysis, we would explicitly describe the mathematical construction of the models.
Although a lot of work went into designing models with causal blocking and confounding in mind, our models remain imperfect. Some major limitations involved the encoding of transportation deserts and the presence of unmeasured confounders.Because we may not have not accounted for all structural predictors in housing equity such as a neighborhood’s rent control policies nor have we adjusted for the reasons behind eviction, we cannot be entirely confident about our models’ conclusions.
Across our models, it becomes clear that many of the structural housing and demographic issues present in NYC need to be more rigorously addressed by both policy-makers and its citizens, regardless of these particular models’ performance. Health begins at home. And if NYC’s Black and Latinx residents are being crushed under the fist of inequity and consequently experiencing increased risks of eviction or tenuous rental prices, then it becomes a health imperative to critically and revolutionarily address NYC’s housing system.
Knowing that there is spatial clustering mean rental prices , we fit a normal model with weakly informative priors for the \(\beta_k\), our scaled prior intercept \(N(1600, 20^2)\), and Besag prior for a random spatial effect via the CARBayes
package’s S.CARleroux function. If there was a non-constant spatial effect on rental prices in a neighborhood, we would edit the rho
parameter below to approximate this new spatial effect prior called a Leroux.
library(CARBayes)
equation <- mean_rent ~ total_pop +
transportation_desert_3cat + borough + gini_neighborhood +
mean_income + black_perc + latinx_perc + asian_perc+
below_poverty_line_perc + school_count + store_count
carbayes_rent <- S.CARleroux(equation,
data = modeling_data,
W = W, family = "gaussian",
prior.mean.beta = c(1600, 0,0,0,0,0,0,0,0,0,0,0,0,0,0),
prior.var.delta = c(20, 0,0,0,0,0,0,0,0,0,0,0,0,0,0),
burnin = 20000, n.sample = 100000, rho=1,
thin = 10, verbose=F)
tidy(moran.mc(x = as.vector(carbayes_rent$residuals$response), listw = col_listw, nsim = 9999, alternative = "greater")) %>%
mutate(Term = "Residual", .before=1) %>%
kable(caption ="Residual Clustering - Rent") %>%
kable_styling()
Term | statistic | p.value | parameter | method | alternative |
---|---|---|---|---|---|
Residual | -0.0197601 | 0.6123 | 3877 | Monte-Carlo simulation of Moran I | greater |
Non-significant Moran I clustering of residuals indicates that our residuals of rental prices are uniformly distributed across neighborhoods. Therefore, this spatial model is sufficiently built to study these rental price trends.
t(carbayes_rent$modelfit)%>%
as_tibble() %>%
kable(align = "c", caption = "CARBayes Spatial Rent Model - DIC Metrics") %>%
kable_styling()
DIC | p.d | WAIC | p.w | LMPL | loglikelihood |
---|---|---|---|---|---|
658.2204 | 41.5531 | 670.6122 | 46.29972 | -339.568 | -287.5571 |
Our WAIC metrics for this spatial model is slightly better than the best of our non-spatial models (Difference = 9.4086). Once again, in a more detailed analysis we could edit our spatial effect prior.
round(carbayes_rent$summary.results[, 1:7], 4) %>%
as.data.frame() %>%
rownames_to_column() %>%
rename(term = rowname) %>%
as_tibble() %>%
mutate(`Median`= ifelse(term == "(Intercept)", exp(`Median`), (exp(`Median`)-1)*100),
`2.5%`= ifelse(term == "(Intercept)", exp(`2.5%`), (exp(`2.5%`)-1)*100),
`97.5%` = ifelse(term == "(Intercept)", exp(`97.5%`), (exp(`97.5%`)-1)*100))%>%
filter(`2.5%` > 0 & `97.5%` > 0 | `2.5%` < 0 & `97.5%` < 0) %>%
mutate_if(is.numeric, ~round(., 4)) %>%
kable(align = "c", caption = "CARBayes Spatial Rent Model - Summary") %>%
kable_styling()
term | Median | 2.5% | 97.5% | n.sample | % accept | n.effective | Geweke.diag |
---|---|---|---|---|---|---|---|
(Intercept) | 10954.4385 | 40.0248 | 3726574.4812 | 8000 | 100 | 1718.3 | -2.3 |
transportation_desert_3catExcellent | 155.9469 | 9.4065 | 508.4229 | 8000 | 100 | 3940.6 | -2.6 |
mean_income | 1.1668 | 0.9848 | 1.3389 | 8000 | 100 | 972.4 | 1.9 |
nu2 | 469.0511 | 188.9259 | 1148.8397 | 8000 | 100 | 127.8 | 1.5 |
tau2 | 130.1590 | 0.5314 | 2002.0539 | 8000 | 100 | 127.1 | -1.0 |
rho | 171.8282 | 171.8282 | 171.8282 | NA | NA | NA | NA |
Here, the only significant predictor of is mean income which is positively associated with increased rental prices.
modeling_data %>%
mutate(resid = carbayes_rent$residuals$response) %>%
ggplot() +
geom_sf(aes(fill = resid), color = "#8f98aa")+
scale_fill_gradient(low = "#FCF5EE", high = "#30969C",
guide = guide_legend(title = "Residuals")) +
theme_minimal() +
ggtitle("CARBayes Besag:\nMean Rental Price Residuals ")+
theme(axis.text = element_blank(),
panel.grid.major = element_line("transparent"),
plot.title = element_text(family="DIN Condensed", size = 2* 25, face = "bold"),
legend.key.width = unit(.5, "in"),
legend.title = element_text(size = 20, face="bold", family="DIN Condensed"),
legend.position = "bottom" ,
legend.text = element_text(size = 15, face="bold", family="DIN Condensed"))+
guides(fill = guide_colourbar(title = "Residual Error", title.position="top", title.hjust = 0.5))
Our Besag model residuals look very evenly distributed!
We fit a Besag model using a Poisson likelihood to study eviction counts across New York.
library(CARBayes)
equation <- eviction_count ~ offset(log(total_pop)) +
transportation_desert_3cat + borough +
below_poverty_line_perc + gini_neighborhood + mean_income + mean_rent+
black_perc + latinx_perc + asian_perc
carbayes_evictions <- S.CARleroux(equation,
data = modeling_data,
W = W, family = "poisson",
burnin = 200000, n.sample = 1000000, rho=1,
thin = 10, verbose=F)
tidy(moran.mc(x = as.vector(carbayes_evictions$residuals$response), listw = col_listw, nsim = 9999, alternative = "greater")) %>%
mutate(Term = "Residual", .before=1) %>%
kable(caption ="Residual Clustering - Evictions") %>%
kable_styling()
Term | statistic | p.value | parameter | method | alternative |
---|---|---|---|---|---|
Residual | -0.2036027 | 0.9999 | 1 | Monte-Carlo simulation of Moran I | greater |
We found no residual clustering! The plot below shows them spatially.
modeling_data %>%
mutate(resid = carbayes_evictions$residuals$response) %>%
ggplot() +
geom_sf(aes(fill = resid), color = "#8f98aa")+
scale_fill_gradient(low = "#FCF5EE", high = "#C47572",
guide = guide_legend(title = "Number of Evictions")) +
theme_minimal() +
ggtitle("CARBayes Besag:\nEviction Count Residuals ")+
theme(axis.text = element_blank(),
panel.grid.major = element_line("transparent"),
plot.title = element_text(family="DIN Condensed", size = 2* 25, face = "bold"),
legend.key.width = unit(.5, "in"),
legend.title = element_text(size = 20, face="bold", family="DIN Condensed"),
legend.position = "bottom" ,
legend.text = element_text(size = 15, face="bold", family="DIN Condensed"))+
guides(fill = guide_colourbar(title = "Residual Error", title.position="top", title.hjust = 0.5))
Our residuals look fantastic! Woohoo!
Our WAIC metrics for this spatial model on eviction counts is dramatically better than the best of our non-spatial models (Difference = 501.104), indicating that this a much better fit to our data than our previous models.
t(carbayes_evictions$modelfit)%>%
as_tibble() %>%
kable(align = "c", caption = "CARBayes Spatial Eviction Model - DIC Metrics") %>%
kable_styling()
DIC | p.d | WAIC | p.w | LMPL | loglikelihood |
---|---|---|---|---|---|
1659.196 | 170.6545 | 1620.611 | 95.08595 | -920.1938 | -658.9436 |
round(carbayes_evictions$summary.results[, 1:7], 4) %>%
as.data.frame() %>%
rownames_to_column() %>%
rename(term = rowname) %>%
as_tibble() %>%
mutate(`Median`= ifelse(term == "(Intercept)", exp(`Median`), (exp(`Median`)-1)*100),
`2.5%`= ifelse(term == "(Intercept)", exp(`2.5%`), (exp(`2.5%`)-1)*100),
`97.5%` = ifelse(term == "(Intercept)", exp(`97.5%`), (exp(`97.5%`)-1)*100))%>%
filter(`2.5%` > 0 & `97.5%` > 0 | `2.5%` < 0 & `97.5%` < 0) %>%
mutate_if(is.numeric, ~round(., 4)) %>%
kable(align = "c", caption = "CARBayes Spatial Eviction Model - Summary") %>%
kable_styling()
term | Median | 2.5% | 97.5% | n.sample | % accept | n.effective | Geweke.diag |
---|---|---|---|---|---|---|---|
(Intercept) | 0.0005 | 0.0001 | 0.0021 | 80000 | 43.5 | 271.5 | -2.1 |
below_poverty_line_perc | -77.6758 | -93.1970 | -26.5966 | 80000 | 43.5 | 317.7 | 0.3 |
gini_neighborhood | 6498.9788 | 283.3967 | 93339.5690 | 80000 | 43.5 | 294.9 | 1.1 |
mean_income | -0.1299 | -0.1898 | -0.0800 | 80000 | 43.5 | 147.7 | 0.6 |
mean_rent | 5.6224 | 2.2653 | 9.2753 | 80000 | 43.5 | 237.5 | -0.4 |
black_perc | 20.9612 | 16.1021 | 25.9733 | 80000 | 43.5 | 376.9 | 0.0 |
latinx_perc | 14.2478 | 8.9915 | 20.5507 | 80000 | 43.5 | 303.7 | 0.9 |
tau2 | 42.2904 | 32.3527 | 56.7842 | 80000 | 100.0 | 7488.7 | 0.7 |
rho | 171.8282 | 171.8282 | 171.8282 | NA | NA | NA | NA |
Although this model has performed phenomenally, we could likely improve this model by choosing a different family for our regression. In NYC, neighborhood eviction counts are zero-inflated, so a count regression model that allows for increased variance or zero-inflation— like Negative-Binomial (NB) or Zero-Inflated Poisson (ZIP) regression— could improve this analysis. Unfortunately, CARBayes
does not support such regressions. INLA
is a good computational alternative. INLA
is a computationally efficient modeling alternative to MCMC based methods— like those in CARBayes
or STAN
— that use numerical integration by way of Laplace approximations to estimate the posterior distributions. The key difference is that INLA
can approximate what MCMC does, but without any sampling or simulation. Beyond its computational efficiency, it’s important to note that INLA
supports spatial models using both ZIP and NB data distributions.