# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
library(tidycensus)
library(tidyverse)
library(knitr)
# Set your Census API key
# Choose your state for analysis - assign it to a variable called my_state
my_state <- "Illinois"Lab 1: Census Data Quality for Policy Decisions
Evaluating Data Reliability for Algorithmic Decision-Making
Assignment Overview
Scenario
You are a data analyst for the Illinois Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.
Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.
Learning Objectives
- Apply dplyr functions to real census data for policy analysis
- Evaluate data quality using margins of error
- Connect technical analysis to algorithmic decision-making
- Identify potential equity implications of data reliability issues
- Create professional documentation for policy stakeholders
Submission Instructions
Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/labs/lab_1/
Make sure to update your _quarto.yml navigation to include this assignment under an “Labs” menu.
Part 1: Portfolio Integration
Create this assignment in your portfolio repository under an labs/lab_1/ folder structure. Update your navigation menu to include:
- text: Assignments
menu:
- href: labs/lab_1/your_file_name.qmd
text: "Lab 1: Census Data Exploration"
If there is a special character like a colon, you need use double quote mark so that the quarto can identify this as text
Setup
State Selection: I have chosen Illinois for this analysis because: I am from Chicago and am interested in learning more about general Census data information for my home state! Illinois has some very different regions – Chicago (Cook County) and surrounding counties are quite different from the rest of the state, so there should be a good variety of trends.
Part 2: County-Level Resource Assessment
2.1 Data Retrieval
Your Task: Use get_acs() to retrieve county-level data for your chosen state.
Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide
Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.
# Write your get_acs() code here
county_data <- get_acs(
geography = "county",
variables = c(
med_income = "B19013_001",
tot_pop = "B01003_001"
),
state = my_state,
year = 2022,
output = "wide"
)
# Clean the county names to remove state name and "County"
# Hint: use mutate() with str_remove()
county_data <- county_data %>%
mutate(county_name = str_remove(NAME, paste0(", ", "Illinois")))
# Display the first few rows
glimpse(county_data)Rows: 102
Columns: 7
$ GEOID <chr> "17001", "17003", "17005", "17007", "17009", "17011", "170…
$ NAME <chr> "Adams County, Illinois", "Alexander County, Illinois", "B…
$ med_incomeE <dbl> 63767, 40365, 58617, 80502, 64760, 64165, 88059, 61539, 64…
$ med_incomeM <dbl> 2375, 8008, 5412, 3759, 7747, 3293, 7593, 3437, 5585, 1625…
$ tot_popE <dbl> 65583, 5261, 16750, 53459, 6334, 33203, 4472, 15594, 12955…
$ tot_popM <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ county_name <chr> "Adams County", "Alexander County", "Bond County", "Boone …
2.2 Data Quality Assessment
Your Task: Calculate margin of error percentages and create reliability categories.
Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)
Hint: Use mutate() with case_when() for the categories.
# Calculate MOE percentage and reliability categories using mutate()
county_data <- county_data %>%
mutate(MOE_percent = (med_incomeM / med_incomeE) * 100)
county_data <- county_data %>%
mutate(reliability = case_when(MOE_percent < 5 ~ 'High Confidence',
MOE_percent <= 10 ~ 'Moderate Confidence',
TRUE ~ 'Low Confidence'))
# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages
reliability_summary <- county_data %>%
count(reliability, name = "counties") %>%
mutate(percent = counties / sum(counties) * 100)2.3 High Uncertainty Counties
Your Task: Identify the 5 counties with the highest MOE percentages.
Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()
Hint: Use arrange(), slice(), and select() functions.
# Create table of top 5 counties by MOE percentage
top5MOE <- county_data %>%
arrange(desc(MOE_percent)) %>%
slice(1:5) %>%
select(county_name, med_incomeE, med_incomeM, MOE_percent, reliability)
# Format as table with kable() - include appropriate column names and caption
kable(top5MOE, col.names = c(
"County", "Median Household Income", "Margin of Error", "MOE %", "Reliability"),
caption = "Five IL Counties with the Highest MOE"
)| County | Median Household Income | Margin of Error | MOE % | Reliability |
|---|---|---|---|---|
| Hardin County | 53026 | 11006 | 20.75586 | Low Confidence |
| Pope County | 57582 | 11811 | 20.51162 | Low Confidence |
| Alexander County | 40365 | 8008 | 19.83897 | Low Confidence |
| Moultrie County | 72833 | 9402 | 12.90898 | Low Confidence |
| Lawrence County | 55811 | 6777 | 12.14277 | Low Confidence |
Data Quality Commentary:
[Write 2-3 sentences explaining what these results mean for algorithmic decision-making. Consider: Which counties might be poorly served by algorithms that rely on this income data? What factors might contribute to higher uncertainty?]
For these counties with such poor data reliability, algorithmic decision-making may over or underestimate allocated resources/attention/program allocation. For instance, Pope and Moultrie Counties have estimated median incomes 20k apart, but their near 10k MOEs are great enough that their actual median incomes could feasibly be very similar. The high MOEs could have been caused by low population (and thus low sample sizes), or by large variability in the data.
Part 3: Neighborhood-Level Analysis
3.1 Focus Area Selection
Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.
Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.
# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties
selected_counties <- county_data %>%
filter(county_name == "Cook County" | county_name == "Pulaski County" | county_name == "Effingham County") %>%
select(county_name, med_incomeE, MOE_percent, reliability)
# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category
kable(selected_counties, col.names = c(
"County", "Median Household Income", "MOE %", "Reliability"),
caption = "Three IL Counties with Varying Reliability")| County | Median Household Income | MOE % | Reliability |
|---|---|---|---|
| Cook County | 78304 | 0.7292092 | High Confidence |
| Effingham County | 73181 | 5.2076359 | Moderate Confidence |
| Pulaski County | 41038 | 10.8436084 | Low Confidence |
Comment on the output: Cook County, being the most populous in the state, has a less than 1% MOE to Estimate ratio, which makes sense as the data set is massive compared to some of the less populous counties. Pulaski County has a far lower estimated median HH income compared to Cook and Effingham Counties, and also has a much worse reliability score. I wonder if counties with lower income estimates would also in general have worse reliability, as larger cities and towns often have higher median incomes.
3.2 Tract-Level Demographics
Your Task: Get demographic data for census tracts in your selected counties.
Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.
# Define your race/ethnicity variables with descriptive names
tract_variables = c(white = "B03002_003",
black = "B03002_004",
hispanic = "B03002_012",
total_pop = "B03002_001")
# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
tract_data <- get_acs(
geography = "tract",
variables = tract_variables,
state = my_state,
county = c("Cook", "Effingham", "Pulaski"),
year = 2022,
output = "wide"
)
# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
tract_data <- tract_data %>%
mutate(white_per = whiteE / total_popE * 100,
black_per = blackE / total_popE * 100,
hispanic_per = hispanicE / total_popE * 100)
# Add readable tract and county name columns using str_extract() or similar
### I looked up the GEOID FIPS and tract codes to use the GEOID to extract the county and tract names for each tract. Then I selected for certain digits within the 11 digit GEOID using str_sub.
tract_data <- tract_data %>%
mutate(
il_county = str_sub(GEOID, 3, 5), #select the 3-5 digits for county
tract_number = str_sub(GEOID, 6, 11) #select the 6-11 digits for tract
)
### now I have to change the county codes (031, 049, 153) to their associated names (Cook, Effingham, Pulaski)
tract_data <- tract_data %>%
mutate(
county_name = recode(il_county,
"031" = "Cook",
"049" = "Effingham",
"153" = "Pulaski")
)3.3 Demographic Analysis
Your Task: Analyze the demographic patterns in your selected areas.
# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract
highest_hisp <- tract_data %>%
arrange(desc(hispanic_per)) %>%
slice(1) %>%
select(county_name, tract_number, hispanic_per, black_per, white_per, total_popE)
#The tract with the highest percent latino/hispanic population is tract 301308 in Cook County (Chicago) with 98.83%.
# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group
demo_by_county <- tract_data %>%
group_by(county_name) %>%
summarize(
n_tracts = n(),
avg_white = mean(white_per, na.rm = TRUE),
avg_black = mean(black_per, na.rm = TRUE),
avg_hisp = mean(hispanic_per, na.rm = TRUE)
)
# Create a nicely formatted table of your results using kable()
kable(demo_by_county, col.names = c(
"County", "Number of Tracts", "Average % White", "Average % Black", "Average % Hispanic"),
caption = "Three IL Counties and their Racial Demographics")| County | Number of Tracts | Average % White | Average % Black | Average % Hispanic |
|---|---|---|---|---|
| Cook | 1332 | 38.32627 | 27.0677340 | 24.721704 |
| Effingham | 8 | 94.99167 | 0.5021559 | 2.363741 |
| Pulaski | 2 | 62.57444 | 31.5587510 | 1.901316 |
Part 4: Comprehensive Data Quality Evaluation
4.1 MOE Analysis for Demographic Variables
Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.
Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics
# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)
tract_data <- tract_data %>%
mutate(MOE_pct_white = (whiteM / whiteE) * 100,
MOE_pct_black = (blackM / blackE) * 100,
MOE_pct_hispanic = (hispanicM / hispanicE) * 100)
# Create a flag for tracts with high MOE on any demographic variable
# Use logical operators (| for OR) in an ifelse() statement
tract_data <- tract_data %>%
mutate(higherror = ifelse(
MOE_pct_white > 30 |
MOE_pct_black > 30 |
MOE_pct_hispanic > 30,
1, 0
))
# Create summary statistics showing how many tracts have data quality issues
tract_data %>%
summarize(
total_tracts = n(),
n_flagged = sum(higherror),
per_flagged = mean(higherror) * 100
#since higherror is either 1 or 0, the mean * 100 works as the percent flagged
)# A tibble: 1 × 3
total_tracts n_flagged per_flagged
<int> <dbl> <dbl>
1 1342 1330 99.1
4.2 Pattern Analysis
Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.
# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages
quality_dist <- tract_data %>%
group_by(higherror) %>%
summarize(
avg_pop = mean(total_popE, na.rm = TRUE),
avg_pctwhite = mean(white_per, na.rm = TRUE),
avg_pctblack = mean(black_per, na.rm = TRUE),
avg_pcthisp = mean(hispanic_per, na.rm = TRUE)
)
# Use group_by() and summarize() to create this comparison
# Create a professional table showing the patterns
kable(quality_dist, col.names = c(
"MOE Percent > 30%", "Average Population", "Average % White", "Average % Black", "Average % Hispanic"),
caption = "High and Low MOE Tracts and their Demographics")| MOE Percent > 30% | Average Population | Average % White | Average % Black | Average % Hispanic |
|---|---|---|---|---|
| 0 | 5627.667 | 33.61835 | 27.18073 | 29.8925 |
| 1 | 3907.974 | 38.74732 | 26.91321 | 24.5056 |
Pattern Analysis: Tracts with any MOEs greater than 30% on average high lower populations, a higher proportion of White people, and lower proportion of Hispanic people. It must be noted that, while not listed on this table, around 99% of tracts had at least one demographic category with an MOE greater than 30%, so the unflagged tracts had a much smaller sample size. That being said, it would make sense for tracts with lower MOEs to gave greater average populations, as greater sample sizes often have lower MOEs.
Part 5: Policy Recommendations
5.1 Analysis Integration and Professional Summary
Your Task: Write an executive summary that integrates findings from all four analyses.
Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses? 2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings? 3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk? 4. Strategic Recommendations: What should the Department implement to address these systematic issues?
Executive Summary: 1. Across analyses in this lab, one key pattern is that lower population counties tend to have lower incomes, and higher margins of error. At the county level, population, income, and other demographic analyses are all possible with less risk of great error, but at the census tract level, errors become too large to confidently make assumptions or inform policy. For some data categories like population or median income (depending on variability in the sample population), errors at the tract level are potentially low enough to inform decisions, but demographic data such as race that divides the population into smaller groups and increasing MOEs, is much less reliable.
In my findings, counties with smaller populations and fewer census tracts face greater risk of algorithmic bias as their errors for median income and other social/demographic factors are far greater than counties (such as Cook) with larger populations. Other at-risk groups include populations in highly racially diverse census tracts, as each subset of the tract’s population is smaller and faces greater error due to a small sample size. Great magins of error may influence policies to afford greater privelages to populations that need them less than others.
The underlying factors increasing risk and bias are small sample sizes, and great variability within those samples. For instance, if a county’s residents have widely ranging incomes, the median income based on a sample population has a higher likelihood of being inaccurate for the whole population. Another risk is using small data samples (such as census tracts) to inform policies – if resources are rationed based on census tract income, race, or other data, then errors are likely to create misalignments with reality.
To address systematic issues, algorithms that influence policy decisions should avoid smaller sample datasets where possible, and supplement their data with 5-year ACS estimates, Decennial Census data, and other available sources to inform decisions. Means beyond conventional government data collection may be deployed depending on policy goals–for instance, if a state government is determining which counties to allocate more funding for transportation and infrastructure, then there should be infrastructure quality assessments, traffic surveys, and travel demand modeling in addition to whatever census or ACS data may be used.
6.3 Specific Recommendations
Your Task: Create a decision framework for algorithm implementation.
# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category
county_decision <- county_data %>%
select(county_name, med_incomeE, MOE_percent, reliability)
# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"
# - Low Confidence: "Requires manual review or additional data"
county_decision <- county_decision %>%
mutate(algo_rec = case_when(
MOE_percent <= 5 ~ 'Safe for algorithmic decisions',
MOE_percent <= 10 ~ 'Use with caution - monitor outcomes',
TRUE ~ 'Requires manual review or additional data'))
# Format as a professional table with kable()
kable(county_decision, col.names = c(
"IL County", "Median Income", "Margin of Error percent", "Reliability", "Algorithmic Use Recommendation"),
caption = "Illinois Counties Median Incomes and Data Reliability")| IL County | Median Income | Margin of Error percent | Reliability | Algorithmic Use Recommendation |
|---|---|---|---|---|
| Adams County | 63767 | 3.7244970 | High Confidence | Safe for algorithmic decisions |
| Alexander County | 40365 | 19.8389694 | Low Confidence | Requires manual review or additional data |
| Bond County | 58617 | 9.2328164 | Moderate Confidence | Use with caution - monitor outcomes |
| Boone County | 80502 | 4.6694492 | High Confidence | Safe for algorithmic decisions |
| Brown County | 64760 | 11.9626313 | Low Confidence | Requires manual review or additional data |
| Bureau County | 64165 | 5.1320814 | Moderate Confidence | Use with caution - monitor outcomes |
| Calhoun County | 88059 | 8.6226280 | Moderate Confidence | Use with caution - monitor outcomes |
| Carroll County | 61539 | 5.5850761 | Moderate Confidence | Use with caution - monitor outcomes |
| Cass County | 64826 | 8.6153704 | Moderate Confidence | Use with caution - monitor outcomes |
| Champaign County | 61090 | 2.6600098 | High Confidence | Safe for algorithmic decisions |
| Christian County | 56933 | 5.1955808 | Moderate Confidence | Use with caution - monitor outcomes |
| Clark County | 65874 | 5.6061572 | Moderate Confidence | Use with caution - monitor outcomes |
| Clay County | 58028 | 9.5695182 | Moderate Confidence | Use with caution - monitor outcomes |
| Clinton County | 78054 | 4.5942553 | High Confidence | Safe for algorithmic decisions |
| Coles County | 53732 | 7.5857962 | Moderate Confidence | Use with caution - monitor outcomes |
| Cook County | 78304 | 0.7292092 | High Confidence | Safe for algorithmic decisions |
| Crawford County | 64163 | 9.3153375 | Moderate Confidence | Use with caution - monitor outcomes |
| Cumberland County | 71274 | 10.5859079 | Low Confidence | Requires manual review or additional data |
| DeKalb County | 68617 | 3.4029468 | High Confidence | Safe for algorithmic decisions |
| De Witt County | 61823 | 9.5223461 | Moderate Confidence | Use with caution - monitor outcomes |
| Douglas County | 67177 | 8.0072048 | Moderate Confidence | Use with caution - monitor outcomes |
| DuPage County | 107035 | 1.1753165 | High Confidence | Safe for algorithmic decisions |
| Edgar County | 56687 | 11.2459647 | Low Confidence | Requires manual review or additional data |
| Edwards County | 60784 | 8.1172677 | Moderate Confidence | Use with caution - monitor outcomes |
| Effingham County | 73181 | 5.2076359 | Moderate Confidence | Use with caution - monitor outcomes |
| Fayette County | 51962 | 6.9108194 | Moderate Confidence | Use with caution - monitor outcomes |
| Ford County | 58930 | 7.1423723 | Moderate Confidence | Use with caution - monitor outcomes |
| Franklin County | 51031 | 5.6259920 | Moderate Confidence | Use with caution - monitor outcomes |
| Fulton County | 57223 | 4.5890638 | High Confidence | Safe for algorithmic decisions |
| Gallatin County | 51868 | 10.4920182 | Low Confidence | Requires manual review or additional data |
| Greene County | 58900 | 4.3463497 | High Confidence | Safe for algorithmic decisions |
| Grundy County | 89993 | 3.4513796 | High Confidence | Safe for algorithmic decisions |
| Hamilton County | 60574 | 10.6118136 | Low Confidence | Requires manual review or additional data |
| Hancock County | 61026 | 6.1236194 | Moderate Confidence | Use with caution - monitor outcomes |
| Hardin County | 53026 | 20.7558556 | Low Confidence | Requires manual review or additional data |
| Henderson County | 64946 | 7.8988698 | Moderate Confidence | Use with caution - monitor outcomes |
| Henry County | 66313 | 5.3217318 | Moderate Confidence | Use with caution - monitor outcomes |
| Iroquois County | 62866 | 5.1426844 | Moderate Confidence | Use with caution - monitor outcomes |
| Jackson County | 44847 | 6.5712311 | Moderate Confidence | Use with caution - monitor outcomes |
| Jasper County | 67429 | 8.4563022 | Moderate Confidence | Use with caution - monitor outcomes |
| Jefferson County | 58384 | 3.7424637 | High Confidence | Safe for algorithmic decisions |
| Jersey County | 77607 | 9.5828984 | Moderate Confidence | Use with caution - monitor outcomes |
| Jo Daviess County | 67729 | 6.2011841 | Moderate Confidence | Use with caution - monitor outcomes |
| Johnson County | 63295 | 5.4696264 | Moderate Confidence | Use with caution - monitor outcomes |
| Kane County | 96400 | 1.8060166 | High Confidence | Safe for algorithmic decisions |
| Kankakee County | 65489 | 4.2785811 | High Confidence | Safe for algorithmic decisions |
| Kendall County | 106358 | 3.0641795 | High Confidence | Safe for algorithmic decisions |
| Knox County | 50263 | 6.6410680 | Moderate Confidence | Use with caution - monitor outcomes |
| Lake County | 104553 | 1.3954645 | High Confidence | Safe for algorithmic decisions |
| LaSalle County | 67942 | 3.1571046 | High Confidence | Safe for algorithmic decisions |
| Lawrence County | 55811 | 12.1427676 | Low Confidence | Requires manual review or additional data |
| Lee County | 64588 | 4.5503809 | High Confidence | Safe for algorithmic decisions |
| Livingston County | 68175 | 4.5544554 | High Confidence | Safe for algorithmic decisions |
| Logan County | 62547 | 4.7564232 | High Confidence | Safe for algorithmic decisions |
| McDonough County | 48904 | 6.7356453 | Moderate Confidence | Use with caution - monitor outcomes |
| McHenry County | 100101 | 2.3036733 | High Confidence | Safe for algorithmic decisions |
| McLean County | 75356 | 3.4383460 | High Confidence | Safe for algorithmic decisions |
| Macon County | 59622 | 3.6513368 | High Confidence | Safe for algorithmic decisions |
| Macoupin County | 64706 | 4.3705375 | High Confidence | Safe for algorithmic decisions |
| Madison County | 71759 | 2.1056592 | High Confidence | Safe for algorithmic decisions |
| Marion County | 59099 | 3.9086956 | High Confidence | Safe for algorithmic decisions |
| Marshall County | 64940 | 5.7083462 | Moderate Confidence | Use with caution - monitor outcomes |
| Mason County | 58479 | 5.2993382 | Moderate Confidence | Use with caution - monitor outcomes |
| Massac County | 57365 | 9.4395537 | Moderate Confidence | Use with caution - monitor outcomes |
| Menard County | 84846 | 6.4599392 | Moderate Confidence | Use with caution - monitor outcomes |
| Mercer County | 67028 | 4.7248911 | High Confidence | Safe for algorithmic decisions |
| Monroe County | 100685 | 4.7385410 | High Confidence | Safe for algorithmic decisions |
| Montgomery County | 61796 | 6.0780633 | Moderate Confidence | Use with caution - monitor outcomes |
| Morgan County | 61188 | 5.7560306 | Moderate Confidence | Use with caution - monitor outcomes |
| Moultrie County | 72833 | 12.9089836 | Low Confidence | Requires manual review or additional data |
| Ogle County | 75782 | 4.2279169 | High Confidence | Safe for algorithmic decisions |
| Peoria County | 63409 | 2.3387847 | High Confidence | Safe for algorithmic decisions |
| Perry County | 56338 | 6.0722780 | Moderate Confidence | Use with caution - monitor outcomes |
| Piatt County | 81151 | 10.7145938 | Low Confidence | Requires manual review or additional data |
| Pike County | 55514 | 6.6721908 | Moderate Confidence | Use with caution - monitor outcomes |
| Pope County | 57582 | 20.5116182 | Low Confidence | Requires manual review or additional data |
| Pulaski County | 41038 | 10.8436084 | Low Confidence | Requires manual review or additional data |
| Putnam County | 75726 | 9.9543090 | Moderate Confidence | Use with caution - monitor outcomes |
| Randolph County | 63860 | 5.1926088 | Moderate Confidence | Use with caution - monitor outcomes |
| Richland County | 61607 | 9.7943416 | Moderate Confidence | Use with caution - monitor outcomes |
| Rock Island County | 64435 | 3.0682083 | High Confidence | Safe for algorithmic decisions |
| St. Clair County | 68915 | 2.6583472 | High Confidence | Safe for algorithmic decisions |
| Saline County | 51710 | 5.0222394 | Moderate Confidence | Use with caution - monitor outcomes |
| Sangamon County | 71653 | 2.5749096 | High Confidence | Safe for algorithmic decisions |
| Schuyler County | 63737 | 10.5825502 | Low Confidence | Requires manual review or additional data |
| Scott County | 70500 | 9.7588652 | Moderate Confidence | Use with caution - monitor outcomes |
| Shelby County | 65585 | 4.7968285 | High Confidence | Safe for algorithmic decisions |
| Stark County | 58125 | 9.2920430 | Moderate Confidence | Use with caution - monitor outcomes |
| Stephenson County | 57527 | 4.1180663 | High Confidence | Safe for algorithmic decisions |
| Tazewell County | 74606 | 2.5333083 | High Confidence | Safe for algorithmic decisions |
| Union County | 54090 | 9.9242004 | Moderate Confidence | Use with caution - monitor outcomes |
| Vermilion County | 52787 | 3.4459242 | High Confidence | Safe for algorithmic decisions |
| Wabash County | 54074 | 10.8776861 | Low Confidence | Requires manual review or additional data |
| Warren County | 62700 | 11.4003190 | Low Confidence | Requires manual review or additional data |
| Washington County | 75111 | 6.4504533 | Moderate Confidence | Use with caution - monitor outcomes |
| Wayne County | 53522 | 7.3016703 | Moderate Confidence | Use with caution - monitor outcomes |
| White County | 54605 | 9.8177823 | Moderate Confidence | Use with caution - monitor outcomes |
| Whiteside County | 62828 | 6.3124721 | Moderate Confidence | Use with caution - monitor outcomes |
| Will County | 103678 | 1.3908447 | High Confidence | Safe for algorithmic decisions |
| Williamson County | 60325 | 5.2598425 | Moderate Confidence | Use with caution - monitor outcomes |
| Winnebago County | 61738 | 2.1639833 | High Confidence | Safe for algorithmic decisions |
| Woodford County | 80093 | 5.0241594 | Moderate Confidence | Use with caution - monitor outcomes |
Key Recommendations:
Your Task: Use your analysis results to provide specific guidance to the department.
- Counties suitable for immediate algorithmic implementation:
There are 38 counties that fall within the High Confidence category. I will not list them all here, but feel free to refer to the table above, listed in alphabetical order, to find counties you are looking for. They are appropriate for algorithmic use because their error margins are low and risk very little over/under estimation of median income, and thus if algorithms are designed well, they will inform accurate policy decisions.
- Counties requiring additional oversight:
There are 49 counties that fall within the Moderate Confidence category. These counties have somewhat reliable median income estimates, but if used for algorithmic decision-making, outcomes should be monitored, and if possible, proxies or other data sources should be used to confirm or dispel estimates.
- Counties needing alternative approaches:
There are 15 counties that fall within the Low Confidence category. These counties have unreliable data and likely provide inaccurate estimates of median income. For these counties, additional data would be required to make any decisions as not to risk misinformed decisions. Proxies, other data sources, or non-algorithmic approaches may be necessary.
Questions for Further Investigation
[List 2-3 questions that your analysis raised that you’d like to explore further in future assignments. Consider questions about spatial patterns, time trends, or other demographic factors.]
1: Is it possible to accumulate other data to lower margins of error by confirming estimates – if so, how do you know if the additional data successfully confirms or denies an estimate?
2: If there is a large data set like the ACS survey for a whole city, but within each census tract the data has high errors, at what point of scale of the data is it acceptable to use? For instance, if I know that for a group of census tracts all have high errors on their racial demographics, at what geographic scale is it acceptable to use them for decisions – and how do you know?
3: How frequently does unreliable data get used in policy decisions? Is there a way to measure its effects?
Technical Notes
Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on [date]
Reproducibility: - All analysis conducted in R version [your version] - Census API key required for replication - Complete code and documentation available at: [your portfolio URL]
Methodology Notes: [Describe any decisions you made about data processing, county selection, or analytical choices that might affect reproducibility]
Limitations: [Note any limitations in your analysis - sample size issues, geographic scope, temporal factors, etc.]
Submission Checklist
Before submitting your portfolio link on Canvas:
Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/labs/lab_1/your_file_name.html