Lab 1: Census Data Quality for Policy Decisions

Evaluating Data Reliability for Algorithmic Decision-Making

Author

Wil Rantala

Published

March 17, 2026

Assignment Overview

Scenario

You are a data analyst for the Illinois Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.

Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.

Learning Objectives

  • Apply dplyr functions to real census data for policy analysis
  • Evaluate data quality using margins of error
  • Connect technical analysis to algorithmic decision-making
  • Identify potential equity implications of data reliability issues
  • Create professional documentation for policy stakeholders

Submission Instructions

Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/labs/lab_1/

Make sure to update your _quarto.yml navigation to include this assignment under an “Labs” menu.

Part 1: Portfolio Integration

Create this assignment in your portfolio repository under an labs/lab_1/ folder structure. Update your navigation menu to include:

- text: Assignments
  menu:
    - href: labs/lab_1/your_file_name.qmd
      text: "Lab 1: Census Data Exploration"

If there is a special character like a colon, you need use double quote mark so that the quarto can identify this as text

Setup

# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
library(tidycensus)
library(tidyverse)
library(knitr)
# Set your Census API key

# Choose your state for analysis - assign it to a variable called my_state
my_state <- "Illinois"

State Selection: I have chosen Illinois for this analysis because: I am from Chicago and am interested in learning more about general Census data information for my home state! Illinois has some very different regions – Chicago (Cook County) and surrounding counties are quite different from the rest of the state, so there should be a good variety of trends.

Part 2: County-Level Resource Assessment

2.1 Data Retrieval

Your Task: Use get_acs() to retrieve county-level data for your chosen state.

Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide

Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.

# Write your get_acs() code here

county_data <- get_acs(
  geography = "county",
  variables = c(
    med_income = "B19013_001", 
    tot_pop = "B01003_001"
    ),
  state = my_state,
  year = 2022,
  output = "wide"
)

# Clean the county names to remove state name and "County" 
# Hint: use mutate() with str_remove()

county_data <- county_data %>%
  mutate(county_name = str_remove(NAME, paste0(", ", "Illinois")))

# Display the first few rows
glimpse(county_data)
Rows: 102
Columns: 7
$ GEOID       <chr> "17001", "17003", "17005", "17007", "17009", "17011", "170…
$ NAME        <chr> "Adams County, Illinois", "Alexander County, Illinois", "B…
$ med_incomeE <dbl> 63767, 40365, 58617, 80502, 64760, 64165, 88059, 61539, 64…
$ med_incomeM <dbl> 2375, 8008, 5412, 3759, 7747, 3293, 7593, 3437, 5585, 1625…
$ tot_popE    <dbl> 65583, 5261, 16750, 53459, 6334, 33203, 4472, 15594, 12955…
$ tot_popM    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ county_name <chr> "Adams County", "Alexander County", "Bond County", "Boone …

2.2 Data Quality Assessment

Your Task: Calculate margin of error percentages and create reliability categories.

Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)

Hint: Use mutate() with case_when() for the categories.

# Calculate MOE percentage and reliability categories using mutate()
county_data <- county_data %>%
  mutate(MOE_percent = (med_incomeM / med_incomeE) * 100)

county_data <- county_data %>%
  mutate(reliability = case_when(MOE_percent < 5 ~ 'High Confidence',
                                 MOE_percent <= 10 ~ 'Moderate Confidence',
                                 TRUE ~ 'Low Confidence'))

# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages
reliability_summary <- county_data %>%
  count(reliability, name = "counties") %>%
    mutate(percent = counties / sum(counties) * 100)

2.3 High Uncertainty Counties

Your Task: Identify the 5 counties with the highest MOE percentages.

Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()

Hint: Use arrange(), slice(), and select() functions.

# Create table of top 5 counties by MOE percentage
top5MOE <- county_data %>%
  arrange(desc(MOE_percent)) %>%
    slice(1:5) %>%
      select(county_name, med_incomeE, med_incomeM, MOE_percent, reliability)

# Format as table with kable() - include appropriate column names and caption
kable(top5MOE, col.names = c(
  "County", "Median Household Income", "Margin of Error", "MOE %", "Reliability"),
  caption = "Five IL Counties with the Highest MOE"
)
Five IL Counties with the Highest MOE
County Median Household Income Margin of Error MOE % Reliability
Hardin County 53026 11006 20.75586 Low Confidence
Pope County 57582 11811 20.51162 Low Confidence
Alexander County 40365 8008 19.83897 Low Confidence
Moultrie County 72833 9402 12.90898 Low Confidence
Lawrence County 55811 6777 12.14277 Low Confidence

Data Quality Commentary:

[Write 2-3 sentences explaining what these results mean for algorithmic decision-making. Consider: Which counties might be poorly served by algorithms that rely on this income data? What factors might contribute to higher uncertainty?]

For these counties with such poor data reliability, algorithmic decision-making may over or underestimate allocated resources/attention/program allocation. For instance, Pope and Moultrie Counties have estimated median incomes 20k apart, but their near 10k MOEs are great enough that their actual median incomes could feasibly be very similar. The high MOEs could have been caused by low population (and thus low sample sizes), or by large variability in the data.

Part 3: Neighborhood-Level Analysis

3.1 Focus Area Selection

Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.

Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.

# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties
selected_counties <- county_data %>%
  filter(county_name == "Cook County" | county_name == "Pulaski County" | county_name == "Effingham County") %>%
  select(county_name, med_incomeE, MOE_percent, reliability)
# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category

kable(selected_counties, col.names = c(
  "County", "Median Household Income", "MOE %", "Reliability"),
  caption = "Three IL Counties with Varying Reliability")
Three IL Counties with Varying Reliability
County Median Household Income MOE % Reliability
Cook County 78304 0.7292092 High Confidence
Effingham County 73181 5.2076359 Moderate Confidence
Pulaski County 41038 10.8436084 Low Confidence

Comment on the output: Cook County, being the most populous in the state, has a less than 1% MOE to Estimate ratio, which makes sense as the data set is massive compared to some of the less populous counties. Pulaski County has a far lower estimated median HH income compared to Cook and Effingham Counties, and also has a much worse reliability score. I wonder if counties with lower income estimates would also in general have worse reliability, as larger cities and towns often have higher median incomes.

3.2 Tract-Level Demographics

Your Task: Get demographic data for census tracts in your selected counties.

Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.

# Define your race/ethnicity variables with descriptive names
tract_variables = c(white = "B03002_003",
                    black = "B03002_004",
                    hispanic = "B03002_012",
                    total_pop = "B03002_001")
# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
tract_data <- get_acs(
  geography = "tract",
  variables = tract_variables,
  state = my_state,
  county = c("Cook", "Effingham", "Pulaski"),
  year = 2022,
  output = "wide"
)
# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
tract_data <- tract_data %>%
  mutate(white_per = whiteE / total_popE * 100, 
         black_per = blackE / total_popE * 100,
         hispanic_per = hispanicE / total_popE * 100)
# Add readable tract and county name columns using str_extract() or similar
### I looked up the GEOID FIPS and tract codes to use the GEOID to extract the county and tract names for each tract. Then I selected for certain digits within the 11 digit GEOID using str_sub.
tract_data <- tract_data %>%
  mutate(
    il_county = str_sub(GEOID, 3, 5), #select the 3-5 digits for county
    tract_number = str_sub(GEOID, 6, 11) #select the 6-11 digits for tract
  )

### now I have to change the county codes (031, 049, 153) to their associated names (Cook, Effingham, Pulaski)

tract_data <- tract_data %>%
  mutate(
    county_name = recode(il_county,
                         "031" = "Cook",
                         "049" = "Effingham",
                         "153" = "Pulaski")
    )

3.3 Demographic Analysis

Your Task: Analyze the demographic patterns in your selected areas.

# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract

highest_hisp <- tract_data %>%
  arrange(desc(hispanic_per)) %>%
    slice(1) %>%
      select(county_name, tract_number, hispanic_per, black_per, white_per, total_popE)

#The tract with the highest percent latino/hispanic population is tract 301308 in Cook County (Chicago) with 98.83%.

# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group

demo_by_county <- tract_data %>%
  group_by(county_name) %>%
  summarize(
    n_tracts = n(),
    avg_white = mean(white_per, na.rm = TRUE),
    avg_black = mean(black_per, na.rm = TRUE),
    avg_hisp = mean(hispanic_per, na.rm = TRUE)
  )

# Create a nicely formatted table of your results using kable()

kable(demo_by_county, col.names = c(
  "County", "Number of Tracts", "Average % White", "Average % Black", "Average % Hispanic"),
  caption = "Three IL Counties and their Racial Demographics")
Three IL Counties and their Racial Demographics
County Number of Tracts Average % White Average % Black Average % Hispanic
Cook 1332 38.32627 27.0677340 24.721704
Effingham 8 94.99167 0.5021559 2.363741
Pulaski 2 62.57444 31.5587510 1.901316

Part 4: Comprehensive Data Quality Evaluation

4.1 MOE Analysis for Demographic Variables

Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.

Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics

# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)

tract_data <- tract_data %>%
  mutate(MOE_pct_white = (whiteM / whiteE) * 100,
         MOE_pct_black = (blackM / blackE) * 100,
         MOE_pct_hispanic = (hispanicM / hispanicE) * 100)

# Create a flag for tracts with high MOE on any demographic variable
# Use logical operators (| for OR) in an ifelse() statement

tract_data <- tract_data %>%
  mutate(higherror = ifelse(
    MOE_pct_white > 30 |
    MOE_pct_black > 30 |
    MOE_pct_hispanic > 30,
    1, 0
  ))

# Create summary statistics showing how many tracts have data quality issues

tract_data %>%
  summarize(
    total_tracts = n(),
    n_flagged = sum(higherror),
    per_flagged = mean(higherror) * 100 
    #since higherror is either 1 or 0, the mean * 100 works as the percent flagged
  )
# A tibble: 1 × 3
  total_tracts n_flagged per_flagged
         <int>     <dbl>       <dbl>
1         1342      1330        99.1

4.2 Pattern Analysis

Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.

# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages

quality_dist <- tract_data %>%
  group_by(higherror) %>%
  summarize(
    avg_pop = mean(total_popE, na.rm = TRUE),
    avg_pctwhite = mean(white_per, na.rm = TRUE),
    avg_pctblack = mean(black_per, na.rm = TRUE),
    avg_pcthisp = mean(hispanic_per, na.rm = TRUE)
  )

# Use group_by() and summarize() to create this comparison
# Create a professional table showing the patterns

kable(quality_dist, col.names = c(
  "MOE Percent > 30%", "Average Population", "Average % White", "Average % Black", "Average % Hispanic"),
  caption = "High and Low MOE Tracts and their Demographics")
High and Low MOE Tracts and their Demographics
MOE Percent > 30% Average Population Average % White Average % Black Average % Hispanic
0 5627.667 33.61835 27.18073 29.8925
1 3907.974 38.74732 26.91321 24.5056

Pattern Analysis: Tracts with any MOEs greater than 30% on average high lower populations, a higher proportion of White people, and lower proportion of Hispanic people. It must be noted that, while not listed on this table, around 99% of tracts had at least one demographic category with an MOE greater than 30%, so the unflagged tracts had a much smaller sample size. That being said, it would make sense for tracts with lower MOEs to gave greater average populations, as greater sample sizes often have lower MOEs.

Part 5: Policy Recommendations

5.1 Analysis Integration and Professional Summary

Your Task: Write an executive summary that integrates findings from all four analyses.

Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses? 2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings? 3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk? 4. Strategic Recommendations: What should the Department implement to address these systematic issues?

Executive Summary: 1. Across analyses in this lab, one key pattern is that lower population counties tend to have lower incomes, and higher margins of error. At the county level, population, income, and other demographic analyses are all possible with less risk of great error, but at the census tract level, errors become too large to confidently make assumptions or inform policy. For some data categories like population or median income (depending on variability in the sample population), errors at the tract level are potentially low enough to inform decisions, but demographic data such as race that divides the population into smaller groups and increasing MOEs, is much less reliable.

  1. In my findings, counties with smaller populations and fewer census tracts face greater risk of algorithmic bias as their errors for median income and other social/demographic factors are far greater than counties (such as Cook) with larger populations. Other at-risk groups include populations in highly racially diverse census tracts, as each subset of the tract’s population is smaller and faces greater error due to a small sample size. Great magins of error may influence policies to afford greater privelages to populations that need them less than others.

  2. The underlying factors increasing risk and bias are small sample sizes, and great variability within those samples. For instance, if a county’s residents have widely ranging incomes, the median income based on a sample population has a higher likelihood of being inaccurate for the whole population. Another risk is using small data samples (such as census tracts) to inform policies – if resources are rationed based on census tract income, race, or other data, then errors are likely to create misalignments with reality.

  3. To address systematic issues, algorithms that influence policy decisions should avoid smaller sample datasets where possible, and supplement their data with 5-year ACS estimates, Decennial Census data, and other available sources to inform decisions. Means beyond conventional government data collection may be deployed depending on policy goals–for instance, if a state government is determining which counties to allocate more funding for transportation and infrastructure, then there should be infrastructure quality assessments, traffic surveys, and travel demand modeling in addition to whatever census or ACS data may be used.

6.3 Specific Recommendations

Your Task: Create a decision framework for algorithm implementation.

# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category

county_decision <- county_data %>%
  select(county_name, med_incomeE, MOE_percent, reliability)

# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"  
# - Low Confidence: "Requires manual review or additional data"

county_decision <- county_decision %>%
  mutate(algo_rec = case_when(
    MOE_percent <= 5 ~ 'Safe for algorithmic decisions',
    MOE_percent <= 10 ~ 'Use with caution - monitor outcomes',
    TRUE ~ 'Requires manual review or additional data'))

# Format as a professional table with kable()

kable(county_decision, col.names = c(
        "IL County", "Median Income", "Margin of Error percent", "Reliability", "Algorithmic Use Recommendation"),
      caption = "Illinois Counties Median Incomes and Data Reliability")
Illinois Counties Median Incomes and Data Reliability
IL County Median Income Margin of Error percent Reliability Algorithmic Use Recommendation
Adams County 63767 3.7244970 High Confidence Safe for algorithmic decisions
Alexander County 40365 19.8389694 Low Confidence Requires manual review or additional data
Bond County 58617 9.2328164 Moderate Confidence Use with caution - monitor outcomes
Boone County 80502 4.6694492 High Confidence Safe for algorithmic decisions
Brown County 64760 11.9626313 Low Confidence Requires manual review or additional data
Bureau County 64165 5.1320814 Moderate Confidence Use with caution - monitor outcomes
Calhoun County 88059 8.6226280 Moderate Confidence Use with caution - monitor outcomes
Carroll County 61539 5.5850761 Moderate Confidence Use with caution - monitor outcomes
Cass County 64826 8.6153704 Moderate Confidence Use with caution - monitor outcomes
Champaign County 61090 2.6600098 High Confidence Safe for algorithmic decisions
Christian County 56933 5.1955808 Moderate Confidence Use with caution - monitor outcomes
Clark County 65874 5.6061572 Moderate Confidence Use with caution - monitor outcomes
Clay County 58028 9.5695182 Moderate Confidence Use with caution - monitor outcomes
Clinton County 78054 4.5942553 High Confidence Safe for algorithmic decisions
Coles County 53732 7.5857962 Moderate Confidence Use with caution - monitor outcomes
Cook County 78304 0.7292092 High Confidence Safe for algorithmic decisions
Crawford County 64163 9.3153375 Moderate Confidence Use with caution - monitor outcomes
Cumberland County 71274 10.5859079 Low Confidence Requires manual review or additional data
DeKalb County 68617 3.4029468 High Confidence Safe for algorithmic decisions
De Witt County 61823 9.5223461 Moderate Confidence Use with caution - monitor outcomes
Douglas County 67177 8.0072048 Moderate Confidence Use with caution - monitor outcomes
DuPage County 107035 1.1753165 High Confidence Safe for algorithmic decisions
Edgar County 56687 11.2459647 Low Confidence Requires manual review or additional data
Edwards County 60784 8.1172677 Moderate Confidence Use with caution - monitor outcomes
Effingham County 73181 5.2076359 Moderate Confidence Use with caution - monitor outcomes
Fayette County 51962 6.9108194 Moderate Confidence Use with caution - monitor outcomes
Ford County 58930 7.1423723 Moderate Confidence Use with caution - monitor outcomes
Franklin County 51031 5.6259920 Moderate Confidence Use with caution - monitor outcomes
Fulton County 57223 4.5890638 High Confidence Safe for algorithmic decisions
Gallatin County 51868 10.4920182 Low Confidence Requires manual review or additional data
Greene County 58900 4.3463497 High Confidence Safe for algorithmic decisions
Grundy County 89993 3.4513796 High Confidence Safe for algorithmic decisions
Hamilton County 60574 10.6118136 Low Confidence Requires manual review or additional data
Hancock County 61026 6.1236194 Moderate Confidence Use with caution - monitor outcomes
Hardin County 53026 20.7558556 Low Confidence Requires manual review or additional data
Henderson County 64946 7.8988698 Moderate Confidence Use with caution - monitor outcomes
Henry County 66313 5.3217318 Moderate Confidence Use with caution - monitor outcomes
Iroquois County 62866 5.1426844 Moderate Confidence Use with caution - monitor outcomes
Jackson County 44847 6.5712311 Moderate Confidence Use with caution - monitor outcomes
Jasper County 67429 8.4563022 Moderate Confidence Use with caution - monitor outcomes
Jefferson County 58384 3.7424637 High Confidence Safe for algorithmic decisions
Jersey County 77607 9.5828984 Moderate Confidence Use with caution - monitor outcomes
Jo Daviess County 67729 6.2011841 Moderate Confidence Use with caution - monitor outcomes
Johnson County 63295 5.4696264 Moderate Confidence Use with caution - monitor outcomes
Kane County 96400 1.8060166 High Confidence Safe for algorithmic decisions
Kankakee County 65489 4.2785811 High Confidence Safe for algorithmic decisions
Kendall County 106358 3.0641795 High Confidence Safe for algorithmic decisions
Knox County 50263 6.6410680 Moderate Confidence Use with caution - monitor outcomes
Lake County 104553 1.3954645 High Confidence Safe for algorithmic decisions
LaSalle County 67942 3.1571046 High Confidence Safe for algorithmic decisions
Lawrence County 55811 12.1427676 Low Confidence Requires manual review or additional data
Lee County 64588 4.5503809 High Confidence Safe for algorithmic decisions
Livingston County 68175 4.5544554 High Confidence Safe for algorithmic decisions
Logan County 62547 4.7564232 High Confidence Safe for algorithmic decisions
McDonough County 48904 6.7356453 Moderate Confidence Use with caution - monitor outcomes
McHenry County 100101 2.3036733 High Confidence Safe for algorithmic decisions
McLean County 75356 3.4383460 High Confidence Safe for algorithmic decisions
Macon County 59622 3.6513368 High Confidence Safe for algorithmic decisions
Macoupin County 64706 4.3705375 High Confidence Safe for algorithmic decisions
Madison County 71759 2.1056592 High Confidence Safe for algorithmic decisions
Marion County 59099 3.9086956 High Confidence Safe for algorithmic decisions
Marshall County 64940 5.7083462 Moderate Confidence Use with caution - monitor outcomes
Mason County 58479 5.2993382 Moderate Confidence Use with caution - monitor outcomes
Massac County 57365 9.4395537 Moderate Confidence Use with caution - monitor outcomes
Menard County 84846 6.4599392 Moderate Confidence Use with caution - monitor outcomes
Mercer County 67028 4.7248911 High Confidence Safe for algorithmic decisions
Monroe County 100685 4.7385410 High Confidence Safe for algorithmic decisions
Montgomery County 61796 6.0780633 Moderate Confidence Use with caution - monitor outcomes
Morgan County 61188 5.7560306 Moderate Confidence Use with caution - monitor outcomes
Moultrie County 72833 12.9089836 Low Confidence Requires manual review or additional data
Ogle County 75782 4.2279169 High Confidence Safe for algorithmic decisions
Peoria County 63409 2.3387847 High Confidence Safe for algorithmic decisions
Perry County 56338 6.0722780 Moderate Confidence Use with caution - monitor outcomes
Piatt County 81151 10.7145938 Low Confidence Requires manual review or additional data
Pike County 55514 6.6721908 Moderate Confidence Use with caution - monitor outcomes
Pope County 57582 20.5116182 Low Confidence Requires manual review or additional data
Pulaski County 41038 10.8436084 Low Confidence Requires manual review or additional data
Putnam County 75726 9.9543090 Moderate Confidence Use with caution - monitor outcomes
Randolph County 63860 5.1926088 Moderate Confidence Use with caution - monitor outcomes
Richland County 61607 9.7943416 Moderate Confidence Use with caution - monitor outcomes
Rock Island County 64435 3.0682083 High Confidence Safe for algorithmic decisions
St. Clair County 68915 2.6583472 High Confidence Safe for algorithmic decisions
Saline County 51710 5.0222394 Moderate Confidence Use with caution - monitor outcomes
Sangamon County 71653 2.5749096 High Confidence Safe for algorithmic decisions
Schuyler County 63737 10.5825502 Low Confidence Requires manual review or additional data
Scott County 70500 9.7588652 Moderate Confidence Use with caution - monitor outcomes
Shelby County 65585 4.7968285 High Confidence Safe for algorithmic decisions
Stark County 58125 9.2920430 Moderate Confidence Use with caution - monitor outcomes
Stephenson County 57527 4.1180663 High Confidence Safe for algorithmic decisions
Tazewell County 74606 2.5333083 High Confidence Safe for algorithmic decisions
Union County 54090 9.9242004 Moderate Confidence Use with caution - monitor outcomes
Vermilion County 52787 3.4459242 High Confidence Safe for algorithmic decisions
Wabash County 54074 10.8776861 Low Confidence Requires manual review or additional data
Warren County 62700 11.4003190 Low Confidence Requires manual review or additional data
Washington County 75111 6.4504533 Moderate Confidence Use with caution - monitor outcomes
Wayne County 53522 7.3016703 Moderate Confidence Use with caution - monitor outcomes
White County 54605 9.8177823 Moderate Confidence Use with caution - monitor outcomes
Whiteside County 62828 6.3124721 Moderate Confidence Use with caution - monitor outcomes
Will County 103678 1.3908447 High Confidence Safe for algorithmic decisions
Williamson County 60325 5.2598425 Moderate Confidence Use with caution - monitor outcomes
Winnebago County 61738 2.1639833 High Confidence Safe for algorithmic decisions
Woodford County 80093 5.0241594 Moderate Confidence Use with caution - monitor outcomes

Key Recommendations:

Your Task: Use your analysis results to provide specific guidance to the department.

  1. Counties suitable for immediate algorithmic implementation:

There are 38 counties that fall within the High Confidence category. I will not list them all here, but feel free to refer to the table above, listed in alphabetical order, to find counties you are looking for. They are appropriate for algorithmic use because their error margins are low and risk very little over/under estimation of median income, and thus if algorithms are designed well, they will inform accurate policy decisions.

  1. Counties requiring additional oversight:

There are 49 counties that fall within the Moderate Confidence category. These counties have somewhat reliable median income estimates, but if used for algorithmic decision-making, outcomes should be monitored, and if possible, proxies or other data sources should be used to confirm or dispel estimates.

  1. Counties needing alternative approaches:

There are 15 counties that fall within the Low Confidence category. These counties have unreliable data and likely provide inaccurate estimates of median income. For these counties, additional data would be required to make any decisions as not to risk misinformed decisions. Proxies, other data sources, or non-algorithmic approaches may be necessary.

Questions for Further Investigation

[List 2-3 questions that your analysis raised that you’d like to explore further in future assignments. Consider questions about spatial patterns, time trends, or other demographic factors.]

1: Is it possible to accumulate other data to lower margins of error by confirming estimates – if so, how do you know if the additional data successfully confirms or denies an estimate?

2: If there is a large data set like the ACS survey for a whole city, but within each census tract the data has high errors, at what point of scale of the data is it acceptable to use? For instance, if I know that for a group of census tracts all have high errors on their racial demographics, at what geographic scale is it acceptable to use them for decisions – and how do you know?

3: How frequently does unreliable data get used in policy decisions? Is there a way to measure its effects?

Technical Notes

Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on [date]

Reproducibility: - All analysis conducted in R version [your version] - Census API key required for replication - Complete code and documentation available at: [your portfolio URL]

Methodology Notes: [Describe any decisions you made about data processing, county selection, or analytical choices that might affect reproducibility]

Limitations: [Note any limitations in your analysis - sample size issues, geographic scope, temporal factors, etc.]


Submission Checklist

Before submitting your portfolio link on Canvas:

Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/labs/lab_1/your_file_name.html