Lab 1: Census Data Quality for Policy Decisions

Evaluating Data Reliability for Algorithmic Decision-Making

Author

Wil Rantala

Published

March 17, 2026

Assignment Overview

Scenario

You are a data analyst for the Illinois Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.

Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.

Learning Objectives

Apply dplyr functions to real census data for policy analysis
Evaluate data quality using margins of error
Connect technical analysis to algorithmic decision-making
Identify potential equity implications of data reliability issues
Create professional documentation for policy stakeholders

Submission Instructions

Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/labs/lab_1/

Make sure to update your _quarto.yml navigation to include this assignment under an “Labs” menu.

Part 1: Portfolio Integration

Create this assignment in your portfolio repository under an labs/lab_1/ folder structure. Update your navigation menu to include:

- text: Assignments
  menu:
    - href: labs/lab_1/your_file_name.qmd
      text: "Lab 1: Census Data Exploration"

If there is a special character like a colon, you need use double quote mark so that the quarto can identify this as text

Setup

# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
library(tidycensus)
library(tidyverse)
library(knitr)
# Set your Census API key

# Choose your state for analysis - assign it to a variable called my_state
my_state <- "Illinois"

State Selection: I have chosen Illinois for this analysis because: I am from Chicago and am interested in learning more about general Census data information for my home state! Illinois has some very different regions – Chicago (Cook County) and surrounding counties are quite different from the rest of the state, so there should be a good variety of trends.

Part 2: County-Level Resource Assessment

2.1 Data Retrieval

Your Task: Use get_acs() to retrieve county-level data for your chosen state.

Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide

Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.

# Write your get_acs() code here

county_data <- get_acs(
  geography = "county",
  variables = c(
    med_income = "B19013_001", 
    tot_pop = "B01003_001"
    ),
  state = my_state,
  year = 2022,
  output = "wide"
)

# Clean the county names to remove state name and "County" 
# Hint: use mutate() with str_remove()

county_data <- county_data %>%
  mutate(county_name = str_remove(NAME, paste0(", ", "Illinois")))

# Display the first few rows
glimpse(county_data)

Rows: 102
Columns: 7
$ GEOID       <chr> "17001", "17003", "17005", "17007", "17009", "17011", "170…
$ NAME        <chr> "Adams County, Illinois", "Alexander County, Illinois", "B…
$ med_incomeE <dbl> 63767, 40365, 58617, 80502, 64760, 64165, 88059, 61539, 64…
$ med_incomeM <dbl> 2375, 8008, 5412, 3759, 7747, 3293, 7593, 3437, 5585, 1625…
$ tot_popE    <dbl> 65583, 5261, 16750, 53459, 6334, 33203, 4472, 15594, 12955…
$ tot_popM    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ county_name <chr> "Adams County", "Alexander County", "Bond County", "Boone …

2.2 Data Quality Assessment

Your Task: Calculate margin of error percentages and create reliability categories.

Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)

Hint: Use mutate() with case_when() for the categories.

# Calculate MOE percentage and reliability categories using mutate()
county_data <- county_data %>%
  mutate(MOE_percent = (med_incomeM / med_incomeE) * 100)

county_data <- county_data %>%
  mutate(reliability = case_when(MOE_percent < 5 ~ 'High Confidence',
                                 MOE_percent <= 10 ~ 'Moderate Confidence',
                                 TRUE ~ 'Low Confidence'))

# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages
reliability_summary <- county_data %>%
  count(reliability, name = "counties") %>%
    mutate(percent = counties / sum(counties) * 100)

2.3 High Uncertainty Counties

Your Task: Identify the 5 counties with the highest MOE percentages.

Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()

Hint: Use arrange(), slice(), and select() functions.

# Create table of top 5 counties by MOE percentage
top5MOE <- county_data %>%
  arrange(desc(MOE_percent)) %>%
    slice(1:5) %>%
      select(county_name, med_incomeE, med_incomeM, MOE_percent, reliability)

# Format as table with kable() - include appropriate column names and caption
kable(top5MOE, col.names = c(
  "County", "Median Household Income", "Margin of Error", "MOE %", "Reliability"),
  caption = "Five IL Counties with the Highest MOE"
)

Five IL Counties with the Highest MOE
County	Median Household Income	Margin of Error	MOE %	Reliability
Hardin County	53026	11006	20.75586	Low Confidence
Pope County	57582	11811	20.51162	Low Confidence
Alexander County	40365	8008	19.83897	Low Confidence
Moultrie County	72833	9402	12.90898	Low Confidence
Lawrence County	55811	6777	12.14277	Low Confidence

Data Quality Commentary:

[Write 2-3 sentences explaining what these results mean for algorithmic decision-making. Consider: Which counties might be poorly served by algorithms that rely on this income data? What factors might contribute to higher uncertainty?]

For these counties with such poor data reliability, algorithmic decision-making may over or underestimate allocated resources/attention/program allocation. For instance, Pope and Moultrie Counties have estimated median incomes 20k apart, but their near 10k MOEs are great enough that their actual median incomes could feasibly be very similar. The high MOEs could have been caused by low population (and thus low sample sizes), or by large variability in the data.

Part 3: Neighborhood-Level Analysis

3.1 Focus Area Selection

Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.

Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.

# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties
selected_counties <- county_data %>%
  filter(county_name == "Cook County" | county_name == "Pulaski County" | county_name == "Effingham County") %>%
  select(county_name, med_incomeE, MOE_percent, reliability)
# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category

kable(selected_counties, col.names = c(
  "County", "Median Household Income", "MOE %", "Reliability"),
  caption = "Three IL Counties with Varying Reliability")

Three IL Counties with Varying Reliability
County	Median Household Income	MOE %	Reliability
Cook County	78304	0.7292092	High Confidence
Effingham County	73181	5.2076359	Moderate Confidence
Pulaski County	41038	10.8436084	Low Confidence

Comment on the output: Cook County, being the most populous in the state, has a less than 1% MOE to Estimate ratio, which makes sense as the data set is massive compared to some of the less populous counties. Pulaski County has a far lower estimated median HH income compared to Cook and Effingham Counties, and also has a much worse reliability score. I wonder if counties with lower income estimates would also in general have worse reliability, as larger cities and towns often have higher median incomes.

3.2 Tract-Level Demographics

Your Task: Get demographic data for census tracts in your selected counties.

Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.

# Define your race/ethnicity variables with descriptive names
tract_variables = c(white = "B03002_003",
                    black = "B03002_004",
                    hispanic = "B03002_012",
                    total_pop = "B03002_001")
# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
tract_data <- get_acs(
  geography = "tract",
  variables = tract_variables,
  state = my_state,
  county = c("Cook", "Effingham", "Pulaski"),
  year = 2022,
  output = "wide"
)
# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
tract_data <- tract_data %>%
  mutate(white_per = whiteE / total_popE * 100, 
         black_per = blackE / total_popE * 100,
         hispanic_per = hispanicE / total_popE * 100)
# Add readable tract and county name columns using str_extract() or similar
### I looked up the GEOID FIPS and tract codes to use the GEOID to extract the county and tract names for each tract. Then I selected for certain digits within the 11 digit GEOID using str_sub.
tract_data <- tract_data %>%
  mutate(
    il_county = str_sub(GEOID, 3, 5), #select the 3-5 digits for county
    tract_number = str_sub(GEOID, 6, 11) #select the 6-11 digits for tract
  )

### now I have to change the county codes (031, 049, 153) to their associated names (Cook, Effingham, Pulaski)

tract_data <- tract_data %>%
  mutate(
    county_name = recode(il_county,
                         "031" = "Cook",
                         "049" = "Effingham",
                         "153" = "Pulaski")
    )

3.3 Demographic Analysis

Your Task: Analyze the demographic patterns in your selected areas.

# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract

highest_hisp <- tract_data %>%
  arrange(desc(hispanic_per)) %>%
    slice(1) %>%
      select(county_name, tract_number, hispanic_per, black_per, white_per, total_popE)

#The tract with the highest percent latino/hispanic population is tract 301308 in Cook County (Chicago) with 98.83%.

# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group

demo_by_county <- tract_data %>%
  group_by(county_name) %>%
  summarize(
    n_tracts = n(),
    avg_white = mean(white_per, na.rm = TRUE),
    avg_black = mean(black_per, na.rm = TRUE),
    avg_hisp = mean(hispanic_per, na.rm = TRUE)
  )

# Create a nicely formatted table of your results using kable()

kable(demo_by_county, col.names = c(
  "County", "Number of Tracts", "Average % White", "Average % Black", "Average % Hispanic"),
  caption = "Three IL Counties and their Racial Demographics")

Three IL Counties and their Racial Demographics
County	Number of Tracts	Average % White	Average % Black	Average % Hispanic
Cook	1332	38.32627	27.0677340	24.721704
Effingham	8	94.99167	0.5021559	2.363741
Pulaski	2	62.57444	31.5587510	1.901316

Part 4: Comprehensive Data Quality Evaluation

4.1 MOE Analysis for Demographic Variables

Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.

Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics

# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)

tract_data <- tract_data %>%
  mutate(MOE_pct_white = (whiteM / whiteE) * 100,
         MOE_pct_black = (blackM / blackE) * 100,
         MOE_pct_hispanic = (hispanicM / hispanicE) * 100)

# Create a flag for tracts with high MOE on any demographic variable
# Use logical operators (| for OR) in an ifelse() statement

tract_data <- tract_data %>%
  mutate(higherror = ifelse(
    MOE_pct_white > 30 |
    MOE_pct_black > 30 |
    MOE_pct_hispanic > 30,
    1, 0
  ))

# Create summary statistics showing how many tracts have data quality issues

tract_data %>%
  summarize(
    total_tracts = n(),
    n_flagged = sum(higherror),
    per_flagged = mean(higherror) * 100 
    #since higherror is either 1 or 0, the mean * 100 works as the percent flagged
  )

# A tibble: 1 × 3
  total_tracts n_flagged per_flagged
         <int>     <dbl>       <dbl>
1         1342      1330        99.1

4.2 Pattern Analysis

Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.

# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages

quality_dist <- tract_data %>%
  group_by(higherror) %>%
  summarize(
    avg_pop = mean(total_popE, na.rm = TRUE),
    avg_pctwhite = mean(white_per, na.rm = TRUE),
    avg_pctblack = mean(black_per, na.rm = TRUE),
    avg_pcthisp = mean(hispanic_per, na.rm = TRUE)
  )

# Use group_by() and summarize() to create this comparison
# Create a professional table showing the patterns

kable(quality_dist, col.names = c(
  "MOE Percent > 30%", "Average Population", "Average % White", "Average % Black", "Average % Hispanic"),
  caption = "High and Low MOE Tracts and their Demographics")

High and Low MOE Tracts and their Demographics
MOE Percent > 30%	Average Population	Average % White	Average % Black	Average % Hispanic
0	5627.667	33.61835	27.18073	29.8925
1	3907.974	38.74732	26.91321	24.5056

Pattern Analysis: Tracts with any MOEs greater than 30% on average high lower populations, a higher proportion of White people, and lower proportion of Hispanic people. It must be noted that, while not listed on this table, around 99% of tracts had at least one demographic category with an MOE greater than 30%, so the unflagged tracts had a much smaller sample size. That being said, it would make sense for tracts with lower MOEs to gave greater average populations, as greater sample sizes often have lower MOEs.

Part 5: Policy Recommendations

5.1 Analysis Integration and Professional Summary

Your Task: Write an executive summary that integrates findings from all four analyses.

Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses? 2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings? 3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk? 4. Strategic Recommendations: What should the Department implement to address these systematic issues?

Executive Summary: 1. Across analyses in this lab, one key pattern is that lower population counties tend to have lower incomes, and higher margins of error. At the county level, population, income, and other demographic analyses are all possible with less risk of great error, but at the census tract level, errors become too large to confidently make assumptions or inform policy. For some data categories like population or median income (depending on variability in the sample population), errors at the tract level are potentially low enough to inform decisions, but demographic data such as race that divides the population into smaller groups and increasing MOEs, is much less reliable.

In my findings, counties with smaller populations and fewer census tracts face greater risk of algorithmic bias as their errors for median income and other social/demographic factors are far greater than counties (such as Cook) with larger populations. Other at-risk groups include populations in highly racially diverse census tracts, as each subset of the tract’s population is smaller and faces greater error due to a small sample size. Great magins of error may influence policies to afford greater privelages to populations that need them less than others.
The underlying factors increasing risk and bias are small sample sizes, and great variability within those samples. For instance, if a county’s residents have widely ranging incomes, the median income based on a sample population has a higher likelihood of being inaccurate for the whole population. Another risk is using small data samples (such as census tracts) to inform policies – if resources are rationed based on census tract income, race, or other data, then errors are likely to create misalignments with reality.
To address systematic issues, algorithms that influence policy decisions should avoid smaller sample datasets where possible, and supplement their data with 5-year ACS estimates, Decennial Census data, and other available sources to inform decisions. Means beyond conventional government data collection may be deployed depending on policy goals–for instance, if a state government is determining which counties to allocate more funding for transportation and infrastructure, then there should be infrastructure quality assessments, traffic surveys, and travel demand modeling in addition to whatever census or ACS data may be used.

6.3 Specific Recommendations

Your Task: Create a decision framework for algorithm implementation.

# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category

county_decision <- county_data %>%
  select(county_name, med_incomeE, MOE_percent, reliability)

# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"  
# - Low Confidence: "Requires manual review or additional data"

county_decision <- county_decision %>%
  mutate(algo_rec = case_when(
    MOE_percent <= 5 ~ 'Safe for algorithmic decisions',
    MOE_percent <= 10 ~ 'Use with caution - monitor outcomes',
    TRUE ~ 'Requires manual review or additional data'))

# Format as a professional table with kable()

kable(county_decision, col.names = c(
        "IL County", "Median Income", "Margin of Error percent", "Reliability", "Algorithmic Use Recommendation"),
      caption = "Illinois Counties Median Incomes and Data Reliability")

Illinois Counties Median Incomes and Data Reliability
IL County	Median Income	Margin of Error percent	Reliability	Algorithmic Use Recommendation
Adams County	63767	3.7244970	High Confidence	Safe for algorithmic decisions
Alexander County	40365	19.8389694	Low Confidence	Requires manual review or additional data
Bond County	58617	9.2328164	Moderate Confidence	Use with caution - monitor outcomes
Boone County	80502	4.6694492	High Confidence	Safe for algorithmic decisions
Brown County	64760	11.9626313	Low Confidence	Requires manual review or additional data
Bureau County	64165	5.1320814	Moderate Confidence	Use with caution - monitor outcomes
Calhoun County	88059	8.6226280	Moderate Confidence	Use with caution - monitor outcomes
Carroll County	61539	5.5850761	Moderate Confidence	Use with caution - monitor outcomes
Cass County	64826	8.6153704	Moderate Confidence	Use with caution - monitor outcomes
Champaign County	61090	2.6600098	High Confidence	Safe for algorithmic decisions
Christian County	56933	5.1955808	Moderate Confidence	Use with caution - monitor outcomes
Clark County	65874	5.6061572	Moderate Confidence	Use with caution - monitor outcomes
Clay County	58028	9.5695182	Moderate Confidence	Use with caution - monitor outcomes
Clinton County	78054	4.5942553	High Confidence	Safe for algorithmic decisions
Coles County	53732	7.5857962	Moderate Confidence	Use with caution - monitor outcomes
Cook County	78304	0.7292092	High Confidence	Safe for algorithmic decisions
Crawford County	64163	9.3153375	Moderate Confidence	Use with caution - monitor outcomes
Cumberland County	71274	10.5859079	Low Confidence	Requires manual review or additional data
DeKalb County	68617	3.4029468	High Confidence	Safe for algorithmic decisions
De Witt County	61823	9.5223461	Moderate Confidence	Use with caution - monitor outcomes
Douglas County	67177	8.0072048	Moderate Confidence	Use with caution - monitor outcomes
DuPage County	107035	1.1753165	High Confidence	Safe for algorithmic decisions
Edgar County	56687	11.2459647	Low Confidence	Requires manual review or additional data
Edwards County	60784	8.1172677	Moderate Confidence	Use with caution - monitor outcomes
Effingham County	73181	5.2076359	Moderate Confidence	Use with caution - monitor outcomes
Fayette County	51962	6.9108194	Moderate Confidence	Use with caution - monitor outcomes
Ford County	58930	7.1423723	Moderate Confidence	Use with caution - monitor outcomes
Franklin County	51031	5.6259920	Moderate Confidence	Use with caution - monitor outcomes
Fulton County	57223	4.5890638	High Confidence	Safe for algorithmic decisions
Gallatin County	51868	10.4920182	Low Confidence	Requires manual review or additional data
Greene County	58900	4.3463497	High Confidence	Safe for algorithmic decisions
Grundy County	89993	3.4513796	High Confidence	Safe for algorithmic decisions
Hamilton County	60574	10.6118136	Low Confidence	Requires manual review or additional data
Hancock County	61026	6.1236194	Moderate Confidence	Use with caution - monitor outcomes
Hardin County	53026	20.7558556	Low Confidence	Requires manual review or additional data
Henderson County	64946	7.8988698	Moderate Confidence	Use with caution - monitor outcomes
Henry County	66313	5.3217318	Moderate Confidence	Use with caution - monitor outcomes
Iroquois County	62866	5.1426844	Moderate Confidence	Use with caution - monitor outcomes
Jackson County	44847	6.5712311	Moderate Confidence	Use with caution - monitor outcomes
Jasper County	67429	8.4563022	Moderate Confidence	Use with caution - monitor outcomes
Jefferson County	58384	3.7424637	High Confidence	Safe for algorithmic decisions
Jersey County	77607	9.5828984	Moderate Confidence	Use with caution - monitor outcomes
Jo Daviess County	67729	6.2011841	Moderate Confidence	Use with caution - monitor outcomes
Johnson County	63295	5.4696264	Moderate Confidence	Use with caution - monitor outcomes
Kane County	96400	1.8060166	High Confidence	Safe for algorithmic decisions
Kankakee County	65489	4.2785811	High Confidence	Safe for algorithmic decisions
Kendall County	106358	3.0641795	High Confidence	Safe for algorithmic decisions
Knox County	50263	6.6410680	Moderate Confidence	Use with caution - monitor outcomes
Lake County	104553	1.3954645	High Confidence	Safe for algorithmic decisions
LaSalle County	67942	3.1571046	High Confidence	Safe for algorithmic decisions
Lawrence County	55811	12.1427676	Low Confidence	Requires manual review or additional data
Lee County	64588	4.5503809	High Confidence	Safe for algorithmic decisions
Livingston County	68175	4.5544554	High Confidence	Safe for algorithmic decisions
Logan County	62547	4.7564232	High Confidence	Safe for algorithmic decisions
McDonough County	48904	6.7356453	Moderate Confidence	Use with caution - monitor outcomes
McHenry County	100101	2.3036733	High Confidence	Safe for algorithmic decisions
McLean County	75356	3.4383460	High Confidence	Safe for algorithmic decisions
Macon County	59622	3.6513368	High Confidence	Safe for algorithmic decisions
Macoupin County	64706	4.3705375	High Confidence	Safe for algorithmic decisions
Madison County	71759	2.1056592	High Confidence	Safe for algorithmic decisions
Marion County	59099	3.9086956	High Confidence	Safe for algorithmic decisions
Marshall County	64940	5.7083462	Moderate Confidence	Use with caution - monitor outcomes
Mason County	58479	5.2993382	Moderate Confidence	Use with caution - monitor outcomes
Massac County	57365	9.4395537	Moderate Confidence	Use with caution - monitor outcomes
Menard County	84846	6.4599392	Moderate Confidence	Use with caution - monitor outcomes
Mercer County	67028	4.7248911	High Confidence	Safe for algorithmic decisions
Monroe County	100685	4.7385410	High Confidence	Safe for algorithmic decisions
Montgomery County	61796	6.0780633	Moderate Confidence	Use with caution - monitor outcomes
Morgan County	61188	5.7560306	Moderate Confidence	Use with caution - monitor outcomes
Moultrie County	72833	12.9089836	Low Confidence	Requires manual review or additional data
Ogle County	75782	4.2279169	High Confidence	Safe for algorithmic decisions
Peoria County	63409	2.3387847	High Confidence	Safe for algorithmic decisions
Perry County	56338	6.0722780	Moderate Confidence	Use with caution - monitor outcomes
Piatt County	81151	10.7145938	Low Confidence	Requires manual review or additional data
Pike County	55514	6.6721908	Moderate Confidence	Use with caution - monitor outcomes
Pope County	57582	20.5116182	Low Confidence	Requires manual review or additional data
Pulaski County	41038	10.8436084	Low Confidence	Requires manual review or additional data
Putnam County	75726	9.9543090	Moderate Confidence	Use with caution - monitor outcomes
Randolph County	63860	5.1926088	Moderate Confidence	Use with caution - monitor outcomes
Richland County	61607	9.7943416	Moderate Confidence	Use with caution - monitor outcomes
Rock Island County	64435	3.0682083	High Confidence	Safe for algorithmic decisions
St. Clair County	68915	2.6583472	High Confidence	Safe for algorithmic decisions
Saline County	51710	5.0222394	Moderate Confidence	Use with caution - monitor outcomes
Sangamon County	71653	2.5749096	High Confidence	Safe for algorithmic decisions
Schuyler County	63737	10.5825502	Low Confidence	Requires manual review or additional data
Scott County	70500	9.7588652	Moderate Confidence	Use with caution - monitor outcomes
Shelby County	65585	4.7968285	High Confidence	Safe for algorithmic decisions
Stark County	58125	9.2920430	Moderate Confidence	Use with caution - monitor outcomes
Stephenson County	57527	4.1180663	High Confidence	Safe for algorithmic decisions
Tazewell County	74606	2.5333083	High Confidence	Safe for algorithmic decisions
Union County	54090	9.9242004	Moderate Confidence	Use with caution - monitor outcomes
Vermilion County	52787	3.4459242	High Confidence	Safe for algorithmic decisions
Wabash County	54074	10.8776861	Low Confidence	Requires manual review or additional data
Warren County	62700	11.4003190	Low Confidence	Requires manual review or additional data
Washington County	75111	6.4504533	Moderate Confidence	Use with caution - monitor outcomes
Wayne County	53522	7.3016703	Moderate Confidence	Use with caution - monitor outcomes
White County	54605	9.8177823	Moderate Confidence	Use with caution - monitor outcomes
Whiteside County	62828	6.3124721	Moderate Confidence	Use with caution - monitor outcomes
Will County	103678	1.3908447	High Confidence	Safe for algorithmic decisions
Williamson County	60325	5.2598425	Moderate Confidence	Use with caution - monitor outcomes
Winnebago County	61738	2.1639833	High Confidence	Safe for algorithmic decisions
Woodford County	80093	5.0241594	Moderate Confidence	Use with caution - monitor outcomes

Key Recommendations:

Your Task: Use your analysis results to provide specific guidance to the department.

Counties suitable for immediate algorithmic implementation:

There are 38 counties that fall within the High Confidence category. I will not list them all here, but feel free to refer to the table above, listed in alphabetical order, to find counties you are looking for. They are appropriate for algorithmic use because their error margins are low and risk very little over/under estimation of median income, and thus if algorithms are designed well, they will inform accurate policy decisions.

Counties requiring additional oversight:

There are 49 counties that fall within the Moderate Confidence category. These counties have somewhat reliable median income estimates, but if used for algorithmic decision-making, outcomes should be monitored, and if possible, proxies or other data sources should be used to confirm or dispel estimates.

Counties needing alternative approaches:

There are 15 counties that fall within the Low Confidence category. These counties have unreliable data and likely provide inaccurate estimates of median income. For these counties, additional data would be required to make any decisions as not to risk misinformed decisions. Proxies, other data sources, or non-algorithmic approaches may be necessary.

Questions for Further Investigation

[List 2-3 questions that your analysis raised that you’d like to explore further in future assignments. Consider questions about spatial patterns, time trends, or other demographic factors.]

1: Is it possible to accumulate other data to lower margins of error by confirming estimates – if so, how do you know if the additional data successfully confirms or denies an estimate?

2: If there is a large data set like the ACS survey for a whole city, but within each census tract the data has high errors, at what point of scale of the data is it acceptable to use? For instance, if I know that for a group of census tracts all have high errors on their racial demographics, at what geographic scale is it acceptable to use them for decisions – and how do you know?

3: How frequently does unreliable data get used in policy decisions? Is there a way to measure its effects?

Technical Notes

Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on [date]

Reproducibility: - All analysis conducted in R version [your version] - Census API key required for replication - Complete code and documentation available at: [your portfolio URL]

Methodology Notes: [Describe any decisions you made about data processing, county selection, or analytical choices that might affect reproducibility]

Limitations: [Note any limitations in your analysis - sample size issues, geographic scope, temporal factors, etc.]

Submission Checklist

Before submitting your portfolio link on Canvas:

All code chunks run without errors
All “[Fill this in]” prompts have been completed
Tables are properly formatted and readable
Executive summary addresses all four required components
Portfolio navigation includes this assignment
Census API key is properly set
Document renders correctly to HTML

Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/labs/lab_1/your_file_name.html