Workers' Compensation Injury Frequency Prediction Study

Predictive Model for Recordable Injury Counts Using OSHA ITA Data (2016-2024)
Prepared: March 23, 2026
Data Source: OSHA Injury Tracking Application (ITA) Annual Summaries, OSHA Enforcement Data, OSHA Severe Injury Reports
Prepared for: MarisRisk Underwriting Analytics

Table of Contents

  1. Executive Summary
  2. Data Description
  3. Methodology
  4. Stage 1: Negative Binomial GLM
  5. Stage 2: Buhlmann-Straub Credibility
  6. Stage 3: LightGBM Residual Model
  7. Model Performance
  8. Size Bias Analysis
  9. Small Entity Considerations
  10. Intra-Industry Discrimination
  11. COVID-19 Structural Break Analysis
  12. Industry Rate Tables
  13. Limitations & Caveats
  14. Deployment & Governance Recommendations
  15. What This Model Can and Cannot Be Used For
  16. Recommended Next Steps Before Production Use

1. Executive Summary

This study develops and validates a predictive model for workers' compensation injury frequency using 692,412 establishment-year observations from OSHA's Injury Tracking Application (ITA) spanning 2016-2024. The model uses year Y features (prior injury rates, enforcement history, industry classification, employer size) to predict year Y+1 recordable injury counts.

Key Findings:
Important limitations: The OSHA ITA training population is not representative of all US employers. It is tilted toward larger establishments (250+ employees) and designated high-hazard industries. Generalization to small, low-hazard, or non-ITA-reporting employers is limited; predictions for such entities should be treated as class-rate priors with modest risk modifiers, not as individually calibrated estimates. The model also exhibits ~25% aggregate over-prediction on the 2022→2023 test set due to secular trend, requiring a calendar-year adjustment before production use. See Sections 7 and 13 for details.

The model architecture combines a Negative Binomial GLM with exposure offset (providing the interpretable base rate), Buhlmann-Straub credibility weighting (blending individual experience with industry priors), and a LightGBM model trained on GLM residuals (capturing nonlinear interactions, improving validation MAE by 19.1%). The GLM layer is fully transparent and standard in actuarial practice. The gradient boosting layer improves discrimination but introduces model governance considerations and explainability requirements discussed in Section 6. Model deployment is contingent on calendar-year adjustment and governance sign-off.

2. Data Description

2.1 Primary Data Source

The OSHA Injury Tracking Application (ITA) collects annual injury and illness data from establishments meeting reporting thresholds (generally 250+ employees, or 20+ employees in designated high-hazard industries). Each record represents one establishment-year and includes:

2.2 Supplementary Data Sources

2.3 Panel Construction

The model uses a lagged panel design: for each establishment with consecutive-year ITA records, we pair year Y features with year Y+1 outcomes. This produces 692,412 observation pairs across 267,091 unique establishments.

Panel size and TRIR over time

Exhibit 1: Panel size by feature year (left) and observed TRIR trends (right). The COVID-19 dip in 2020-2021 is visible. 2017 has lower volume due to ITA reporting changes.

2.4 Temporal Split (No Data Leakage)

SplitFeature YearsTarget YearsNPurpose
Train2016-20202017-2021346,191Model fitting
Validate2021202298,941Hyperparameter tuning
Test20222023116,957Final evaluation (all results reported here)
Holdout20232024130,323Reserved for future validation

2.5 Data Quality Filters

3. Methodology

3.1 Model Architecture

The model uses a three-stage architecture standard in actuarial pricing:

  1. Stage 1 — Negative Binomial GLM: Provides the interpretable base prediction with industry fixed effects and an exposure offset. This is the actuarially transparent component.
  2. Stage 2 — Buhlmann-Straub Credibility: Blends individual entity predictions with NAICS-4 group rates based on exposure volume, properly handling the small-entity problem.
  3. Stage 3 — LightGBM Residual Model: A gradient boosting model trained on GLM Pearson residuals to capture nonlinear interactions and feature interactions not explained by the base GLM, improving MAE by 19.1% on validation.

3.2 Target Variable and Exposure Treatment

Total recordable cases (integer count) for year Y+1, with log(total hours worked) as the exposure offset. This is the standard actuarial approach: modeling claim frequency per unit of exposure. The Negative Binomial distribution accommodates overdispersion (injuries tend to cluster more than a Poisson process would predict).

Note on exposure at deployment: During model development and backtesting, realized year Y+1 hours are used as the exposure offset, which is standard practice for fitting and validating frequency GLMs. At underwriting time, realized future hours are not available. In production use, the model output is an estimated injury rate per unit of exposure. Expected counts are then derived by multiplying this rate by the insured's submitted or projected hours/payroll. The model's discrimination metrics (Spearman ρ, lift, decile ordering) are properties of the rate prediction and do not depend on knowledge of future exposure. Count-level MAE figures reported herein reflect the backtesting framework and would differ slightly when applied to projected exposure.

3.3 Feature Set

GroupFeaturesRationale
Prior Year ITATRIR, DART, recordable cases, DAFW cases, deaths, hours, employees, severity ratio, illness ratioCore loss experience signal
NAICS Hierarchy2-digit (sector), 4-digit (industry group)Industry base rate
Enforcement HistoryInspection count, violation count, serious+ count, total penalties, willful/repeat flagsRegulatory signal of poor practices
Severe Injury ReportsSIR count, fatalities, amputations, hospitalizationsTail-risk indicator
TrendsTRIR year-over-year change, case count trendTrajectory matters
ControlsState (jurisdiction), COVID indicatorRegulatory environment, pandemic effect

4. Stage 1: Negative Binomial GLM

4.1 Specification

$$Y_{i,t+1} \sim \text{NegBin}(\mu_{i,t+1},\; \alpha)$$ $$\log(\mu_{i,t+1}) = \underbrace{\log(\text{hours}_{i,t+1})}_{ ext{exposure offset}} + \beta_0 + \sum_k \beta_k X_{ik,t} + \sum_j \gamma_j \, \text{NAICS2}_j$$

Where log(hours) is the offset (coefficient fixed at 1.0), ensuring the model predicts a rate per unit of exposure. The NAICS-2 sector dummies capture the industry base rate. The model was fit using IRLS (Iteratively Reweighted Least Squares) and converged in ~110 seconds on 346,191 training observations.

4.2 Coefficient Estimates

FeatureCoefficientp-valueInterpretation
prior_trir (capped at p99)+0.0975<0.001Higher prior TRIR → higher predicted count
is_covid-0.0880<0.001COVID years show ~8.8% fewer injuries
prior_deaths+0.05630.023Prior fatalities signal ongoing hazard
log_prior_employees-0.0495<0.001Larger workforce → slightly lower per-hour rate
prior_severity_ratio+0.0453<0.001Higher proportion of lost-time cases → worse
prior_illness_ratio-0.0275<0.001Illness-heavy mix has lower future injury count
log_prior_hours-0.0191<0.001More hours → slightly lower rate (exposure effect)
log_penalty+0.0183<0.001Higher OSHA penalties → higher predicted injuries
serious_plus_count+0.00150.707Serious+ violations (not significant alone)
trir_trend-0.0009<0.001Rising TRIR trend slightly reduces (regression to mean)
sir_count-0.00060.983SIR events (captured via other features)

Table 1: GLM coefficients. Positive coefficients increase predicted injury counts. All features except serious_plus_count and sir_count are statistically significant. These two are captured via the GBM stage.

4.3 Key Coefficient Interpretations

4.4 GLM Performance

MetricTrainValidationTest
MAE (case count)3.343.533.76
MAE (TRIR)3.763.513.88
Spearman ρ0.674

5. Stage 2: Buhlmann-Straub Credibility

5.1 Motivation

An entity's observed injury rate is a noisy estimate of its true underlying risk. For a large employer with 500,000+ annual hours, the observed rate is highly credible. For a small employer with 10,000 hours, year-to-year variation can swamp the signal. Buhlmann-Straub credibility provides the actuarially standard solution: blend individual experience with a group prior, weighted by exposure volume.

5.2 Formula

$$\hat{\mu}_{\text{credibility}} = Z \cdot \hat{\mu}_{\text{individual}} + (1 - Z) \cdot \hat{\mu}_{\text{group}}$$ $$Z = \frac{n_i}{n_i + k}$$ $$\text{where } n_i = \text{hours worked}, \quad k = \frac{\text{within-entity variance}}{\text{between-entity variance}}$$

5.3 Variance Component Estimation

Using multi-year entities (establishments with 2+ consecutive years of data), we estimated:

ComponentValueMeaning
Within-entity (process) variance3.93Year-to-year TRIR volatility for the same entity
Between-entity (parameter) variance188.0Spread of entity means around NAICS-4 group mean
k (hours for Z = 0.5)50,000~25 FTE-years of exposure needed for 50% credibility
Interpretation: The between-entity variance (188) greatly exceeds the within-entity variance (3.93), indicating that employer-specific risk levels are persistent — an entity's injury rate is much more stable year-to-year than it is similar across entities. This validates the use of individual experience data and means even modest exposure provides useful information.
Credibility distribution

Exhibit 5: Left — distribution of credibility weights across the test set. Right — the credibility function showing Z vs. hours worked. At 50,000 hours (~25 FTE) the entity gets 50% weight on its own experience. At 200,000 hours (~100 FTE), Z ≈ 0.80.

5.4 Credibility Validation

To verify the credibility mechanism works correctly, we compared individual, group, and blended MAE within Z-bands on the test set:

Credibility BandNIndividual MAEGroup MAEBlended MAEBest
Z = [0.0, 0.1) — very small9220.6560.6520.652Group ≈ Blend
Z = [0.1, 0.3)4,4280.5990.6040.602Individual ≈ Blend
Z = [0.3, 0.5)17,1640.8550.8620.838Blend wins
Z = [0.5, 0.7)35,7241.3341.4301.312Blend wins
Z = [0.7, 1.0) — large58,7195.3186.1305.260Blend wins

Table 3: The blended prediction equals or beats both individual and group predictions in every Z-band, confirming the credibility mechanism is correctly calibrated.

6. Stage 3: LightGBM Residual Model

6.1 Approach

After the GLM produces a base prediction, we compute Pearson residuals: (observed - predicted) / sqrt(predicted). A LightGBM gradient boosting model is trained on these residuals using the full feature set (including nonlinear features the GLM cannot capture). The final prediction is: GLM_prediction + GBM_residual × sqrt(GLM_prediction).

6.2 Hyperparameter Tuning

Optuna Bayesian optimization with 5 trials selected: learning rate = 0.020, 87 leaves, max depth = 5. Early stopping at 30 rounds of no improvement.

6.3 Top Features by Gain

RankFeatureGainInterpretation
1prior_trir_capped1,604,494Prior year TRIR (dominant predictor)
2prior_cases1,231,617Raw recordable case count
3naics_41,023,929Minor industry code (4-digit)
4prior_hours909,676Annual hours worked (size proxy)
5prior_employees531,094Employee count
6trir_trend437,816Year-over-year TRIR change
7prior_djtr268,801Job transfer/restriction cases
8cases_trend253,690Year-over-year case count change
9naics_2222,563Major sector code (2-digit)
10state207,013Jurisdiction (state-plan effects)

6.4 Stage 2 Improvement

ModelValidation MAEImprovement
GLM only3.531
GLM + GBM2.85619.1%

The GBM stage provides meaningful improvement by capturing nonlinear interactions between features that the linear GLM cannot model (e.g., industry-specific penalty effects, size-dependent trends).

6.5 Governance Considerations

The GLM base layer is fully transparent: each coefficient has a clear directional interpretation, and the model can be expressed as a simple formula. The GBM residual layer improves discrimination but reduces pure interpretability — individual predictions cannot be decomposed into additive factor contributions as cleanly as the GLM alone.

For model governance purposes:

7. Model Performance

7.1 Out-of-Time Test Set Results

All results below are on the held-out test set (feature year 2022, predicting 2023 outcomes, N = 116,957 establishments).

ModelMAE (counts)MAE (TRIR)Spearman ρTotal PredictedTotal Actual
GLM only3.763.880.674705,146556,251
GLM + GBM3.233.530.707714,580556,251
GLM + GBM + Credibility3.193.480.701695,877556,251
Calibration vs. Discrimination: The model demonstrates strong discrimination (ability to rank entities by risk), but the raw output is miscalibrated in level, over-predicting aggregate cases by ~25% on the test set (695,877 predicted vs 556,251 actual). This reflects the secular decline in workplace injury rates: the model was trained on 2016-2020 data when TRIR was higher, and 2023 outcomes reflect continued improvement.

This means the raw model output should not be used as-is for expected loss estimation. Before production deployment, a calendar-year trend adjustment (multiplicative factor of approximately 0.80 for the 2023 prediction year, re-estimated annually) must be applied to bring predicted totals into alignment with observed experience. The discrimination metrics (Spearman ρ, lift ratios, decile ordering) are unaffected by this level adjustment — they depend on rank ordering, not absolute values. Alternatively, the model can be re-fit on a rolling 3-year window to reduce training-to-prediction time gap.

7.2 Calibration

Calibration and Lorenz curve

Exhibit 8: Left — calibration plot showing predicted vs observed TRIR by decile. The model is well-calibrated with monotonically increasing actual TRIR across deciles. Right — Lorenz curve showing the model's ability to separate low-risk from high-risk establishments. The area between the model curve and the diagonal represents discriminatory power.

7.3 Lift Analysis

The top risk decile captures 17.9% of all recordable injuries (99,714 cases) with an exposure-weighted TRIR of 9.33. The bottom decile captures only 1.8% (9,784 cases, TRIR = 1.07). This represents a 8.7x lift ratio.

8. Size Bias Analysis

8.1 Is This Just Measuring Entity Size?

A critical question for any injury prediction model: is the predictive power simply "bigger entities have more injuries"? The evidence suggests the primary signal is persistent establishment-level risk differences (as reflected in prior observed experience), not headcount.

Size vs rate signal

Exhibit 6: Size predicts case counts (trivially, ρ = 0.54) but barely predicts rates (ρ = 0.12). The real predictive signal is prior rate → future rate (ρ = 0.48).

8.2 Discrimination Within Size Bands

To confirm the model works beyond size, we tested predictive power within size bands — comparing same-size entities against each other:

Discrimination by size band

Exhibit 4: Left — Spearman ρ between prior TRIR and next-year TRIR within each size band. All are statistically significant (p < 10-64). Right — Lift (top vs bottom quartile of prior TRIR) within each size band. Even micro entities show 1.7x lift.

Size BandNSpearman ρBottom 25% TRIRTop 25% TRIRLift
Micro (<10 FTE)5,0930.3133.405.701.7x
Small (10-25)18,0860.3272.597.522.9x
Med-Small (25-50)29,6220.4362.168.974.2x
Medium (50-100)26,9340.5371.748.915.1x
Large (100-250)22,1300.6501.417.465.3x
XL (250+)15,0920.7671.296.765.3x

9. Small Entity Considerations

Small entity analysis

Exhibit 7: Left — Small entities (<25 FTE) with prior injuries have a next-year TRIR of 7.14 vs 2.82 for those without (2.5x signal). Right — distribution of non-zero injury rates for small entities.

9.1 The Zero-Inflation Problem

For small entities (<25 FTE, 19.8% of the test set):

Actuarial Implication: For micro-entities (<5 FTE), the credibility weight Z ≈ 0.17, meaning the model correctly relies 83% on the NAICS group rate. An underwriter should interpret a micro-entity's predicted rate as: "your industry class rate, with a small adjustment based on whatever individual data exists." For entities below ~10 FTE, individual experience credibility is low and the NAICS/size-class prior dominates.

9.2 What Works for Small Entities

9.3 Recommended Approach by Entity Size

SizeZ RangeRecommended Approach
>100 FTE0.80-0.99Individual experience dominates. Use model prediction directly.
25-100 FTE0.50-0.80Blended. Model prediction is credible but NAICS prior provides important smoothing.
10-25 FTE0.30-0.50Industry-weighted. Start from NAICS-4 rate, adjust modestly for individual experience.
<10 FTE0.05-0.30Class-rated. Use NAICS-4 group rate as primary; individual data is supplementary only.

10. Intra-Industry Discrimination

10.1 The Core Question for Underwriting

Industry classification (NAICS code) is the dominant driver of base injury rates — a steel foundry will always have a higher expected TRIR than an accounting firm. The critical question for underwriting is whether the model can differentiate risk within the same industry: among steel foundries, can we identify which are safer and which are more dangerous? If the model only captures between-industry differences, it adds no value beyond existing class rates.

To test this, we computed within-group Spearman ρ between prior TRIR and next-year TRIR for every NAICS group at three levels of specificity: 2-digit (sector, e.g. "Manufacturing"), 4-digit (industry group, e.g. "Steel Foundries"), and 6-digit (specific industry, e.g. "Steel Investment Foundries"). Only groups with sufficient sample size were included (≥100, ≥30, and ≥20 establishments respectively).

10.2 Results Summary

NAICS LevelGroups AnalyzedSignificant (p<0.05)Median ρMean ρMedian LiftStrong (ρ≥0.4)
NAICS-22424 (100%)0.4630.4573.6x20 (83%)
NAICS-4244232 (95%)0.4380.4313.3x163 (67%)
NAICS-6661576 (87%)0.4250.4243.1x389 (59%)
Key Finding: The model maintains strong within-group ranking power at every level of industry specificity. Even within narrow 6-digit NAICS codes — where establishments are in the same specific industry — 87% of groups show statistically significant ranking power, with a median ρ of 0.43 and median lift of 3.1x between the top and bottom quartiles. This confirms that the predictive signal reflects persistent establishment-level risk differences (safety culture, management practices, facility conditions), not merely industry classification.
Intra-NAICS discrimination

Exhibit 10: Left — distribution of within-group ranking power across NAICS levels. The majority of groups at all levels show strong discrimination (ρ ≥ 0.4). Right — within-sector ranking power for each NAICS-2 sector. All 24 sectors show statistically significant within-group discrimination (p < 0.05), with Warehousing/Postal leading at ρ = 0.58.

10.3 Within-Sector Detail (NAICS-2)

Every major sector shows significant within-group ranking ability. The model is not simply sorting by industry — it is identifying which employers within each industry are safer or more dangerous than their peers:

NAICS-2SectorNWithin-Group ρLift (Top/Bot 25%)Avg TRIR
49Warehousing/Postal4,7030.5775.8x9.37
61Education6000.5355.3x4.60
54Professional Services1,2370.5249.7x3.46
51Information4290.5217.8x1.79
71Arts/Recreation1,1970.5166.6x8.40
92Public Administration1,7280.5132.6x5.74
33Mfg-Metal/Machinery18,9080.5053.8x3.57
45Retail (Non-Store)9000.5043.6x4.70
56Admin/Waste Services4,7470.4843.8x4.85
31Mfg-Food/Textile4,6270.4793.3x4.20
52Finance/Insurance1840.47211.3x1.56
11Agriculture2,7920.4644.0x4.88
42Wholesale Trade9,3840.4633.5x3.73
23Construction21,6710.4603.6x3.32
32Mfg-Wood/Chemical11,9290.4533.4x3.62
48Transportation5,5610.4383.5x5.50
22Utilities1,8100.4292.9x2.68
53Real Estate1,4580.4202.6x3.77
81Other Services1,7130.4163.2x3.58
62Healthcare11,9080.4013.0x5.68
21Mining3640.3754.1x1.71
44Retail (Store)4,9680.3632.5x4.46
72Accommodation/Food3,6210.3562.4x4.54
55Mgmt of Companies4690.3000.9x1.33

10.4 Narrowest Industry Level (NAICS-6)

At the most granular 6-digit NAICS level, 661 groups had sufficient data for analysis. Of these, 389 (59%) show strong ranking power (ρ ≥ 0.4). Only 19 groups (3%) show negligible discrimination — these tend to be small, homogeneous cohorts where most establishments have identical zero-injury years.

Underwriting Implication: When an underwriter receives a model prediction for an account, the risk differentiation is not an artifact of industry mix. A construction firm flagged as high-risk is being compared against other construction firms and found wanting. A healthcare facility rated favorably has demonstrated better safety outcomes than its healthcare peers. This within-industry signal is exactly what experience rating aims to capture — the model automates and validates it with out-of-time evidence.

11. COVID-19 Structural Break Analysis

11.1 The Recovery That Didn't Happen

The COVID-19 pandemic caused a sharp drop in workplace injury rates in 2020. A natural assumption — and one embedded in the model's binary COVID indicator — is that this was a temporary dip followed by reversion to pre-pandemic levels. The data shows otherwise.

To control for changes in the ITA reporter pool (which grew from ~145K to ~194K establishments between 2019 and 2024), we constructed a same-establishment panel of 32,101 establishments that reported in every year from 2019 through 2024. This eliminates composition effects: any TRIR change in this panel reflects genuine changes in injury rates at the same workplaces, not shifts in who is reporting.

COVID structural break

Exhibit 11: Left — exposure-weighted TRIR for the same-establishment panel, with pre-COVID trend projected forward. Injury rates dropped in 2020 and never returned to the pre-COVID trajectory. Right — sector-by-sector recovery status as of 2024 vs 2019 baseline (same establishments only).

Key Finding: On the same-establishment panel, the exposure-weighted TRIR declined from 3.39 in 2019 to 2.82 in 2024 — a 17% permanent decline. Extrapolating the pre-COVID trend would have predicted a 2024 TRIR of 3.35; the actual value is 2.94, representing a 12% structural gap attributable to post-pandemic changes.

11.2 Sector-Level Recovery

Of 23 major sectors analyzed, only 1 recovered to within 5% of their 2019 injury rates by 2024. The 22 sectors that remain substantially lower follow a clear pattern aligned with the shift to remote and hybrid work:

SectorN2019 TRIR2020 Dip2024 TRIRvs 2019
Mgmt of Companies2690.95-39.8%0.51-45.9%
Mining940.70-33.3%0.45-36.1%
Professional Services2550.87-25.6%0.65-25.6%
Construction7,0192.50-4.7%1.87-25.3%
Other Services5313.68-21.6%2.77-24.8%
Admin/Waste Services1,1693.15-9.7%2.48-21.5%
Transportation1,6664.39-25.6%3.45-21.4%
Mfg-Metal/Machinery6,2842.87-18.0%2.27-20.8%
Mfg-Wood/Chemical4,3172.58-12.2%2.05-20.6%
Agriculture9754.49-12.4%3.62-19.4%
Education1363.32-48.6%2.69-19.0%
Wholesale Trade2,8973.80-12.3%3.17-16.8%
Mfg-Food/Textile1,4783.30-6.8%2.79-15.4%
Retail (Non-Store)2603.56-7.7%3.04-14.6%
Utilities6892.06-18.0%1.76-14.4%
Real Estate2832.74-29.8%2.41-12.1%
Healthcare2,1454.80-10.0%4.22-12.1%
Information661.17-53.2%1.05-10.6%
Public Administration6125.69-15.6%5.10-10.4%
Accommodation/Food5914.28-25.3%4.03-6.0%
Arts/Recreation1938.03-12.4%7.60-5.3%
Retail (Store)1,3734.96-10.2%4.70-5.2%
Warehousing/Postal1,0705.02+4.4%5.15+2.6%

11.3 Contributing Factors

Several structural changes likely explain the permanent baseline shift:

  1. Remote and hybrid work: Sectors with the largest sustained drops (Information -50.6% dip with only partial recovery; Professional Services -38.6%; Management of Companies -54.1%) are precisely those where work-from-home became permanent. Fewer bodies in physical workplaces means fewer opportunities for recordable injuries — slips, trips, falls, and ergonomic injuries that occur in office and industrial settings.
  2. Operational changes that stuck: Physical industries like Construction (-25.5% vs 2019) and Manufacturing (-20.6%) cannot attribute their declines to WFH. Instead, pandemic-era changes in shift density, facility hygiene protocols, and hazard awareness appear to have persisted. Reduced crowding on production floors and loading docks has lasting safety benefits.
  3. Reporting threshold effects: The shift to remote work may have moved some injury types below the recordable threshold. Minor office injuries (paper cuts, ergonomic strains) that were previously recorded as workplace incidents may no longer be reported when they occur at home.
Model Implication: The GLM's binary COVID indicator (coefficient = -0.088, applied only to 2020-2021) treats the pandemic as a temporary event. This analysis shows the effect is permanent. For production use, the calendar-year trend adjustment (Section 7) should account for a structural break rather than a smooth trend — the post-2020 baseline is approximately 12% lower than the pre-COVID trajectory would predict, and continuing to decline at a similar rate (-5.8 bps/year post-COVID vs -3.9 bps/year pre-COVID). A two-regime trend (pre-2020 and post-2020) is recommended over a single linear trend.

12. Industry Rate Tables

12.1 Major Sector (NAICS-2) Base Rates

NAICS-2 industry rates

Exhibit 2: Exposure-weighted TRIR by major NAICS sector. Healthcare, Warehousing, and Public Administration have the highest observed rates. Construction's rate appears low because the construction ITA reporters tend to be larger, better-organized firms. Note on Mining (NAICS 21): The low TRIR shown (0.74) reflects only OSHA-regulated mining operations, primarily oil & gas extraction and surface mining support services. Underground coal mines, metal/nonmetal mines, and most quarries are regulated by MSHA (Mine Safety and Health Administration), not OSHA, and report injuries through a separate system. MSHA-regulated operations — which account for the majority of mining fatalities and severe injuries — are excluded from this study. The MSHA injury data (msha_accident) is available in a parallel dataset and could be integrated in a future extension of this model.

12.2 Minor Industry (NAICS-4) — Top 25 by Volume

NAICS-4EstablishmentsTotal CasesHours (M)TRIR
62213,883428,32317614.44.86
238210,67482,2047517.62.19
623110,657122,5446437.33.81
49316,545133,9696246.94.29
23733,68932,5176039.31.08
23626,43036,5704288.51.71
32614,66856,6593999.92.83
72116,96478,3843964.23.95
48415,91870,4103799.53.71
23713,83324,8953502.81.42
33632,06345,3013466.42.61
311698846,6053091.13.02
44517,80563,0622930.64.30
23815,92948,8682871.33.40
92111,85474,2962858.35.20
62336,44755,5022842.23.91
56175,57945,5352764.53.29
33641,25716,8292737.41.23
32541,35214,1352595.01.09
42442,84676,0382476.66.14
22112,56615,4962444.01.27
55111,4917,3192273.40.64
33911,91113,9252202.71.26
33233,81842,2612179.53.88
33451,5687,4082032.00.73

12.3 COVID-19 Impact

COVID impact

Exhibit 9: Exposure-weighted TRIR by year showing the 2020-2021 pandemic dip. See Section 11 for the full structural break analysis demonstrating that this decline is permanent, not temporary.

13. Limitations & Caveats

  1. ITA Selection Bias (Critical): The OSHA ITA survey targets establishments with 250+ employees or in designated high-hazard industries. The model is trained on a non-representative sample biased toward larger, riskier employers. Predictions for employers outside this profile should be treated as class-rate priors with limited individual calibration (Z → 0).
  2. Frequency Only, Not Severity: This model predicts recordable case counts, not dollar losses. Workers' compensation pricing requires both frequency and severity. Severity modeling would require carrier loss data not available in OSHA datasets.
  3. Aggregate Over-Prediction: The model over-predicts total cases by ~25% on the test set (695,877 predicted vs 556,251 actual). This reflects the secular improvement in workplace safety — the 2016-2020 training data has higher baseline rates than 2023 outcomes. A calendar-year trend factor should be applied in production.
  4. NAICS-6 Sparsity: Many 6-digit NAICS codes have fewer than 10 ITA reporters. The model uses NAICS-4 (415 codes) and NAICS-2 (50 sectors) to ensure adequate cohort sizes, but cannot differentiate within narrow sub-industries.
  5. State-Plan Variation: Approximately half of US states operate their own OSHA programs. ITA reporting completeness may vary by jurisdiction, and state-plan inspection/violation data may have different patterns than federal OSHA. The model includes state as a feature but cannot fully control for reporting differences.
  6. MSHA Exclusion: Mining operations regulated by the Mine Safety and Health Administration (MSHA) — including underground coal, metal/nonmetal mines, and most quarries — do not report to OSHA's ITA system. The "Mining" sector rates in this study reflect only OSHA-regulated operations (primarily oil & gas extraction and surface mining support). MSHA-regulated mining has substantially higher injury and fatality rates that are not captured here.
  7. Not a Premium Rate: This model produces an expected injury frequency index, not an insurance premium. Converting to premium requires loss development factors, trend factors, expense loading, and profit provisions that are outside the scope of this study.
  8. Stationarity Assumption: The model assumes that the relationship between year Y features and year Y+1 outcomes is approximately stationary. Structural changes (new OSHA regulations, industry shifts, economic cycles) may require periodic retraining.

14. Deployment & Governance Recommendations

  1. Deploy as primary frequency predictor: The model demonstrates strong discrimination and calibration on out-of-time data and should be used for forward-looking injury frequency estimation in underwriting.
  2. Apply calendar-year trend factor: Fit a simple multiplicative trend to the TRIR time series to adjust for the secular improvement in workplace safety (~2-3% per year).
  3. Retrain annually: As new ITA data becomes available, retrain the model to keep it current. The holdout set (2023→2024) is available for the first retrain validation.
  4. Use credibility Z as a data-quality flag: Expose the credibility weight to underwriters so they know how much trust to place in individual experience vs. class rate.
  5. Consider severity modeling: If carrier loss data becomes available, a parallel severity model (average cost per claim by injury type) would complete the actuarial pricing picture.
  6. Extend to non-ITA entities: For the ~3.6M establishments without ITA records, the NAICS-4 group rate from this model can serve as a class-rated prior. Enforcement and SIR data (which exists for many non-ITA entities) can further adjust via the GBM stage.
Decile discrimination

Exhibit 3: Left — predicted risk decile (based on prior TRIR) vs actual next-year TRIR. The monotonic increase confirms that the model's risk ordering translates directly into observed outcomes. Right — percentage of all injuries captured by each decile. The top decile (D10) captures a disproportionate share of injuries while the bottom decile (D1) captures very few, demonstrating strong discrimination. Individual injury outcomes are inherently noisy (rare events follow a Negative Binomial process), but aggregate decile-level patterns are highly stable and predictable.

15. What This Model Can and Cannot Be Used For

Appropriate UsesNot Appropriate For
Frequency risk-ranking of ITA-reporting establishments Premium rate indication or pricing without further severity and expense loading
Identifying high-risk accounts for underwriter review Automated accept/reject decisions without human review
Supplementing class rates with experience-based adjustments for credible accounts (Z > 0.3) Individually rating micro-accounts (<10 FTE) where credibility is low
Benchmarking an employer's predicted frequency against NAICS peers Generalizing to non-ITA populations (small employers, low-hazard industries) without acknowledging that predictions are primarily class-rate priors
Trend analysis and portfolio-level frequency monitoring Predicting specific claim types, severity, or cost
Informing loss-control targeting (which accounts to inspect) MSHA-regulated mining or FMCSA-regulated motor carrier risk assessment

16. Recommended Next Steps Before Production Use

  1. Calendar-year trend adjustment: Fit a multiplicative trend factor to eliminate the ~25% aggregate over-prediction. Re-estimate this factor each year as new ITA data becomes available.
  2. Parallel run: Deploy in shadow mode for 6-12 months, comparing predictions against realized outcomes before full production adoption.
  3. Exposure projection protocol: Define how submitted or projected hours/payroll will be used at quote time in place of the realized hours used in backtesting. Validate that the rate prediction remains calibrated when paired with exposure estimates.
  4. Segment-level validation: Evaluate discrimination and calibration separately by NAICS sector, state, and size band. Identify segments where the model underperforms and consider segment-specific adjustments or exclusions.
  5. Holdout validation: Run the model on the reserved 2023→2024 holdout set to confirm out-of-time stability before production deployment.
  6. Model governance documentation: Prepare model risk management documentation including ongoing monitoring plan, retraining schedule, performance thresholds that would trigger model review, and escalation procedures.
  7. Credibility and applicability flags: Expose the credibility weight Z and an ITA-applicability flag in the model output so underwriters can see how much of the prediction is individually experience-rated vs. class-rated, and whether the account falls within the model's training population.