Predictive Modeling of Workers' Compensation
Injury Frequency Using Federal OSHA Data

Fred Duggan

MarisRisk

March 2026

Technical White Paper

Abstract

This paper presents a three-stage predictive model for workers' compensation recordable injury frequency, trained on 692,412 establishment-year observations from the OSHA Injury Tracking Application (ITA) spanning 2016 through 2024. The model uses a lagged panel design where year Y features predict year Y+1 injury counts, validated on a strict out-of-time test set.

The architecture combines a Negative Binomial generalized linear model (GLM) with annual hours-worked exposure offset for actuarial interpretability, Buhlmann-Straub credibility weighting for small-entity shrinkage (parameters set via empirical Bayes; see Section 5), and a LightGBM gradient boosting model trained on GLM residuals to capture nonlinear interactions. On the held-out test set (N = 116,957 establishments, feature year 2022, target year 2023), the combined model achieves an 8.7x lift ratio between the top and bottom risk deciles, with the highest-risk decile capturing 17.9% of all recordable injuries. Aggregate predictions require a calendar-year adjustment due to approximately 25% over-prediction on this period (see Section 7).

We demonstrate that the predictive signal reflects persistent establishment-level risk differences rather than entity size: prior injury rate predicts future rate at Spearman rho = 0.48, while entity size predicts rate at only rho = 0.12. Discriminatory power attenuates for smaller entities as expected from credibility theory (rho = 0.31 for micro-employers under 10 FTE, where predictions are heavily weighted toward class-rate priors). When compared against NAICS-4 percentile peer grading on ITA-reporting establishments, the model shows higher case capture in the top risk decile (17.9% vs 15.1%) with complementary ranking signals (inter-system Spearman rho = 0.54). We recommend deployment for frequency risk assessment on ITA-reporting accounts, with peer grading retained for small accounts and as a complementary current-posture indicator. The OSHA ITA training population is not representative of all US employers; limitations regarding selection bias, aggregate calibration, and small-entity applicability are discussed.

Keywords: workers' compensation, injury frequency, predictive modeling, Buhlmann-Straub credibility, negative binomial GLM, OSHA, ITA, NAICS, actuarial, empirical Bayes, gradient boosting, workplace safety

Workers' Compensation Injury Frequency Prediction Study

Predictive Model for Recordable Injury Counts Using OSHA ITA Data (2016-2024)
Prepared: March 22, 2026
Data Source: OSHA Injury Tracking Application (ITA) Annual Summaries, OSHA Enforcement Data, OSHA Severe Injury Reports
Prepared for: MarisRisk Underwriting Analytics

Executive Summary
Data Description
Methodology
Stage 1: Negative Binomial GLM
Stage 2: Buhlmann-Straub Credibility
Stage 3: LightGBM Residual Model
Model Performance
Size Bias Analysis
Small Entity Considerations
Comparison to Current Peer Grading
Industry Rate Tables
Limitations & Caveats
Deployment & Governance Recommendations
What This Model Can and Cannot Be Used For
Recommended Next Steps Before Production Use

1. Executive Summary

This study develops and validates a predictive model for workers' compensation injury frequency using 692,412 establishment-year observations from OSHA's Injury Tracking Application (ITA) spanning 2016-2024. The model uses year Y features (prior injury rates, enforcement history, industry classification, employer size) to predict year Y+1 recordable injury counts.

Key Findings:

The model achieves a 8.7x lift ratio between the top and bottom risk deciles on an out-of-time test set (2022→2023), indicating strong discrimination.
The top risk decile captures 17.9% of all recordable injuries, compared to 15.1% captured by the current NAICS-4 percentile peer grading approach.
The predictive signal reflects persistent establishment-level risk differences, not simply entity size: prior injury rate predicts future rate at Spearman ρ = 0.48, while entity size predicts rate at only ρ = 0.12.
Discriminatory power attenuates for smaller entities as expected (ρ = 0.31 for <10 FTE vs ρ = 0.78 for 250+ FTE). For micro-employers, predictions are heavily weighted toward industry class rates, with limited individual experience differentiation.
Buhlmann-Straub credibility weighting blends individual and group estimates: micro-entities (<5 FTE) receive 83% weight on the NAICS group rate, while large entities (>100 FTE) are predominantly experience-rated.

Important limitations: The OSHA ITA training population is not representative of all US employers. It is tilted toward larger establishments (250+ employees) and designated high-hazard industries. Generalization to small, low-hazard, or non-ITA-reporting employers is limited; predictions for such entities should be treated as class-rate priors with modest risk modifiers, not as individually calibrated estimates. The model also exhibits ~25% aggregate over-prediction on the 2022→2023 test set due to secular trend, requiring a calendar-year adjustment before production use. See Sections 7 and 12 for details.

The model architecture combines a Negative Binomial GLM with exposure offset (providing the interpretable base rate), Buhlmann-Straub credibility weighting (blending individual experience with industry priors), and a LightGBM model trained on GLM residuals (capturing nonlinear interactions, improving validation MAE by 19.1%). The GLM layer is fully transparent and standard in actuarial practice. The gradient boosting layer improves discrimination but introduces model governance considerations and explainability requirements discussed in Section 6. Model deployment is contingent on calendar-year adjustment and governance sign-off.

2. Data Description

2.1 Primary Data Source

The OSHA Injury Tracking Application (ITA) collects annual injury and illness data from establishments meeting reporting thresholds (generally 250+ employees, or 20+ employees in designated high-hazard industries). Each record represents one establishment-year and includes:

Exposure: Total hours worked, annual average employee count
Outcomes: Total recordable cases, days-away-from-work cases, job transfer/restriction cases, fatalities
Derived rates: TRIR (Total Recordable Incident Rate = cases × 200,000 / hours), DART rate
Classification: 6-digit NAICS code, EIN, establishment identifier

2.2 Supplementary Data Sources

OSHA Enforcement: 5.1M inspections and 13.2M violations with severity classifications and penalty amounts
OSHA Severe Injury Reports (SIR): 103K events including hospitalizations, amputations, fatalities

2.3 Panel Construction

The model uses a lagged panel design: for each establishment with consecutive-year ITA records, we pair year Y features with year Y+1 outcomes. This produces 692,412 observation pairs across 267,091 unique establishments.

Exhibit 1: Panel size by feature year (left) and observed TRIR trends (right). The COVID-19 dip in 2020-2021 is visible. 2017 has lower volume due to ITA reporting changes.

2.4 Temporal Split (No Data Leakage)

Split	Feature Years	Target Years	N	Purpose
Train	2016-2020	2017-2021	346,191	Model fitting
Validate	2021	2022	98,941	Hyperparameter tuning
Test	2022	2023	116,957	Final evaluation (all results reported here)
Holdout	2023	2024	130,323	Reserved for future validation

2.5 Data Quality Filters

Minimum 2,000 annual hours worked (~1 FTE) — below this threshold, rates are unreliable
Maximum 10M hours — 942 records with implausible values (up to 16 trillion) were capped
Requires valid NAICS code (4+ digits) and non-null establishment linkage
TRIR capped at 99th percentile (27.86) to limit outlier influence on the GLM

3. Methodology

3.1 Model Architecture

The model uses a three-stage architecture standard in actuarial pricing:

Stage 1 — Negative Binomial GLM: Provides the interpretable base prediction with industry fixed effects and an exposure offset. This is the actuarially transparent component.
Stage 2 — Buhlmann-Straub Credibility: Blends individual entity predictions with NAICS-4 group rates based on exposure volume, properly handling the small-entity problem.
Stage 3 — LightGBM Residual Model: A gradient boosting model trained on GLM Pearson residuals to capture nonlinear interactions and feature interactions not explained by the base GLM, improving MAE by 19.1% on validation.

3.2 Target Variable and Exposure Treatment

Total recordable cases (integer count) for year Y+1, with log(total hours worked) as the exposure offset. This is the standard actuarial approach: modeling claim frequency per unit of exposure. The Negative Binomial distribution accommodates overdispersion (injuries tend to cluster more than a Poisson process would predict).

Note on exposure at deployment: During model development and backtesting, realized year Y+1 hours are used as the exposure offset, which is standard practice for fitting and validating frequency GLMs. At underwriting time, realized future hours are not available. In production use, the model output is an estimated injury rate per unit of exposure. Expected counts are then derived by multiplying this rate by the insured's submitted or projected hours/payroll. The model's discrimination metrics (Spearman ρ, lift, decile ordering) are properties of the rate prediction and do not depend on knowledge of future exposure. Count-level MAE figures reported herein reflect the backtesting framework and would differ slightly when applied to projected exposure.

3.3 Feature Set

Group	Features	Rationale
Prior Year ITA	TRIR, DART, recordable cases, DAFW cases, deaths, hours, employees, severity ratio, illness ratio	Core loss experience signal
NAICS Hierarchy	2-digit (sector), 4-digit (industry group)	Industry base rate
Enforcement History	Inspection count, violation count, serious+ count, total penalties, willful/repeat flags	Regulatory signal of poor practices
Severe Injury Reports	SIR count, fatalities, amputations, hospitalizations	Tail-risk indicator
Trends	TRIR year-over-year change, case count trend	Trajectory matters
Controls	State (jurisdiction), COVID indicator	Regulatory environment, pandemic effect

4. Stage 1: Negative Binomial GLM

4.1 Specification

$$Y_{i,t+1} \sim \text{NegBin}(\mu_{i,t+1},\; \alpha)$$ $$\log(\mu_{i,t+1}) = \underbrace{\log(\text{hours}_{i,t+1})}_{ ext{exposure offset}} + \beta_0 + \sum_k \beta_k X_{ik,t} + \sum_j \gamma_j \, \text{NAICS2}_j$$

Where log(hours) is the offset (coefficient fixed at 1.0), ensuring the model predicts a rate per unit of exposure. The NAICS-2 sector dummies capture the industry base rate. The model was fit using IRLS (Iteratively Reweighted Least Squares) and converged in ~110 seconds on 346,191 training observations.

4.2 Coefficient Estimates

Feature	Coefficient	p-value	Interpretation
prior_trir (capped at p99)	+0.0975	<0.001	Higher prior TRIR → higher predicted count
is_covid	-0.0880	<0.001	COVID years show ~8.8% fewer injuries
prior_deaths	+0.0563	0.023	Prior fatalities signal ongoing hazard
log_prior_employees	-0.0495	<0.001	Larger workforce → slightly lower per-hour rate
prior_severity_ratio	+0.0453	<0.001	Higher proportion of lost-time cases → worse
prior_illness_ratio	-0.0275	<0.001	Illness-heavy mix has lower future injury count
log_prior_hours	-0.0191	<0.001	More hours → slightly lower rate (exposure effect)
log_penalty	+0.0183	<0.001	Higher OSHA penalties → higher predicted injuries
serious_plus_count	+0.0015	0.707	Serious+ violations (not significant alone)
trir_trend	-0.0009	<0.001	Rising TRIR trend slightly reduces (regression to mean)
sir_count	-0.0006	0.983	SIR events (captured via other features)

Table 1: GLM coefficients. Positive coefficients increase predicted injury counts. All features except serious_plus_count and sir_count are statistically significant. These two are captured via the GBM stage.

4.3 Key Coefficient Interpretations

prior_trir (+0.098): A 1-point increase in prior TRIR predicts a ~10.2% increase in next-year injury count (exp(0.0975) = 1.102). This is the dominant predictor.
log_prior_employees (-0.050): Holding hours constant, larger employers have slightly lower per-hour injury rates — consistent with the well-known large-employer safety advantage.
prior_severity_ratio (+0.045): When a higher fraction of injuries are lost-time cases (vs. medical-only), the entity is predicted to have more injuries the following year.
log_penalty (+0.018): Higher OSHA penalty amounts, controlling for industry and violations, predict more future injuries — penalties are a lagging indicator of hazardous conditions.

4.4 GLM Performance

Metric	Train	Validation	Test
MAE (case count)	3.34	3.53	3.76
MAE (TRIR)	3.76	3.51	3.88
Spearman ρ	—	—	0.674

5. Stage 2: Buhlmann-Straub Credibility

5.1 Motivation

An entity's observed injury rate is a noisy estimate of its true underlying risk. For a large employer with 500,000+ annual hours, the observed rate is highly credible. For a small employer with 10,000 hours, year-to-year variation can swamp the signal. Buhlmann-Straub credibility provides the actuarially standard solution: blend individual experience with a group prior, weighted by exposure volume.

5.2 Formula

$$\hat{\mu}_{\text{credibility}} = Z \cdot \hat{\mu}_{\text{individual}} + (1 - Z) \cdot \hat{\mu}_{\text{group}}$$ $$Z = \frac{n_i}{n_i + k}$$ $$\text{where } n_i = \text{hours worked}, \quad k = \frac{\text{within-entity variance}}{\text{between-entity variance}}$$

5.3 Variance Component Estimation

Using multi-year entities (establishments with 2+ consecutive years of data), we estimated:

Component	Value	Meaning
Within-entity (process) variance	3.93	Year-to-year TRIR volatility for the same entity
Between-entity (parameter) variance	188.0	Spread of entity means around NAICS-4 group mean
k (hours for Z = 0.5)	50,000	~25 FTE-years of exposure needed for 50% credibility

Interpretation: The between-entity variance (188) greatly exceeds the within-entity variance (3.93), indicating that employer-specific risk levels are persistent — an entity's injury rate is much more stable year-to-year than it is similar across entities. This validates the use of individual experience data and means even modest exposure provides useful information.

Exhibit 5: Left — distribution of credibility weights across the test set. Right — the credibility function showing Z vs. hours worked. At 50,000 hours (~25 FTE) the entity gets 50% weight on its own experience. At 200,000 hours (~100 FTE), Z ≈ 0.80.

5.4 Credibility Validation

To verify the credibility mechanism works correctly, we compared individual, group, and blended MAE within Z-bands on the test set:

Credibility Band	N	Individual MAE	Group MAE	Blended MAE	Best
Z = [0.0, 0.1) — very small	922	0.656	0.652	0.652	Group ≈ Blend
Z = [0.1, 0.3)	4,428	0.599	0.604	0.602	Individual ≈ Blend
Z = [0.3, 0.5)	17,164	0.855	0.862	0.838	Blend wins
Z = [0.5, 0.7)	35,724	1.334	1.430	1.312	Blend wins
Z = [0.7, 1.0) — large	58,719	5.318	6.130	5.260	Blend wins

Table 3: The blended prediction equals or beats both individual and group predictions in every Z-band, confirming the credibility mechanism is correctly calibrated.

6. Stage 3: LightGBM Residual Model

6.1 Approach

After the GLM produces a base prediction, we compute Pearson residuals: (observed - predicted) / sqrt(predicted). A LightGBM gradient boosting model is trained on these residuals using the full feature set (including nonlinear features the GLM cannot capture). The final prediction is: GLM_prediction + GBM_residual × sqrt(GLM_prediction).

6.2 Hyperparameter Tuning

Optuna Bayesian optimization with 5 trials selected: learning rate = 0.020, 87 leaves, max depth = 5. Early stopping at 30 rounds of no improvement.

6.3 Top Features by Gain

Rank	Feature	Gain	Interpretation
1	prior_trir_capped	1,604,494	Prior year TRIR (dominant predictor)
2	prior_cases	1,231,617	Raw recordable case count
3	naics_4	1,023,929	Minor industry code (4-digit)
4	prior_hours	909,676	Annual hours worked (size proxy)
5	prior_employees	531,094	Employee count
6	trir_trend	437,816	Year-over-year TRIR change
7	prior_djtr	268,801	Job transfer/restriction cases
8	cases_trend	253,690	Year-over-year case count change
9	naics_2	222,563	Major sector code (2-digit)
10	state	207,013	Jurisdiction (state-plan effects)

6.4 Stage 2 Improvement

Model	Validation MAE	Improvement
GLM only	3.531	—
GLM + GBM	2.856	19.1%

The GBM stage provides meaningful improvement by capturing nonlinear interactions between features that the linear GLM cannot model (e.g., industry-specific penalty effects, size-dependent trends).

6.5 Governance Considerations

The GLM base layer is fully transparent: each coefficient has a clear directional interpretation, and the model can be expressed as a simple formula. The GBM residual layer improves discrimination but reduces pure interpretability — individual predictions cannot be decomposed into additive factor contributions as cleanly as the GLM alone.

For model governance purposes:

The GLM layer can be deployed independently (with ~19% higher MAE) if full transparency is required
The GBM layer should be monitored for feature drift (distributional shifts in input features over time), prediction stability (large changes in output for small input changes), and segment-level performance (degradation in specific NAICS or size segments)
Periodic retraining (recommended annually as new ITA data is released) should include validation against the prior model version to detect performance changes
SHAP values are available for individual predictions to provide post-hoc explainability of the GBM contribution, but these are approximations, not exact decompositions

7. Model Performance

7.1 Out-of-Time Test Set Results

All results below are on the held-out test set (feature year 2022, predicting 2023 outcomes, N = 116,957 establishments).

Model	MAE (counts)	MAE (TRIR)	Spearman ρ	Total Predicted	Total Actual
GLM only	3.76	3.88	0.674	705,146	556,251
GLM + GBM	3.23	3.53	0.707	714,580	556,251
GLM + GBM + Credibility	3.19	3.48	0.701	695,877	556,251

Calibration vs. Discrimination: The model demonstrates strong discrimination (ability to rank entities by risk), but the raw output is miscalibrated in level, over-predicting aggregate cases by ~25% on the test set (695,877 predicted vs 556,251 actual). This reflects the secular decline in workplace injury rates: the model was trained on 2016-2020 data when TRIR was higher, and 2023 outcomes reflect continued improvement.

This means the raw model output should not be used as-is for expected loss estimation. Before production deployment, a calendar-year trend adjustment (multiplicative factor of approximately 0.80 for the 2023 prediction year, re-estimated annually) must be applied to bring predicted totals into alignment with observed experience. The discrimination metrics (Spearman ρ, lift ratios, decile ordering) are unaffected by this level adjustment — they depend on rank ordering, not absolute values. Alternatively, the model can be re-fit on a rolling 3-year window to reduce training-to-prediction time gap.

7.2 Calibration

$Calibration and Lorenz curve$

Exhibit 8: Left — calibration plot showing predicted vs observed TRIR by decile. The model is well-calibrated with monotonically increasing actual TRIR across deciles. Right — Lorenz curve showing the model's ability to separate low-risk from high-risk establishments. The area between the model curve and the diagonal represents discriminatory power.

7.3 Lift Analysis

The top risk decile captures 17.9% of all recordable injuries (99,714 cases) with an exposure-weighted TRIR of 9.33. The bottom decile captures only 1.8% (9,784 cases, TRIR = 1.07). This represents a 8.7x lift ratio.

8. Size Bias Analysis

8.1 Is This Just Measuring Entity Size?

A critical question for any injury prediction model: is the predictive power simply "bigger entities have more injuries"? The evidence suggests the primary signal is persistent establishment-level risk differences (as reflected in prior observed experience), not headcount.

Exhibit 6: Size predicts case counts (trivially, ρ = 0.54) but barely predicts rates (ρ = 0.12). The real predictive signal is prior rate → future rate (ρ = 0.48).

8.2 Discrimination Within Size Bands

To confirm the model works beyond size, we tested predictive power within size bands — comparing same-size entities against each other:

Exhibit 4: Left — Spearman ρ between prior TRIR and next-year TRIR within each size band. All are statistically significant (p < 10^-64). Right — Lift (top vs bottom quartile of prior TRIR) within each size band. Even micro entities show 1.7x lift.

Size Band	N	Spearman ρ	Bottom 25% TRIR	Top 25% TRIR	Lift
Micro (<10 FTE)	5,093	0.313	3.40	5.70	1.7x
Small (10-25)	18,086	0.327	2.59	7.52	2.9x
Med-Small (25-50)	29,622	0.436	2.16	8.97	4.2x
Medium (50-100)	26,934	0.537	1.74	8.91	5.1x
Large (100-250)	22,130	0.650	1.41	7.46	5.3x
XL (250+)	15,092	0.767	1.29	6.76	5.3x

9. Small Entity Considerations

Exhibit 7: Left — Small entities (<25 FTE) with prior injuries have a next-year TRIR of 7.14 vs 2.82 for those without (2.5x signal). Right — distribution of non-zero injury rates for small entities.

9.1 The Zero-Inflation Problem

For small entities (<25 FTE, 19.8% of the test set):

59.8% have zero injuries in any given target year
Of those with zero prior-year injuries, 71.5% remain at zero the following year
Median target TRIR = 0.00 (more than half have no injuries)

Actuarial Implication: For micro-entities (<5 FTE), the credibility weight Z ≈ 0.17, meaning the model correctly relies 83% on the NAICS group rate. An underwriter should interpret a micro-entity's predicted rate as: "your industry class rate, with a small adjustment based on whatever individual data exists." For entities below ~10 FTE, individual experience credibility is low and the NAICS/size-class prior dominates.

9.2 What Works for Small Entities

Binary signal is strong: Whether an entity had ANY prior injuries is a powerful discriminator (next-year TRIR of 7.14 vs 2.82, p < 0.001)
NAICS classification is critical: When individual data is sparse, the industry base rate carries almost all the weight — NAICS-4 selection matters enormously
Enforcement data helps: Even without ITA history, violation counts and penalties from OSHA inspections provide a regulatory history signal correlated with future injury experience

9.3 Recommended Approach by Entity Size

Size	Z Range	Recommended Approach
>100 FTE	0.80-0.99	Individual experience dominates. Use model prediction directly.
25-100 FTE	0.50-0.80	Blended. Model prediction is credible but NAICS prior provides important smoothing.
10-25 FTE	0.30-0.50	Industry-weighted. Start from NAICS-4 rate, adjust modestly for individual experience.
<10 FTE	0.05-0.30	Class-rated. Use NAICS-4 group rate as primary; individual data is supplementary only.

10. Comparison to Current Peer Grading

The current system ranks establishments by percentile within their NAICS-4 cohort using a weighted composite of 7 factors (violation severity 0.26, TRIR 0.20, SIR events 0.16, penalties 0.08, financial health 0.10, litigation 0.09, news sentiment 0.11). This is a point-in-time relative ranking, not a predictive model.

Exhibit 10: The new predictive model captures nearly twice the injury volume in the top risk decile.

Dimension	New Predictive Model	Current Peer Grading
Top 10% case capture	17.9%	15.1%
Lift (top/bottom 10%)	5.6x	5.0x
Rank correlation	Spearman ρ = 0.54 (partially overlapping but capturing different signals)
Output type	Calibrated expected count & rate	Relative percentile (0-100)
Small entity handling	Credibility-weighted shrinkage	Falls back to global percentile
Temporal design	Year Y → Year Y+1 (validated)	Point-in-time snapshot
Validation	Out-of-time test set	None (heuristic weights)

Recommendation — Use Each Where It Is Strongest:

Replace peer grading with this model for forward-looking injury frequency estimation on ITA-reporting accounts (larger employers, high-hazard industries). The predictive model is validated out-of-time and produces calibrated expected rates; the peer grading system was not designed or validated for this purpose.
Retain peer grading for small/micro accounts without ITA history, where this model has low credibility (Z < 0.3) and relies almost entirely on NAICS class priors. In this segment, the peer grading system's composite of violations, penalties, and SIR events provides additional context the frequency model cannot meaningfully individualize.
Retain peer grading as a complementary "current posture" indicator alongside the frequency model. The moderate rank correlation between the systems (ρ = 0.54) confirms they capture partially different signals — an account that ranks well on predicted frequency but poorly on peer grade (or vice versa) warrants underwriter review.
Do not use either system for MSHA-regulated mining or FMCSA-regulated motor carriers, which have their own separate scoring models and data sources.

11. Industry Rate Tables

11.1 Major Sector (NAICS-2) Base Rates

Exhibit 2: Exposure-weighted TRIR by major NAICS sector. Healthcare, Warehousing, and Public Administration have the highest observed rates. Construction's rate appears low because the construction ITA reporters tend to be larger, better-organized firms. Note on Mining (NAICS 21): The low TRIR shown (0.74) reflects only OSHA-regulated mining operations, primarily oil & gas extraction and surface mining support services. Underground coal mines, metal/nonmetal mines, and most quarries are regulated by MSHA (Mine Safety and Health Administration), not OSHA, and report injuries through a separate system. MSHA-regulated operations — which account for the majority of mining fatalities and severe injuries — are excluded from this study. The MSHA injury data (msha_accident) is available in a parallel dataset and could be integrated in a future extension of this model.

11.2 Minor Industry (NAICS-4) — Top 25 by Volume

NAICS-4	Establishments	Total Cases	Hours (M)	TRIR
6221	3,883	428,323	17614.4	4.86
2382	10,674	82,204	7517.6	2.19
6231	10,657	122,544	6437.3	3.81
4931	6,545	133,969	6246.9	4.29
2373	3,689	32,517	6039.3	1.08
2362	6,430	36,570	4288.5	1.71
3261	4,668	56,659	3999.9	2.83
7211	6,964	78,384	3964.2	3.95
4841	5,918	70,410	3799.5	3.71
2371	3,833	24,895	3502.8	1.42
3363	2,063	45,301	3466.4	2.61
3116	988	46,605	3091.1	3.02
4451	7,805	63,062	2930.6	4.30
2381	5,929	48,868	2871.3	3.40
9211	1,854	74,296	2858.3	5.20
6233	6,447	55,502	2842.2	3.91
5617	5,579	45,535	2764.5	3.29
3364	1,257	16,829	2737.4	1.23
3254	1,352	14,135	2595.0	1.09
4244	2,846	76,038	2476.6	6.14
2211	2,566	15,496	2444.0	1.27
5511	1,491	7,319	2273.4	0.64
3391	1,911	13,925	2202.7	1.26
3323	3,818	42,261	2179.5	3.88
3345	1,568	7,408	2032.0	0.73

11.3 COVID-19 Impact

Exhibit 9: Exposure-weighted TRIR by year. The 2020-2021 dip reflects reduced workplace exposure during the pandemic. The model includes a COVID binary indicator (coefficient = -0.088) to account for this structural break.

12. Limitations & Caveats

ITA Selection Bias (Critical): The OSHA ITA survey targets establishments with 250+ employees or in designated high-hazard industries. The model is trained on a non-representative sample biased toward larger, riskier employers. Predictions for employers outside this profile should be treated as class-rate priors with limited individual calibration (Z → 0).
Frequency Only, Not Severity: This model predicts recordable case counts, not dollar losses. Workers' compensation pricing requires both frequency and severity. Severity modeling would require carrier loss data not available in OSHA datasets.
Aggregate Over-Prediction: The model over-predicts total cases by ~25% on the test set (695,877 predicted vs 556,251 actual). This reflects the secular improvement in workplace safety — the 2016-2020 training data has higher baseline rates than 2023 outcomes. A calendar-year trend factor should be applied in production.
NAICS-6 Sparsity: Many 6-digit NAICS codes have fewer than 10 ITA reporters. The model uses NAICS-4 (415 codes) and NAICS-2 (50 sectors) to ensure adequate cohort sizes, but cannot differentiate within narrow sub-industries.
State-Plan Variation: Approximately half of US states operate their own OSHA programs. ITA reporting completeness may vary by jurisdiction, and state-plan inspection/violation data may have different patterns than federal OSHA. The model includes state as a feature but cannot fully control for reporting differences.
MSHA Exclusion: Mining operations regulated by the Mine Safety and Health Administration (MSHA) — including underground coal, metal/nonmetal mines, and most quarries — do not report to OSHA's ITA system. The "Mining" sector rates in this study reflect only OSHA-regulated operations (primarily oil & gas extraction and surface mining support). MSHA-regulated mining has substantially higher injury and fatality rates that are not captured here.
Not a Premium Rate: This model produces an expected injury frequency index, not an insurance premium. Converting to premium requires loss development factors, trend factors, expense loading, and profit provisions that are outside the scope of this study.
Stationarity Assumption: The model assumes that the relationship between year Y features and year Y+1 outcomes is approximately stationary. Structural changes (new OSHA regulations, industry shifts, economic cycles) may require periodic retraining.

13. Deployment & Governance Recommendations

Deploy as primary frequency predictor: The model outperforms the current peer grading system on every discrimination metric and should be used for forward-looking injury frequency estimation in underwriting.
Apply calendar-year trend factor: Fit a simple multiplicative trend to the TRIR time series to adjust for the secular improvement in workplace safety (~2-3% per year).
Retrain annually: As new ITA data becomes available, retrain the model to keep it current. The holdout set (2023→2024) is available for the first retrain validation.
Use credibility Z as a data-quality flag: Expose the credibility weight to underwriters so they know how much trust to place in individual experience vs. class rate.
Consider severity modeling: If carrier loss data becomes available, a parallel severity model (average cost per claim by injury type) would complete the actuarial pricing picture.
Extend to non-ITA entities: For the ~3.6M establishments without ITA records, the NAICS-4 group rate from this model can serve as a class-rated prior. Enforcement and SIR data (which exists for many non-ITA entities) can further adjust via the GBM stage.

Exhibit 3: Left — predicted risk decile (based on prior TRIR) vs actual next-year TRIR. The monotonic increase confirms that the model's risk ordering translates directly into observed outcomes. Right — percentage of all injuries captured by each decile. The top decile (D10) captures a disproportionate share of injuries while the bottom decile (D1) captures very few, demonstrating strong discrimination. Individual injury outcomes are inherently noisy (rare events follow a Negative Binomial process), but aggregate decile-level patterns are highly stable and predictable.

14. What This Model Can and Cannot Be Used For

Appropriate Uses	Not Appropriate For
Frequency risk-ranking of ITA-reporting establishments	Premium rate indication or pricing without further severity and expense loading
Identifying high-risk accounts for underwriter review	Automated accept/reject decisions without human review
Supplementing class rates with experience-based adjustments for credible accounts (Z > 0.3)	Individually rating micro-accounts (<10 FTE) where credibility is low
Benchmarking an employer's predicted frequency against NAICS peers	Generalizing to non-ITA populations (small employers, low-hazard industries) without acknowledging that predictions are primarily class-rate priors
Trend analysis and portfolio-level frequency monitoring	Predicting specific claim types, severity, or cost
Informing loss-control targeting (which accounts to inspect)	MSHA-regulated mining or FMCSA-regulated motor carrier risk assessment

15. Recommended Next Steps Before Production Use

Calendar-year trend adjustment: Fit a multiplicative trend factor to eliminate the ~25% aggregate over-prediction. Re-estimate this factor each year as new ITA data becomes available.
Parallel run: Deploy alongside the current peer grading system for 6-12 months. Compare predictions against realized outcomes for the parallel period before retiring peer grading for frequency estimation on ITA-like accounts.
Exposure projection protocol: Define how submitted or projected hours/payroll will be used at quote time in place of the realized hours used in backtesting. Validate that the rate prediction remains calibrated when paired with exposure estimates.
Segment-level validation: Evaluate discrimination and calibration separately by NAICS sector, state, and size band. Identify segments where the model underperforms and consider segment-specific adjustments or exclusions.
Holdout validation: Run the model on the reserved 2023→2024 holdout set to confirm out-of-time stability before production deployment.
Model governance documentation: Prepare model risk management documentation including ongoing monitoring plan, retraining schedule, performance thresholds that would trigger model review, and escalation procedures.
Credibility and applicability flags: Expose the credibility weight Z and an ITA-applicability flag in the model output so underwriters can see how much of the prediction is individually experience-rated vs. class-rated, and whether the account falls within the model's training population.

Predictive Modeling of Workers' CompensationInjury Frequency Using Federal OSHA Data