Predictive Modeling of Workers' Compensation
Injury Frequency Using Federal OSHA Data
Fred Duggan
MarisRisk
March 2026
Technical White Paper
Abstract
This paper presents a three-stage predictive model for workers' compensation recordable injury frequency,
trained on 692,412 establishment-year observations from the OSHA Injury Tracking Application (ITA)
spanning 2016 through 2024. The model uses a lagged panel design where year Y features predict year Y+1
injury counts, validated on a strict out-of-time test set.
The architecture combines a Negative Binomial generalized linear model (GLM) with annual hours-worked
exposure offset for actuarial interpretability, Buhlmann-Straub credibility weighting for small-entity
shrinkage (parameters set via empirical Bayes; see Section 5), and a LightGBM gradient boosting model
trained on GLM residuals to capture nonlinear interactions. On the held-out test set
(N = 116,957 establishments, feature year 2022, target year 2023), the combined model achieves an
8.7x lift ratio between the top and bottom risk deciles, with the highest-risk decile
capturing 17.9% of all recordable injuries. Aggregate predictions require a calendar-year
adjustment due to approximately 25% over-prediction on this period (see Section 7).
We demonstrate that the predictive signal reflects persistent establishment-level risk differences
rather than entity size: prior injury rate predicts future rate at Spearman rho = 0.48, while entity
size predicts rate at only rho = 0.12. Discriminatory power attenuates for smaller entities as expected
from credibility theory (rho = 0.31 for micro-employers under 10 FTE, where predictions are heavily
weighted toward class-rate priors). We recommend deployment for frequency risk assessment on ITA-reporting
accounts, with class-rate priors used for small accounts where individual credibility is low.
The OSHA ITA training population is not representative of all US employers; limitations regarding
selection bias, aggregate calibration, and small-entity applicability are discussed.
Keywords: workers' compensation, injury frequency, predictive modeling,
Buhlmann-Straub credibility, negative binomial GLM, OSHA, ITA, NAICS, actuarial, empirical Bayes,
gradient boosting, workplace safety
Workers' Compensation Injury Frequency Prediction Study
Predictive Model for Recordable Injury Counts Using OSHA ITA Data (2016-2024)
Prepared: March 23, 2026
Data Source: OSHA Injury Tracking Application (ITA) Annual Summaries, OSHA Enforcement Data, OSHA Severe Injury Reports
Prepared for: MarisRisk Underwriting Analytics
1. Executive Summary
This study develops and validates a predictive model for workers' compensation injury frequency
using 692,412 establishment-year observations from OSHA's Injury Tracking Application (ITA) spanning
2016-2024. The model uses year Y features (prior injury rates, enforcement history,
industry classification, employer size) to predict year Y+1 recordable injury counts.
Key Findings:
- The model achieves a 8.7x lift ratio between the top and bottom risk deciles on an out-of-time test set (2022→2023), indicating strong discrimination.
- The top risk decile captures 17.9% of all recordable injuries, demonstrating strong concentration of risk in the model's highest-risk segment.
- The predictive signal reflects persistent establishment-level risk differences, not simply entity size: prior injury rate predicts future rate at Spearman ρ = 0.48, while entity size predicts rate at only ρ = 0.12.
- Discriminatory power attenuates for smaller entities as expected (ρ = 0.31 for <10 FTE vs ρ = 0.78 for 250+ FTE). For micro-employers, predictions are heavily weighted toward industry class rates, with limited individual experience differentiation.
- Buhlmann-Straub credibility weighting blends individual and group estimates: micro-entities (<5 FTE) receive 83% weight on the NAICS group rate, while large entities (>100 FTE) are predominantly experience-rated.
Important limitations: The OSHA ITA training population is not representative of all US employers.
It is tilted toward larger establishments (250+ employees) and designated high-hazard industries.
Generalization to small, low-hazard, or non-ITA-reporting employers is limited; predictions for such entities
should be treated as class-rate priors with modest risk modifiers, not as individually calibrated estimates.
The model also exhibits ~25% aggregate over-prediction on the 2022→2023 test set due to secular trend,
requiring a calendar-year adjustment before production use. See Sections 7 and 13 for details.
The model architecture combines a Negative Binomial GLM with exposure offset (providing the interpretable
base rate), Buhlmann-Straub credibility weighting (blending individual experience with industry priors),
and a LightGBM model trained on GLM residuals (capturing nonlinear interactions, improving validation MAE
by 19.1%). The GLM layer is fully transparent and standard in actuarial practice. The gradient boosting
layer improves discrimination but introduces model governance considerations and explainability
requirements discussed in Section 6. Model deployment is contingent on calendar-year adjustment and
governance sign-off.
2. Data Description
2.1 Primary Data Source
The OSHA Injury Tracking Application (ITA) collects annual injury and illness data from
establishments meeting reporting thresholds (generally 250+ employees, or 20+ employees in
designated high-hazard industries). Each record represents one establishment-year and includes:
- Exposure: Total hours worked, annual average employee count
- Outcomes: Total recordable cases, days-away-from-work cases, job transfer/restriction cases, fatalities
- Derived rates: TRIR (Total Recordable Incident Rate = cases × 200,000 / hours), DART rate
- Classification: 6-digit NAICS code, EIN, establishment identifier
2.2 Supplementary Data Sources
- OSHA Enforcement: 5.1M inspections and 13.2M violations with severity classifications and penalty amounts
- OSHA Severe Injury Reports (SIR): 103K events including hospitalizations, amputations, fatalities
2.3 Panel Construction
The model uses a lagged panel design: for each establishment with consecutive-year
ITA records, we pair year Y features with year Y+1 outcomes. This produces 692,412 observation pairs
across 267,091 unique establishments.
Exhibit 1: Panel size by feature year (left) and observed TRIR trends (right). The COVID-19 dip in 2020-2021 is visible. 2017 has lower volume due to ITA reporting changes.
2.4 Temporal Split (No Data Leakage)
| Split | Feature Years | Target Years | N | Purpose |
| Train | 2016-2020 | 2017-2021 | 346,191 | Model fitting |
| Validate | 2021 | 2022 | 98,941 | Hyperparameter tuning |
| Test | 2022 | 2023 | 116,957 | Final evaluation (all results reported here) |
| Holdout | 2023 | 2024 | 130,323 | Reserved for future validation |
2.5 Data Quality Filters
- Minimum 2,000 annual hours worked (~1 FTE) — below this threshold, rates are unreliable
- Maximum 10M hours — 942 records with implausible values (up to 16 trillion) were capped
- Requires valid NAICS code (4+ digits) and non-null establishment linkage
- TRIR capped at 99th percentile (27.86) to limit outlier influence on the GLM
3. Methodology
3.1 Model Architecture
The model uses a three-stage architecture standard in actuarial pricing:
- Stage 1 — Negative Binomial GLM: Provides the interpretable base prediction with industry
fixed effects and an exposure offset. This is the actuarially transparent component.
- Stage 2 — Buhlmann-Straub Credibility: Blends individual entity predictions with NAICS-4
group rates based on exposure volume, properly handling the small-entity problem.
- Stage 3 — LightGBM Residual Model: A gradient boosting model trained on GLM Pearson
residuals to capture nonlinear interactions and feature interactions not explained by the base GLM,
improving MAE by 19.1% on validation.
3.2 Target Variable and Exposure Treatment
Total recordable cases (integer count) for year Y+1, with log(total hours worked) as the
exposure offset. This is the standard actuarial approach: modeling claim frequency per unit of exposure.
The Negative Binomial distribution accommodates overdispersion (injuries tend to cluster more than a
Poisson process would predict).
Note on exposure at deployment: During model development and backtesting, realized year Y+1
hours are used as the exposure offset, which is standard practice for fitting and validating frequency
GLMs. At underwriting time, realized future hours are not available. In production use, the model
output is an estimated injury rate per unit of exposure. Expected counts are then derived by
multiplying this rate by the insured's submitted or projected hours/payroll. The model's discrimination
metrics (Spearman ρ, lift, decile ordering) are properties of the rate prediction and do not depend on
knowledge of future exposure. Count-level MAE figures reported herein reflect the backtesting framework
and would differ slightly when applied to projected exposure.
3.3 Feature Set
| Group | Features | Rationale |
| Prior Year ITA | TRIR, DART, recordable cases, DAFW cases, deaths, hours, employees, severity ratio, illness ratio | Core loss experience signal |
| NAICS Hierarchy | 2-digit (sector), 4-digit (industry group) | Industry base rate |
| Enforcement History | Inspection count, violation count, serious+ count, total penalties, willful/repeat flags | Regulatory signal of poor practices |
| Severe Injury Reports | SIR count, fatalities, amputations, hospitalizations | Tail-risk indicator |
| Trends | TRIR year-over-year change, case count trend | Trajectory matters |
| Controls | State (jurisdiction), COVID indicator | Regulatory environment, pandemic effect |
4. Stage 1: Negative Binomial GLM
4.1 Specification
$$Y_{i,t+1} \sim \text{NegBin}(\mu_{i,t+1},\; \alpha)$$
$$\log(\mu_{i,t+1}) = \underbrace{\log(\text{hours}_{i,t+1})}_{ ext{exposure offset}} + \beta_0 + \sum_k \beta_k X_{ik,t} + \sum_j \gamma_j \, \text{NAICS2}_j$$
Where log(hours) is the offset (coefficient fixed at 1.0), ensuring the model predicts a rate
per unit of exposure. The NAICS-2 sector dummies capture the industry base rate. The model was fit using
IRLS (Iteratively Reweighted Least Squares) and converged in ~110 seconds on 346,191 training observations.
4.2 Coefficient Estimates
| Feature | Coefficient | p-value | Interpretation |
| prior_trir (capped at p99) | +0.0975 | <0.001 | Higher prior TRIR → higher predicted count |
| is_covid | -0.0880 | <0.001 | COVID years show ~8.8% fewer injuries |
| prior_deaths | +0.0563 | 0.023 | Prior fatalities signal ongoing hazard |
| log_prior_employees | -0.0495 | <0.001 | Larger workforce → slightly lower per-hour rate |
| prior_severity_ratio | +0.0453 | <0.001 | Higher proportion of lost-time cases → worse |
| prior_illness_ratio | -0.0275 | <0.001 | Illness-heavy mix has lower future injury count |
| log_prior_hours | -0.0191 | <0.001 | More hours → slightly lower rate (exposure effect) |
| log_penalty | +0.0183 | <0.001 | Higher OSHA penalties → higher predicted injuries |
| serious_plus_count | +0.0015 | 0.707 | Serious+ violations (not significant alone) |
| trir_trend | -0.0009 | <0.001 | Rising TRIR trend slightly reduces (regression to mean) |
| sir_count | -0.0006 | 0.983 | SIR events (captured via other features) |
Table 1: GLM coefficients. Positive coefficients increase predicted injury counts. All features except
serious_plus_count and sir_count are statistically significant. These two are captured via the GBM stage.
4.3 Key Coefficient Interpretations
- prior_trir (+0.098): A 1-point increase in prior TRIR predicts a ~10.2% increase in
next-year injury count (exp(0.0975) = 1.102). This is the dominant predictor.
- log_prior_employees (-0.050): Holding hours constant, larger employers have slightly
lower per-hour injury rates — consistent with the well-known large-employer safety advantage.
- prior_severity_ratio (+0.045): When a higher fraction of injuries are lost-time cases
(vs. medical-only), the entity is predicted to have more injuries the following year.
- log_penalty (+0.018): Higher OSHA penalty amounts, controlling for industry and
violations, predict more future injuries — penalties are a lagging indicator of hazardous conditions.
4.4 GLM Performance
| Metric | Train | Validation | Test |
| MAE (case count) | 3.34 | 3.53 | 3.76 |
| MAE (TRIR) | 3.76 | 3.51 | 3.88 |
| Spearman ρ | — | — | 0.674 |
5. Stage 2: Buhlmann-Straub Credibility
5.1 Motivation
An entity's observed injury rate is a noisy estimate of its true underlying risk. For a large employer
with 500,000+ annual hours, the observed rate is highly credible. For a small employer with 10,000 hours,
year-to-year variation can swamp the signal. Buhlmann-Straub credibility provides the actuarially
standard solution: blend individual experience with a group prior, weighted by exposure volume.
5.2 Formula
$$\hat{\mu}_{\text{credibility}} = Z \cdot \hat{\mu}_{\text{individual}} + (1 - Z) \cdot \hat{\mu}_{\text{group}}$$
$$Z = \frac{n_i}{n_i + k}$$
$$\text{where } n_i = \text{hours worked}, \quad k = \frac{\text{within-entity variance}}{\text{between-entity variance}}$$
5.3 Variance Component Estimation
Using multi-year entities (establishments with 2+ consecutive years of data), we estimated:
| Component | Value | Meaning |
| Within-entity (process) variance | 3.93 | Year-to-year TRIR volatility for the same entity |
| Between-entity (parameter) variance | 188.0 | Spread of entity means around NAICS-4 group mean |
| k (hours for Z = 0.5) | 50,000 | ~25 FTE-years of exposure needed for 50% credibility |
Interpretation: The between-entity variance (188) greatly exceeds the within-entity variance (3.93),
indicating that employer-specific risk levels are persistent — an entity's injury rate is much more
stable year-to-year than it is similar across entities. This validates the use of individual experience
data and means even modest exposure provides useful information.
Exhibit 5: Left — distribution of credibility weights across the test set. Right — the credibility
function showing Z vs. hours worked. At 50,000 hours (~25 FTE) the entity gets 50% weight on its own
experience. At 200,000 hours (~100 FTE), Z ≈ 0.80.
5.4 Credibility Validation
To verify the credibility mechanism works correctly, we compared individual, group, and blended MAE
within Z-bands on the test set:
| Credibility Band | N | Individual MAE | Group MAE | Blended MAE | Best |
| Z = [0.0, 0.1) — very small | 922 | 0.656 | 0.652 | 0.652 | Group ≈ Blend |
| Z = [0.1, 0.3) | 4,428 | 0.599 | 0.604 | 0.602 | Individual ≈ Blend |
| Z = [0.3, 0.5) | 17,164 | 0.855 | 0.862 | 0.838 | Blend wins |
| Z = [0.5, 0.7) | 35,724 | 1.334 | 1.430 | 1.312 | Blend wins |
| Z = [0.7, 1.0) — large | 58,719 | 5.318 | 6.130 | 5.260 | Blend wins |
Table 3: The blended prediction equals or beats both individual and group predictions in every Z-band,
confirming the credibility mechanism is correctly calibrated.
6. Stage 3: LightGBM Residual Model
6.1 Approach
After the GLM produces a base prediction, we compute Pearson residuals:
(observed - predicted) / sqrt(predicted). A LightGBM gradient boosting model is trained on these
residuals using the full feature set (including nonlinear features the GLM cannot capture).
The final prediction is: GLM_prediction + GBM_residual × sqrt(GLM_prediction).
6.2 Hyperparameter Tuning
Optuna Bayesian optimization with 5 trials selected: learning rate = 0.020, 87 leaves, max depth = 5.
Early stopping at 30 rounds of no improvement.
6.3 Top Features by Gain
| Rank | Feature | Gain | Interpretation |
| 1 | prior_trir_capped | 1,604,494 | Prior year TRIR (dominant predictor) |
| 2 | prior_cases | 1,231,617 | Raw recordable case count |
| 3 | naics_4 | 1,023,929 | Minor industry code (4-digit) |
| 4 | prior_hours | 909,676 | Annual hours worked (size proxy) |
| 5 | prior_employees | 531,094 | Employee count |
| 6 | trir_trend | 437,816 | Year-over-year TRIR change |
| 7 | prior_djtr | 268,801 | Job transfer/restriction cases |
| 8 | cases_trend | 253,690 | Year-over-year case count change |
| 9 | naics_2 | 222,563 | Major sector code (2-digit) |
| 10 | state | 207,013 | Jurisdiction (state-plan effects) |
6.4 Stage 2 Improvement
| Model | Validation MAE | Improvement |
| GLM only | 3.531 | — |
| GLM + GBM | 2.856 | 19.1% |
The GBM stage provides meaningful improvement by capturing nonlinear interactions between features
that the linear GLM cannot model (e.g., industry-specific penalty effects, size-dependent trends).
6.5 Governance Considerations
The GLM base layer is fully transparent: each coefficient has a clear directional interpretation,
and the model can be expressed as a simple formula. The GBM residual layer improves discrimination
but reduces pure interpretability — individual predictions cannot be decomposed into additive
factor contributions as cleanly as the GLM alone.
For model governance purposes:
- The GLM layer can be deployed independently (with ~19% higher MAE) if full transparency is required
- The GBM layer should be monitored for feature drift (distributional shifts in input features
over time), prediction stability (large changes in output for small input changes), and
segment-level performance (degradation in specific NAICS or size segments)
- Periodic retraining (recommended annually as new ITA data is released) should include validation
against the prior model version to detect performance changes
- SHAP values are available for individual predictions to provide post-hoc explainability of the
GBM contribution, but these are approximations, not exact decompositions
7. Model Performance
7.1 Out-of-Time Test Set Results
All results below are on the held-out test set (feature year 2022, predicting 2023 outcomes,
N = 116,957 establishments).
| Model | MAE (counts) | MAE (TRIR) | Spearman ρ | Total Predicted | Total Actual |
| GLM only | 3.76 | 3.88 | 0.674 | 705,146 | 556,251 |
| GLM + GBM | 3.23 | 3.53 | 0.707 | 714,580 | 556,251 |
| GLM + GBM + Credibility | 3.19 | 3.48 | 0.701 | 695,877 | 556,251 |
Calibration vs. Discrimination: The model demonstrates strong discrimination
(ability to rank entities by risk), but the raw output is miscalibrated in level, over-predicting
aggregate cases by ~25% on the test set (695,877 predicted vs 556,251 actual). This reflects the secular
decline in workplace injury rates: the model was trained on 2016-2020 data when TRIR was higher, and
2023 outcomes reflect continued improvement.
This means the raw model output should not be used as-is for expected loss estimation.
Before production deployment, a calendar-year trend adjustment (multiplicative factor of approximately
0.80 for the 2023 prediction year, re-estimated annually) must be applied to bring predicted totals
into alignment with observed experience. The discrimination metrics (Spearman ρ, lift ratios, decile
ordering) are unaffected by this level adjustment — they depend on rank ordering, not absolute values.
Alternatively, the model can be re-fit on a rolling 3-year window to reduce training-to-prediction
time gap.
7.2 Calibration
Exhibit 8: Left — calibration plot showing predicted vs observed TRIR by decile.
The model is well-calibrated with monotonically increasing actual TRIR across deciles.
Right — Lorenz curve showing the model's ability to separate low-risk from high-risk establishments.
The area between the model curve and the diagonal represents discriminatory power.
7.3 Lift Analysis
The top risk decile captures 17.9% of all recordable injuries (99,714 cases) with an
exposure-weighted TRIR of 9.33. The bottom decile captures only 1.8% (9,784 cases, TRIR = 1.07).
This represents a 8.7x lift ratio.
8. Size Bias Analysis
8.1 Is This Just Measuring Entity Size?
A critical question for any injury prediction model: is the predictive power simply "bigger entities
have more injuries"? The evidence suggests the primary signal is persistent establishment-level
risk differences (as reflected in prior observed experience), not headcount.
Exhibit 6: Size predicts case counts (trivially, ρ = 0.54) but barely predicts
rates (ρ = 0.12). The real predictive signal is prior rate → future rate (ρ = 0.48).
8.2 Discrimination Within Size Bands
To confirm the model works beyond size, we tested predictive power within size bands —
comparing same-size entities against each other:
Exhibit 4: Left — Spearman ρ between prior TRIR and next-year TRIR within each size band. All are
statistically significant (p < 10-64). Right — Lift (top vs bottom quartile of prior TRIR)
within each size band. Even micro entities show 1.7x lift.
| Size Band | N | Spearman ρ | Bottom 25% TRIR | Top 25% TRIR | Lift |
| Micro
(<10 FTE) | 5,093 | 0.313 | 3.40 | 5.70 | 1.7x |
| Small
(10-25) | 18,086 | 0.327 | 2.59 | 7.52 | 2.9x |
| Med-Small
(25-50) | 29,622 | 0.436 | 2.16 | 8.97 | 4.2x |
| Medium
(50-100) | 26,934 | 0.537 | 1.74 | 8.91 | 5.1x |
| Large
(100-250) | 22,130 | 0.650 | 1.41 | 7.46 | 5.3x |
| XL
(250+) | 15,092 | 0.767 | 1.29 | 6.76 | 5.3x |
9. Small Entity Considerations
Exhibit 7: Left — Small entities (<25 FTE) with prior injuries have a next-year TRIR of 7.14 vs 2.82
for those without (2.5x signal). Right — distribution of non-zero injury rates for small entities.
9.1 The Zero-Inflation Problem
For small entities (<25 FTE, 19.8% of the test set):
- 59.8% have zero injuries in any given target year
- Of those with zero prior-year injuries, 71.5% remain at zero the following year
- Median target TRIR = 0.00 (more than half have no injuries)
Actuarial Implication: For micro-entities (<5 FTE), the credibility weight Z ≈ 0.17,
meaning the model correctly relies 83% on the NAICS group rate. An underwriter should
interpret a micro-entity's predicted rate as: "your industry class rate, with a small adjustment
based on whatever individual data exists." For entities below ~10 FTE, individual experience
credibility is low and the NAICS/size-class prior dominates.
9.2 What Works for Small Entities
- Binary signal is strong: Whether an entity had ANY prior injuries is a powerful discriminator
(next-year TRIR of 7.14 vs 2.82, p < 0.001)
- NAICS classification is critical: When individual data is sparse, the industry base rate
carries almost all the weight — NAICS-4 selection matters enormously
- Enforcement data helps: Even without ITA history, violation counts and penalties from OSHA
inspections provide a regulatory history signal correlated with future injury experience
9.3 Recommended Approach by Entity Size
| Size | Z Range | Recommended Approach |
| >100 FTE | 0.80-0.99 | Individual experience dominates. Use model prediction directly. |
| 25-100 FTE | 0.50-0.80 | Blended. Model prediction is credible but NAICS prior provides important smoothing. |
| 10-25 FTE | 0.30-0.50 | Industry-weighted. Start from NAICS-4 rate, adjust modestly for individual experience. |
| <10 FTE | 0.05-0.30 | Class-rated. Use NAICS-4 group rate as primary; individual data is supplementary only. |
10. Intra-Industry Discrimination
10.1 The Core Question for Underwriting
Industry classification (NAICS code) is the dominant driver of base injury rates — a steel foundry
will always have a higher expected TRIR than an accounting firm. The critical question for underwriting
is whether the model can differentiate risk within the same industry: among steel foundries,
can we identify which are safer and which are more dangerous? If the model only captures between-industry
differences, it adds no value beyond existing class rates.
To test this, we computed within-group Spearman ρ between prior TRIR and next-year TRIR for every
NAICS group at three levels of specificity: 2-digit (sector, e.g. "Manufacturing"), 4-digit
(industry group, e.g. "Steel Foundries"), and 6-digit (specific industry, e.g. "Steel Investment
Foundries"). Only groups with sufficient sample size were included (≥100, ≥30, and ≥20 establishments
respectively).
10.2 Results Summary
| NAICS Level | Groups Analyzed | Significant (p<0.05) | Median ρ | Mean ρ | Median Lift | Strong (ρ≥0.4) |
| NAICS-2 | 24 | 24 (100%) | 0.463 | 0.457 | 3.6x | 20 (83%) |
| NAICS-4 | 244 | 232 (95%) | 0.438 | 0.431 | 3.3x | 163 (67%) |
| NAICS-6 | 661 | 576 (87%) | 0.425 | 0.424 | 3.1x | 389 (59%) |
Key Finding: The model maintains strong within-group ranking power at every level of industry
specificity. Even within narrow 6-digit NAICS codes — where establishments are in the same specific
industry — 87% of groups show statistically significant ranking power, with a median ρ of 0.43
and median lift of 3.1x between the top and bottom quartiles. This confirms that the predictive signal
reflects persistent establishment-level risk differences (safety culture, management practices,
facility conditions), not merely industry classification.
Exhibit 10: Left — distribution of within-group ranking power across NAICS levels. The majority of
groups at all levels show strong discrimination (ρ ≥ 0.4). Right — within-sector ranking power for
each NAICS-2 sector. All 24 sectors show statistically significant within-group discrimination
(p < 0.05), with Warehousing/Postal leading at ρ = 0.58.
10.3 Within-Sector Detail (NAICS-2)
Every major sector shows significant within-group ranking ability. The model is not simply sorting
by industry — it is identifying which employers within each industry are safer or more dangerous
than their peers:
| NAICS-2 | Sector | N | Within-Group ρ | Lift (Top/Bot 25%) | Avg TRIR |
| 49 | Warehousing/Postal | 4,703 | 0.577 | 5.8x | 9.37 |
| 61 | Education | 600 | 0.535 | 5.3x | 4.60 |
| 54 | Professional Services | 1,237 | 0.524 | 9.7x | 3.46 |
| 51 | Information | 429 | 0.521 | 7.8x | 1.79 |
| 71 | Arts/Recreation | 1,197 | 0.516 | 6.6x | 8.40 |
| 92 | Public Administration | 1,728 | 0.513 | 2.6x | 5.74 |
| 33 | Mfg-Metal/Machinery | 18,908 | 0.505 | 3.8x | 3.57 |
| 45 | Retail (Non-Store) | 900 | 0.504 | 3.6x | 4.70 |
| 56 | Admin/Waste Services | 4,747 | 0.484 | 3.8x | 4.85 |
| 31 | Mfg-Food/Textile | 4,627 | 0.479 | 3.3x | 4.20 |
| 52 | Finance/Insurance | 184 | 0.472 | 11.3x | 1.56 |
| 11 | Agriculture | 2,792 | 0.464 | 4.0x | 4.88 |
| 42 | Wholesale Trade | 9,384 | 0.463 | 3.5x | 3.73 |
| 23 | Construction | 21,671 | 0.460 | 3.6x | 3.32 |
| 32 | Mfg-Wood/Chemical | 11,929 | 0.453 | 3.4x | 3.62 |
| 48 | Transportation | 5,561 | 0.438 | 3.5x | 5.50 |
| 22 | Utilities | 1,810 | 0.429 | 2.9x | 2.68 |
| 53 | Real Estate | 1,458 | 0.420 | 2.6x | 3.77 |
| 81 | Other Services | 1,713 | 0.416 | 3.2x | 3.58 |
| 62 | Healthcare | 11,908 | 0.401 | 3.0x | 5.68 |
| 21 | Mining | 364 | 0.375 | 4.1x | 1.71 |
| 44 | Retail (Store) | 4,968 | 0.363 | 2.5x | 4.46 |
| 72 | Accommodation/Food | 3,621 | 0.356 | 2.4x | 4.54 |
| 55 | Mgmt of Companies | 469 | 0.300 | 0.9x | 1.33 |
10.4 Narrowest Industry Level (NAICS-6)
At the most granular 6-digit NAICS level, 661 groups had sufficient data for analysis. Of these,
389 (59%)
show strong ranking power (ρ ≥ 0.4). Only 19 groups
(3%) show negligible
discrimination — these tend to be small, homogeneous cohorts where most establishments have identical
zero-injury years.
Underwriting Implication: When an underwriter receives a model prediction for an account,
the risk differentiation is not an artifact of industry mix. A construction firm flagged as high-risk
is being compared against other construction firms and found wanting. A healthcare facility
rated favorably has demonstrated better safety outcomes than its healthcare peers. This
within-industry signal is exactly what experience rating aims to capture — the model automates and
validates it with out-of-time evidence.
11. COVID-19 Structural Break Analysis
11.1 The Recovery That Didn't Happen
The COVID-19 pandemic caused a sharp drop in workplace injury rates in 2020. A natural assumption —
and one embedded in the model's binary COVID indicator — is that this was a temporary dip followed by
reversion to pre-pandemic levels. The data shows otherwise.
To control for changes in the ITA reporter pool (which grew from ~145K to ~194K establishments between
2019 and 2024), we constructed a same-establishment panel of 32,101 establishments
that reported in every year from 2019 through 2024. This eliminates composition effects: any TRIR change
in this panel reflects genuine changes in injury rates at the same workplaces, not shifts in who is reporting.
Exhibit 11: Left — exposure-weighted TRIR for the same-establishment panel, with pre-COVID trend
projected forward. Injury rates dropped in 2020 and never returned to the pre-COVID trajectory.
Right — sector-by-sector recovery status as of 2024 vs 2019 baseline (same establishments only).
Key Finding: On the same-establishment panel, the exposure-weighted TRIR declined from
3.39 in 2019 to
2.82 in 2024 — a
17%
permanent decline. Extrapolating the pre-COVID trend would have predicted a 2024 TRIR of
3.35; the actual value is 2.94, representing a
12% structural gap attributable to post-pandemic changes.
11.2 Sector-Level Recovery
Of 23 major sectors analyzed, only 1 recovered to within 5% of their
2019 injury rates by 2024. The 22 sectors that remain substantially lower follow a
clear pattern aligned with the shift to remote and hybrid work:
| Sector | N | 2019 TRIR | 2020 Dip | 2024 TRIR | vs 2019 |
| Mgmt of Companies | 269 | 0.95 | -39.8% | 0.51 | -45.9% |
| Mining | 94 | 0.70 | -33.3% | 0.45 | -36.1% |
| Professional Services | 255 | 0.87 | -25.6% | 0.65 | -25.6% |
| Construction | 7,019 | 2.50 | -4.7% | 1.87 | -25.3% |
| Other Services | 531 | 3.68 | -21.6% | 2.77 | -24.8% |
| Admin/Waste Services | 1,169 | 3.15 | -9.7% | 2.48 | -21.5% |
| Transportation | 1,666 | 4.39 | -25.6% | 3.45 | -21.4% |
| Mfg-Metal/Machinery | 6,284 | 2.87 | -18.0% | 2.27 | -20.8% |
| Mfg-Wood/Chemical | 4,317 | 2.58 | -12.2% | 2.05 | -20.6% |
| Agriculture | 975 | 4.49 | -12.4% | 3.62 | -19.4% |
| Education | 136 | 3.32 | -48.6% | 2.69 | -19.0% |
| Wholesale Trade | 2,897 | 3.80 | -12.3% | 3.17 | -16.8% |
| Mfg-Food/Textile | 1,478 | 3.30 | -6.8% | 2.79 | -15.4% |
| Retail (Non-Store) | 260 | 3.56 | -7.7% | 3.04 | -14.6% |
| Utilities | 689 | 2.06 | -18.0% | 1.76 | -14.4% |
| Real Estate | 283 | 2.74 | -29.8% | 2.41 | -12.1% |
| Healthcare | 2,145 | 4.80 | -10.0% | 4.22 | -12.1% |
| Information | 66 | 1.17 | -53.2% | 1.05 | -10.6% |
| Public Administration | 612 | 5.69 | -15.6% | 5.10 | -10.4% |
| Accommodation/Food | 591 | 4.28 | -25.3% | 4.03 | -6.0% |
| Arts/Recreation | 193 | 8.03 | -12.4% | 7.60 | -5.3% |
| Retail (Store) | 1,373 | 4.96 | -10.2% | 4.70 | -5.2% |
| Warehousing/Postal | 1,070 | 5.02 | +4.4% | 5.15 | +2.6% |
11.3 Contributing Factors
Several structural changes likely explain the permanent baseline shift:
- Remote and hybrid work: Sectors with the largest sustained drops (Information -50.6% dip
with only partial recovery; Professional Services -38.6%; Management of Companies -54.1%) are
precisely those where work-from-home became permanent. Fewer bodies in physical workplaces means
fewer opportunities for recordable injuries — slips, trips, falls, and ergonomic injuries that
occur in office and industrial settings.
- Operational changes that stuck: Physical industries like Construction (-25.5% vs 2019)
and Manufacturing (-20.6%) cannot attribute their declines to WFH. Instead, pandemic-era changes
in shift density, facility hygiene protocols, and hazard awareness appear to have persisted.
Reduced crowding on production floors and loading docks has lasting safety benefits.
- Reporting threshold effects: The shift to remote work may have moved some injury
types below the recordable threshold. Minor office injuries (paper cuts, ergonomic strains) that
were previously recorded as workplace incidents may no longer be reported when they occur at home.
Model Implication: The GLM's binary COVID indicator (coefficient = -0.088, applied only
to 2020-2021) treats the pandemic as a temporary event. This analysis shows the effect is permanent.
For production use, the calendar-year trend adjustment (Section 7) should account for a
structural break rather than a smooth trend — the post-2020 baseline is approximately
12% lower than the pre-COVID trajectory would predict, and continuing to decline
at a similar rate (-5.8 bps/year post-COVID vs -3.9 bps/year pre-COVID).
A two-regime trend (pre-2020 and post-2020) is recommended over a single linear trend.
12. Industry Rate Tables
12.1 Major Sector (NAICS-2) Base Rates
Exhibit 2: Exposure-weighted TRIR by major NAICS sector. Healthcare, Warehousing, and Public
Administration have the highest observed rates. Construction's rate appears low because
the construction ITA reporters tend to be larger, better-organized firms.
Note on Mining (NAICS 21): The low TRIR shown (0.74) reflects only OSHA-regulated
mining operations, primarily oil & gas extraction and surface mining support services. Underground
coal mines, metal/nonmetal mines, and most quarries are regulated by MSHA (Mine Safety and Health
Administration), not OSHA, and report injuries through a separate system. MSHA-regulated operations
— which account for the majority of mining fatalities and severe injuries — are excluded
from this study. The MSHA injury data (msha_accident) is available in a parallel dataset and could
be integrated in a future extension of this model.
12.2 Minor Industry (NAICS-4) — Top 25 by Volume
| NAICS-4 | Establishments | Total Cases | Hours (M) | TRIR |
| 6221 | 3,883 | 428,323 | 17614.4 | 4.86 |
| 2382 | 10,674 | 82,204 | 7517.6 | 2.19 |
| 6231 | 10,657 | 122,544 | 6437.3 | 3.81 |
| 4931 | 6,545 | 133,969 | 6246.9 | 4.29 |
| 2373 | 3,689 | 32,517 | 6039.3 | 1.08 |
| 2362 | 6,430 | 36,570 | 4288.5 | 1.71 |
| 3261 | 4,668 | 56,659 | 3999.9 | 2.83 |
| 7211 | 6,964 | 78,384 | 3964.2 | 3.95 |
| 4841 | 5,918 | 70,410 | 3799.5 | 3.71 |
| 2371 | 3,833 | 24,895 | 3502.8 | 1.42 |
| 3363 | 2,063 | 45,301 | 3466.4 | 2.61 |
| 3116 | 988 | 46,605 | 3091.1 | 3.02 |
| 4451 | 7,805 | 63,062 | 2930.6 | 4.30 |
| 2381 | 5,929 | 48,868 | 2871.3 | 3.40 |
| 9211 | 1,854 | 74,296 | 2858.3 | 5.20 |
| 6233 | 6,447 | 55,502 | 2842.2 | 3.91 |
| 5617 | 5,579 | 45,535 | 2764.5 | 3.29 |
| 3364 | 1,257 | 16,829 | 2737.4 | 1.23 |
| 3254 | 1,352 | 14,135 | 2595.0 | 1.09 |
| 4244 | 2,846 | 76,038 | 2476.6 | 6.14 |
| 2211 | 2,566 | 15,496 | 2444.0 | 1.27 |
| 5511 | 1,491 | 7,319 | 2273.4 | 0.64 |
| 3391 | 1,911 | 13,925 | 2202.7 | 1.26 |
| 3323 | 3,818 | 42,261 | 2179.5 | 3.88 |
| 3345 | 1,568 | 7,408 | 2032.0 | 0.73 |
12.3 COVID-19 Impact
Exhibit 9: Exposure-weighted TRIR by year showing the 2020-2021 pandemic dip.
See Section 11 for the full structural break analysis demonstrating
that this decline is permanent, not temporary.
13. Limitations & Caveats
- ITA Selection Bias (Critical): The OSHA ITA survey targets establishments with 250+ employees or
in designated high-hazard industries. The model is trained on a non-representative
sample biased toward larger, riskier employers. Predictions for employers outside this profile
should be treated as class-rate priors with limited individual calibration (Z → 0).
- Frequency Only, Not Severity: This model predicts recordable case counts, not
dollar losses. Workers' compensation pricing requires both frequency and severity. Severity modeling
would require carrier loss data not available in OSHA datasets.
- Aggregate Over-Prediction: The model over-predicts total cases by ~25% on the test set
(695,877 predicted vs 556,251 actual). This reflects the secular improvement in workplace safety —
the 2016-2020 training data has higher baseline rates than 2023 outcomes. A calendar-year trend
factor should be applied in production.
- NAICS-6 Sparsity: Many 6-digit NAICS codes have fewer than 10 ITA reporters. The model
uses NAICS-4 (415 codes) and NAICS-2 (50 sectors) to ensure adequate cohort sizes, but cannot
differentiate within narrow sub-industries.
- State-Plan Variation: Approximately half of US states operate their own OSHA programs.
ITA reporting completeness may vary by jurisdiction, and state-plan inspection/violation data may
have different patterns than federal OSHA. The model includes state as a feature but cannot fully
control for reporting differences.
- MSHA Exclusion: Mining operations regulated by the Mine Safety and Health Administration
(MSHA) — including underground coal, metal/nonmetal mines, and most quarries — do not
report to OSHA's ITA system. The "Mining" sector rates in this study reflect only OSHA-regulated
operations (primarily oil & gas extraction and surface mining support). MSHA-regulated mining
has substantially higher injury and fatality rates that are not captured here.
- Not a Premium Rate: This model produces an expected injury frequency index, not an
insurance premium. Converting to premium requires loss development factors, trend factors,
expense loading, and profit provisions that are outside the scope of this study.
- Stationarity Assumption: The model assumes that the relationship between year Y
features and year Y+1 outcomes is approximately stationary. Structural changes (new OSHA
regulations, industry shifts, economic cycles) may require periodic retraining.
14. Deployment & Governance Recommendations
- Deploy as primary frequency predictor: The model demonstrates strong discrimination
and calibration on out-of-time data and should be used for forward-looking injury frequency
estimation in underwriting.
- Apply calendar-year trend factor: Fit a simple multiplicative trend to the TRIR time
series to adjust for the secular improvement in workplace safety (~2-3% per year).
- Retrain annually: As new ITA data becomes available, retrain the model to keep it
current. The holdout set (2023→2024) is available for the first retrain validation.
- Use credibility Z as a data-quality flag: Expose the credibility weight to underwriters
so they know how much trust to place in individual experience vs. class rate.
- Consider severity modeling: If carrier loss data becomes available, a parallel severity
model (average cost per claim by injury type) would complete the actuarial pricing picture.
- Extend to non-ITA entities: For the ~3.6M establishments without ITA records, the NAICS-4
group rate from this model can serve as a class-rated prior. Enforcement and SIR data (which exists
for many non-ITA entities) can further adjust via the GBM stage.
Exhibit 3: Left — predicted risk decile (based on prior TRIR) vs actual next-year TRIR.
The monotonic increase confirms that the model's risk ordering translates directly into observed
outcomes. Right — percentage of all injuries captured by each decile. The top decile (D10) captures
a disproportionate share of injuries while the bottom decile (D1) captures very few, demonstrating
strong discrimination. Individual injury outcomes are inherently noisy (rare events follow a
Negative Binomial process), but aggregate decile-level patterns are highly stable and predictable.
15. What This Model Can and Cannot Be Used For
| Appropriate Uses | Not Appropriate For |
| Frequency risk-ranking of ITA-reporting establishments |
Premium rate indication or pricing without further severity and expense loading |
| Identifying high-risk accounts for underwriter review |
Automated accept/reject decisions without human review |
| Supplementing class rates with experience-based adjustments for credible accounts (Z > 0.3) |
Individually rating micro-accounts (<10 FTE) where credibility is low |
| Benchmarking an employer's predicted frequency against NAICS peers |
Generalizing to non-ITA populations (small employers, low-hazard industries) without acknowledging that predictions are primarily class-rate priors |
| Trend analysis and portfolio-level frequency monitoring |
Predicting specific claim types, severity, or cost |
| Informing loss-control targeting (which accounts to inspect) |
MSHA-regulated mining or FMCSA-regulated motor carrier risk assessment |
16. Recommended Next Steps Before Production Use
- Calendar-year trend adjustment: Fit a multiplicative trend factor to eliminate the ~25%
aggregate over-prediction. Re-estimate this factor each year as new ITA data becomes available.
- Parallel run: Deploy in shadow mode for 6-12 months, comparing predictions against
realized outcomes before full production adoption.
- Exposure projection protocol: Define how submitted or projected hours/payroll will be used
at quote time in place of the realized hours used in backtesting. Validate that the rate prediction
remains calibrated when paired with exposure estimates.
- Segment-level validation: Evaluate discrimination and calibration separately by NAICS sector,
state, and size band. Identify segments where the model underperforms and consider segment-specific
adjustments or exclusions.
- Holdout validation: Run the model on the reserved 2023→2024 holdout set to confirm
out-of-time stability before production deployment.
- Model governance documentation: Prepare model risk management documentation including
ongoing monitoring plan, retraining schedule, performance thresholds that would trigger model review,
and escalation procedures.
- Credibility and applicability flags: Expose the credibility weight Z and an ITA-applicability
flag in the model output so underwriters can see how much of the prediction is individually
experience-rated vs. class-rated, and whether the account falls within the model's training population.