How We Score CIS Subcontractor Risk: A Methodology Note

The legal question the Finance Bill 2025 asks is, in form, an evidential one: given the information reasonably available, what did the contractor know, and did they act in proportion to it?

This note sets out how we approach that question at Tax Radar. It explains why a transparent, multi-source scoring framework is a better fit for the legal standard than either a checklist or a black-box model, what categories of signal we treat as informative, how those signals are calibrated, and what is deliberately out of scope.

1. Why checklists are insufficient

Standard CIS due diligence relies on verifying a small set of attributes at onboarding: registration status, identity, basic credentials. The structure of this approach has three weaknesses.

It treats subcontractors as point observations. Each one is evaluated in isolation and at a single moment in time. The window between a subcontractor’s last verification and HMRC’s first enquiry is precisely where the “should have known” test bites, and a checklist sees nothing in that window.
It assumes signals are reliable on their face. A subcontractor engaged in fraud has every incentive to pass an obvious check. The dangerous cases tend to be the ones that look ordinary at the point of verification.
It conflates absence of a flag with evidence of compliance. Absence of a flag is evidence of nothing in particular, unless the system has actively looked for, and failed to find, specific patterns.

In every detection domain, whether card fraud, AML, or sanctions screening, these three weaknesses produce the same failure mode: the system catches the obvious cases and misses the costly ones.

2. The legal standard, as a system requirement

The “knew or should have known” standard, expressed operationally, requires a contractor to demonstrate two things:

They had access to a reasonable set of risk signals.
They acted on those signals in proportion to what the signals suggested.

Three implications follow, and they shape every design decision in our engine.

The audit trail is the deliverable, not the score. HMRC and the courts assess the contractor’s response to the information they had. The score is an input to that response, not a substitute for it. Every assessment we generate is paired with a structured, time-stamped record of the factors that drove it, retained for seven years.
Explainability is not optional. A score without an attributable rationale is not a defence. Under GDPR Article 22, automated decisions that materially affect a data subject must be accompanied by meaningful information about the logic involved. A black-box model fails that test by construction.
Proportionate scrutiny in the presence of moderate evidence is reasonable care. The system must therefore distinguish gradations rather than merely “pass” and “fail”, and tie each gradation to a defined action.

These three constraints are why we built a calibrated, transparent rule engine rather than a learned classifier. A learned model could in principle catch more subtle patterns, but at the current stage we judge the explainability cost to outweigh the marginal gain. The engine is a calibrated rule framework by deliberate choice, not by default.

3. What we look at, and why

Our engine combines signals from a small number of authoritative sources. Each one is independently maintained, periodically refreshed, and carries known evidential weight in HMRC and tribunal proceedings.

Identity, status, and registration

HMRC CIS verification confirms whether a subcontractor is registered and which deduction rate applies (Gross, Standard, or Higher). The tax treatment is informational rather than scored, but its absence, when registration would be expected, is itself a flag.
HMRC VAT verification (Notice 726 compliance) is performed on every subcontractor with a VAT number. De-registration, mismatched names, and invalid formats are weighted distinctly from “check unavailable”, because the evidential value of each is different.
Companies House provides company status, filing currency, officer history, PSC structure, and insolvency status.

Director-level signals

OFSI sanctions screening runs against the consolidated list with calibrated fuzzy matching, tuned conservatively against known true positives and negatives. A match is the highest single contribution to the score.
HMRC Deliberate Tax Defaulters list is screened with the same fuzzy-matching approach. False negatives on this list carry the highest legal cost, so the threshold is set conservatively.
Companies House Disqualified Directors register is screened on every director. Engaging a disqualified director is a criminal offence under the Company Directors Disqualification Act 1986, which is reflected in the weight assigned to a match.

Phoenix indicators

Phoenix activity rarely shows itself in any single field. We score it from a combination of:

dissolved company counts in a director’s recent history
an unusually high number of concurrent or recent directorships
rapid sequential incorporation patterns over a defined recent window
company age, with sharper weighting for very recently incorporated entities
officer turnover and rapid officer churn within a defined recent window

No single one of these is sufficient. The combination is what carries signal, and the combination is what the engine is designed to surface.

Commercial reality (three-tier benchmarking)

For every subcontractor whose pricing data we hold, we benchmark against three tiers.

Tier 1, the compliance floor. Statutory minima: National Living Wage by age band, employer NI on earnings above the Secondary Threshold, auto-enrolment pension contributions, and statutory holiday pay. These define the absolute floor below which a quoted rate is mathematically incompatible with lawful PAYE employment.

Tier 2, the statutory cost stack. What it would cost a compliant employer to deliver the same labour after on-costs. This is the line below which legitimate competition is implausible without offsetting efficiencies.

Tier 3, market reality. ONS Annual Survey of Hours and Earnings (ASHE) hourly pay distributions by region and percentile, with a calibrated self-employed uplift to reconcile ASHE’s PAYE basis with CIS subcontractor economics. Cross-referenced with Hudson Contract’s published self-employed pay trends where available.

Variance against the Tier 3 distribution drives a graded score. Below-market pricing is weighted more heavily than above-market pricing, because below-market pricing is the diagnostic signature of Mini Umbrella Company, missing-trader, and labour-only fraud schemes. Pricing thresholds were chosen to balance recall against false-positive rate after empirical review.

An unusually high labour-to-materials ratio is scored separately as a structural indicator of labour-only supply, which HMRC treats as elevated-risk territory.

Supply chain depth (HMRC GfC12)

HMRC’s Guidelines for Compliance 12 (GfC12) explicitly identify long subcontracting chains as a fraud indicator. We mirror that guidance with a tier-weighted score:

Total chain depth	Risk level
Tier 1 (direct)	Standard
Tier 2	Elevated
Tier 3	High
Tier 4	Very high
Tier 5+	Critical

Score contributions rise non-linearly with depth, reflecting GfC12’s framing of long chains as a major fraud indicator. From Tier 3 upwards, commercial justification is required and reviewed. Absence of a credible justification adds further weight; presence of one reduces it. Labour-only intermediaries that themselves engage further subcontractors are scored as a distinct factor, because that pattern is the operational signature of umbrella and MSC arrangements.

Trading address

Trading from a residential address is scored as a moderate indicator. It is not damning in itself. Many legitimate sole traders do. But in combination with phoenix or pricing signals, it is one of the patterns that distinguishes a shell from a going concern.

4. Calibration, not just rules

Calling the engine “rule-based” is technically accurate but understates what is actually engineered. The non-trivial work is not in writing rules. It is in:

Choosing thresholds that are stable across contexts. Every threshold in the engine, whether for name matching, pricing variance, recency, or ratio analysis, is the output of a calibration exercise against known cases rather than a number chosen by intuition. We periodically revisit them as the dataset and the threat landscape evolve.
Choosing weights that survive adversarial pressure. A subcontractor designing themselves to evade a single check should not collapse the score. Weights are distributed so that the high-cost failure modes (sanctions, defaulters, disqualification, deep chains) cannot be offset by improvements elsewhere.
Choosing data sources that are independently maintained. A signal whose ground truth we control is not a signal. It is an opinion. Every authoritative input above is sourced from an external register that updates on its own cadence.
Continuous re-verification with material-change detection. Every subject is re-screened on a recurring schedule. The system distinguishes cosmetic changes from material ones (risk-band transitions, new red flags, CIS status changes, VAT de-registrations, officer movements) and only the material ones generate alerts. This is the part of the system that addresses the “should have known” gap between verifications.

5. What we deliberately do not do

A note on what is out of scope, by design.

No automated final determinations. Every assessment is flagged for human review at the schema level. The system’s job is to compress a large evidence base into a small number of cases worth attention. The judgement remains with the contractor and their advisers.
No learned classifiers in the scoring path. A calibrated rule framework gives a stronger explainability and audit position than a learned model, which we consider decisive in the current regulatory environment.
No cross-tenant inference. Subcontractors are evaluated against external reference data, not against the population of other tenants’ records. This is a privacy and competition decision, deliberately taken.
No publication of weights and thresholds. The categories of signal and the methodology are public; they are above. The specific weights, thresholds, and calibration constants are commercial information. There is a tension here between Kerckhoffs’s principle and adversarial detection. For static cryptographic primitives, openness is correct. For adaptive detection systems facing intelligent adversaries, it is not. Stripe, Visa, and bank AML teams treat detection logic as commercially sensitive for the same reason. We publish what is needed for accountability, namely the signal categories, the methodology, the audit trail, and the explanation accompanying every score, and no more.

6. What this means for contractors

The practical content of “reasonable care”, under this framing, comes down to three requirements.

Continuous monitoring. Without it, the contractor cannot evidence what they should have known between verification dates.
A structured, time-stamped audit trail. Narrative records do not survive scrutiny. Reproducible records, tied to the specific data sources and thresholds applied, do.
Risk signals tied to identifiable patterns, not opaque scores. A score with a documented rationale, applied consistently, is defensible. A score without one is not.

A contractor does not need to understand the underlying mathematics. They need a system that produces evidence the underlying mathematics is doing its job, and that, when challenged, can show its workings.

About the author

Josian Quintana Arroyo is a mathematician and data scientist with a background in anomaly detection. He holds an MSc in Mathematics and a BSc in Engineering, and advises Tax Radar on its risk-scoring methodology.