UNSPSC Classification Nauwkeurigheid: Wat 90–95% Werkelijk Betekent

Procurement28 April 20266 min read

The number you keep seeing

Als u onderzoek heeft gedaan naar geautomatiseerde UNSPSC classification, bent u waarschijnlijk een claim tegengekomen van ongeveer 90–95% automatische classificatie. Pearstop publiceert dit cijfer. Dat doen ook verschillende concurrenten.

Maar wat betekent het werkelijk? Hoe wordt het gemeten? En wat gebeurt er met de resterende 5–10%?

Dit zijn de vragen die procurement teams stellen voordat ze zich verbinden aan een geautomatiseerd classificatiesysteem. Dit artikel beantwoordt ze duidelijk.

How accuracy is defined

"90–95% automatische classificatie" betekent dat 90–95% van de procurement regelitems in een bepaalde dataset een UNSPSC commodity code krijgt toegewezen door de geautomatiseerde engine, zonder menselijke tussenkomst.

De resterende 5–10% zijn items waarbij het vertrouwen van de engine onder een gedefinieerde drempel valt. Deze items worden gemarkeerd voor menselijke beoordeling — een buyer of category manager bekijkt elk item en bevestigt of corrigeert de voorgestelde code.

Nauwkeurigheid wordt doorgaans gemeten op het commodity level — het meest specifieke niveau van de UNSPSC hiërarchie (een 8-cijferige code). Meten op segment- of familieniveau zou hogere cijfers opleveren, maar veel minder bruikbare classificatie.

What makes an item hard to classify automatically?

De items die in de review queue terechtkomen delen doorgaans enkele kenmerken:

Sterk afgekort beschrijvingen. "WK elec H3 Q2-26" is betekenisvol voor de ingenieur die het schreef, maar vertelt een algoritme heel weinig. Zonder context van de supplier, kostenelement en locatie is een betrouwbare classificatie niet mogelijk.

Merknamen en onderdeelnummers zonder beschrijvingen. "Wago 221-412" is een specifieke klemmenblokconnector, maar zonder

Genuinely ambiguous spend. Some procurement lines sit at the boundary between two UNSPSC categories. A maintenance visit that includes both labour and materials might be Segment 72 (Maintenance Services) or Segment 78 (Transportation and Storage), depending on how the work was invoiced.

First-time suppliers. The ML layer learns from patterns across your supplier base. A brand-new supplier with no purchase history produces lower confidence scores until the engine has seen enough examples to establish patterns.

Accuracy at different stages

A good automated classification system does not stay at the same accuracy level over time. It improves.

Stage	Auto-classification rate
Initial baseline (first run)	80–90%
After 3 months of operation	90–95%
After 12 months of operation	95%+

The improvement comes from the feedback loop. Every item that a human reviewer classifies is fed back into the ML model. The next time a similar description appears — from the same or a different supplier — the engine classifies it automatically with high confidence.

This is why the review queue matters. It is not a failure of the system; it is the system learning.

How Pearstop's four-layer engine achieves this

Layer 1 — Rules Engine. User-defined rules and automatically loaded patterns handle high-confidence classifications immediately. If you have told the system that all purchases from Supplier X under GL account 6400 are Segment 72 Class 721010, every matching line item is classified without any computation.

Layer 2 — Machine Learning. A proprietary ML layer is trained on your historical spend data and on a broad corpus of procurement transactions across industries. It replicates the classification logic your most experienced category managers would apply — handling the common cases automatically.

Layer 3 — LLM Layer. Ambiguous or unusual line items are processed by a large language model that brings broad product and industry knowledge. The LLM handles descriptions that the ML layer has never seen before, including foreign-language descriptions and highly technical terminology.

Layer 4 — Human Review. Items below the confidence threshold are surfaced to your team in a review interface. Each decision feeds back into layers 1–3. Over time, the rules and ML layers expand to cover items that initially required human input.

What happens to unclassified items

Some procurement teams worry about the 5–10% that goes to review. The realistic picture:

In the first month, a team might review 500–1,000 items from a dataset of 10,000 lines
By month three, the same volume of new invoices produces 200–300 items for review
By month twelve, it is often fewer than 100

The review interface is designed for speed — a category manager can process 100 items in 20–30 minutes with clear suggested codes and confidence indicators. It is a fundamentally different workload from manual classification of the full dataset.

A note on how competitors measure accuracy

Not all accuracy claims are equivalent. Watch for:

Accuracy at segment level vs. commodity level. Segment-level classification (the first 2 digits of the UNSPSC code) is much easier than commodity-level (all 8 digits). A tool that achieves 95% at segment level might achieve only 60% at commodity level.
Accuracy on clean data vs. real-world data. Some vendors test against datasets where descriptions are already standardised. Real procurement data from SAP or Oracle is messier, and accuracy figures should reflect that.
Review queue vs. unclassified. A system that flags items for review is different from one that leaves them unclassified. Flagged items get classified — eventually. Unclassified items stay unclassified.

Pearstop's figures are measured at commodity level on real client datasets including Dutch infrastructure and FM spend, where descriptions are in mixed Dutch and English and vary significantly across sites.

What accuracy actually buys you

A 90–95% auto-classification rate means that a procurement team handling 10,000 invoice lines per month reduces their manual classification effort from 10,000 decisions to 500–1,000 — a reduction of 90–95%.

That reduction does not just save time. It makes spend analysis possible in the first place. A manually classified dataset where one person is working through 10,000 lines is always weeks or months behind. An automated system produces a classified dataset within days of the invoice data arriving.

For category management, supplier benchmarking, and margin analysis, timeliness matters as much as accuracy. A spend baseline that is three months old is far less useful than one that reflects last month's purchases.

Free Tools

Not sure which UNSPSC code to use?

Paste any product or service description and get the correct 8-digit code instantly — or explore the full taxonomy tree to understand the hierarchy.

Try the free lookup →Explore the taxonomy tree

Pearstop Team

Pearstop

Pearstop helps procurement and operations teams in hard services, FM, construction, and manufacturing turn messy data into a reliable foundation for decisions, AI, and category management.

LinkedIn →