I'm cleaning commercial product data, and even tho I got high accuracy for SKUs, I've reached the limit of regex madness. Data string can contain one or multiple products —with arbitrary separator, which can also occur in normal text—, sometimes their description, model, dimentions, SKU and other info to be ignored.
I decided to keep regexes for the easy matches, and give AI a shot for the complex ones, starting with "fastino/gliner2-multi-v1", a zero-shot NER (also classification, structured and relation extraction). The multi variant groks Portuguese too, which is nice (data is from Brasil). Promising, but produces too much fluff. I tried GPT, which was the opposite: good precision, but low recall.
Simplifying I tried to split multi-product strings first. GLiNER2 was not useful, GPT pretty good with commas, which is most of the cases, but seems overkill for such a simple task; I would like to run it locally, or even on a VPS, so I probably should train a model, but the many options are overwhelming, so it's high time I ask for advice:
1. Which models are recommended to train locally (6G VRAM) or with an accesible price that would extract product name, model, description/dimentions and SKU with good accuracy from a mess like this?
I'm still leaning towards GLiNER2 over the xBERTs, nevermind spaCy (I've got a bunch of regexes already, but data is too messy for it?), but I'm going blind.
2. In a single pass, or should I split multiple products first? With a different model?
3. Get training data quickly with LLM using something like Unsloth Studio?
Example single product data:
445-905/00 Broca para patela, Ø 6,3 mm (encaixe Hudson (B))
72402718N - EVOS 2,7mm X 18mm parafuso cortical T8 auto-atarraxante
71366448 PROVA DE SUPERFICIE ACETABULAR R3 20 GRAUS, DIAMETRO INTERNO 28MM MODELO: 71366448, LOTE: 24BAP0050, QTDE: 1
Multiple products:
CHIBA: CH18080-00, CH18100-00, CH18120-00, CH18160-00, CH18180-00
Placas Nível – 2: 6190035 – 35mm, 6190037 – 37,5mm, 6190040 – 40mm
Embalagem com 100 (cem) tubos. (13 x 75)mm: 1,8mL, 2mL, 2,5mL e 3mL. Código de cor AZUL
Dissilicatos de lítio Mega se apresentam sob a forma de blocos ou pastilhas, nos modelos HT,LT,MT,MO,HO,AMBER e UZ Direct HT/LT nas Cores: A1,A2,A3,A3.5
NQ1000 (NQ1001, NQ1002, NQ1003): NQ1001 - (NE331R, NP580R, NP582R)
PARAFUSO PEDICULAR ESPONDILOLISTESE POLIAXIAL:MF.PD.4520 - Parafuso Pedicular Ø 4,5 x 20mm, MF.PD.4525 - Parafuso Pedicular Ø 4,5 x 25mm
1104-3012 Haste de Quadril Matrix com revestimento poroso 11104-3212 Haste de Quadril Matrix com revestimento poroso
Sample queries:
# GLiNER2
entities = {
"product_line": "The commercial or family brand name of the medical device or product line",
"variant": "Specific models, sizes, dimensions, or variations of the product",
"sku": "Stock keeping units, part numbers, catalog references, or alphanumeric commercial codes"
}
schema = (model.create_schema()
.entities(entities)
# Or structured:
.structure("product_model")
.field("product_line", dtype="str")
.field("variant", dtype="str")
.field("sku", dtype="str")
# LLM
prompt = f"""
You are an accurate medical product model data extractor. Analyze the following text and extract the response hierarchy, which you will return in strict JSON with the following format:
{{
"line": "...",
"variant": "...",
"skus": ["sku1", "sku2"]
}}
The text comes from Brazil's ANVISA database, so it is mostly in Portuguese.
The text may not contain the product line or variant. If you do not find an entity, return null.
The delimiter is variable, and sometimes some are missing.
Only extract text that literally exists in the input, verbatim; do not change or add anything.
Input: "Stent coronário SUPRAFLEX STARFGTT200008,FGTT200012"
Output: {{"line": "SUPRAFLEX", "variant": "STAR", "skus": ["FGTT200008", "FGTT200012"]}}
Input: "Cateter Balão CRUZ MOD: FGTZ400016 - 4 X 16 MM"
Output: {{"line": "CRUZ", "variant": "4 X 16 MM", "skus": ["FGTZ400016"]}}
Text:
{input}
"""