AI Stack Exchange
2026-06-23 23:04 UTC
Score 28.0
AI-110-20260623-social-media-17065327
Full article
I'm cleaning commercial product data, and even tho I got high accuracy for SKUs, I've reached the limit of regex madness. Data string can contain one or multiple products —with arbitrary separator, which can also occur in normal text—, sometimes their description, model, dimentions, SKU and other info to be ignored. I decided to keep regexes for the easy matches, and give AI a shot for the complex ones, starting with " fastino/gliner2-multi-v1 ", a zero-shot NER (also classification, structured and relation extraction). The multi variant groks Portuguese too, which is nice (data is from Brasil). Promising, but produces too much fluff. I tried GPT, which was the opposite: good precision, but low recall. Simplifying I tried to split multi-product strings first. GLiNER2 was not useful, GPT pretty good with commas, which is most of the cases, but seems overkill for such a simple task; I would like to run it locally, or even on a VPS, so I probably should train a model, but the many options are overwhelming, so it's high time I ask for advice: 1. Which models are recommended to train locally (6G VRAM) or with an accesible price that would extract product name, model, description/dimentions and SKU with good accuracy from a mess like this? I'm still leaning towards GLiNER2 over the xBERTs, nevermind spaCy (I've got a bunch of regexes already, but data is too messy for it?), but I'm going blind. 2. In a single pass, or should I split multiple products first? With a different model?…