I am working on a graph store of entities and relationships extracted from a factual test document of around 500 words. The first pass (NER) extracts named entities, the second extracts relationships (RE). For a given person, there are different references in the text: Maria, Maria Gotthard, Dr. Maria Gotthard and can also be referred to by 'she', for example 'she was rewarded by the company'.
The goal is to merge all these references into one entity so that the relationship graph is not fragmented into different contexts.
I have seen a few posts on different forums saying this is a very difficult problem, but hopefully someone out there has some insights or experience to share ๐
To make things interesting, references to the same entity can occur in different chunks of text, making it impossible for the LLM (currently Ollama/Mistral) to process the cross-chunk context in one call. To address this, I have added a pass across all extracted entities, including exact text matching and a Levenshtein similarity check, but this does not handle first name v full name and comes with a host of other issues. It has a high risk of over-merging, for example if a set of entities consist of incrementally numbered items they will all be merged into one entity.
I am wondering if there is a particular architecture for this problem, for example pre-processing a document to link related entities before extracting. Doesn't have to be LLM-based, heuristics and algorithms sometimes do the trick as well.
Any ideas or feedback are welcome!