If you’ve worked with or read up on the common Master Data Management tools on the market, you will find that most of them provide functionality for matching records from various source systems and blending or merging them into a single, trusted data set, a practice also dubbed “Golden Record Management” (GRM). Many of these products are expensive, some prohibitively so.
So, as someone who deep down inside believes no problem should be too big for a Python solution, I thought there had to be a way to do something similar. In the oil/gas industry, for matching well data records from various databases, the trouble starts with the name of a well which may occur in various forms. A well could be called the “Peter Smith No.1”, “Smith, P.-1H” or “P Smith, #1”. So to match these across systems requires a degree of fuzziness. Enter Python’s fuzzywuzzy module. Find here on GitHub here, or get it using pip. Find instructions here.
For real world example, go to the Texas Railroad Commission for some sample data. I searched for “Smith, P” and found:
Smith, Patricia Unit 1 , Burleson County, API 4205130712
Smith, Pattie L, Shackelford County, API 4241731960
These are different wells and they get pretty low scores when comparing their names using fuzzy wuzzy:
from fuzzywuzzy import fuzz
>>> fuzz.ratio("SMITH, PATRICIA UNIT", "Smith, Pattie L")
>>> fuzz.partial_ratio("SMITH, PATRICIA UNIT", "Smith, Pattie L")
>>> fuzz.token_set_ratio(“SMITH, PATRICIA UNIT”, “Smith, Pattie L”)
Compare that two potential different spelling of the same wells:
>>> fuzz.token_set_ratio("SMITH, pattie", "Smith, Pattie L")
>>> fuzz.token_set_ratio("SMITH, pattie", "Pattie Smith L")
>>> fuzz.token_set_ratio("SMITH, pat", "P Smith L")
Anyway, I plan to play with some more. It’s promising. But of course this only addresses fuzzy matching for strings. It doesn’t help me match on dates *Sept 30, 1955″ and “October 1, 1955” being a near match. But maybe Python has another module for that, too!
Additional useful links: