Tag Archives: fuzzy matching

Fuzzy Date Matching – Python

So this won’t cover all aspects of matching dates based on similar records stored in different databases but it could be the start for another Python solution. Picking up where I left off with my last post about fuzzy well name matching in oil gas, matching dates for different events in the life of an oil/gas well (permitting, drilling, completing) is another challenge.

Matching “09/30/2014″ 30 Sept 2014” is one thing. but what if the dates are approximate and you’d consider Sept 30 and Oct 2 a match because they’re close.

### https://pypi.python.org/pypi/fuzzyparsers
### https://docs.python.org/2/library/datetime.html
import datetime
from fuzzyparsers import parse_date
date1 = "September 30, 1985"
date2 = "10/02/1985"

transformed_date1 = parse_date(date1)
transformed_date2 = parse_date(date2)

timediff = transformed_date2-transformed_date1

print timediff.days

>>> 
2
>>> type(timediff)
<type 'datetime.timedelta'>

So parse_date from fuzzyparses cleans up your dates and the datetime.timedelta datatype allows setting a threshold for what you consider a match.

Advertisements

Leave a comment

Filed under Uncategorized

Fuzzy Matching for Oil/Gas MDM using Python

If you’ve worked with or read up on the common Master Data Management tools on the market, you will find that most of them provide functionality for matching records from various source systems and blending or merging them into a single, trusted data set, a practice also dubbed “Golden Record Management” (GRM). Many of these products are expensive, some prohibitively so.

So, as someone who deep down inside believes no problem should be too big for a Python solution, I thought there had to be a way to do something similar. In the oil/gas industry, for matching well data records from various databases, the trouble starts with the name of a well which may occur in various forms. A well could be called the “Peter Smith No.1”, “Smith, P.-1H” or “P Smith, #1”. So to match these across systems requires a degree of fuzziness. Enter Python’s fuzzywuzzy module. Find here on GitHub here, or get it using pip. Find instructions here.

For real world example, go to the Texas Railroad Commission for some sample data. I searched for “Smith, P” and found:

Smith, Patricia Unit 1 , Burleson County, API 4205130712

Smith, Pattie L, Shackelford County, API 4241731960

These are different wells and they get pretty low scores when comparing their names using fuzzy wuzzy:

from fuzzywuzzy import fuzz

>>> fuzz.ratio("SMITH, PATRICIA UNIT", "Smith, Pattie L")
29
>>> fuzz.partial_ratio("SMITH, PATRICIA UNIT", "Smith, Pattie L")
26

>>> fuzz.token_set_ratio(“SMITH, PATRICIA UNIT”, “Smith, Pattie L”)
61

Compare that two potential different spelling of the same wells:


>>> fuzz.token_set_ratio("SMITH, pattie", "Smith, Pattie L")
100
>>> fuzz.token_set_ratio("SMITH, pattie", "Pattie Smith L")
100
>>> fuzz.token_set_ratio("SMITH, pat", "P Smith L")
78

Anyway, I plan to play with some more. It’s promising. But of course this only addresses fuzzy matching for strings. It doesn’t help me match on dates *Sept 30, 1955″ and “October 1, 1955” being a near match. But maybe Python has another module for that, too!
Additional useful links:

http://chairnerd.seatgeek.com/

and

http://marcobonzanini.com/

Leave a comment

Filed under Python