Monthly Archives: June 2015

Python & Big Data Analytics (some good links)

I am reading a lot about the use of Python in the data management arena, and while I am not currently working with Big Data, I thought this article here – Using Python for Big Data Analytics – had some great information on things to avoid in Python.

Here is an article that lists Python among the best languages for crunching data (spells “Big Data”). It also mentions a number of other languages, I am not familiar with at all – Kafka? I must be living on the dark side of the moon.

Finally, to complete the triad of links for sharing, there is a page with some pandas how-to. Pandas is another framework I need to take a look at. It seems to pop up in data analytics everywhere. In fact, O’Reilly has a number of titles on Python in that sphere and this one touches on pandas, e.g. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. I should pick up a copy quick… Data science sounds like job security.


Leave a comment

Filed under Uncategorized

Fuzzy Date Matching – Python

So this won’t cover all aspects of matching dates based on similar records stored in different databases but it could be the start for another Python solution. Picking up where I left off with my last post about fuzzy well name matching in oil gas, matching dates for different events in the life of an oil/gas well (permitting, drilling, completing) is another challenge.

Matching “09/30/2014″ 30 Sept 2014” is one thing. but what if the dates are approximate and you’d consider Sept 30 and Oct 2 a match because they’re close.

import datetime
from fuzzyparsers import parse_date
date1 = "September 30, 1985"
date2 = "10/02/1985"

transformed_date1 = parse_date(date1)
transformed_date2 = parse_date(date2)

timediff = transformed_date2-transformed_date1

print timediff.days

>>> type(timediff)
<type 'datetime.timedelta'>

So parse_date from fuzzyparses cleans up your dates and the datetime.timedelta datatype allows setting a threshold for what you consider a match.

Leave a comment

Filed under Uncategorized

Fuzzy Matching for Oil/Gas MDM using Python

If you’ve worked with or read up on the common Master Data Management tools on the market, you will find that most of them provide functionality for matching records from various source systems and blending or merging them into a single, trusted data set, a practice also dubbed “Golden Record Management” (GRM). Many of these products are expensive, some prohibitively so.

So, as someone who deep down inside believes no problem should be too big for a Python solution, I thought there had to be a way to do something similar. In the oil/gas industry, for matching well data records from various databases, the trouble starts with the name of a well which may occur in various forms. A well could be called the “Peter Smith No.1”, “Smith, P.-1H” or “P Smith, #1”. So to match these across systems requires a degree of fuzziness. Enter Python’s fuzzywuzzy module. Find here on GitHub here, or get it using pip. Find instructions here.

For real world example, go to the Texas Railroad Commission for some sample data. I searched for “Smith, P” and found:

Smith, Patricia Unit 1 , Burleson County, API 4205130712

Smith, Pattie L, Shackelford County, API 4241731960

These are different wells and they get pretty low scores when comparing their names using fuzzy wuzzy:

from fuzzywuzzy import fuzz

>>> fuzz.ratio("SMITH, PATRICIA UNIT", "Smith, Pattie L")
>>> fuzz.partial_ratio("SMITH, PATRICIA UNIT", "Smith, Pattie L")

>>> fuzz.token_set_ratio(“SMITH, PATRICIA UNIT”, “Smith, Pattie L”)

Compare that two potential different spelling of the same wells:

>>> fuzz.token_set_ratio("SMITH, pattie", "Smith, Pattie L")
>>> fuzz.token_set_ratio("SMITH, pattie", "Pattie Smith L")
>>> fuzz.token_set_ratio("SMITH, pat", "P Smith L")

Anyway, I plan to play with some more. It’s promising. But of course this only addresses fuzzy matching for strings. It doesn’t help me match on dates *Sept 30, 1955″ and “October 1, 1955” being a near match. But maybe Python has another module for that, too!
Additional useful links:


Leave a comment

Filed under Python