Category Archives: Python

Fuzzy Matching for Oil/Gas MDM using Python

If you’ve worked with or read up on the common Master Data Management tools on the market, you will find that most of them provide functionality for matching recordsĀ from various source systems and blending or merging them into a single, trusted data set, a practice also dubbed “Golden Record Management” (GRM). Many of these products are expensive, some prohibitively so.

So, as someone who deep down inside believes no problem should be too big for a Python solution, I thought there had to be a way to do something similar. In the oil/gas industry, for matching well data records from various databases, the trouble starts with the name of a well which may occur in various forms. A well could be called the “Peter Smith No.1”, “Smith, P.-1H” or “P Smith, #1”. So to match these across systems requires a degree of fuzziness. Enter Python’s fuzzywuzzy module. Find here on GitHub here, or get it using pip. Find instructions here.

For real world example, go to the Texas Railroad Commission for some sample data. I searched for “Smith, P” and found:

Smith, Patricia Unit 1 , Burleson County, API 4205130712

Smith, Pattie L, Shackelford County, API 4241731960

These are different wells and they get pretty low scores when comparing their names using fuzzy wuzzy:

from fuzzywuzzy import fuzz

>>> fuzz.ratio("SMITH, PATRICIA UNIT", "Smith, Pattie L")
29
>>> fuzz.partial_ratio("SMITH, PATRICIA UNIT", "Smith, Pattie L")
26

>>> fuzz.token_set_ratio(“SMITH, PATRICIA UNIT”, “Smith, Pattie L”)
61

Compare that two potential different spelling of the same wells:


>>> fuzz.token_set_ratio("SMITH, pattie", "Smith, Pattie L")
100
>>> fuzz.token_set_ratio("SMITH, pattie", "Pattie Smith L")
100
>>> fuzz.token_set_ratio("SMITH, pat", "P Smith L")
78

Anyway, I plan to play with some more. It’sĀ promising. But of course this only addresses fuzzy matching for strings. It doesn’t help me match on dates *Sept 30, 1955″ and “October 1, 1955” being a near match. But maybe Python has another module for that, too!
Additional useful links:

http://chairnerd.seatgeek.com/

and

http://marcobonzanini.com/

Leave a comment

Filed under Python

Monitoring if a file has changed in Python

The reason I recently looked at creating Windows Services in Python was because I’m interested in monitoring log files. So naturally, I’m also interested when these files are changed or updated. Thought I’d capture some of the basics of watching files with Python.
Check if a path (directory of file) exists:

import os
if os.path.exists("yourpath here"):
print "Yep - found it"

Check if a path is a file (if file doesn’t exist, returns ‘False’):

import os
if os.path.isfile("your path to file"):
print "that's a file alright"

How to compare two (2) files? – You could open them and (if they’re text files) read and compare them line by line. But that would be laborious.
Instead, Python offers the filecmp module.

import filecmp.
filecmp.cmp (filea, fileb)
### returns True of False

(see Python docs – https://docs.python.org/2/library/filecmp.html)

Now, if you’re comparing more than two (2) files, or you would like to compare a file with a prior version of itself, you could generate a checksum (or hash, hash sum) of each file. For that, you use the md5 module. (See Pythond ocs – https://docs.python.org/2/library/md5.html?highlight=md5#md5 )

import md5
hash1 = md5.new()
hash1.update("file1")
hash1.digest() # this generates the checksum

hash2 = md5.new()
hash1.update("file2")
hash2.digest() # this generates the checksum

Then you can compare the two (2) check sums.
Finally, there is os.stat, which allows you to take a peek a file attributes, including the date/time the file was last modified. So if you wanted to just poll a directory or file for that information, you could do:

import os, time
moddate = os.stat("filepath")[8] # there are 10 attributes this call returns and you want the next to last

If you need a readable date/time, try:

print time.ctime(moddate)

Leave a comment

Filed under Python

Creating Windows Services in Python

If you’ve ever been tasked with automating processes in MS Windows that run in the background or at night, you’ve likely worked with the Windows Task Scheduler (not to be confused with the Task Manager) and then some form of batch file, power shell scripts, or similar. If you were lucky enough to have Python installed on the server (assuming you are working on a server), you’ve probably run something like “python.exe mypythonscript.py” either directly or through a *.bat or *.cmd file. There are lots of tutorials online for this kind of thing.

But you might find that you want more control over the sequence of actions that make up your task, you might like some conditional controls, or have one action generate some output before another action starts. In short, you’ll enjoy being able to code all this in Python. Soon, you’ll start wonder if you can replace all your scheduling with a Python services that runs all the time and only performs certain actions, such as checking for the existence for a file, when a certain trigger (time of day) goes off. So how to write your Python service?

Two good posts explaining how to create a Windows service written in Python are this one (ryrobes.com) and this one (chrisumbel.com). Either way, you will be using Mark Hammond’s Win32 extension, now available through Sourceforge. If you’re like me, you might’ve picked up a used copy of “Python – Programming on Win32” (Mark Hammond & Andy Robinson). Granted it’s quite dated (Jan 2000) but it’s a good reason for all things Python on Windows and has a whole chapter on “Windows NT services” that is the basis for what the links above illustrate.

There is a post on debugging Python win services here. If you use or are familiar with Active Python, then there are alternatives such as PythonCom, an example for which you can find here on stackoverflow. Finally, here is a recipe on how to do this using Windows Server 2003 Resource Kit Tools (RKT). If you’re like me and you’re jumping back and forth between different version of Windows Server, that may be useful.

Ok, this was a collection of useful links. Next post will have some code again.

Leave a comment

Filed under Python