PyCharm with Tortoise SVN

If you’re like me, you’ve been coding in Python for a few years but non of your work has involved team efforts where source/version control would have been key. So much of the code is cobbled together, re-produced, recycled, sometimes documented through comments, often not… the list goes on.

Well, I’m breaking with all these bad habits. I’ve heard and read so much about PyCharm, I’m giving it a try, and I’m also starting with SVC. Getting Jetbrains’ Pycharm up and running is easy. You get it here (https://www.jetbrains.com/pycharm/). It claims “Best Python IDE”.

Then I installed Tortoise SVN, a client for Apache Subversion (SVN), available here (http://tortoisesvn.net/). I went with version 1.8.11.

Finally, once both are installed, in PyCharm, under File >> Settings => Version Control => You select the directory you’d like to put under version control and then pick your version control software (SVN). Done!

pycharm_tortoise

Leave a comment

Filed under Uncategorized

Discovering Pandas (Python)

Py_data_AnalSo I’ve added up another book to my Python library. “Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython”.

So far, it promises to be a good introduction to data science using Python tools. – The first thing you need to do to be able to follow with the exercises in the book is get pandas installed. Which is easy. The link is right here.

Pandas leverages Numpy, which I was familiar with in that I had heard of it. But I didn’t realize the significance of Numpy in terms of user community or the size of the package. In fact, I assumed it might ship as part of the standard library. Wrong!

But you don’t have to install Pandas or Numpy separately. I’ve just discovered pip. Again, likely due to the fact that I spend most of my Python time in the Standard Library. Once you have pip running with your Python installation of preference, getting new sotware is as easy as running “sudo apt-get install” in Linux.

You just type “pip install pandas” and let it download the necessary packages from the web and install them locally. If you’ve already downloaded a Python wheel (WHL), you can also point pip at that and install from the local file. For more about Wheels, which are replacing Eggs, go here.

Trying to install pandas though, I was at first getting an error related to “Windows C++ 10.0” (link). That turned out to be due to my trying to install 32-bit Numpy. So I did end up downloading a 64-bit version for my 64-bit Python 3.3 and then used pip to install from that WHL. The whole installation took no more than 10 minutes. Now, I’m ready to play with pandas.

numpy_via_pip

Leave a comment

Filed under Uncategorized

Python & Big Data Analytics (some good links)

I am reading a lot about the use of Python in the data management arena, and while I am not currently working with Big Data, I thought this article here – Using Python for Big Data Analytics – had some great information on things to avoid in Python.

Here is an article that lists Python among the best languages for crunching data (spells “Big Data”). It also mentions a number of other languages, I am not familiar with at all – Kafka? I must be living on the dark side of the moon.

Finally, to complete the triad of links for sharing, there is a page with some pandas how-to. Pandas is another framework I need to take a look at. It seems to pop up in data analytics everywhere. In fact, O’Reilly has a number of titles on Python in that sphere and this one touches on pandas, e.g. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. I should pick up a copy quick… Data science sounds like job security.

Leave a comment

Filed under Uncategorized

Fuzzy Date Matching – Python

So this won’t cover all aspects of matching dates based on similar records stored in different databases but it could be the start for another Python solution. Picking up where I left off with my last post about fuzzy well name matching in oil gas, matching dates for different events in the life of an oil/gas well (permitting, drilling, completing) is another challenge.

Matching “09/30/2014″ 30 Sept 2014” is one thing. but what if the dates are approximate and you’d consider Sept 30 and Oct 2 a match because they’re close.

### https://pypi.python.org/pypi/fuzzyparsers
### https://docs.python.org/2/library/datetime.html
import datetime
from fuzzyparsers import parse_date
date1 = "September 30, 1985"
date2 = "10/02/1985"

transformed_date1 = parse_date(date1)
transformed_date2 = parse_date(date2)

timediff = transformed_date2-transformed_date1

print timediff.days

>>> 
2
>>> type(timediff)
<type 'datetime.timedelta'>

So parse_date from fuzzyparses cleans up your dates and the datetime.timedelta datatype allows setting a threshold for what you consider a match.

Leave a comment

Filed under Uncategorized

Fuzzy Matching for Oil/Gas MDM using Python

If you’ve worked with or read up on the common Master Data Management tools on the market, you will find that most of them provide functionality for matching records from various source systems and blending or merging them into a single, trusted data set, a practice also dubbed “Golden Record Management” (GRM). Many of these products are expensive, some prohibitively so.

So, as someone who deep down inside believes no problem should be too big for a Python solution, I thought there had to be a way to do something similar. In the oil/gas industry, for matching well data records from various databases, the trouble starts with the name of a well which may occur in various forms. A well could be called the “Peter Smith No.1”, “Smith, P.-1H” or “P Smith, #1”. So to match these across systems requires a degree of fuzziness. Enter Python’s fuzzywuzzy module. Find here on GitHub here, or get it using pip. Find instructions here.

For real world example, go to the Texas Railroad Commission for some sample data. I searched for “Smith, P” and found:

Smith, Patricia Unit 1 , Burleson County, API 4205130712

Smith, Pattie L, Shackelford County, API 4241731960

These are different wells and they get pretty low scores when comparing their names using fuzzy wuzzy:

from fuzzywuzzy import fuzz

>>> fuzz.ratio("SMITH, PATRICIA UNIT", "Smith, Pattie L")
29
>>> fuzz.partial_ratio("SMITH, PATRICIA UNIT", "Smith, Pattie L")
26

>>> fuzz.token_set_ratio(“SMITH, PATRICIA UNIT”, “Smith, Pattie L”)
61

Compare that two potential different spelling of the same wells:


>>> fuzz.token_set_ratio("SMITH, pattie", "Smith, Pattie L")
100
>>> fuzz.token_set_ratio("SMITH, pattie", "Pattie Smith L")
100
>>> fuzz.token_set_ratio("SMITH, pat", "P Smith L")
78

Anyway, I plan to play with some more. It’s promising. But of course this only addresses fuzzy matching for strings. It doesn’t help me match on dates *Sept 30, 1955″ and “October 1, 1955” being a near match. But maybe Python has another module for that, too!
Additional useful links:

http://chairnerd.seatgeek.com/

and

http://marcobonzanini.com/

Leave a comment

Filed under Python

Monitoring if a file has changed in Python

The reason I recently looked at creating Windows Services in Python was because I’m interested in monitoring log files. So naturally, I’m also interested when these files are changed or updated. Thought I’d capture some of the basics of watching files with Python.
Check if a path (directory of file) exists:

import os
if os.path.exists("yourpath here"):
print "Yep - found it"

Check if a path is a file (if file doesn’t exist, returns ‘False’):

import os
if os.path.isfile("your path to file"):
print "that's a file alright"

How to compare two (2) files? – You could open them and (if they’re text files) read and compare them line by line. But that would be laborious.
Instead, Python offers the filecmp module.

import filecmp.
filecmp.cmp (filea, fileb)
### returns True of False

(see Python docs – https://docs.python.org/2/library/filecmp.html)

Now, if you’re comparing more than two (2) files, or you would like to compare a file with a prior version of itself, you could generate a checksum (or hash, hash sum) of each file. For that, you use the md5 module. (See Pythond ocs – https://docs.python.org/2/library/md5.html?highlight=md5#md5 )

import md5
hash1 = md5.new()
hash1.update("file1")
hash1.digest() # this generates the checksum

hash2 = md5.new()
hash1.update("file2")
hash2.digest() # this generates the checksum

Then you can compare the two (2) check sums.
Finally, there is os.stat, which allows you to take a peek a file attributes, including the date/time the file was last modified. So if you wanted to just poll a directory or file for that information, you could do:

import os, time
moddate = os.stat("filepath")[8] # there are 10 attributes this call returns and you want the next to last

If you need a readable date/time, try:

print time.ctime(moddate)

Leave a comment

Filed under Python

Creating Windows Services in Python

If you’ve ever been tasked with automating processes in MS Windows that run in the background or at night, you’ve likely worked with the Windows Task Scheduler (not to be confused with the Task Manager) and then some form of batch file, power shell scripts, or similar. If you were lucky enough to have Python installed on the server (assuming you are working on a server), you’ve probably run something like “python.exe mypythonscript.py” either directly or through a *.bat or *.cmd file. There are lots of tutorials online for this kind of thing.

But you might find that you want more control over the sequence of actions that make up your task, you might like some conditional controls, or have one action generate some output before another action starts. In short, you’ll enjoy being able to code all this in Python. Soon, you’ll start wonder if you can replace all your scheduling with a Python services that runs all the time and only performs certain actions, such as checking for the existence for a file, when a certain trigger (time of day) goes off. So how to write your Python service?

Two good posts explaining how to create a Windows service written in Python are this one (ryrobes.com) and this one (chrisumbel.com). Either way, you will be using Mark Hammond’s Win32 extension, now available through Sourceforge. If you’re like me, you might’ve picked up a used copy of “Python – Programming on Win32” (Mark Hammond & Andy Robinson). Granted it’s quite dated (Jan 2000) but it’s a good reason for all things Python on Windows and has a whole chapter on “Windows NT services” that is the basis for what the links above illustrate.

There is a post on debugging Python win services here. If you use or are familiar with Active Python, then there are alternatives such as PythonCom, an example for which you can find here on stackoverflow. Finally, here is a recipe on how to do this using Windows Server 2003 Resource Kit Tools (RKT). If you’re like me and you’re jumping back and forth between different version of Windows Server, that may be useful.

Ok, this was a collection of useful links. Next post will have some code again.

Leave a comment

Filed under Python