Pythonic Parsing Programs
Creed of Python Developers
Pythonistas are eager to extol
the lovely virtues of our language. Most beginning Python
programmers are invited to run import
this
from the interpreter right after the canonical
hello
world
. One of the favorite quips from running that
command is:
There should be one-- and preferably only one --obvious way to do it.
But the path to Python enlightenment is often covered in rocky terrain, or thorns hidden under leaves.
A Dilemma
On that note, I recently had to
use some code that parsed a file. A problem arose when the
API had been optimized around the assumption that what I wanted to
parse would be found in the filesystem of a POSIX compliant system.
The implementation was a staticmethod
on a class that was called from_filepath
.
Well in 2013, we tend to ignore files and shove those lightweight
chisels of the 70s behind in favor of a shiny new super-powered
jack-hammers called NoSQL.
It so happened that I found myself with a string (pulled out of a NoSQL database) containing the contents of a file I wanted to parse. There was no file, no filename, only the data. But my API only supported access through the filename.
Perhaps the pragmatic solution would be to simply throw the contents into a temporary file and be done with it:
import tempfile
data = get_string_data() # fancy call out to NoSQL
with tempfile.NamedTemporaryFile() as fp:
fp.write(data)
fp.seek(0)
obj = Foo.from_filepath(fp.name)
But I spent a bit of time thinking about the root of the problem and wanted to see how others solved it. Having a parsing interface that just supports parsing a string is probably a premature optimization on the other end of the spectrum.
A Little Light Reading
My first thought was to look to the source of all truth—The Python Standard Library. Surely it would enlighten me by illuminating all 19 tenets of “The Zen of Python”. I asked myself what modules I used to parse with the standard library and came up with the following list:
json
pickle
xml.etree.ElementTree
xml.dom.minidom
ConfigParser
csv
(Note: all of the above are
module names. The nested namespace of xml.*
violates Zen tenet #5 “Flat is better than nested”, and
ConfigParser
violates PEP 8 naming conventions. This is not news to long time
Python programmers, but to newbies here it is a dose of reality. The
standard library is not perfect and has its quirks. Even in Python
3. And this is only the tip of the iceberg.)
A Clear Picture
I went through the documentation and source code for these modules to determine the single best, most Pythonic, beautiful, explicit, simple, readable, practical, non-ambiguous, and easy to explain solution to parsing. Specifically, should I parse a filename, a file-like object, or a string? Here is the resulting table I came up with:
Module |
String Data |
File |
Filename |
---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(Note: pyyaml
is a 3rd party library, but there has been much hubbub going around
recently on the naming of load
,
which is unsafe (but probably the method most will use unless
they really pour through the docs), and safe_load
,
which is safe (and hidden away in the docs).)
The trick to this table is to spin around three times, really squint your eyes, and pick something from the File column.
Matt Harrison (@__mharrison__) is a Python developer at Fusion-io where he helps build data analysis tools.
Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.
Sponsored by Red Hat
If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.
Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.
Sponsored by ActiveState
Linux Kernel News - November 2013 | Dec 06, 2013 |
Mars Needs Women | Dec 05, 2013 |
IBM Will Minimize Impact of Future Disasters | Dec 04, 2013 |
Readers' Choice Awards 2013 | Dec 02, 2013 |
December 2013 Issue of Linux Journal: Readers' Choice | Dec 01, 2013 |
The NEW SUSE Catalog is in! | Nov 27, 2013 |
- Readers' Choice Awards 2013
- Linux Kernel News - November 2013
- December 2013 Issue of Linux Journal: Readers' Choice
- Mars Needs Women
- RSS Feeds
- Raspberry Pi: the Perfect Home Server
- Sublime Text: One Editor to Rule Them All?
- Advanced Hard Drive Caching Techniques
- Web Administration Scripts
- IBM Will Minimize Impact of Future Disasters
- thanks for share, great
8 hours 26 min ago - There are factors which are
13 hours 25 min ago - Gnome 3 ?
14 hours 10 min ago - Reply to comment | Linux Journal
18 hours 18 min ago - "Redis RethinkDB 4.5%" on Best NoSQL Databases
1 day 4 hours ago - on the ground
1 day 10 hours ago - I was able to read the whole
1 day 12 hours ago - since i have read the title i
1 day 15 hours ago - Belanja Online Cari Voucher Diskon
1 day 15 hours ago - The kernel doesn't really
2 days 3 hours ago
Featured Jobs
Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
Senior Perl Developer | Austin, Texas | Host Gator |
Technical Support Rep | Houston and Austin, Texas | Host Gator |
UX Designer | Austin, Texas | Host Gator |
Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Comments
Not work
hi, iam try it and something error at any line. please review it
"Pythonic Parsing Programs"
"Pythonic Parsing Programs" is a wonderful post, very thanks @Matt Harrison. i'm a php and java fan, i use magento make a site:
http://www.jollyoutdoor.com/
What about error handling?
Great article - my only issue is the lack of any attention to errors. Errors do occur and while in this particular example (the passwd file) one may assume it is valid (because the system is using it?), this is not the case in general. All too often, parsing failures are ignored resulting in corrupted data or otherwise we get some out-of-context error message 'bad format' or something similar.
Since in the pattern presented there is no way to know from where the data came, at minimum it would be nice to count the lines so the error message could be useful (a column may be even better).
Then the return from the parse function should be (pw,error) so the caller can always get the error (I know it reminds of Go but it is useful).
/d
'bad format' or something
'bad format' or something similar." would you please explain more detail about this ?
When parsing files, you
When parsing files, you invariably hit upon cases where the input does not adhere to the specified grammar (e.g. when parsing a line of the passwd file, the user ID is specified in Hex rather than in decimal). In such a case, if one does not check, one may end up working on wrong data and the result could be anything from benign to tragic. If one does check (as one should), one need a way to communicate why the parsing failed and where to the caller. That was the essence of my comment (along with the hint on how it can be approached).
/d
Reply to comment | Linux Journal
Hi there, I found your blog by the use of Google while looking
for a similar topic, your web site came up, it appears to be like great.
I have bookmarked it in my google bookmarks.
Hi there, simply became alert to your blog through Google, and
found that it's really informative. I am going to watch out for brussels. I'll appreciate should you continue this in future.
A lot of people can be benefited from your writing. Cheers!