Using python to transform and filter data in bash pipes

Posted on Thu 30 April 2020 in Articles

I've long been a fan of bash pipes and the unix philosophy of composability. The text stream interface is so simple to extend and build upon that once you create a simple command line tool that works over stdin and stdout you suddenly have interoperability with a tremendous number of tools and workflows.

I'm also a fan of python generators. Ever since watching David Beazley's talks on generators about 8 years ago, I have used generators extensively in my python code as a way to keep memory usage low and actions composable, using both the explicit yield syntax as well as the more compact list comprehension syntax. Thinking about operations as a series of transforms feels natural and lends itself to fairly high re-usability, especially for data processing workloads (cf. Apache Spark's data frame transformation).

While working with some complex JSON data recently, I realized that the tools I had available for filtering and transforming that data were awkward. I wanted to stay in bash (vs an ipython shell or a standalone script just for this processing) because of all the other tools available in bash, but I wasn't very excited about parsing data with sed, awk, and xargs.

  • I've already written a tool to avoid complex sed expressions in the past, mostly to avoid all the escaping necessary with sed.
  • I have written awk programs that are 10s of lines long, now I tend to just jump over to python when I want to do more complex processing.
  • xargs is pretty awesome, but the syntax has a lot of gotchas once you start wanting to compose more complex expressions from a line of input.

Inspired in part by ammonite (scala) and xon.sh (python), I wanted to be able to use a batteries-included programming language alongside bash to get things done. What I put together started out as ~50 lines of python and has since grown a bit to add more features (esp. multi-expression python and multiprocessing for parallel computation), but it is still small enough to live as a single file gist.

The tool is called pype (for python pipe). The name is, unsurprisingly, already used by a few projects, none of which are terribly active:

  • python-pype, similar to this project (bash + python)
  • pype, a pipe-like constructor for python operations
  • PyPE, an editor
  • More on github: https://github.com/search?q=python-pype

The source code and docs are included below. I'll be using this and likely adding to it over time. If it becomes part of my workflow I'll move it from a gist to a normal github repo, and add some tests and some packaging.

Let me know what you think by commenting below or reaching out on Twitter!