Using python to transform and filter data in bash pipes

Posted on Thu 30 April 2020 in Articles

I've long been a fan of bash pipes and the unix philosophy of composability. The text stream interface is so simple to extend and build upon that once you create a simple command line tool that works over stdin and stdout you suddenly have interoperability with a tremendous number of tools and workflows.

I'm also a fan of python generators. Ever since watching David Beazley's talks on generators about 8 years ago, I have used generators extensively in my python code as a way to keep memory usage low and actions composable, using both the explicit yield syntax as well as the more compact list comprehension syntax. Thinking about operations as a series of transforms feels natural and lends itself to fairly high re-usability, especially for data processing workloads (cf. Apache Spark's data frame transformation).

While working with some complex JSON data recently, I realized that the tools I had available for filtering and transforming that data were awkward. I wanted to stay in bash (vs an ipython shell or a standalone script just for this processing) because of all the other tools available in bash, but I wasn't very excited about parsing data with sed, awk, and xargs.

  • I've already written a tool to avoid complex sed expressions in the past, mostly to avoid all the escaping necessary with sed.
  • I have written awk programs that are 10s of lines long, now I tend to just jump over to python when I want to do more complex processing.
  • xargs is pretty awesome, but the syntax has a lot of gotchas once you start wanting to compose more complex expressions from a line of input.

Inspired in part by ammonite (scala) and (python), I wanted to be able to use a batteries-included programming language alongside bash to get things done. What I put together started out as ~50 lines of python and has since grown a bit to add more features (esp. multi-expression python and multiprocessing for parallel computation), but it is still small enough to live as a single file gist.

The tool is called pype (for python pipe). The name is, unsurprisingly, already used by a few projects, none of which are terribly active:

  • python-pype, similar to this project (bash + python)
  • pype, a pipe-like constructor for python operations
  • PyPE, an editor
  • More on github:

The source code and docs are included below. I'll be using this and likely adding to it over time. If it becomes part of my workflow I'll move it from a gist to a normal github repo, and add some tests and some packaging.

Let me know what you think by commenting below or reaching out on Twitter!