Conda for the pipeline person
Time for reading: 15 minutes
2024/02 EDIT: I still believe in the value of conda pack, but these days my approach with third party tools that I don't own is a lot more conservative.
Who am I?
- A Pipeline TD trying to write, test and ship code that requires third party dependenciesFor those who are not from the VFX industry, a Pipeline TD is a poor fellow that was too technical to be a VFX artist but didn't feel pursuing the good, cozy warm life of a software developer. He's basically a dev who likes movies more than startups, and should seriously reconsider his/her life choices.
Who am I not?
- A Pipeline TD that is fine with just vendorizing such third party dependencies manually for each package I need to shipNB: I'm not saying vendorizing is evil, but I feel that as devs we should strive for organizing things in a way that's more efficient and feels less like a "I need a solution that works now, I'm just gonna dump all of my dependencies here and call it a day". There's a few interesting insights here: http://bitprophet.org/blog/2012/06/07/on-vendorizing so I definitely recommend having a read. While Vendorizing can sometimes be the way to go, it can make mantaining your dependencies quite hard, because it all boils down to a manual process. It's also inherently inefficient because each single software package bundles everything without sharing its dependencies.
What do I want?
- A way to be able to follow and build upon the vfx-reference-platform and distribute my python libraries/packages in a easy way. My Python code might rely on some third-party dependencies (like `numpy`), because I might need to built more complex tools.What the heck is Conda? And why do y'all use snakes names?
My surname is literally the name of a snake so I'm not the best person to advise here. But what's not to like about Pythons and Anacondas?Anyway, Conda is an open source package management system and environment management system that runs on Windows, macOS and Linux. Think of it as a more advanced rpm
that you can use to create and install software packages.
The main idea of a conda environment is that it creates an isolated space where you have precise control over the software dependencies installed. The control comes from declaring a recipe, a list of software ingredients required to bake a specific software package.
Quoting from conda's website:
Users can create virtual environments using one of several tools such as Pipenv or Poetry, or a conda virtual environment. Pipenv and Poetry are based around Python's built-in venv library, whereas conda has its own notion of virtual environments that is lower-level (Python itself is a dependency provided in conda environments).
The beauty of conda is that Python is considered a dependency itself. With pip, instead, is the other way around: pip can be seen as a dependency of Python.
Talk is cheap, show us the code
Ok ok, Linus. Here ya go. Here's is how a bare minimum recipe for a conda env looks like. Since recipe is actually the name for other conda specific stuff, let's just start calling this with its true name: `environment.yml` :name: test-env
channels:
- conda-forge
- defaults
dependencies:
- ca-certificates=2021.5.30=h033912b_0
- libcxx=11.1.0=habf9029_0
- libffi=3.3=h046ec9c_2
- ncurses=6.2=h2e338ed_4
- openssl=1.1.1k=h0d85af4_0
- python=3.9.4=h9133fd0_0_cpython
- readline=8.1=h05e3726_0
- sqlite=3.35.5=h44b9ce1_0
- tk=8.6.10=h0419947_1
- tzdata=2021a=he74cb21_0
- xz=5.2.5=haf1e3a3_1
- zlib=1.2.11=h7795811_1010
prefix: /Users/vvzen/miniconda3/envs/test-env
Let's analyze that:
name is just the arbitrary name I gave to this environment.
channels indicates from where the packages for this environment can be downloaded.
-
defaults
is the main and official conda channel, handled by Continuum Analytics. -
conda-forge
is an awesome bunch of peeps creating packages and relying on a fully automated CI/CD (running mostly on Azure) to build these packages for a number of platforms. Basically, conda-forge is the home of all community contributed packages. If something is not on default, somebody probably packaged it on conda-forge. If not, join the cult and do it yourself! (I'll make a blog post about that too, but it's quite easy and doesn't require a lot of time!).
You can also have a local
channel, and that's where conda might come handy for Pipeline TDs. If your code lives in a content network where you don't have access to the internet, having a local channel is the way to go. All it takes is to have a web server that serves a local directory, and refer to your packages with file:///
. I'll try to write a bit more about that, too.
dependencies The real fun. A list of all dependencies names, versions and checksums. Right now, what you see is the result of just running a simple conda create -name test-env python=3
to create a base env with python3. I haven't specified the minor version of python3, so conda picked the latest available (3.9.4).
prefix Where this environment lives. All of the binaries and libraries installed via conda will live here under the respective bin
, lib
etc.. directories. When asked to (conda init
), conda will prepend this dir to your $PATH and other update relevant environment variables. There are many ways to tailor this in order to have as little impact as possible on your existing environment.
Now that we've got the hang of a simple environment, let's spice up things a bit. I'm a Pipeline TD working on a python library to deal with editorial formats, and I know that there's a nice library for parsing AAF files, so that I don't reinvent the wheel myself. Let's install it:
conda install pyaaf2 --channel conda-forge
Note how I specified the channel via the cli args. I know that pyaaf2
is a niche thing, and it's not on the default
channel, but somebody (in this case, myself!) has luckily made it available on conda-forge.
Conda will then tell me that
>>> The following NEW packages will be INSTALLED:
>>>
>>> pyaaf2 conda-forge/noarch::pyaaf2-1.4.0-pyhd8ed1ab_0
You can see the name of the dependency, the channel, the version and something that resembles a checksum.
The noarch
prefix just means that it's a pure python package, so it doesn't need to be compiled against a specific architecture.
If you run conda env export
now (the command to generate that environment.yml
automatically, based on my current active env), we'll just see a new line popped:
dependencies:
[...]
- pyaaf2=1.4.0=pyhd8ed1ab_0
- python=3.9.4=h9133fd0_0_cpython
[...]
That makes sense. pyaaf2 is a pure python package, with no additional dependencies. Installing it only brought just pyaaf2
itself. All good.
Let's now say that I also need to do some data processing, and I want to use the full power of pandas because I need to be able to run complex joins in a simple way.
Let's go:
conda install pandas
will tell me that
The following packages will be downloaded:
<pre>
package | build
---------------------------|-----------------
libblas-3.9.0 | 9_openblas 11 KB conda-forge
libcblas-3.9.0 | 9_openblas 11 KB conda-forge
libgfortran-5.0.0 |9_3_0_h6c81a4c_22 19 KB conda-forge
libgfortran5-9.3.0 | h6c81a4c_22 1.7 MB conda-forge
liblapack-3.9.0 | 9_openblas 11 KB conda-forge
libopenblas-0.3.15 |openmp_h5e1b9a4_1 8.7 MB conda-forge
llvm-openmp-11.1.0 | hda6cdc1_1 268 KB conda-forge
numpy-1.20.3 | py39h7eed0ac_1 5.5 MB conda-forge
pandas-1.2.4 | py39h4d6be9b_0 10.8 MB conda-forge
------------------------------------------------------------
Total: 27.0 MB
The following NEW packages will be INSTALLED:
libblas conda-forge/osx-64::libblas-3.9.0-9_openblas
libcblas conda-forge/osx-64::libcblas-3.9.0-9_openblas
libgfortran conda-forge/osx-64::libgfortran-5.0.0-9_3_0_h6c81a4c_22
libgfortran5 conda-forge/osx-64::libgfortran5-9.3.0-h6c81a4c_22
liblapack conda-forge/osx-64::liblapack-3.9.0-9_openblas
libopenblas conda-forge/osx-64::libopenblas-0.3.15-openmp_h5e1b9a4_1
llvm-openmp conda-forge/osx-64::llvm-openmp-11.1.0-hda6cdc1_1
numpy conda-forge/osx-64::numpy-1.20.3-py39h7eed0ac_1
pandas conda-forge/osx-64::pandas-1.2.4-py39h4d6be9b_0
python-dateutil conda-forge/noarch::python-dateutil-2.8.1-py_0
python_abi conda-forge/osx-64::python_abi-3.9-1_cp39
pytz conda-forge/noarch::pytz-2021.1-pyhd8ed1ab_0
six conda-forge/noarch::six-1.16.0-pyh6c4a22f_0
Wow.
pandas uses a lot of dependencies, the most heavy and notable one being numpy. conda has fetched all of them and is promptly telling me how installing this new package is gonna affect my environment. Note that there's some really low level stuff here, like libgfortran, llvm-openmp (for multiprocessing), et cetera.
I don't see any error so I'm happy to go on and install all of them.
Now my environment.yml contains all of these new dependencies.
But ouch.. wait. I've installed numpy 1.20.3, but the vfx-reference-platform for 2021 clearly mentions that numpy should stay at 1.19.x. I'm a good dev, and the main point of the vfx-reference-platform is to make things consistent, in order to avoid future headaches. So lemme just stay on par with that.
conda install numpy=1.19
Conda will then tell me that
The following packages will be downloaded:
package | build
---------------------------|-----------------
numpy-1.19.5 | py39he588a01_1 5.1 MB conda-forge
------------------------------------------------------------
Total: 5.1 MB
The following packages will be DOWNGRADED:
numpy 1.20.3-py39h7eed0ac_1 --> 1.19.5-py39he588a01_1
Once again, it seems all good. Downgrading numpy didn't cause any issues with other dependencies, otherwise conda would have warned me. That makes sense, since right now pandas is the only package in my env that requires numpy.
Coming soon: try to break the system by installing conflicting dependencies!
Where conda can shine!
I'm in the process of gathering a few interesting use cases of when conda can save us by avoid the installation of 2 dependencies that require a slightly different and incompatible version of another dependency!
Conda pack: making self-contained tarballs of environments
Amazing, now I've got a way to handle environments in a consistent way. So what - does that mean that I have to use conda everywhere? Not just at installation time, but also at runtime? What if I don't want to, because my pipeline is old and I can't just reboot it and use conda environments everywhere?
Enter conda-pack
!
conda-pack
is a neat way to package and distribute conda environments to be run on machines that don't have conda installed.
NB: I personally see it as an alternative to building python wheels, since I still haven't fully investigated wheels, but there's another command,'conda build', that allows you to build wheels too: I just didn't have the time to play around with it long enough.
This means that you can use conda simply to handle the dependency resolution mechanism at "build time". Then, you can leave the conda nest, pack your environment, deploy it, untar it and just source
the content of the tar in your shell without the need for conda or virtualenv to be installed. In my case, I can set up a nice GitLab CI/CD that revolves around a curated environment.yml
, and automatically run conda-pack
to generate the tarball whenever somebody makes a new PR.
You can see a really nice demo here: https://conda.github.io/conda-pack/
The gist is:
conda install conda-pack
conda install -c conda-forge conda-pack
conda-pack --name pipeline-third-party-packages -o pipeline-third-party-packages.tar.gz
where pipeline-third-party-packages
is the name of the environment you want to pack.
Speeding up things with mamba
Dependency resolution is not an easy task, and conda can sometimes take a long time when trying to resolve dependencies and find a compatible subset. Picking your dependencies wisely is crucial to help Conda do its job, and there's a few tips you can learn to generally help the conda algorithm to do its thing, see https://www.anaconda.com/blog/understanding-and-improving-condas-performance and https://www.anaconda.com/blog/understanding-and-improving-condas-performance.Another good strategy is to pick a slightly faster conda client, like mamba: https://github.com/mamba-org/mamba. While conda itself is written in python, mamba is written in C++, and exposes a few nice option in its CLI to query who depends on a certain package, and what a certain package depends on.