{"id":17591,"date":"2025-11-06T16:00:47","date_gmt":"2025-11-06T16:00:47","guid":{"rendered":"https:\/\/slack.engineering\/?p=17591"},"modified":"2025-10-23T18:20:33","modified_gmt":"2025-10-23T18:20:33","slug":"build-better-software-to-build-software-better","status":"publish","type":"post","link":"https:\/\/slack.engineering\/build-better-software-to-build-software-better\/","title":{"rendered":"Build better software to build software better"},"content":{"rendered":"<p>We manage the build pipeline that delivers Quip and Slack Canvas\u2019s backend. A year ago, we were chasing exciting ideas to help engineers ship better code, faster. But we had one huge problem: <b>builds took 60 minutes<\/b>. With a build that slow, the whole pipeline gets less agile, and feedback doesn\u2019t come to engineers until far too late.<\/p>\n<p>We fixed this problem by combining modern, high-performance build tooling (Bazel) with classic software engineering principles. Here\u2019s how we did it.<\/p>\n<h2>Thinking About Build (and Code) Performance<\/h2>\n<p>Imagine a simple application. It has a backend server that provides an API and data storage, and a frontend that presents the user interface. Like many modern applications, the frontend and backend are decoupled; they can be developed and delivered independently.<\/p>\n<p>The graph of this build looks like this:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1642\" height=\"430\" class=\"alignnone size-full wp-image-17605\" src=\"https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation6-e1759869480617.png\" alt=\"An example service build graph. It shows a set of Python files building into a backend artifact and a set of TypeScript files building into a frontend artifact.\" srcset=\"https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation6-e1759869480617.png 1642w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation6-e1759869480617.png?resize=640,168 640w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation6-e1759869480617.png?resize=768,201 768w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation6-e1759869480617.png?resize=1280,335 1280w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation6-e1759869480617.png?resize=1536,402 1536w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation6-e1759869480617.png?resize=380,100 380w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation6-e1759869480617.png?resize=800,210 800w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation6-e1759869480617.png?resize=1160,304 1160w\" sizes=\"auto, (max-width: 1642px) 100vw, 1642px\" \/><\/p>\n<p>We represent the dependencies between the elements of the build, like source files and deployable artifacts, with arrows, forming a <i>directed acyclic graph<\/i>. Here, our backend depends on a collection of Python files, meaning that whenever a Python file changes, we need to rebuild the backend. Likewise, we need to rebuild our frontend whenever a TypeScript file changes \u2014\u00a0but not when a Python file does.<\/p>\n<p>Modeling our build as a graph of clearly-defined units of work lets us apply the same kind of performance optimizations we might use to speed up application code:<i><\/i><\/p>\n<ul>\n<li><i>Do less work<\/i>. Store the results of expensive work you do so that you only have to do it once, trading off memory for time.<\/li>\n<li><i>Share the load<\/i>. Spread the work you\u2019re doing across more compute resources in parallel, so that it completes more quickly, trading off compute for time.<\/li>\n<\/ul>\n<h3>Caching and Parallelization<\/h3>\n<p>These techniques will look familiar to most engineers, and thinking about them through a code lens helps solidify how they apply to a build system. Take this Python example:<\/p>\n<pre><code class=\"language-python\">def factorial(n):\n    return n * factorial(n-1) if n else 1\n<\/code><\/pre>\n<p>Calculating a factorial can be very expensive. If we need to do it often, we\u2019re going to run quite slowly. But intuitively, we know that the factorial of a number doesn\u2019t change. This means we can apply strategy 1 to <i>cache<\/i> the results of this function. Then it only needs to run once for any given input.<\/p>\n<pre><code class=\"language-python\">@functools.cache\ndef factorial(n):\n    return n * factorial(n-1) if n else 1\t\n<\/code><\/pre>\n<p>The cache stores our inputs (<code>n<\/code>) and maps them to the corresponding outputs (the return value of <code>factorial()<\/code>). <code>n<\/code> is the <i>cache key<\/i>: it functions like a source file in the build graph, with the output of the function corresponding to the built artifact.<\/p>\n<p>Refining our intuition, we can say that this function needs to have a few specific attributes for caching to work. It needs to be <i>hermetic<\/i>: it only uses the inputs it is explicitly given to produce a specific output value. And it needs to be <i>idempotent<\/i>: the output for any given set of inputs is always the same. Otherwise, caching is unsound and will have rather surprising effects.<\/p>\n<p>Once we have a cache in place, we want to maximize our <i>hit rate<\/i>: the fraction of times we call <code>factorial()<\/code> that are answered from the cache, rather than from doing the calculation. The way we define our units of work can help us keep the hit rate high.<\/p>\n<p>To illustrate that idea, let\u2019s look at a more complex piece of code:<\/p>\n<pre><code class=\"language-python\">@functools.cache\ndef process_images(\n  images: list[Image],\n  transforms: list[Transform]\n) -&gt; list[Image]: ..<\/code><\/pre>\n<p>This function has a cache attached to it, but the cache isn\u2019t very effective. The function takes two collections of inputs: a list of images and a list of transforms to apply to each image. If the caller changes <i>any<\/i> of those inputs, say by adding one more <code>Image<\/code> to the list, they also change the cache key. That means we won\u2019t find a result in the cache, and will have to do all of the work from the beginning.<\/p>\n<p>In other words, this cache is not very <i>granular<\/i>, and as a result has a poor hit rate. More effective code might look like this:<\/p>\n<pre><code class=\"language-python\">def process_images(\n  images: list[Image],\n  transforms: list[Transform]\n) -&gt; list[Image]:\n  new_images = []\n  for image in images:\n    new_image = image\n    for transform in transforms:\n       new_image = process_image(new_image, transform)\n    new_images.append(new_image)\n\n  return new_images\n\n@functools.cache\ndef process_image(image: Image, transform: Transform) -&gt; Image:\n  ...\n<\/code><\/pre>\n<p>We move the caching to a smaller unit of work \u2014 the application of a single transform to a single image \u2014 and with a correspondingly smaller cache key. We can still retain the higher-level API; we just don\u2019t cache there. When a caller invokes the higher-level API with a set of inputs, we can answer that request by only processing the combinations of <code>Image<\/code> and <code>Transform<\/code> that we haven\u2019t seen before. Everything else comes from the cache. Our cache hit rate should both improve, along with our performance.<\/p>\n<p>Caching helps us do less work. Now let&#8217;s look at sharing the load. What if we fanned out image processing across multiple CPU cores with threads?<\/p>\n<pre><code class=\"language-python\">def process_images_threaded(\n  images: list[Image],\n  transforms: list[Transform]\n) -&gt; list[Image]:\n  with ThreadPoolExecutor() as executor:\n    futures = []\n    for image in images:\n      futures.append(executor.submit(process_images, [image], transforms))\n\n    # Returns images in any order!\n    return [future.result() for future in futures.as_completed(futures)]<\/code><\/pre>\n<p>Let\u2019s suss out the qualities we need to be able to run our work in parallel, while also noting some key caveats.<\/p>\n<ul>\n<li>Like with caching, we need to rigorously and completely define the inputs and outputs for the work we want to parallelize.<\/li>\n<li>We need to be able to move those inputs and outputs across some type of boundary. Here, it\u2019s a thread boundary, but it might as well be a process boundary or a network boundary to some other compute node.<\/li>\n<li>Units of work that we run in parallel might complete, or fail, in any order. Our code has to manage those outcomes, and clearly document what guarantees it offers. Here, we\u2019re noting the caveat that images may not return in the same order. Whether or not that\u2019s acceptable depends on the API.<\/li>\n<\/ul>\n<p>The granularity of our units of work also plays a role in how effectively we can parallelize, although there\u2019s a lot of nuance involved. If our units of work are few in number but large in size, we won\u2019t be able to spread them over as many compute resources. The right trade-off between task size and count varies wildly between problems, but it\u2019s something we need to consider as we design our APIs.<\/p>\n<h3>Transitioning to Build Performance<\/h3>\n<p>If you\u2019re a software engineer, you\u2019ve probably applied those techniques many times to improve your code\u2019s performance. Let\u2019s apply some of those principles and intuitions to build systems.<\/p>\n<p>In a Bazel build, you define <i>targets <\/i>that form a <i>directed acyclic graph<\/i>. Each target has three critical elements:<\/p>\n<ol>\n<li>The files that are <i>dependencies<\/i>, or <i>inputs<\/i>, for this build step.<\/li>\n<li>The files that are <i>outputs<\/i> for this build step.<\/li>\n<li>The commands that transform the inputs into the outputs.<\/li>\n<\/ol>\n<p>(It\u2019s a little more complicated, but that\u2019s enough to get the gist for now).<\/p>\n<p>Let\u2019s go back to that original sample application.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1642\" height=\"430\" class=\"alignnone size-full wp-image-17605\" src=\"https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation6-e1759869480617.png\" alt=\"An example service build graph. It shows a set of Python files building into a backend artifact and a set of TypeScript files building into a frontend artifact.\" srcset=\"https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation6-e1759869480617.png 1642w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation6-e1759869480617.png?resize=640,168 640w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation6-e1759869480617.png?resize=768,201 768w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation6-e1759869480617.png?resize=1280,335 1280w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation6-e1759869480617.png?resize=1536,402 1536w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation6-e1759869480617.png?resize=380,100 380w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation6-e1759869480617.png?resize=800,210 800w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation6-e1759869480617.png?resize=1160,304 1160w\" sizes=\"auto, (max-width: 1642px) 100vw, 1642px\" \/><\/p>\n<p>The Bazel target definitions might look something like this pseudocode:<\/p>\n<pre><code class=\"language-python\">python_build(\n    name = &quot;backend&quot;,\n    srcs = [&quot;core\/http.py&quot;, &quot;lib\/options.py&quot;, &quot;data\/access.py&quot;],\n    outs = [&quot;backend.tgz&quot;],\n    cmd = &quot;python build.py&quot;, \n)\nts_build(\n    name = &quot;frontend&quot;,\n    srcs = [&quot;cms\/cms.ts&quot;, &quot;collab\/bridge.ts&quot;, &quot;editing\/find.ts&quot;],\n    outs = [&quot;frontend.tgz&quot;],\n    cmd = &quot;npm build&quot;\n)\n<\/code><\/pre>\n<p>You can think about a build target like defining (but not invoking!) a function. Notice some themes carrying over from the Python examples above? We exhaustively define the inputs (<code>srcs<\/code>) and outputs (<code>outs<\/code>) for each build step. We can think of the <code>srcs<\/code> of a target (and the transitive <code>srcs<\/code> of any targets that build <i>them<\/i>, which we\u2019ll see more later) as the inputs to a function we wish to cache.<\/p>\n<p>Because Bazel builds in a sandbox, the <code>cmds<\/code> that transform inputs to outputs only get to touch the input files we declare. And because our inputs and outputs are simply on-disk files, we solve the problem of moving them across a boundary: they just get copied! Lastly, we have to make Bazel a promise: our build steps\u2019 <code>cmds<\/code> are in fact idempotent and hermetic.<\/p>\n<p>When we do those things, we get some extraordinary capabilities for free.<\/p>\n<ul>\n<li>Bazel automatically caches the outcomes of our build actions. When the inputs to a target don\u2019t change, the cached output is used \u2014 no build cost!<\/li>\n<li>Bazel distributes build actions across as many CPU cores as we allow, or even across multiple machines in a build cluster.<\/li>\n<li>Bazel only executes those actions we need for the output artifacts we want. To put it another way, we always run the least amount of work needed for the output we want.<\/li>\n<\/ul>\n<p>Just like in our code examples, we\u2019ll get the most out of Bazel\u2019s magic when we have a well-defined dependency graph, composed of build units that are idempotent, hermetic, and granular. Those qualities allow Bazel\u2019s native caching and parallelization to deliver great speed boosts.<\/p>\n<p>That\u2019s plenty of theory. Let\u2019s drill down on the actual problem.<\/p>\n<h2>Why Quip and Canvas are Harder<\/h2>\n<p>Here\u2019s the trick: Quip and Canvas are <i>much, much more complex<\/i> than the simple example we looked at before. Here\u2019s a real diagram we drew to understand how our build worked. Don\u2019t worry about reading all the details \u2014 we\u2019ll use a schematic version as we dig into the problem and our solutions.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"2862\" height=\"4314\" class=\"alignnone size-full wp-image-17606\" src=\"https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/Artifact-Pipeline-Artifact-Detail1-e1759869667574.png\" alt=\"A very complex build graph. The details are not intelligible.\" \/><\/p>\n<p>When we analyzed the graph, we discovered some critical flaws that meant we did not have the characteristics we needed to get a speed boost from Bazel:<\/p>\n<ul>\n<li>A directed acyclic dependency graph<br \/>\n<i>\u2192The graph wasn\u2019t well defined, and in fact contained cycles!<\/i><\/li>\n<li>Idempotent, hermetic, well-sized units of work.<br \/>\n\u2192 <i>Build execution units were <\/i>huge<i>, not all were idempotent, and hermeticity was a challenge because<\/i> <i>many build steps mutated the working directory.<\/i><\/li>\n<li>Granular cache keys to keep cache hit rate high.<br \/>\n\u2192 <i>Our build was so interconnected that our cache hit rate was zero. Imagine that every cached \u201cfunction\u201d we tried to call had 100 parameters, 2-3 of which always changed.<\/i><\/li>\n<\/ul>\n<p>If we had started by throwing Bazel at the build, it would have been ineffective. With a cache hit rate of zero, Bazel\u2019s advanced cache management wouldn\u2019t have helped, and Bazel parallelization would have added little to nothing over the ad-hoc parallelization already present in the build code. To get the Bazel magic, we needed to do some engineering work first.<\/p>\n<h3>Separating Concerns<\/h3>\n<p>Our backend code and our build code were deeply intertwined. Without a Bazel-like build system, we\u2019d defaulted to using in-application frameworks to orchestrate the various steps within our build graph. We managed parallelization using Python\u2019s <code>multiprocessing<\/code> strategy and the async routines built into our core codebase. Python business logic pulled together the Protobuf compilation and built Python and Cython artifacts. Lastly, tools like <code>tsc<\/code> and <code>webpack<\/code>, orchestrated by more Python scripts, transformed TypeScript and Less into the unique frontend bundle format used across Slack Canvas and Quip\u2019s desktop and web applications.<\/p>\n<p>As we began to unpick that Gordian knot, we zeroed in on how the union of backend and build code distorted the graph of our <i>frontend<\/i> build. Here\u2019s an easier-to-digest representation of that graph.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-17595 size-full\" src=\"https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation-e1759865828860.png\" alt=\"A graph of a software build. It shows a group of Python files at the top, feeding into a Python application build. The Python build and a set of TypeScript files are then the inputs to a TypeScript build, producing a set of frontend bundles.\" width=\"1475\" height=\"906\" srcset=\"https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation-e1759865828860.png 1475w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation-e1759865828860.png?resize=640,393 640w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation-e1759865828860.png?resize=768,472 768w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation-e1759865828860.png?resize=1280,786 1280w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation-e1759865828860.png?resize=380,233 380w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation-e1759865828860.png?resize=800,491 800w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation-e1759865828860.png?resize=1160,713 1160w\" sizes=\"auto, (max-width: 1475px) 100vw, 1475px\" \/><\/p>\n<p>Notice how large the dependency tree is \u201cabove\u201d our frontend bundles. It includes not just their TypeScript source and build process, but also the <i>entire<\/i> built Python backend! Those Python sources and artifacts are <i>transitive<\/i> sources for each and every frontend bundle. That means that not just a TypeScript change, but also <i>every single Python change<\/i>, alters the cache key for the frontend (one of those hundred parameters we alluded to earlier), requiring an expensive full rebuild.<\/p>\n<p>The key to this build problem is that dependency edge between the Python application and the TypeScript build.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"554\" height=\"525\" class=\"alignnone size-full wp-image-17600\" src=\"https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation-1-e1759866943466.png\" alt=\"A zoomed-in view of the build graph from above. It focuses on the edges between the Python build and application and the TypeScript build.\" srcset=\"https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation-1-e1759866943466.png 554w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation-1-e1759866943466.png?resize=380,360 380w\" sizes=\"auto, (max-width: 554px) 100vw, 554px\" \/><\/p>\n<p>That edge was costing us an average of <b>35 minutes per build<\/b> \u2014 more than half the total cost! \u2014 because every change was causing a full backend <i>and<\/i> frontend rebuild, and the frontend rebuild was especially expensive.<\/p>\n<p>As we worked through this problem, we realized that the performance cost was not solely a build issue. It was a symptom of our pervasive failure to <b>separate our concerns<\/b> across the whole application: backend, frontend, and build code. We had couplings:<\/p>\n<ul>\n<li>between our backend and our frontend<\/li>\n<li>between our Python and TypeScript infrastructures and toolchains<\/li>\n<li>between our build system and our application code<\/li>\n<\/ul>\n<p>Beyond making the build performance quantitatively worse, those couplings had qualitative impacts across our software development lifecycle. Engineers couldn\u2019t reason about the blast radius of changes they made to the backend, because it might break the frontend too.\u00a0Or the build system. Or both. And because our build took an hour, we couldn\u2019t give them early warnings by running builds at the Pull Request level. The lack of health signals, and the difficulty of reasoning through potential consequences, meant that we\u2019d break our build and our main branch far too often, and through no fault of our engineers.<\/p>\n<p>Once we understood the problem through the lens of separation of concerns, it became clear we couldn\u2019t succeed with changes to the build system alone. We had to cleanly sever the dependencies between our frontend and backend, our Python and TypeScript, our application and our build. And that meant we had to invest a lot more time than we\u2019d originally planned.<\/p>\n<p>Over several months, we painstakingly unraveled the <i>actual<\/i> requirements of each of our build steps. We rewrote Python build orchestration code in Starlark, the language Bazel uses for build definitions. Starlark is a deliberately constrained language whose limitations aim at ensuring builds meet all the requirements for Bazel to be effective. Building in Starlark helped us enforce a full separation from application code. Where we needed to retain Python scripts, we rewrote them to remove all dependencies save the Python standard library: no links to our backend code, and no additional build dependencies. We left out all of the parallelization code, because Bazel handles that for us. We\u2019ll revisit that paradigm below in thinking about <i>layering<\/i>.<\/p>\n<p>The complexity of the original build code made it challenging to define \u201ccorrect\u201d behavior. Our build code mostly did not have tests. The only criterion for what was correct was what the existing build system produced under a specific configuration. To reassure ourselves, and to instill confidence in our engineers, we built a tool in Rust to compare an artifact produced by the existing process with one produced by our new code. We used the differences to guide us to points where our new logic wasn\u2019t quite right, and iterated, and iterated more.<\/p>\n<p>That work paid off when we could finally draw a new build graph:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1843\" height=\"947\" class=\"alignnone size-full wp-image-17596\" src=\"https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation1-e1759865982755.png\" alt=\"A software build graph. A set of Python source files feed into a Python application build. This build is shown separately from a frontend build, where a set of TypeScript files is shown to be built into frontend bundles.\" srcset=\"https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation1-e1759865982755.png 1843w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation1-e1759865982755.png?resize=640,329 640w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation1-e1759865982755.png?resize=768,395 768w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation1-e1759865982755.png?resize=1280,658 1280w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation1-e1759865982755.png?resize=1536,789 1536w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation1-e1759865982755.png?resize=380,195 380w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation1-e1759865982755.png?resize=800,411 800w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation1-e1759865982755.png?resize=1160,596 1160w\" sizes=\"auto, (max-width: 1843px) 100vw, 1843px\" \/><\/p>\n<p>We\u2019d severed all three of those key couplings, and taken off the table any concern that a Python change might break the build or alter the frontend output. Our build logic was colocated with the units it built in <code>BUILD.bazel<\/code> files, with well-defined Starlark APIs and a clean separation between build code and application code. Our cache hit rate went way up, because Python changes no longer formed part of the cache key for TypeScript builds.<\/p>\n<p>The outcome of this work was a substantial reduction in build time: if the frontend was cached, we could build the whole application in as little as 25 minutes. That\u2019s a big improvement, but still not enough!<\/p>\n<h3>Designing for Composition with Layering<\/h3>\n<p>Once we severed the backend \u2190\u2192 frontend coupling, we took a closer look at the frontend build. Why did it take 35 minutes, anyway? As we unraveled the build scripts, we found more separation-of-concerns challenges.<\/p>\n<p>Our frontend builder was trying hard to be performant. It took in a big set of inputs (TypeScript source, LESS and CSS source, and a variety of knobs and switches controlled by environment variables and command-line options),\u00a0calculated the build activity it needed to do, parallelized that activity across a collection of worker processes, and marshalled the output into deployable JavaScript bundles and CSS resources. Here\u2019s a sketch of this part of the build graph.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1482\" height=\"746\" class=\"alignnone size-full wp-image-17602\" src=\"https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation4-e1759867199624.png\" alt=\"A build graph shows TypeScript and LESS source files feeding into a single TypeScript and CSS Build node, along with environment variables and switches. The output of the build node is a set of frontend bundles. The build process interacts with a set of worker processes.\" srcset=\"https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation4-e1759867199624.png 1482w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation4-e1759867199624.png?resize=640,322 640w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation4-e1759867199624.png?resize=768,387 768w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation4-e1759867199624.png?resize=1280,644 1280w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation4-e1759867199624.png?resize=380,191 380w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation4-e1759867199624.png?resize=800,403 800w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation4-e1759867199624.png?resize=1160,584 1160w\" sizes=\"auto, (max-width: 1482px) 100vw, 1482px\" \/><\/p>\n<p>Like much of what we discovered during this project, this strategy represented a set of reasonable trade-offs when it was written: it was a pragmatic attempt to speed up one piece of the build in the absence of a larger build framework. And it <i>was<\/i> faster than not parallelizing the same work!<\/p>\n<p>There are two key challenges with this implementation.<\/p>\n<p>Like our <code>process_image()<\/code> example above, our cacheable units were too large. We took in <i>all<\/i> the sources and produced <i>all <\/i>the bundles. What if just one input file changed? That altered the cache key, and we had to rebuild everything. Or what if we wanted to build exactly one bundle, to satisfy a requirement elsewhere in our build process? We were out of luck; we had to depend on the whole shebang.<\/p>\n<p>We\u2019re parallelizing the work across processes, which is good \u2014 unless we could be parallelizing across <i>machines<\/i>, machines with lots more CPU cores. If we have those resources available, we cannot use them here. And we\u2019re actually making Bazel less effective at its core function of parallelizing independent build steps: Bazel and the script\u2019s worker processes are contending for the same set of resources. The script might even be parallelizing work that Bazel already knows it does not need!<\/p>\n<p>In both these respects, it was tough for us to <i>compose<\/i> the build functionality in new ways: either a different form of orchestrating multiple bundle builds, or a different parallelization strategy. This was a <i>layering violation<\/i>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"618\" height=\"602\" class=\"alignnone size-full wp-image-17599\" src=\"https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation3-e1759866526521.png\" alt=\"A diagram of layered functionality. At the base is the OS, followed by Parallelization and Language Runtime, then Orchestration and App Core, and finally Logic. A red outline cuts out a unit of logic plus the orchestration and parallelization layers.\" srcset=\"https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation3-e1759866526521.png 618w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation3-e1759866526521.png?resize=380,370 380w\" sizes=\"auto, (max-width: 618px) 100vw, 618px\" \/><\/p>\n<p>If we draw, as here, layers of capability from the operating system up to the application layer, we can see that the boundary of our builder is cutting across layers. It combines business logic, significant parts of a task orchestration framework, and parallelization. We really just wanted the top layer \u2014 the logic \u2014 so that we could re-compose it in a new orchestration context. We ended up with a work orchestrator (the builder) inside a work orchestrator (Bazel), and the two layers contending for slices of the same resource pool.<\/p>\n<p>To be more effective, we really just needed to do less. We deleted a lot of code. The new version of the frontend builder was much, much simpler. It didn\u2019t parallelize. It had a much smaller \u201cAPI\u201d interface. It took in one set of source files, and built one output bundle, with TypeScript and CSS processed independently.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"714\" height=\"784\" class=\"alignnone size-full wp-image-17603\" src=\"https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation5-e1759867349405.png\" alt=\"A build graph shows parallel tracks, with TypeScript source flowing through a build to produce JavaScript source, and Less and CSS source flowing through a CSS build to form CSS output. The outputs are then combined into a frontend bundle.\" srcset=\"https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation5-e1759867349405.png 714w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation5-e1759867349405.png?resize=640,703 640w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation5-e1759867349405.png?resize=380,417 380w\" sizes=\"auto, (max-width: 714px) 100vw, 714px\" \/><\/p>\n<p>This new builder is highly cacheable and highly parallelizable. Each output artifact can be cached independently, keyed only on its direct inputs. A bundle\u2019s TypeScript build and CSS build can run in parallel, both against one another and against the builds of other bundles. And our Bazel logic can make decisions about scope (one bundle? two? all?), rather than trying to manage a build of all of the bundles.<\/p>\n<p>See the resonance, again, with our example <code>process_images()<\/code> API? We\u2019ve created granular, composable units of work, and that\u2019s dramatically improved our ability to parallelize and cache. We\u2019ve also separated the concerns of our business logic and its orchestration, making it possible for us to re-compose the logic within our new Bazel build.<\/p>\n<p>This change gave us some really nice outcomes:<\/p>\n<ul>\n<li>Because bundle builds and TypeScript and CSS builds can be cached independently of one another, our cache hit rate went up.<\/li>\n<li>Given enough resources, Bazel could parallelize <i>all<\/i> of the bundle builds and CSS compilation steps simultaneously. That bought us a nice reduction in full-rebuild time.<\/li>\n<\/ul>\n<p>As an added benefit, we\u2019re no longer running our own parallelization code. We\u2019ve delegated that responsibility to Bazel. Our build script has a single concern: the business logic around compiling a frontend bundle. That\u2019s a win for maintainability.<\/p>\n<h2>Outcomes and Takeaways<\/h2>\n<p>We made our build a <b>lot<\/b> faster. Faster builds lead to tighter cycle times for engineers, quicker incident resolutions, and more frequent releases. After applying these principles across our whole build graph, we came away with a build that\u2019s as much as six times faster than when we started.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1133\" height=\"717\" class=\"alignnone size-full wp-image-17597\" src=\"https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation2-e1759866173684.png\" alt=\"A diagram shows a &quot;before&quot; build time of 60 minutes for all cases. The &quot;after&quot; time shows three cases, 10 minutes in the best case (cached and parallelized); 12 minutes in the average case (mostly cached and parallelized); and 30 minutes in the worst cases (a cache miss).\" srcset=\"https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation2-e1759866173684.png 1133w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation2-e1759866173684.png?resize=640,405 640w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation2-e1759866173684.png?resize=768,486 768w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation2-e1759866173684.png?resize=380,240 380w, https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation2-e1759866173684.png?resize=800,506 800w\" sizes=\"auto, (max-width: 1133px) 100vw, 1133px\" \/><\/p>\n<p>From a qualitative angle, we took away the key principle that <b>software engineering principles apply to the <\/b><b><i>whole<\/i><\/b><b> system<\/b>. And the whole system is more than our application code. It\u2019s also our build code, our release pipeline, the setup strategies for our developer and production environments, and the interrelations between those components.<\/p>\n<p>So here\u2019s our pitch to you, whether you\u2019re writing application code, build code, release code, or all of the above: separate concerns. Think about the whole system. Design for composability. When you do, every facet of your application gets stronger \u2014 and, as a happy side effect, your build will run a lot faster too.<\/p>\n","protected":false},"excerpt":{"rendered":"We manage the build pipeline that delivers Quip and Slack Canvas\u2019s backend. A year ago, we were chasing exciting ideas to help engineers ship better code, faster. But we had one huge problem: builds took 60 minutes. With a build that slow, the whole pipeline gets less agile, and feedback doesn\u2019t come to engineers until&hellip;","protected":false},"author":546,"featured_media":17605,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[3],"tags":[518,519,2472,2478,545,614],"class_list":{"0":"post-17591","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-uncategorized","8":"tag-build-performance","9":"tag-caching","10":"tag-ci-cd","11":"tag-developer-experience","12":"tag-devops","13":"tag-python","14":"ts-entry"},"acf":{"subtitle":"","excerpt":"","has_toc":false,"author_group":{"configure_author":"wordpress","authors":[{"ID":17592,"post_author":"546","post_date":"2025-10-07 19:17:04","post_date_gmt":"2025-10-07 19:17:04","post_content":"","post_title":"David Reed","post_excerpt":"","post_status":"publish","comment_status":"closed","ping_status":"closed","post_password":"","post_name":"david-reed","to_ping":"","pinged":"","post_modified":"2025-10-07 19:17:04","post_modified_gmt":"2025-10-07 19:17:04","post_content_filtered":"","post_parent":0,"guid":"https:\/\/slack.engineering\/?post_type=author&#038;p=17592","menu_order":0,"post_type":"author","post_mime_type":"","comment_count":"0","filter":"raw"}],"custom_author":""},"series":false,"tags":[518,519,2472,2478,545,614]},"jetpack_featured_media_url":"https:\/\/slack.engineering\/wp-content\/uploads\/sites\/7\/2025\/10\/DevXP-Council-Presentation6-e1759869480617.png","_links":{"self":[{"href":"https:\/\/slack.engineering\/wp-json\/wp\/v2\/posts\/17591","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/slack.engineering\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/slack.engineering\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/slack.engineering\/wp-json\/wp\/v2\/users\/546"}],"replies":[{"embeddable":true,"href":"https:\/\/slack.engineering\/wp-json\/wp\/v2\/comments?post=17591"}],"version-history":[{"count":6,"href":"https:\/\/slack.engineering\/wp-json\/wp\/v2\/posts\/17591\/revisions"}],"predecessor-version":[{"id":17608,"href":"https:\/\/slack.engineering\/wp-json\/wp\/v2\/posts\/17591\/revisions\/17608"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/slack.engineering\/wp-json\/wp\/v2\/media\/17605"}],"wp:attachment":[{"href":"https:\/\/slack.engineering\/wp-json\/wp\/v2\/media?parent=17591"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/slack.engineering\/wp-json\/wp\/v2\/categories?post=17591"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/slack.engineering\/wp-json\/wp\/v2\/tags?post=17591"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}