How to Reduce Rust Binary Size from 40MB to 400KB

Rust: How I Shrunk a 40MB CLI Binary to 400KB and Cut Cold Starts

Rust optimization: how dependency choices, compilation flags, and feature pruning cut a simple CLI from ~40MB to ~400KB, dramatically improving cold-starts.

Why a Rust binary’s size mattered for deployment

When a compact command-line tool written in Rust produced a 40MB release binary, the consequence showed up not in local development but during Docker deployment: the container image expanded to about 180MB and cold-start time grew from roughly 2 seconds to 8 seconds. That regression in startup latency—six extra seconds—was unacceptable in a microservices environment where fast cold starts matter. This article follows the same optimization journey, showing how targeted dependency surgery, compilation settings, and feature control reduced the binary by roughly 99%, cut runtime memory dramatically, and brought cold-starts down to milliseconds.

The deceptive cost of “lightweight” crates

The project began with a conventional Rust stack: serde for JSON, reqwest for HTTP, and tokio as the async runtime. Each crate is well-regarded, but a cargo bloat analysis revealed the true impact: reqwest accounted for 11.2MB of .text, tokio for 7.7MB, openssl-sys for 3.7MB, and hyper for 2.6MB, with those few crates dominating the executable size. More generally, dependency count and transitive features correlated directly with both binary size and startup overhead—each included crate introduced code paths and initialization costs the application ultimately paid for at runtime.

Measuring before changing: the data that drove decisions

Optimization began with metrics. Comparing different approaches showed a clear pattern: more crates meant larger binaries and slower cold-starts. That observation framed every subsequent tradeoff. Concrete production results after optimization were striking: container start time returned from 8s to 2s, runtime memory dropped from 28.4MB to 2.1MB (a 92% reduction), cold-start latency improved from 847ms to 23ms (about a 97% improvement), and the per-deployment storage footprint moved from roughly 40.2MB to 0.4MB. Those numbers guided which changes justified their engineering cost.

Surgical dependency replacement: trade convenience for minimality

Rather than accept the full feature sets of heavy crates, the author replaced large libraries with bespoke, narrowly scoped implementations for the specific needs of the tool.

HTTP client: reqwest provided a full-featured HTTP stack but the tool only needed to POST JSON to a single endpoint. Replacing reqwest with a tiny TCP-based client eliminated the 11.2MB contribution for HTTP functionality. The tradeoffs were explicit: the minimalist client removed automatic HTTPS handling, connection pooling, and extensive error semantics—features the application did not require for its constrained workload.
JSON parsing: serde is powerful for broad serialization needs, but the tool only needed to extract a few predictable fields. Switching to a targeted parser that searches for field keys in the input JSON reduced dependency surface and cut parsing-related size by a substantial fraction. The guiding principle was to align the parser’s capabilities precisely with the input shape rather than carrying a full serialization framework.

The consistent lesson: match tooling to the exact, known requirements; avoid pulling in broad ecosystems “just in case.”

Compilation flags that make a real difference

After reducing dependencies, compiler-level options produced additional, deterministic size wins. Key release profile settings included enabling link-time optimization (lto = true), consolidating code generation units (codegen-units = 1), switching panic behavior to abort (panic = "abort"), stripping symbols, and choosing an optimization level focused on size (opt-level = "z"). In this case opt-level = "z" alone reduced binary size by about 23%. Combined with LTO, the compiler could inline and eliminate dead code across crate boundaries more aggressively, translating to smaller output and faster startup.

Feature-flag surgery: stop paying for unused functionality

Many crates enable conservative default features intended to help developers get started quickly. Explicitly disabling default features and enabling only the needed ones yielded consistent reductions—typically on the order of 20–30% for the author’s dependencies. Examples used in the project included disabling default features for tokio and serde while opting into just the minimal runtime or derive support required. This manual pruning requires upfront analysis of which sub-features the application actually relies on, but the size dividends proved substantial.

The static linking trade

The team evaluated dynamic versus static linking and chose a static approach for distribution simplicity. For cryptographic needs the openssl crate was configured with a vendored feature so the library was built into the binary. That decision added about 2.1MB but removed runtime library dependencies and avoided version conflicts in target environments. For single-binary deployments the extra bytes were a reasonable trade for simpler operations and predictable runtime behavior.

When to optimize aggressively and when not to

The author developed a decision framework based on deployment context:

Optimize aggressively when image size affects startup time (container deployments), when bandwidth and storage are constrained (edge computing, embedded systems), and when cold starts are critical (serverless/Lambda and high-frequency deployments).
Accept larger binaries when you are in development builds where compile-time and debug info matter, when complex feature sets require richer libraries, or in environments where shared libraries and dynamic linking provide operational benefits.

Choosing whether to pursue aggressive size reduction depends not on abstract ideals but on concrete production constraints.

Production impact quantified

Putting the combined strategies together yielded measurable production improvements:

Container deployment speed reverted to a 2-second start time from 8 seconds.
Runtime memory demand fell from 28.4MB to 2.1MB, a 92% decline.
Cold-start latency dropped from 847ms to 23ms, representing a 97% improvement.
The binary size and storage footprint shifted from roughly 40.2MB per deployment to about 0.4MB.

These outcomes reinforce that binary-size optimization affects the full deployment lifecycle: build, transfer, startup, and runtime.

Developer tradeoffs and team implications

The process required deliberate choices that trade developer ergonomics for production efficiency. Replacing robust libraries with minimal implementations increases maintenance overhead, reduces automatic coverage for edge-case errors, and may require more careful testing. Explicit feature disabling and compiler tweaks can complicate local development workflows and debugability (for example, using panic = "abort" removes unwinding behavior and strip=true removes symbols). Teams must weigh these costs against operational gains. For applications where cold-start latency, container image size, or edge constraints are not critical, default dependency choices and full-featured crates may be the better option.

A framework for repeating this optimization

The practical approach used here is repeatable:

Measure baseline: run size and startup analyses (cargo bloat or equivalent).
Identify dominant contributors: focus on the handful of crates that consume most of the binary.
Question necessity: for each heavyweight dependency, ask whether the full feature set is required.
Apply surgical replacements where feasible: implement minimal functionality that satisfies production needs.
Use compiler and feature flags to further reduce size: LTO, single codegen unit, panic behavior, opt-level for size, and feature toggles.
Re-measure and validate performance and correctness in production-like conditions.

Following this loop turns intuition into measurable improvements without blind premature optimization.

Broader implications for systems and deployment practices

This optimization story highlights structural tradeoffs in modern systems engineering. Language ecosystems that favor ergonomics and rapid development can lead to surprisingly large runtime artifacts that matter in distributed deployments. As microservices, serverless functions, and edge workloads proliferate, teams must adopt tooling and processes that surface the deployment costs of developer-facing conveniences. The work also underscores the importance of measurement-driven decisions: without profiling and end-to-end metrics, dependency cost is invisible until it hits production.

For developer tools and CI pipelines, there is an opportunity to bake size and startup checks into continuous validation. For platform teams managing container registries or edge rollouts, size-aware policies and caching strategies can reduce the operational impact of large artifacts. And for architects choosing runtimes and libraries, the story is a reminder that “zero-cost” abstractions can still carry real operational expenses when their transitive ecosystems are large.

The experience also has implications for library authors: consider exposing smaller, opt-in feature sets and documenting size impacts so downstream consumers can make informed tradeoffs.

The work applies directly to Rust and related tooling, but the pattern—measure, isolate, and minimize—applies across languages and stacks where dependency ecosystems have grown large.

Continued improvements will likely come from a mix of library-level modularization, more granular feature defaults, and platform-level tooling that makes size and startup costs visible earlier in development. For teams prioritizing minimal deployment footprints, combining targeted dependency choices with release-profile tuning yields immediate, practical wins that improve both cost and user experience.