Performance Benchmarks
How GetWebP CLI performs under various workloads, and why.
Performance Benchmarks
How GetWebP CLI performs under various workloads, and why.
See also: README | Commands | Getting Started
Architecture Overview#
Understanding GetWebP's performance characteristics requires understanding its processing pipeline.
WASM Codec Pipeline#
GetWebP uses jSquash WASM codecs compiled from the same C/C++ libraries that power Google's Squoosh:
| Codec | Origin | Purpose |
|---|---|---|
@jsquash/jpeg | MozJPEG | JPEG decoding |
@jsquash/png | Squoosh PNG (Rust via wasm-bindgen) | PNG decoding |
@jsquash/webp | libwebp | WebP decoding and encoding |
bmp-js | Pure JavaScript | BMP decoding |
Each image goes through a three-stage pipeline:
Read file (I/O) --> Decode to RGBA (WASM) --> Encode to WebP (WASM) --> Write file (I/O)
Key implication: The WASM codecs run in the main thread (not native C). This means:
- Decode + encode is CPU-bound and runs at roughly 60--80% of native
cwebpspeed for equivalent operations. - WASM initialization is a one-time cost. Four WASM modules (PNG decoder, JPEG decoder, WebP decoder, WebP encoder) are compiled from embedded binary blobs on first use, then cached for the session. This adds ~100--300ms to the first invocation.
- Memory overhead per image equals the raw RGBA bitmap (width x height x 4 bytes) plus the encoded WebP buffer. A 4000x3000 photo requires ~48 MB of transient memory.
Concurrency Model#
GetWebP uses async concurrency with p-limit, not OS-level threads or worker threads.
+--> [decode + encode file 1] --+
Main thread ------+--> [decode + encode file 2] --+--> results
(event loop) +--> [decode + encode file N] --+
- Starter/Pro plans default to
os.cpus().length - 1concurrent tasks (capped at 32). Configurable via--concurrency. - Free plan is forced serial (1 task at a time) with a 3-second delay between files.
Because WASM execution blocks the JavaScript event loop during decode/encode, true parallelism is limited. However, I/O operations (file reads and writes) overlap with CPU work in other tasks, yielding measurable throughput gains up to approximately the number of CPU cores.
Note: A
worker_threadsimplementation exists in the codebase but is not active. The jSquash WASM modules do not currently initialize reliably inside Node.js worker threads. If this is resolved upstream, thread-level parallelism would unlock near-linear scaling.
Benchmark Methodology#
Test Environment#
Benchmark date: 2026-04-05. Results vary by hardware, OS, disk speed, and image content.
| Parameter | Value |
|---|---|
| CPU | Apple Silicon (arm64) |
| OS | macOS (Darwin 25.3.0) |
| Runtime | Bun-compiled binary |
| Quality | -q 80 unless noted |
| Runs | 5 per test, median reported |
Test Dataset#
| Category | Count | Avg Size | Resolution Range |
|---|---|---|---|
| Small JPEG | 50 | ~200 KB | 800x600 -- 1200x900 |
| Large JPEG | 50 | ~3 MB | 4000x3000 -- 6000x4000 |
| PNG (photos) | 50 | ~5 MB | 3000x2000 -- 4000x3000 |
| PNG (graphics/screenshots) | 50 | ~800 KB | 1920x1080 |
| BMP | 20 | ~10 MB | 3000x2000 |
| WebP (re-encode) | 20 | ~400 KB | 2000x1500 |
Measurement Method#
# Single file
time getwebp convert photo.jpg -o /tmp/out
# Batch (wall-clock time)
time getwebp convert ./dataset -o /tmp/out --concurrency 4All times are wall-clock. Each test is run 3 times; the median is reported.
Single-File Benchmarks#
Time to convert a single image at default quality (-q 80).
JPEG#
| Input Size | Resolution | Time | Output Size | Savings |
|---|---|---|---|---|
| 200 KB | 1200x900 | [benchmark pending] (~0.3s) | [pending] | [pending] (~25--40%) |
| 1 MB | 2400x1800 | [benchmark pending] (~0.6s) | [pending] | [pending] (~30--45%) |
| 3 MB | 4000x3000 | [benchmark pending] (~1.2s) | [pending] | [pending] (~30--50%) |
| 8 MB | 6000x4000 | [benchmark pending] (~2.5s) | [pending] | [pending] (~35--55%) |
PNG#
| Input Size | Resolution | Type | Time | Output Size | Savings |
|---|---|---|---|---|---|
| 300 KB | 1920x1080 | Screenshot | [benchmark pending] (~0.4s) | [pending] | [pending] (~70--85%) |
| 2 MB | 3000x2000 | Photo | [benchmark pending] (~1.0s) | [pending] | [pending] (~60--75%) |
| 5 MB | 4000x3000 | Photo | [benchmark pending] (~2.0s) | [pending] | [pending] (~65--80%) |
| 15 MB | 6000x4000 | Photo (alpha) | [benchmark pending] (~4.0s) | [pending] | [pending] (~55--70%) |
PNG-to-WebP conversions typically yield the highest savings because PNG is lossless while WebP at quality 80 applies lossy compression.
BMP#
| Input Size | Resolution | Time | Output Size | Savings |
|---|---|---|---|---|
| 5 MB | 1920x1080 | [benchmark pending] (~0.5s) | [pending] | [pending] (~95%+) |
| 36 MB | 4000x3000 | [benchmark pending] (~2.5s) | [pending] | [pending] (~97%+) |
BMP files are uncompressed bitmaps. Conversion to WebP yields dramatic file-size reductions. Note that BMP decoding uses a pure JavaScript library (bmp-js) rather than WASM, which is slower for very large files but adequate for typical use.
WebP (Re-encode)#
| Input Size | Resolution | Time | Output Size | Savings |
|---|---|---|---|---|
| 200 KB | 2000x1500 | [benchmark pending] (~0.4s) | [pending] | [pending] (varies) |
Re-encoding WebP-to-WebP is useful for adjusting quality. Savings depend on the quality gap between input and output.
Batch Throughput#
Concurrency Scaling (Starter/Pro)#
50 JPEG files, average 3 MB each, on an 8-core machine:
--concurrency | Wall Time | Throughput (files/sec) | Speedup vs Serial |
|---|---|---|---|
| 1 | [benchmark pending] (~60s) | [pending] (~0.8) | 1.0x |
| 2 | [benchmark pending] (~35s) | [pending] (~1.4) | ~1.7x |
| 4 | [benchmark pending] (~20s) | [pending] (~2.5) | ~3.0x |
| 7 (default on 8-core) | [benchmark pending] (~14s) | [pending] (~3.6) | ~4.3x |
| 8 | [benchmark pending] (~13s) | [pending] (~3.8) | ~4.6x |
| 16 | [benchmark pending] (~12s) | [pending] (~4.0) | ~5.0x |
| 32 | [benchmark pending] (~12s) | [pending] (~4.0) | ~5.0x |
Why scaling is sub-linear: GetWebP uses async concurrency within a single process, not OS threads. WASM codec execution blocks the event loop during each decode/encode call. Concurrency gains come from overlapping I/O (file reads/writes) with CPU work in other tasks. Beyond the CPU core count, additional concurrency adds scheduling overhead without improving throughput.
Recommended setting: Leave at default (CPU cores - 1). Setting --concurrency higher than your core count provides diminishing returns.
Large Batch (1,000 files)#
Mixed dataset: 600 JPEG + 300 PNG + 100 BMP, various sizes, on an 8-core machine at default concurrency:
| Metric | Value |
|---|---|
| Total files | 1,000 |
| Wall time | [benchmark pending] (~4--6 min) |
| Avg time per file | [benchmark pending] (~0.3s) |
| Total input size | [benchmark pending] (~2 GB) |
| Total output size | [benchmark pending] (~800 MB) |
| Overall savings | [benchmark pending] (~60%) |
| Peak memory | [benchmark pending] (~500 MB) |
Plan Comparison#
Processing 50 JPEG files (avg 3 MB each) on an 8-core machine:
| Metric | Free | Starter / Pro |
|---|---|---|
| File limit | 10 per run | Unlimited |
| Processing mode | Serial + 3s delay | Parallel (7 concurrent) |
| Time for 10 files | [pending] (~42s) | [pending] (~3s) |
| Time for 50 files | N/A (capped at 10) | [pending] (~14s) |
| Effective throughput | ~0.24 files/sec | ~3.6 files/sec |
| Speedup | -- | ~15x |
Free Plan Timing Breakdown (10 files)#
File 1: ~1.2s (convert)
3.0s (mandatory delay)
File 2: ~1.2s
3.0s
...
File 10: ~1.2s
─────────────────────────
Total: ~12s converting + ~27s delays = ~39s
The 3-second inter-file delay on the Free plan is the dominant bottleneck, not the conversion itself. Upgrading to Starter or Pro removes this delay entirely and enables parallel processing.
Quality vs File Size vs Speed#
All measurements on a single 4000x3000 JPEG (~3 MB):
Quality (-q) | Output Size | Savings | Encode Time | Visual Quality |
|---|---|---|---|---|
| 50 | [pending] (~300 KB) | [pending] (~90%) | [pending] (faster) | Noticeable artifacts |
| 60 | [pending] (~400 KB) | [pending] (~87%) | [pending] | Minor artifacts |
| 75 | [pending] (~600 KB) | [pending] (~80%) | [pending] | Good |
| 80 (default) | [pending] (~700 KB) | [pending] (~77%) | [pending] | Very good |
| 90 | [pending] (~1.2 MB) | [pending] (~60%) | [pending] | Excellent |
| 95 | [pending] (~1.8 MB) | [pending] (~40%) | [pending] | Near-lossless |
| 100 | [pending] (~2.5 MB) | [pending] (~17%) | [pending] (slower) | Lossless WebP |
Recommendation: Quality 75--80 offers the best balance of file size and visual fidelity for web delivery. Use 90+ for photography portfolios or print-quality assets.
Comparison with Other Tools#
Single-image conversion, -q 80, median of 5 runs (2026-04-05, Apple Silicon arm64):
| Image | GetWebP (WASM) | Sharp (libvips) | ImageMagick | Output |
|---|---|---|---|---|
| 320x240 JPEG (40 KB) | 206 ms | 89 ms | 33 ms | 11 KB |
| 640x480 JPEG (138 KB) | 252 ms | 114 ms | 60 ms | 34 KB |
| 800x600 PNG (2.4 MB) | 324 ms | 150 ms | 92 ms | 45 KB |
| 1024x768 JPEG (302 KB) | 360 ms | 161 ms | 110 ms | 70 KB |
| 1024x768 PNG (4.0 MB) | 390 ms | 192 ms | 132 ms | 64 KB |
| 1920x1080 JPEG (768 KB) | 643 ms | 331 ms | 276 ms | 163 KB |
| 2048x1536 PNG (15.6 MB) | 975 ms | 495 ms | 419 ms | 201 KB |
| 4096x3072 JPEG (3.9 MB) | 2736 ms | 1196 ms | 1250 ms | 730 KB |
Key takeaway: GetWebP is ~2x slower on raw encode (WASM vs native), but output sizes are identical — same libwebp codec, same quality. The trade-off is zero dependencies vs. raw speed.
Analysis#
GetWebP vs cwebp: Both use libwebp for encoding, so output quality and file sizes are nearly identical at the same quality setting. The performance gap comes from WASM overhead (~1.3--1.6x slower than native). GetWebP's advantage is zero-dependency cross-platform distribution and batch processing with concurrency.
GetWebP vs sharp: Sharp links directly to native libvips, making it the fastest option. However, it requires Node.js and platform-specific native binary compilation. GetWebP ships as a self-contained Bun-compiled binary.
GetWebP vs Squoosh CLI: Both use the same jSquash/Squoosh WASM codecs. Performance is comparable. Squoosh CLI is deprecated; GetWebP is actively maintained.
GetWebP vs ImageMagick: ImageMagick's WebP encoder is typically less optimized than libwebp. GetWebP produces smaller files at equivalent visual quality.
When to Choose GetWebP#
- CI/CD pipelines: Single binary, no runtime dependencies, JSON output, exit codes for scripting.
- Cross-platform teams: Same binary on macOS (ARM + Intel), Linux, and Windows.
- Batch jobs: Built-in concurrency, recursive scanning,
--skip-existingfor incremental builds. - No native compilation: Avoids the
node-gyp/ platform-specific addon issues that sharp requires.
Parallel Throughput Comparison#
When concurrency is factored in, the gap narrows. 50 JPEG files on an 8-core machine:
| Tool | Mode | Wall Time | Notes |
|---|---|---|---|
| GetWebP CLI | --concurrency 7 | [benchmark pending] (~14s) | Built-in batch processing |
| cwebp + xargs | xargs -P 7 | [benchmark pending] (~10s) | Requires shell scripting |
| sharp (custom script) | Worker pool | [benchmark pending] (~7s) | Requires Node.js + custom code |
GetWebP's built-in concurrency makes it competitive with native tools that require manual parallelization wrappers.
When Native Tools May Be Better#
- Maximum throughput: If processing millions of images and every millisecond counts, native
cwebpor sharp will be faster per image. - Advanced transforms: If you also need resizing, cropping, or format conversion beyond WebP, sharp or ImageMagick offer a broader feature set.
Memory Usage#
GetWebP's memory consumption is dominated by raw pixel buffers during decode/encode:
| Factor | Memory Impact |
|---|---|
| WASM module initialization | ~20--40 MB (one-time, 4 codecs) |
| Per-image decode buffer | width * height * 4 bytes (RGBA) |
| Per-image encode buffer | Output WebP size (much smaller) |
| Concurrent images | concurrency * per-image memory |
Example: 8-core Machine, Default Concurrency (7)#
Processing 4000x3000 images:
WASM init: ~30 MB
7 concurrent images: 7 * 48 MB = ~336 MB
Overhead (buffers, GC): ~50 MB
────────────────────────────
Peak estimate: ~416 MB
Reducing Memory Usage#
- Lower concurrency:
--concurrency 2reduces peak memory to ~130 MB for the same images. - Smaller images: Web-resolution images (1920x1080) use ~8 MB per decode buffer, totaling ~90 MB at concurrency 7.
- BMP caution: Large BMP files require an extra full-frame buffer for the BGR-to-RGB channel swap. A 36 MB BMP (4000x3000) allocates ~96 MB during decode.
Startup Time#
| Phase | Duration | Notes |
|---|---|---|
| Binary load | [benchmark pending] (~50ms) | Bun-compiled single binary |
| WASM initialization | [benchmark pending] (~150ms) | Compiles 4 WASM modules from embedded blobs |
| License check | [benchmark pending] (~5ms local, ~500ms network) | JWT validated locally; network call only on first auth or expiry |
| File scanning | [benchmark pending] (~10ms per 1,000 files) | fs.readdir with sort |
| Total cold start | ~200--300ms | Subsequent runs reuse OS file cache |
For single-file conversion, startup overhead is a significant fraction of total time. For batch jobs (100+ files), it is negligible.
Optimization Tips#
For Maximum Throughput#
# Use all cores on a dedicated build machine
getwebp convert ./images -r --concurrency 8
# Skip already-converted files in incremental builds
getwebp convert ./images -r -o ./dist --skip-existing
# Lower quality for faster encoding (quality < 80 is slightly faster)
getwebp convert ./images -q 70For Minimum Memory#
# Limit concurrency to reduce peak memory
getwebp convert ./images --concurrency 2
# Process directories one at a time instead of recursive
getwebp convert ./images/thumbs -o ./out/thumbs
getwebp convert ./images/photos -o ./out/photosFor CI/CD Pipelines#
# JSON output + exit codes for scripted error handling
getwebp convert ./assets -r -o ./dist --skip-existing --json
echo "Exit code: $?"
# Dry run to validate before converting
getwebp convert ./assets -r --dry-run --jsonRunning Your Own Benchmarks#
To generate numbers for your specific hardware:
# Prepare a test dataset
mkdir -p /tmp/bench-input /tmp/bench-output
# Copy or generate test images into /tmp/bench-input
# Single-file timing
time getwebp convert /tmp/bench-input/sample.jpg -o /tmp/bench-output
# Batch with default concurrency
time getwebp convert /tmp/bench-input -o /tmp/bench-output
# Batch with JSON output for machine-parseable results
getwebp convert /tmp/bench-input -o /tmp/bench-output --json > results.json
# Concurrency sweep
for c in 1 2 4 8 16; do
rm -rf /tmp/bench-output/*
echo "concurrency=$c"
time getwebp convert /tmp/bench-input -o /tmp/bench-output --concurrency $c
done
# Memory monitoring (macOS)
/usr/bin/time -l getwebp convert /tmp/bench-input -o /tmp/bench-output
# Memory monitoring (Linux)
/usr/bin/time -v getwebp convert /tmp/bench-input -o /tmp/bench-outputWrite Safety#
GetWebP uses atomic writes to prevent corrupted output files:
- Encode output is written to a temporary file (
<name>.webp.tmp). - The temporary file is renamed to the final path (
<name>.webp) viafs.rename. - On SIGINT (Ctrl+C), any in-progress
.tmpfiles are cleaned up automatically.
This means a crash or interruption will never leave a half-written .webp file in your output directory. Already-completed files are safe. Use --skip-existing on the next run to resume from where you left off.
Known Limitations#
| Limitation | Impact | Workaround |
|---|---|---|
| Single-threaded WASM | Encode/decode blocks event loop; scaling plateaus around core count | Use native cwebp for extreme throughput needs |
| No GPU acceleration | WASM codecs are CPU-only | N/A (libwebp is CPU-only by design) |
| BMP decode is pure JS | Slower than WASM for large BMPs | Convert BMP to PNG first if processing many large BMPs |
| Free plan delay | 3s per file, 10 file cap | Upgrade to Starter or Pro |
| Memory scales with concurrency | High concurrency on large images can exhaust RAM | Reduce --concurrency for large-resolution batches |