Tài liệu tham khảo

Performance Benchmarks

How GetWebP CLI performs under various workloads, and why.

Performance Benchmarks

How GetWebP CLI performs under various workloads, and why.

See also: README | Commands | Getting Started


Architecture Overview#

Understanding GetWebP's performance characteristics requires understanding its processing pipeline.

WASM Codec Pipeline#

GetWebP uses jSquash WASM codecs compiled from the same C/C++ libraries that power Google's Squoosh:

CodecOriginPurpose
@jsquash/jpegMozJPEGJPEG decoding
@jsquash/pngSquoosh PNG (Rust via wasm-bindgen)PNG decoding
@jsquash/webplibwebpWebP decoding and encoding
bmp-jsPure JavaScriptBMP decoding

Each image goes through a three-stage pipeline:

Read file (I/O) --> Decode to RGBA (WASM) --> Encode to WebP (WASM) --> Write file (I/O)

Key implication: The WASM codecs run in the main thread (not native C). This means:

  • Decode + encode is CPU-bound and runs at roughly 60--80% of native cwebp speed for equivalent operations.
  • WASM initialization is a one-time cost. Four WASM modules (PNG decoder, JPEG decoder, WebP decoder, WebP encoder) are compiled from embedded binary blobs on first use, then cached for the session. This adds ~100--300ms to the first invocation.
  • Memory overhead per image equals the raw RGBA bitmap (width x height x 4 bytes) plus the encoded WebP buffer. A 4000x3000 photo requires ~48 MB of transient memory.

Concurrency Model#

GetWebP uses async concurrency with p-limit, not OS-level threads or worker threads.

                    +--> [decode + encode file 1] --+
Main thread ------+--> [decode + encode file 2] --+--> results
   (event loop)    +--> [decode + encode file N] --+
  • Starter/Pro plans default to os.cpus().length - 1 concurrent tasks (capped at 32). Configurable via --concurrency.
  • Free plan is forced serial (1 task at a time) with a 3-second delay between files.

Because WASM execution blocks the JavaScript event loop during decode/encode, true parallelism is limited. However, I/O operations (file reads and writes) overlap with CPU work in other tasks, yielding measurable throughput gains up to approximately the number of CPU cores.

Note: A worker_threads implementation exists in the codebase but is not active. The jSquash WASM modules do not currently initialize reliably inside Node.js worker threads. If this is resolved upstream, thread-level parallelism would unlock near-linear scaling.


Benchmark Methodology#

Test Environment#

Benchmark date: 2026-04-05. Results vary by hardware, OS, disk speed, and image content.

ParameterValue
CPUApple Silicon (arm64)
OSmacOS (Darwin 25.3.0)
RuntimeBun-compiled binary
Quality-q 80 unless noted
Runs5 per test, median reported

Test Dataset#

CategoryCountAvg SizeResolution Range
Small JPEG50~200 KB800x600 -- 1200x900
Large JPEG50~3 MB4000x3000 -- 6000x4000
PNG (photos)50~5 MB3000x2000 -- 4000x3000
PNG (graphics/screenshots)50~800 KB1920x1080
BMP20~10 MB3000x2000
WebP (re-encode)20~400 KB2000x1500

Measurement Method#

# Single file
time getwebp convert photo.jpg -o /tmp/out
 
# Batch (wall-clock time)
time getwebp convert ./dataset -o /tmp/out --concurrency 4

All times are wall-clock. Each test is run 3 times; the median is reported.


Single-File Benchmarks#

Time to convert a single image at default quality (-q 80).

JPEG#

Input SizeResolutionTimeOutput SizeSavings
200 KB1200x900[benchmark pending] (~0.3s)[pending][pending] (~25--40%)
1 MB2400x1800[benchmark pending] (~0.6s)[pending][pending] (~30--45%)
3 MB4000x3000[benchmark pending] (~1.2s)[pending][pending] (~30--50%)
8 MB6000x4000[benchmark pending] (~2.5s)[pending][pending] (~35--55%)

PNG#

Input SizeResolutionTypeTimeOutput SizeSavings
300 KB1920x1080Screenshot[benchmark pending] (~0.4s)[pending][pending] (~70--85%)
2 MB3000x2000Photo[benchmark pending] (~1.0s)[pending][pending] (~60--75%)
5 MB4000x3000Photo[benchmark pending] (~2.0s)[pending][pending] (~65--80%)
15 MB6000x4000Photo (alpha)[benchmark pending] (~4.0s)[pending][pending] (~55--70%)

PNG-to-WebP conversions typically yield the highest savings because PNG is lossless while WebP at quality 80 applies lossy compression.

BMP#

Input SizeResolutionTimeOutput SizeSavings
5 MB1920x1080[benchmark pending] (~0.5s)[pending][pending] (~95%+)
36 MB4000x3000[benchmark pending] (~2.5s)[pending][pending] (~97%+)

BMP files are uncompressed bitmaps. Conversion to WebP yields dramatic file-size reductions. Note that BMP decoding uses a pure JavaScript library (bmp-js) rather than WASM, which is slower for very large files but adequate for typical use.

WebP (Re-encode)#

Input SizeResolutionTimeOutput SizeSavings
200 KB2000x1500[benchmark pending] (~0.4s)[pending][pending] (varies)

Re-encoding WebP-to-WebP is useful for adjusting quality. Savings depend on the quality gap between input and output.


Batch Throughput#

Concurrency Scaling (Starter/Pro)#

50 JPEG files, average 3 MB each, on an 8-core machine:

--concurrencyWall TimeThroughput (files/sec)Speedup vs Serial
1[benchmark pending] (~60s)[pending] (~0.8)1.0x
2[benchmark pending] (~35s)[pending] (~1.4)~1.7x
4[benchmark pending] (~20s)[pending] (~2.5)~3.0x
7 (default on 8-core)[benchmark pending] (~14s)[pending] (~3.6)~4.3x
8[benchmark pending] (~13s)[pending] (~3.8)~4.6x
16[benchmark pending] (~12s)[pending] (~4.0)~5.0x
32[benchmark pending] (~12s)[pending] (~4.0)~5.0x

Why scaling is sub-linear: GetWebP uses async concurrency within a single process, not OS threads. WASM codec execution blocks the event loop during each decode/encode call. Concurrency gains come from overlapping I/O (file reads/writes) with CPU work in other tasks. Beyond the CPU core count, additional concurrency adds scheduling overhead without improving throughput.

Recommended setting: Leave at default (CPU cores - 1). Setting --concurrency higher than your core count provides diminishing returns.

Large Batch (1,000 files)#

Mixed dataset: 600 JPEG + 300 PNG + 100 BMP, various sizes, on an 8-core machine at default concurrency:

MetricValue
Total files1,000
Wall time[benchmark pending] (~4--6 min)
Avg time per file[benchmark pending] (~0.3s)
Total input size[benchmark pending] (~2 GB)
Total output size[benchmark pending] (~800 MB)
Overall savings[benchmark pending] (~60%)
Peak memory[benchmark pending] (~500 MB)

Plan Comparison#

Processing 50 JPEG files (avg 3 MB each) on an 8-core machine:

MetricFreeStarter / Pro
File limit10 per runUnlimited
Processing modeSerial + 3s delayParallel (7 concurrent)
Time for 10 files[pending] (~42s)[pending] (~3s)
Time for 50 filesN/A (capped at 10)[pending] (~14s)
Effective throughput~0.24 files/sec~3.6 files/sec
Speedup--~15x

Free Plan Timing Breakdown (10 files)#

File  1: ~1.2s  (convert)
         3.0s   (mandatory delay)
File  2: ~1.2s
         3.0s
...
File 10: ~1.2s
─────────────────────────
Total:   ~12s converting + ~27s delays = ~39s

The 3-second inter-file delay on the Free plan is the dominant bottleneck, not the conversion itself. Upgrading to Starter or Pro removes this delay entirely and enables parallel processing.


Quality vs File Size vs Speed#

All measurements on a single 4000x3000 JPEG (~3 MB):

Quality (-q)Output SizeSavingsEncode TimeVisual Quality
50[pending] (~300 KB)[pending] (~90%)[pending] (faster)Noticeable artifacts
60[pending] (~400 KB)[pending] (~87%)[pending]Minor artifacts
75[pending] (~600 KB)[pending] (~80%)[pending]Good
80 (default)[pending] (~700 KB)[pending] (~77%)[pending]Very good
90[pending] (~1.2 MB)[pending] (~60%)[pending]Excellent
95[pending] (~1.8 MB)[pending] (~40%)[pending]Near-lossless
100[pending] (~2.5 MB)[pending] (~17%)[pending] (slower)Lossless WebP

Recommendation: Quality 75--80 offers the best balance of file size and visual fidelity for web delivery. Use 90+ for photography portfolios or print-quality assets.


Comparison with Other Tools#

Single-image conversion, -q 80, median of 5 runs (2026-04-05, Apple Silicon arm64):

ImageGetWebP (WASM)Sharp (libvips)ImageMagickOutput
320x240 JPEG (40 KB)206 ms89 ms33 ms11 KB
640x480 JPEG (138 KB)252 ms114 ms60 ms34 KB
800x600 PNG (2.4 MB)324 ms150 ms92 ms45 KB
1024x768 JPEG (302 KB)360 ms161 ms110 ms70 KB
1024x768 PNG (4.0 MB)390 ms192 ms132 ms64 KB
1920x1080 JPEG (768 KB)643 ms331 ms276 ms163 KB
2048x1536 PNG (15.6 MB)975 ms495 ms419 ms201 KB
4096x3072 JPEG (3.9 MB)2736 ms1196 ms1250 ms730 KB

Key takeaway: GetWebP is ~2x slower on raw encode (WASM vs native), but output sizes are identical — same libwebp codec, same quality. The trade-off is zero dependencies vs. raw speed.

Analysis#

GetWebP vs cwebp: Both use libwebp for encoding, so output quality and file sizes are nearly identical at the same quality setting. The performance gap comes from WASM overhead (~1.3--1.6x slower than native). GetWebP's advantage is zero-dependency cross-platform distribution and batch processing with concurrency.

GetWebP vs sharp: Sharp links directly to native libvips, making it the fastest option. However, it requires Node.js and platform-specific native binary compilation. GetWebP ships as a self-contained Bun-compiled binary.

GetWebP vs Squoosh CLI: Both use the same jSquash/Squoosh WASM codecs. Performance is comparable. Squoosh CLI is deprecated; GetWebP is actively maintained.

GetWebP vs ImageMagick: ImageMagick's WebP encoder is typically less optimized than libwebp. GetWebP produces smaller files at equivalent visual quality.

When to Choose GetWebP#

  • CI/CD pipelines: Single binary, no runtime dependencies, JSON output, exit codes for scripting.
  • Cross-platform teams: Same binary on macOS (ARM + Intel), Linux, and Windows.
  • Batch jobs: Built-in concurrency, recursive scanning, --skip-existing for incremental builds.
  • No native compilation: Avoids the node-gyp / platform-specific addon issues that sharp requires.

Parallel Throughput Comparison#

When concurrency is factored in, the gap narrows. 50 JPEG files on an 8-core machine:

ToolModeWall TimeNotes
GetWebP CLI--concurrency 7[benchmark pending] (~14s)Built-in batch processing
cwebp + xargsxargs -P 7[benchmark pending] (~10s)Requires shell scripting
sharp (custom script)Worker pool[benchmark pending] (~7s)Requires Node.js + custom code

GetWebP's built-in concurrency makes it competitive with native tools that require manual parallelization wrappers.

When Native Tools May Be Better#

  • Maximum throughput: If processing millions of images and every millisecond counts, native cwebp or sharp will be faster per image.
  • Advanced transforms: If you also need resizing, cropping, or format conversion beyond WebP, sharp or ImageMagick offer a broader feature set.

Memory Usage#

GetWebP's memory consumption is dominated by raw pixel buffers during decode/encode:

FactorMemory Impact
WASM module initialization~20--40 MB (one-time, 4 codecs)
Per-image decode bufferwidth * height * 4 bytes (RGBA)
Per-image encode bufferOutput WebP size (much smaller)
Concurrent imagesconcurrency * per-image memory

Example: 8-core Machine, Default Concurrency (7)#

Processing 4000x3000 images:

WASM init:          ~30 MB
7 concurrent images: 7 * 48 MB = ~336 MB
Overhead (buffers, GC): ~50 MB
────────────────────────────
Peak estimate:       ~416 MB

Reducing Memory Usage#

  • Lower concurrency: --concurrency 2 reduces peak memory to ~130 MB for the same images.
  • Smaller images: Web-resolution images (1920x1080) use ~8 MB per decode buffer, totaling ~90 MB at concurrency 7.
  • BMP caution: Large BMP files require an extra full-frame buffer for the BGR-to-RGB channel swap. A 36 MB BMP (4000x3000) allocates ~96 MB during decode.

Startup Time#

PhaseDurationNotes
Binary load[benchmark pending] (~50ms)Bun-compiled single binary
WASM initialization[benchmark pending] (~150ms)Compiles 4 WASM modules from embedded blobs
License check[benchmark pending] (~5ms local, ~500ms network)JWT validated locally; network call only on first auth or expiry
File scanning[benchmark pending] (~10ms per 1,000 files)fs.readdir with sort
Total cold start~200--300msSubsequent runs reuse OS file cache

For single-file conversion, startup overhead is a significant fraction of total time. For batch jobs (100+ files), it is negligible.


Optimization Tips#

For Maximum Throughput#

# Use all cores on a dedicated build machine
getwebp convert ./images -r --concurrency 8
 
# Skip already-converted files in incremental builds
getwebp convert ./images -r -o ./dist --skip-existing
 
# Lower quality for faster encoding (quality < 80 is slightly faster)
getwebp convert ./images -q 70

For Minimum Memory#

# Limit concurrency to reduce peak memory
getwebp convert ./images --concurrency 2
 
# Process directories one at a time instead of recursive
getwebp convert ./images/thumbs -o ./out/thumbs
getwebp convert ./images/photos -o ./out/photos

For CI/CD Pipelines#

# JSON output + exit codes for scripted error handling
getwebp convert ./assets -r -o ./dist --skip-existing --json
echo "Exit code: $?"
 
# Dry run to validate before converting
getwebp convert ./assets -r --dry-run --json

Running Your Own Benchmarks#

To generate numbers for your specific hardware:

# Prepare a test dataset
mkdir -p /tmp/bench-input /tmp/bench-output
# Copy or generate test images into /tmp/bench-input
 
# Single-file timing
time getwebp convert /tmp/bench-input/sample.jpg -o /tmp/bench-output
 
# Batch with default concurrency
time getwebp convert /tmp/bench-input -o /tmp/bench-output
 
# Batch with JSON output for machine-parseable results
getwebp convert /tmp/bench-input -o /tmp/bench-output --json > results.json
 
# Concurrency sweep
for c in 1 2 4 8 16; do
  rm -rf /tmp/bench-output/*
  echo "concurrency=$c"
  time getwebp convert /tmp/bench-input -o /tmp/bench-output --concurrency $c
done
 
# Memory monitoring (macOS)
/usr/bin/time -l getwebp convert /tmp/bench-input -o /tmp/bench-output
 
# Memory monitoring (Linux)
/usr/bin/time -v getwebp convert /tmp/bench-input -o /tmp/bench-output

Write Safety#

GetWebP uses atomic writes to prevent corrupted output files:

  1. Encode output is written to a temporary file (<name>.webp.tmp).
  2. The temporary file is renamed to the final path (<name>.webp) via fs.rename.
  3. On SIGINT (Ctrl+C), any in-progress .tmp files are cleaned up automatically.

This means a crash or interruption will never leave a half-written .webp file in your output directory. Already-completed files are safe. Use --skip-existing on the next run to resume from where you left off.


Known Limitations#

LimitationImpactWorkaround
Single-threaded WASMEncode/decode blocks event loop; scaling plateaus around core countUse native cwebp for extreme throughput needs
No GPU accelerationWASM codecs are CPU-onlyN/A (libwebp is CPU-only by design)
BMP decode is pure JSSlower than WASM for large BMPsConvert BMP to PNG first if processing many large BMPs
Free plan delay3s per file, 10 file capUpgrade to Starter or Pro
Memory scales with concurrencyHigh concurrency on large images can exhaust RAMReduce --concurrency for large-resolution batches