Root cause
When a Kubernetes pod completes a GPU training run with tqdm progress bars, the pod log contains multi-byte UTF-8 block glyphs (█▉▊▋▌▍▎▏, each 3 bytes: E2 96 8x). Two distinct mechanisms produce torn byte sequences in the raw log bytes that the Kubernetes API returns:
-
Concurrent writer interleaving — 64 worker processes share a single stderr fd. Python's TextIOWrapper can split one logical write() call across multiple write() syscalls. When two processes' writes interleave at byte granularity, the receiver sees an orphaned continuation byte (invalid continuation byte).
-
Response truncation at a chunk boundary — the K8s API response is split at a size boundary (~5.1 MB observed), cutting a 3-byte glyph after its first byte (unexpected end of data at 0xe2).
The Kubernetes Python client reads the response with r.data.decode('utf8') (strict, no error handling). Both variants throw UnicodeDecodeError before the run has any real failure. The exception bubbles through the _retry wrapper in internal_process_one_running_execution and, after 5 retries, marks the execution SYSTEM_ERROR — causing downstream Upload steps to be skipped on an otherwise successful training run.
This is responsible for 3,960 Bugsnag occurrences in the past 7 days.
Neither the Linux kernel nor the Kubernetes client is at fault. The kernel's PIPE_BUF atomic write guarantee applies only to writes ≤ 4096 bytes, and Python's TextIOWrapper may issue multiple syscalls for one logical write. The corruption is baked into the bytes before the K8s client sees them, so the client cannot fix it in a lossless way.
Solutions compared
Three approaches were kept as separate PRs for team comparison:
PR #277 — Defensive decode (_preload_content=False + errors="replace")
Bypass the K8s client's strict decode by using _preload_content=False, then decode ourselves with errors="replace" so U+FFFD is substituted for any malformed byte sequence. A _logger.warning fires when substitution occurs.
- Pro: handles both error classes; logs remain complete (only the torn glyph is replaced)
- Con: lossy — the torn bytes (partial tqdm glyph) are replaced rather than recovered
PR #278 — Suppress tqdm output at source
Inject TQDM_DISABLE=1 (and related env vars) into the pod's environment so block glyphs are never written to stderr in the first place.
- Pro: eliminates the root cause entirely for tqdm-based progress bars
- Con: only covers tqdm; any other multi-byte output from training code or libraries would still tear
PR #279 / #280 — Swallow log-acquisition errors + Observe search link
Catch all exceptions from log reading, return an empty string / None, and embed a pre-built Observe log search URL in the placeholder message so the user can still find the pod logs.
- Pro: execution is never incorrectly marked
SYSTEM_ERROR; user has a direct link to raw logs
- Con: the execution log stored in GCS is lost; relies on Observe retention
Fix landed
PR #277 — _preload_content=False + errors="replace" was selected. It addresses both UnicodeDecodeError variants, keeps the execution log intact (minus the torn glyph bytes), and does not require changes to the training environment.
Fixes #277
Root cause
When a Kubernetes pod completes a GPU training run with tqdm progress bars, the pod log contains multi-byte UTF-8 block glyphs (
█▉▊▋▌▍▎▏, each 3 bytes:E2 96 8x). Two distinct mechanisms produce torn byte sequences in the raw log bytes that the Kubernetes API returns:Concurrent writer interleaving — 64 worker processes share a single stderr fd. Python's
TextIOWrappercan split one logicalwrite()call across multiplewrite()syscalls. When two processes' writes interleave at byte granularity, the receiver sees an orphaned continuation byte (invalid continuation byte).Response truncation at a chunk boundary — the K8s API response is split at a size boundary (~5.1 MB observed), cutting a 3-byte glyph after its first byte (
unexpected end of dataat0xe2).The Kubernetes Python client reads the response with
r.data.decode('utf8')(strict, no error handling). Both variants throwUnicodeDecodeErrorbefore the run has any real failure. The exception bubbles through the_retrywrapper ininternal_process_one_running_executionand, after 5 retries, marks the executionSYSTEM_ERROR— causing downstream Upload steps to be skipped on an otherwise successful training run.This is responsible for 3,960 Bugsnag occurrences in the past 7 days.
Neither the Linux kernel nor the Kubernetes client is at fault. The kernel's
PIPE_BUFatomic write guarantee applies only to writes ≤ 4096 bytes, and Python'sTextIOWrappermay issue multiple syscalls for one logical write. The corruption is baked into the bytes before the K8s client sees them, so the client cannot fix it in a lossless way.Solutions compared
Three approaches were kept as separate PRs for team comparison:
PR #277 — Defensive decode (
_preload_content=False+errors="replace")Bypass the K8s client's strict decode by using
_preload_content=False, then decode ourselves witherrors="replace"soU+FFFDis substituted for any malformed byte sequence. A_logger.warningfires when substitution occurs.PR #278 — Suppress tqdm output at source
Inject
TQDM_DISABLE=1(and related env vars) into the pod's environment so block glyphs are never written to stderr in the first place.PR #279 / #280 — Swallow log-acquisition errors + Observe search link
Catch all exceptions from log reading, return an empty string /
None, and embed a pre-built Observe log search URL in the placeholder message so the user can still find the pod logs.SYSTEM_ERROR; user has a direct link to raw logsFix landed
PR #277 —
_preload_content=False+errors="replace"was selected. It addresses both UnicodeDecodeError variants, keeps the execution log intact (minus the torn glyph bytes), and does not require changes to the training environment.Fixes #277