Debugging phantom symbol insertions in a 4-FSK audio bootloader on STM32. Root cause analysis across 9 sessions, 15+ eliminated hypotheses and a 3-line fix. A deep dive into Goertzel filters, clock drift, DMA timing and the kind of bugs that only appear at scale.
Hunting down a phantom
The setup
KARON is an A/B bootloader for STM32F030RC that receives firmware updates through audio. A bootloader is a small program that runs before the main firmware. Its job is to decide whether to start the existing firmware or accept a new one. KARON does this over audio: encode firmware as a 4-FSK (Frequency-Shift Keying -- a method of encoding data as different audio tones) modulated WAV file, play it into the module's Analog-to-Digital Converter (ADC), demodulate on-chip, write the new image to flash memory. No UART (Universal Asynchronous Receiver-Transmitter -- a serial connection), no USB, no programmer. Just audio through a 3.5 mm cable.
The system runs on a Cortex-M0 with 256 KB flash and 32 KB RAM. No Floating-Point Unit (FPU) -- every calculation uses integer math. No hardware Vector Table Offset Register (VTOR) -- the chip cannot relocate its interrupt table, so KARON copies it into RAM manually. No external crystal -- the internal oscillator (HSI) runs at 8 MHz +/-1%, multiplied via a Phase-Locked Loop (PLL) to 48 MHz. That +/-1% tolerance means the chip's idea of "48,000 samples per second" can differ from the PC's actual sample rate by several hundred samples per second. This mismatch turns out to be the entire story.
Four tones (2400, 3200, 4000, 4800 Hz) carry two bits each. At 200 symbols per second the raw data rate is 400 bits per second -- slow enough that a full firmware image takes a few minutes. Each packet is protected by a Cyclic Redundancy Check (CRC16 -- a checksum that detects corrupted data). The WAV file starts with a preamble, followed by a calibration tone, then the actual data packets separated by short gaps. For the full protocol, see the KARON lab notes.
Everything worked -- until it did not.
The symptom: sporadic CRC failures
During testing, the bootloader would frequently fail to receive all packets. CRC errors appeared seemingly at random. Sometimes an update would succeed, sometimes not. Even short test images (a few hundred bytes, producing WAV files of roughly 30 seconds) had problems. Failure rate was unpredictable -- the same WAV file could succeed on one attempt and fail on the next.
Think of it like a fax machine that sometimes garbles page 3 of a 5-page document, but works fine if you send it again. Same document, same machine, different result.
Initial assumption was noise. Various band-aids were applied: packet repetitions (send each DATA packet twice so the receiver can pick whichever copy arrives intact), inter-packet gaps (10 silence symbols between packets), data whitening (an LFSR scrambler -- a Linear Feedback Shift Register that XORs the data with a pseudo-random sequence to break up repetitive byte patterns) and a confidence filter in the tone detector. None solved it. The CRC failures persisted.
sync[9]->sync[10] = 3120 dibits = 3 x 1040 -- three entire packets consumed by one phantom.
The fix -- 10 silence symbols between packets (50 ms) -- absorbs the over-read so the next sync word stays intact.
Phase 1: confidence gating -- the fix that made everything worse
KARON detects tones using a Goertzel filter -- an algorithm that measures how much energy a signal contains at one specific frequency. Run it four times (once per tone) and pick the strongest. The result is one of four symbols, each carrying two bits.
The first fix seemed obvious: if the Goertzel cannot clearly distinguish which tone is
present, reject the symbol as noise rather than guessing. After computing all four bins,
compare the best power to the second-best. If the ratio is less than 2x, return
NO_TONE.
Result with GOERTZEL_CONFIDENCE_RATIO=2: zero DATA packets received.
Complete failure.
The post-mortem buffer told the story. In the DATA region, 15-70% of symbols were being
rejected as NO_TONE. A single DATA packet requires 520 consecutive valid
symbols to assemble correctly. The probability of getting 520 in a row at a 20% rejection
rate: 0.80520 ~ 10-50. That is a number with 50 zeros after the
decimal point. Statistically impossible.
The symbols that looked "uncertain" were at tone boundaries where the analysis window straddled two different tones. The Goertzel was doing its job -- reporting that two frequencies had similar energy -- because the window genuinely contained both. Rejecting them did not help. It destroyed everything. Confidence gating was permanently shelved.
Phase 2: the debugging tool was the problem
SEGGER J-Link RTT (Real-Time Transfer) seemed like the right tool -- it prints debug messages over the chip's debug wire (Serial Wire Debug, or SWD) without needing a serial port. Minimal overhead, supposedly. Except RTT streaming requires constant traffic on the SWD bus, which shares the chip's internal data highway (the AHB -- Advanced High-performance Bus) with the Direct Memory Access (DMA) controller. DMA is the hardware that shuttles audio samples from the ADC into RAM without bothering the CPU. On the STM32F030, the SWD debug interface has hardwired priority over all other bus masters including DMA.
Imagine a two-lane road where the debug probe has a permanent blue-light siren. Every time it wants to read or write, DMA has to pull over and wait. The audio samples still arrive at the ADC -- they just do not make it into RAM on time.
RTT captures showed a 20.9% noise rate -- but the noise organized itself into 15 distinct
bursts of roughly 10 consecutive NO_TONE symbols, perfectly correlated with
2048-byte RTT buffer fill events. Between bursts: 0.46% noise. The DMA priority register
was irrelevant -- DMA_CCR priority bits only arbitrate between DMA channels,
not between DMA and the debug port. SWD always wins.
Solution: a RAM-based post-mortem debug buffer (roughly 16.5 KB, enabled via a
compile-time flag POST_MORTEM_DEBUG). Record everything during reception --
symbols, Goertzel powers, diagnostic events -- then read it all out via SWD
once after the transfer completes. Zero bus traffic during audio. The host
finds the buffer by scanning RAM for a magic number (0xDB600001), the same technique
J-Link uses to locate its own RTT control block.
~ 16.5 KB total -- half of the 32 KB RAM. All hooks compile to no-ops without POST_MORTEM_DEBUG.
A Python script does the heavy lifting: symbol histograms, inter-sync interval analysis, power ratio plots, automated WAV comparison.
Phase 3: the 1050/1051 pattern
With clean post-mortem data, a clear pattern emerged. The receiver counts how many symbols arrive between two consecutive sync words (a known bit pattern that marks the start of each packet). That count was either exactly 1050 or exactly 1051:
| Interval | Symbols | CRC | |
|---|---|---|---|
| sync[8] -> [9] | 1050 | CRC_OK | |
| sync[9] -> [10] | 1051 | CRC_FAIL | <- phantom |
| sync[10] -> [11] | 1050 | CRC_OK | |
| sync[11] -> [12] | 1051 | CRC_FAIL | <- phantom |
1050 is correct. Here's the breakdown:
8 + 16 + 1008 + 8 + 10 = 1050 symbols per DATA interval (sync to sync).
Every +1 phantom meant one extra symbol inserted somewhere in the stream. Each symbol carries two bits. The receiver assembles four consecutive symbols (8 bits) into one byte. If an extra symbol appears in the middle of a packet, every byte after that point reads two bits from the correct byte plus two bits from the next one -- like reading a book where someone inserted a single extra letter on page 5. Every word after that point is shifted and unreadable. Even a single phantom anywhere in 1050 symbols guarantees a CRC failure.
The phantom was not noise. The symbol buffer showed noise=0 for every DATA
interval -- every demodulated symbol was a valid tone (f0-f3). Something was inserting a
real-looking symbol that did not exist in the WAV file.
The alternating pattern
With dual packet repetitions, the pattern was perfectly alternating:
| Copy | Seq | Symbols | CRC | |
|---|---|---|---|---|
| copy 1 | seq=0 | 1050 | CRC_OK | |
| copy 2 | seq=0 | 1051 | CRC_FAIL | <- phantom |
| copy 1 | seq=1 | 1050 | CRC_OK | |
| copy 2 | seq=1 | 1051 | CRC_FAIL | <- phantom |
Every copy 1 had 1050 symbols. Every copy 2 had 1051. Over 14 consecutive packets, the probability of this being random at a 1/3900 phantom rate: approximately 10-45. Not random.
Phase 4: systematic elimination
Flash-write timing
The alternating pattern suggested timing. Each DATA packet is sent twice. When copy 1 arrives with a valid CRC, the firmware processes it and queues a flash write. When copy 2 arrives (CRC failed, phantom inside), nothing happens. Flash programming stalls the CPU -- during a write, the CPU cannot fetch its own instructions from flash because the flash controller is busy. DMA keeps filling the audio buffer in the background, so when the CPU resumes, it finds roughly one extra symbol's worth of unprocessed samples waiting.
Test: disabled flash writing entirely. Result: much worse -- only 1/14 packets received instead of 12/14. With flash writing disabled, a safety guard rejects every subsequent DATA packet after the first because the previous data was never written out. But even the packets before the guard kicked in still had phantoms. Flash writing was not the cause.
DMA race conditions
Four dedicated diagnostic counters checked hardware registers on every audio processing call. Each counter targets a specific failure mode: did the ADC produce samples faster than DMA could transfer them (overrun)? Did DMA encounter a bus error (transfer error)? Did the software accidentally process the same buffer region twice (half-buffer guard)? How many unprocessed samples piled up at peak load?
Zero hardware anomalies. Peak lag of 625 samples means 1423 samples of margin -- nowhere near a wrap-around or torn-read scenario. All four hardware hypotheses eliminated at once.
Deferred ProcessPacket -- breaking the alternation
The perfectly alternating pattern was the strongest clue. What was different between processing a packet with valid CRC and one with failed CRC?
On CRC OK: the firmware runs the full packet handler -- copies 252 bytes to a flash buffer, updates the state machine, increments counters. Estimated time: 100-200 us (microseconds). On CRC FAIL: increments an error counter and does nothing else. Time: roughly 1 us.
That 200 us gap matters. During those 200 us the audio processing loop does not run. DMA keeps filling the buffer with fresh samples. When the loop resumes, it finds a slightly different alignment between the analysis window and the incoming signal.
Fix: move the packet handler out of the CRC check path entirely. On CRC OK, just copy the packet into a holding buffer and set a flag. The actual processing happens later in the main loop, outside the time-critical audio path. Now both code paths take the same time inside the sample processing loop.
Strict period-2 alternation
Alternation broken, rate unchanged
The perfect alternation broke -- proving that ProcessPacket timing had determined which copy got the phantom. But the phantom rate stayed the same: 4/6 packets had phantoms (vs. 3/6 before). ProcessPacket timing was a correlate, not a cause.
The flash-EMI hypothesis
This one gets its own section because it was the most convincing hypothesis and took the most work to kill.
The theory: when the chip writes data to its internal flash memory, it needs a high voltage. A small charge pump circuit inside the chip generates that voltage, drawing sharp current spikes from the power supply. Those spikes create Electromagnetic Interference (EMI) -- electrical noise that can leak into the analog input. If the noise hits during a Goertzel analysis window, the detector might see tone-like energy where there is only silence, producing a phantom symbol.
The physics checked out. The STM32F030's flash programming draws 10-50 mA current spikes from the internal supply rail. The STM32F030 does have a separate analog supply pin (VDDA), but on the test hardware it was tied directly to VDD without dedicated filtering -- the board was never designed for precision analog work. A 50 mA charge pump transient on VDD would couple straight through to the ADC reference, causing 10-20 mV of supply bounce. On a 12-bit ADC with a 3.3V reference that translates to 1-2 LSB (Least Significant Bits -- the smallest unit the ADC can resolve) of error injected into every conversion during the flash write.
Diagnostic: gap-gated flash writing -- only program flash when the receiver is in an inter-packet gap (3 or more consecutive silence symbols), never during active tone reception. Plus: log the global symbol counter at flash write start to correlate phantom positions with flash write timing.
| Interval | Sym | CRC | Flash? | |
|---|---|---|---|---|
| sync[8] -> [9] | 1050 | CRC_OK | FLUSH @3277 | |
| sync[9] -> [10] | 1051 | CRC_FAIL | -- | <- phantom, no flash |
| sync[10] -> [11] | 1051 | CRC_FAIL | -- | <- phantom, no flash |
| sync[11] -> [12] | 1050 | CRC_OK | FLUSH @6428 | |
| sync[12] -> [13] | 1051 | CRC_FAIL | -- | <- phantom, no flash |
| sync[13] -> [14] | 1051 | CRC_FAIL | -- | <- phantom, no flash |
Insert/skip analysis showed no correlation (18 and 19 inserts at both 1050 and 1051 intervals). Bresenham simulation across all 200 remainder values could not produce the observed pattern. Python Goertzel demodulation of the WAV confirmed all six DATA intervals had exactly 1050 symbols, copies bitwise identical.
Gap-gated flash writing was kept anyway -- there is no reason to program flash during active tone reception when there is a 50 ms gap sitting right there.
Phase 5: root cause -- Goertzel window phase drift
The breakthrough came from a different angle: simulating the firmware's packet assembly logic in Python using the real post-mortem data. A script loaded the raw symbol stream from the debug dump and fed each two-bit symbol through an exact replica of the firmware's state machine -- same sync detection, same byte assembly, same state transitions.
The simulation reproduced the exact phantom pattern. That meant the symbols in the debug dump already contained the phantoms -- the problem was in the Goertzel detector itself, not in the packet assembly logic.
Two phantom mechanisms
Detailed analysis of the bit-level trace at packet boundaries revealed two distinct mechanisms:
Mechanism 1 -- length-byte corruption: Each packet contains a byte that tells the receiver how long the payload is. At symbol 4330, the Goertzel misdetected one two-bit symbol in the length field, turning 0xFC (decimal 252) into 0xFF (decimal 255). The firmware then read 3 extra bytes (12 symbols) from the gap region before concluding the packet -- consuming part of the silence that was supposed to separate it from the next packet.
Mechanism 2 -- extra symbol at tone transition: At the end of a DATA packet, the audio changes from scrambled data tones back to f0 (silence/gap). If the Goertzel analysis window straddles this transition -- half of it seeing the last data tone, half seeing f0 -- the mixed signal can produce an extra f0 detection. A symbol that does not exist in the WAV file.
The shared root cause
Both mechanisms trace to the same underlying problem: Goertzel window phase drift.
The STM32's internal oscillator and the PC's audio clock are never perfectly aligned. The chip thinks it is sampling at exactly 48,000 Hz, but the real rate might be 48,028 Hz or 47,970 Hz. The difference is small -- a few hundred parts per million -- but it accumulates. Over a 1050-symbol DATA packet, the analysis window drifts by roughly 18 samples relative to the true symbol boundaries.
Think of two people reading the same sheet music at very slightly different tempos. At first they are in sync. After a few hundred bars, one is a fraction of a beat ahead of the other. The Bresenham drift compensation tries to correct this with occasional skip/insert adjustments (like one player occasionally holding a note slightly longer), but it cannot prevent the analysis window from gradually sliding. At tone transitions -- where one tone ends and another begins -- the window ends up straddling both, producing a mixed reading.
Phase 6: the fix
plen clamp (Mechanism 1)
Clamp the payload length to MAX_PAYLOAD_SIZE. Even if the Goertzel misdetects the
length byte (252 -> 255), the firmware will not over-read into the gap. Four lines, zero overhead.
The N=200 disaster -- spectral leakage
First attempt at guard bands: shrink the analysis window from 240 samples to 200, leaving 20 guard samples on each side (just barely above the 18-sample drift).
Total failure. Every single DATA packet failed CRC. Power ratios dropped from the normal 10-100x range to 1.0-1.2x. The Goertzel was guessing randomly.
Root cause: fractional bin alignment. The Goertzel algorithm is tuned to a specific frequency by choosing a bin index k = frequency x N / sample_rate. For clean detection, k must be an exact integer. If k lands between two integers, the algorithm's energy "leaks" across all bins -- like trying to tune a radio to a frequency between two stations, hearing both at once. This is called spectral leakage. k = freq x N / fs. For this to work, k must be an exact integer -- otherwise power leaks across all bins (spectral leakage).
| Tone | Freq | N=240 (original) | N=200 (attempted) |
|---|---|---|---|
| f0 | 2400 Hz | k = 12.0 OK | k = 10.0 OK |
| f1 | 3200 Hz | k = 16.0 OK | k = 13.33 X |
| f2 | 4000 Hz | k = 20.0 OK | k = 16.67 X |
| f3 | 4800 Hz | k = 24.0 OK | k = 20.0 OK |
f1 and f2 land between integer bins. Energy leaks into all four bins roughly equally, argmax picks at random. Post-mortem confirmed it: power dumps showed f0 and f3 with nearly equal power (ratio 1.0x).
N=180 -- exact bin alignment
The constraint: N must yield exact integer k-values for all four frequencies. k = freq x N / fs is integer for all four tones when N is a multiple of fs / GCD(f0, f1, f2, f3). GCD(2400, 3200, 4000, 4800) = 800. fs / 800 = 48000 / 800 = 60. N must be a multiple of 60. Valid choices: 240 (original), 180, 120, 60.
| Tone | Freq | N=180 | Q7 coeff |
|---|---|---|---|
| f0 | 2400 Hz | k = 9 OK | 243 |
| f1 | 3200 Hz | k = 12 OK | 234 |
| f2 | 4000 Hz | k = 15 OK | 222 |
| f3 | 4800 Hz | k = 18 OK | 207 |
The Goertzel coefficients stay identical -- 2xcos(2pixfreq/fs) depends on freq/fs,
not N, when k/N is constant (9/180 = 12/240 = 1/20). Q7 values: [243, 234, 222, 207].
Trade-off: ~1.2 dB less SNR from the shorter window. With typical on-frequency power ratios of 10-100x, this is negligible.
Implementation
Three files, three changes: GOERTZEL_WINDOW=180 and GOERTZEL_GUARD=30
in goertzel.h, the Goertzel loop bound changed from SYMBOL_SAMPLES to
GOERTZEL_WINDOW in goertzel.c and the Goertzel input pointer offset by
GOERTZEL_GUARD samples in transport_audio.c.
Sidebar: the Q7 detour
The Goertzel algorithm needs to multiply coefficients by running state values on every sample. The original code used Q14 fixed-point coefficients -- numbers stored as integers but treated as fractions with 14 bits after the decimal point. High precision, but on the Cortex-M0 (a chip with no hardware support for large multiplies) every multiplication became a call to a library function that does 64-bit math in software. Roughly 35 clock cycles per call, roughly 200 bytes of code, and it runs in the hot path: 4 tones x 180 samples = 720 multiplies per symbol.
Switching to Q7 (7 bits after the decimal point, coefficients scaled by 128 instead of 16384) keeps every multiplication within the range of a single 32-bit integer. One hardware MUL instruction, 1 cycle. The trade-off is precision -- but the error compared to Q14 is less than 0.2%, far below what matters when typical power ratios are 10-100x. This was not directly related to the phantom bug, but the headroom it freed up made the guard band approach practical. At 5.8 ms per symbol the time budget was already blown; at 0.5 ms there is room to breathe.
Sidebar: the calibration saga
Getting clock drift calibration right was its own multi-session adventure. The WAV contains 200 consecutive f3 symbols after the preamble. The firmware measures the exact sample count between f0->f3 and f3->f0 transitions using a cross-buffer boundary sweep.
Attempt 1 -- f1 instead of f3: Using f1 (3200 Hz) for calibration. The f0<->f1 spacing is only 800 Hz = 1 bin at the sweep window's resolution. Could not distinguish f0 from f1 -> random boundary positions -> 74.8% match rate. Fix: switch to f3 (4800 Hz), where f0<->f3 = 2400 Hz = 6 bins.
Attempt 2 -- single-buffer dead zone:
If the tone transition falls within 60 samples of a buffer edge, the sweep cannot find it--
start_x256=0 in ~50% of phase alignments.
Fix: cross-buffer sweep operating on two adjacent 240-sample buffers (480 virtual samples).
Attempt 3 -- wrong sign:
drift = expected - measured instead of measured - expected.
Negative drift -> insert instead of skip -> made things worse.
At adaptive drift=-11, only 2/6 DATA packets succeeded; at drift=0, 19/22 succeeded.
Attempt 4 -- phase offset ignored:
Calibration measured cal_start_x256 = 9340 (36.5 samples offset) but never applied
a phase correction. The Goertzel window was permanently 15% off from true symbol boundaries.
The mystery of why a hardcoded drift=+36 worked perfectly: the frequent skips
(every 7 symbols) accidentally compensated the 36.5-sample phase offset. It was not drift
correction -- it was unintentional phase correction.
Attempt 5 -- near-miss early-exit:
start_x256=-30647, only 73 x256 units (0.29 samples) above the threshold.
Power interpolation barely missed the boundary. Fix: threshold with 4-sample margin.
Attempt 6 -- integer division truncation:
drift_total / CAL_EXPECTED loses the remainder.
With drift_total=2702 and CAL_EXPECTED=200: quotient=13, remainder=102
-> 0.51 x256/symbol error -> 12 samples at stream end.
Fix: Bresenham remainder tracking -- error now <=0.7 samples over the entire stream.
Results
With N=180 guard bands active, short test images (the same ones that previously failed unpredictably) transfer reliably. Longer WAV files of 2-3 minutes also succeed. Phantom symbols are eliminated. Every packet interval shows the correct 1050-symbol count.
The guard band makes the system tolerant of clock drift up to +/-30 samples -- well beyond the typical 18 samples accumulated over a full packet.
Eliminated hypotheses
15+ distinct hypotheses tested and eliminated over 9 sessions before the root cause was found:
Takeaways
Observation bias is real. The J-Link RTT streaming introduced 20.9% noise through AHB bus contention. The measurement tool was the disease.
Statistical impossibility is a signal. When a pattern has a 10-45 probability of occurring randomly, it is not random. The alternating 1050/1051 pattern pointed directly at something deterministic in software.
The obvious fix can backfire. Confidence gating sounds like good engineering -- reject uncertain measurements. But when the "uncertain" measurements are correct readings of boundary-straddling windows, rejecting them destroys reception entirely.
Spectral constraints are unforgiving. N=200 seemed reasonable (20-sample guard) but puts two of four frequencies on fractional Goertzel bins. The resulting spectral leakage turned a working detector into a random number generator. Integer constraints on N are non-negotiable.
Work from the data, not the hypothesis. Over the course of the investigation, 15+ hypotheses were tested and eliminated: flash EMI, DMA race conditions, ProcessPacket timing, torn reads, WAV file errors, sync state machine bugs, clock drift miscalculation. Each hypothesis had a plausible mechanism. The data eliminated all of them until only the true root cause remained.
© Oscaria Audio. Berlin, Germany.