Error handling branches -- timeouts, malformed responses, rate limits, partial failures -- are invisible to valid inputs. The connector code that handles a 503 or a half-written JSON response never fires when the mock server is behaving. And that's where the bugs hide.
This is part of a series on formally verifying identity connectors. The coverage-guided verification reaches most branches through normal input variation. Fault injection reaches the rest.
Ten Fault Types#
The system defines ten categories of injectable failure:
const (
FaultTimeout = "timeout" // Block until context cancels
FaultServerError = "server_error" // HTTP 500
FaultBadGateway = "bad_gateway" // HTTP 502
FaultServiceUnavailable = "service_unavailable" // HTTP 503
FaultMalformedJSON = "malformed_json" // Invalid JSON body
FaultEmptyBody = "empty_body" // 200 OK, no content
FaultRateLimit = "rate_limit" // HTTP 429 + Retry-After
FaultConnectionReset = "connection_reset" // TCP RST
FaultSlowResponse = "slow_response" // Delayed response
FaultPartialResponse = "partial_response" // Truncated JSON
)
These aren't arbitrary. Each corresponds to a failure mode that real APIs produce and that connectors must handle. A 429 from Okta's rate limiter. A 503 from AWS during a service disruption. A connection reset from a load balancer timeout. Truncated JSON from a proxy that closed the connection early.
Configuration#
Each fault is configured with targeting and scheduling:
type FaultConfig struct {
Type FaultType
Probability float64 // 0.0-1.0
AfterRequests int // Trigger after N successes
ForRequests int // Apply for N requests, then stop
Endpoints []string // Filter by URL path pattern
DelayMs int // For slow_response
CustomStatus int // Override HTTP status code
CustomBody string // Override response body
}
AfterRequests is key. A connector might handle the first page of results correctly but fail on the third. Setting AfterRequests: 2 injects the fault after two successful requests, testing pagination error recovery specifically. ForRequests limits the fault duration -- the system can verify that a connector recovers after a transient failure.
Endpoints targets faults at specific API paths. A connector talks to multiple endpoints -- users, groups, roles, memberships. Injecting a fault on the groups endpoint while users works normally tests whether the connector handles partial API availability.
Breaking the Transport Layer#
The simple faults are straightforward: return an error status code, return empty content, delay the response. The interesting ones break the transport layer itself.
Connection reset hijacks the TCP connection and forces a RST instead of a clean FIN:
case FaultConnectionReset:
if hijacker, ok := w.(http.Hijacker); ok {
conn, _, _ := hijacker.Hijack()
if tcpConn, ok := conn.(*net.TCPConn); ok {
tcpConn.SetLinger(0) // Forces TCP RST
}
conn.Close()
}
SetLinger(0) tells the kernel to send a RST segment instead of the normal FIN handshake. The connector sees an abrupt connection drop, not a graceful close. This is exactly what happens when a load balancer times out or a network partition heals ungracefully.
Partial response writes the beginning of a valid JSON response, then kills the connection:
case FaultPartialResponse:
w.Header().Set("Content-Length", "1000") // Claim more data coming
w.Write([]byte(`{"data": [`)) // Start valid JSON
if hijacker, ok := w.(http.Hijacker); ok {
conn, _, _ := hijacker.Hijack()
conn.Close() // Abrupt close
}
The Content-Length header says 1000 bytes are coming. Only 11 arrive. The connector has to detect the truncation -- by checking Content-Length against bytes received, by handling the read error, or by detecting invalid JSON. Each connector handles this differently. The fault injection tests whether it handles it at all.
Field-Level Mutations#
Beyond transport-level faults, the system supports semantic mutations -- modifying the content of otherwise valid responses:
type FaultKind struct {
Name string `yaml:"name"`
Status int `yaml:"status"`
Body string `yaml:"body"`
Headers map[string]string `yaml:"headers"`
DelayMs int `yaml:"delay_ms"`
OmitFields []string `yaml:"omit_fields"`
OverrideFields map[string]any `yaml:"override_fields"`
InjectFields map[string]any `yaml:"inject_fields"`
TruncateResults int `yaml:"truncate_results"`
}
OmitFields removes JSON fields from the response. What happens when the API returns a user without an email field? Does the connector crash, skip the user, or sync it with a blank email?
OverrideFields replaces values. What if status comes back as an unexpected string? What if role is null instead of a string?
TruncateResults limits array sizes. A paginated response that normally returns 100 items returns 3. Does the connector still follow the pagination link?
InjectFields adds unexpected fields. APIs evolve. A new field appearing in the response shouldn't break a connector that doesn't expect it.
These mutations apply via a capture-modify-forward middleware:
func (fi *FaultInjector) FaultKindMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
capture := &responseCapture{ResponseWriter: w}
next.ServeHTTP(capture, r)
body, status := kind.ApplyToResponse(
capture.body.Bytes(), capture.statusCode)
w.WriteHeader(status)
w.Write(body)
})
}
The mock server generates a correct response. The middleware captures it, mutates the JSON, and forwards the mutated version. The connector sees a response that's structurally valid but semantically wrong in a controlled way.
Combined Coverage#
Normal input variation exercises the happy path and its boundary conditions. Fault injection exercises error handling, retry logic, and degraded-mode behavior. Between the two, the framework reaches branches that neither covers alone.
For connectors where the framework has source access, the combination gets branch coverage to 100%. Every if err != nil, every pagination check, every rate limit handler, every timeout path -- exercised by some combination of input configuration and fault scenario.
The coverage predictor tracks which faults exercise which error-handling branches, using the same DFA-based prediction that guides normal input exploration. The system walks through fault scenarios the same way it walks through the input space: one fault type at a time, one endpoint at a time, Gray code traversal over the fault configuration space, bisecting to find which fault configurations trigger new branches.
Series#
This is part of a series on formally verifying identity connectors: