Plan: richer error reporting in anonymous usage collection
Goal
Improve the error information captured by the usage-reporting subsystem so that:
- We capture the full error chain, not just the leaf-most (innermost) error type.
- The report includes gRPC status codes and HTTP status codes (from Kubernetes API errors) when the failure carries one.
All of this must stay within the package's hard privacy rule: no error message text is ever sent — only tool-intrinsic identifiers (Go types, errcat category, status codes).
Current behavior
The only error-report site is pkg/usg/cobra.go:83, inside the RunE wrapper installed by
AttachToRoot:
if err != nil {
r.Add("error", formatError(err))
}
formatError (cobra.go:114) walks to the innermost unwrapped error and renders a single
string "{errcat-category}:{innermost-go-type}", e.g. user:*errors.errorString. Two
limitations:
- It deliberately throws away every wrapper frame, so a connect failure that was
categorized -> fmt.wrapError -> StructuredError(Unavailable)collapses to just the leaf type, losing both the wrapping structure and the gRPC code. - It has no notion of gRPC or HTTP status codes.
The traffic-manager side does not report command errors at all; this plan does not change that (out of scope — see below).
Key facts discovered
- errcat category is already a whole-chain property:
errcat.GetCategoryunwraps until it finds aCategorized. A newCategory.String()already exists on this branch. - gRPC codes, server side / direct calls: standard
status.FromError(err)(grpc v1.81.0, which unwraps viaerrors.As) finds a*status.Statusanywhere in the chain. - gRPC codes, CLI side: errors returned from the user daemon are reconstructed by
pkg/grpc.FromGRPCinto*grpc.StructuredError. That type carries the code viaGetCode() grpcCodes.Codeand the category viaGetCategory(), but it does not implementGRPCStatus()and does notUnwrap(). Sostatus.FromErrorwill not see it — we must also probe for aGetCode()-bearing error. - HTTP codes: the relevant source is Kubernetes API errors
(
k8s.io/apimachinery/pkg/api/errors). These implement theapierrors.APIStatusinterface;apiStatus.Status().Codeis the HTTP status (int32) and.Reasonis a stable enum string.errors.As(err, &apiStatus)finds it anywhere in the chain. Rawnet/httpresponses are not surfaced as errors anywhere relevant to tracked commands.
Design
Replace the single formatError with a small analyzer that produces a structured result,
and emit it as multiple report entries instead of one packed string. Multiple keys are
easier for a collector to aggregate/group than a single delimited blob, and we are well
under MaxEntryCount.
Entries produced (only when applicable)
| Key | Example value | When |
|---|---|---|
error.cat | user | always, on failure |
error.chain | *url.Error>*net.OpError>*os.SyscallError | always, on failure |
error.grpc | Unavailable | a gRPC status code is present |
error.http | 404 | a k8s/HTTP status code is present |
Notes:
error.catreplaces the category half of the olderrorvalue. (This branch is unreleased, so there is no backward-compatibility constraint on the olderrorkey; we drop it in favor of the split keys.)error.grpcuses thecodes.Code.String()name (Unavailable,NotFound, …) rather than the numeric code — stable and human-readable.OK/Unknownare treated as "no meaningful code" and omitted.error.httpuses the numeric status (404,409, …). We may optionally also record the k8sReasonaserror.k8s(e.g.NotFound,Conflict); flagged as a decision below.
Chain representation
Walk the chain outermost -> innermost via errors.Unwrap. For each frame, emit a token:
- Skip pure structural wrappers that carry no diagnostic signal:
*fmt.wrapError,*fmt.wrapErrors, and errcat's*errcat.categorized. (Matched by%Tstring; these are the only wrappers we introduce ourselves.) - Sentinels get a friendly stable name instead of their type:
context.Canceled,context.DeadlineExceeded,io.EOF,os.ErrNotExist. - Everything else contributes its
%T. - Collapse consecutive duplicate tokens.
- Join with
>.
Edge case — multi-wrap (Unwrap() []error, e.g. errors.Join): recurse into each branch
and join branches with |. Kept minimal; rare in tracked command paths.
Code extraction is independent of the chain walk (codes can sit on any frame):
// gRPC: standard status first, then the telepresence StructuredError shape.
if s, ok := status.FromError(err); ok && s.Code() != codes.OK && s.Code() != codes.Unknown {
grpc = s.Code().String()
} else {
var coder interface{ GetCode() codes.Code } // matches *grpc.StructuredError
if errors.As(err, &coder) {
if c := coder.GetCode(); c != codes.OK && c != codes.Unknown {
grpc = c.String()
}
}
}
// HTTP via k8s API errors.
var apiStatus apierrors.APIStatus
if errors.As(err, &apiStatus) {
http = strconv.Itoa(int(apiStatus.Status().Code))
}
Using a locally-declared interface{ GetCode() codes.Code } (rather than importing
pkg/grpc) keeps pkg/usg from depending on pkg/grpc. Decision flagged below.
Files to change
pkg/usg/errinfo.go(new) —errInfostruct +analyzeError(err) errInfoand an(errInfo).addTo(*Report)method. Move/replace the logic currently informatError.pkg/usg/cobra.go— replace ther.Add("error", formatError(err))call withanalyzeError(err).addTo(r); deleteformatError; drop the now-unusedfmtimport if applicable; update theAttachToRootdoc comment (lines 37-38) describing what is recorded on failure.pkg/usg/errinfo_test.go(new) — table tests mirroringusg_test.gostyle, covering: plain error, single/multiple%wwrapping, errcat-categorized, astatus.Error, a syntheticGetCode()error (StructuredError shape), a k8sNewNotFound, sentinels, and aerrors.Joinbranch.pkg/usg/report.go— no code change; if we add per-error keys, confirmMaxEntryCountcomfortably accommodates them alongside the existingflags/flag.*entries (it does).
No proto/RPC changes: entries already travel as the free-form map<string,string> entries
field, so the collector schema is unaffected.
Out of scope (call out, do not implement unless requested)
- Sending any error message text or stack traces.
- Changing the collector / proto schema.
Follow-up: traffic-manager side (implemented)
The manager had no per-operation error reporting (only manager.boot). Its analog to the
client's per-command failure is a failed gRPC call, so reporting hangs off the gRPC server's
error interceptor — the single chokepoint that still sees the raw error chain before
jsonError flattens it for the wire.
pkg/usg/server.go(new) —ReportServerError(ctx, method, err)plusisBenignCode. Sends a report with topicmanager.rpc, amethodentry (the code-definedinfo.FullMethod, safe), and theanalyzeErrorentries. No-op unless a producer is installed and itsSource == manager— the user daemon shares the gRPC server code, and its failures are already reported per-command by the CLI, so gating on source prevents double-counting.InstallManageris only called by the traffic-manager, confirming the gate is sufficient.pkg/grpc/server/server.go— the unary and stream error interceptors callusg.ReportServerErrorwith the raw error beforejsonError. No import cycle:pkg/usg's dependency tree does not includepkg/grpc/server.- Benign-code filter (confirmed with user): skip
OK,Canceled,DeadlineExceeded,NotFound,AlreadyExists,Unimplemented.status.Codemaps nil→OK and the context errors to Canceled/DeadlineExceeded, so raw client disconnects are filtered too. pkg/usg/server_test.go(new) — manager-reports / benign-skipped / client-no-report / nil-noop / no-producer-noop, plus anisBenignCodetable.
Decisions (confirmed)
- Split keys —
error.cat/error.chain/error.grpc/error.http. The old singleerrorkey is dropped (branch unreleased). - k8s: HTTP code + Reason — emit both
error.http(numeric) anderror.k8s(apiStatus.Status().Reason, e.g.NotFound). - Coupling — use a local
interface{ GetCode() codes.Code }to read the telepresence gRPC code; nopkg/usg -> pkg/grpcimport. - Chain: signal-only — drop our structural wrappers (
fmt.wrapError/wrapErrors,errcat.categorized); keep informative frames and sentinel names.