← ALL RFDS
RFD 035 idea

Performance Counter Integrity Protection

AUTHOR RAND Corporation CREATED 2024-11-27
verificationhardwareside-channel

The Idea

Protect on-chip performance counters (used for compute metering) against tampering, reset, and manipulation. GPUs already contain extensive performance monitoring infrastructure for debugging and profiling; these counters become security-critical when used for governance.

Attacks include: fault injection to reset counters, cutting signal wires to prevent incrementing, manipulating aggregation logic, and exploiting the counter read path. Defense requires distributed redundancy, sanity checking, shielding, and tamper detection.

Why It Matters

Offline licensing (RFD 032) and FLOP counting (RFD 024) depend on accurate, tamper-resistant metering. If an attacker can reset or freeze performance counters, they can use unlimited compute while appearing to consume none. Counter integrity is the foundation of hardware-based compute governance.

Existing Counter Infrastructure

NVIDIA GPUs already track:

Counter TypeGranularityExamples
Instruction countersPer-SMInstructions issued, instructions retired
FLOP countersPer-SMFP16/FP32/FP64 operations, tensor core ops
Memory countersPer-memory controllerBytes read/written, cache hits/misses
Interconnect countersPer-linkNVLink bytes transferred, PCIe transactions
Power/thermalChip-wideInstantaneous power, temperature

These are currently used for profiling (NSight, DCGM) but are not designed to be tamper-resistant.

Attack Vectors

AttackMechanismDifficulty
Fault injection (EM pulse)Randomize counter valuesMedium (requires proximity)
Fault injection (voltage glitch)Reset counters to zeroMedium
Laser fault injectionTarget specific counter bitsHigh (requires decapping)
Wire cutting (FIB)Sever increment signal pathHigh (requires FIB equipment)
Firmware manipulationReport false values via driverLow (if firmware unsigned)
Aggregation attackManipulate when counters are summedMedium

Defense: Distributed Redundancy

┌─────────────────────────────────────────────────────────┐
│                         GPU                             │
│                                                         │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐   │
│  │   SM0   │  │   SM1   │  │   SM2   │  │  SM143  │   │
│  │         │  │         │  │         │  │         │   │
│  │ Counter │  │ Counter │  │ Counter │  │ Counter │   │
│  │   A0    │  │   A1    │  │   A2    │  │  A143   │   │
│  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘   │
│       │            │            │            │         │
│       └────────────┴─────┬──────┴────────────┘         │
│                          │                              │
│              ┌───────────▼───────────┐                 │
│              │   AGGREGATION BLOCK   │                 │
│              │                       │                 │
│              │  • Sum all SM counters│                 │
│              │  • Compare with shadow│                 │
│              │  • Detect anomalies   │                 │
│              └───────────┬───────────┘                 │
│                          │                              │
│              ┌───────────▼───────────┐                 │
│              │   SHADOW COUNTERS     │                 │
│              │   (redundant path)    │                 │
│              └───────────────────────┘                 │
└─────────────────────────────────────────────────────────┘

An attacker must compromise counters across many SMs simultaneously—a single SM counter being reset or frozen is detected by comparison with the aggregate.

Defense: Sanity Checking

Cross-validate counters against each other:

CheckDetects
Instructions issued ≥ instructions retiredCounter underflow attacks
Memory accesses ≥ cache missesImpossible cache behavior
FLOP count consistent with instruction countSelective counter manipulation
Power draw correlates with activity countersCounters frozen while compute active
Time elapsed × max throughput ≥ counter valueCounter overflow attacks

Sanity check failures trigger alerts or automatic throttling.

Defense: Physical Shielding

ProtectionMechanism
EM shieldingMetal layers over counter circuits
Active meshConductor grid that detects probing
Backside protectionShield against through-silicon attacks
Voltage monitoringDetect glitch attempts
Temperature monitoringDetect thermal attacks

Defense: Distributed Aggregation

Don’t aggregate all counters in one place:

Instead of:
  All SM counters → Single aggregator → Output

Use:
  SM 0-35 → Aggregator A ─┐
  SM 36-71 → Aggregator B ─┼→ Cross-check → Output
  SM 72-107 → Aggregator C ─┤
  SM 108-143 → Aggregator D ┘

Compromising a single aggregator only affects 1/4 of the count; cross-checking detects the discrepancy.

Counter Monotonicity

For governance purposes, counters should only increase:

COUNTER UPDATE LOGIC:

new_value = current_value + increment

ASSERT increment ≥ 0
ASSERT new_value ≥ current_value  // overflow check

IF assertion fails:
  - Log anomaly
  - Trigger tamper response
  - Optionally: lock chip until inspection

Hardware enforcement prevents any decrement, even via legitimate-looking commands.

Secure Readout Path

The path from counters to external reporting must also be protected:

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   COUNTERS   │────►│  AGGREGATION │────►│   SIGNING    │
│  (protected) │     │  (protected) │     │   ENGINE     │
└──────────────┘     └──────────────┘     └──────┬───────┘

                                          Signed attestation
                                          of counter values


                                          External verifier

Counter values are signed before leaving the secure domain, preventing manipulation by compromised drivers or firmware.

Open Questions

  • What’s the overhead (area, power) of redundant counter infrastructure?
  • Can existing GPU counter hardware be hardened, or is a redesign needed?
  • How to handle legitimate counter resets (e.g., after chip RMA/repair)?
  • What’s the false positive rate for sanity checks under normal operation?
  • Can formal verification ensure counter logic has no hidden reset paths?

References

  • RAND WR-A3056-1, Chapter 5: Meter Protection
  • NVIDIA NSight Compute counter documentation
  • Academic literature on fault injection countermeasures