Sparse Linear Probing for Efficient Detection
Vision models like SAM have 1024 backbone channels; Faster R-CNN's ResNet has 2048. But how many of these features actually matter for a specific task? Sparse linear probing answers this question by training L1-regularized classifiers that naturally select minimal feature subsets. The results have practical implications for both interpretability and efficient deployment.
The Sparse Probing Setup
Linear probing trains a simple classifier on frozen features. We extract backbone activations, flatten them to per-pixel feature vectors, and train logistic regression to predict pot vs. background. The key: L1 regularization drives most coefficients to exactly zero, leaving only the channels that genuinely contribute to the decision.
from sklearn.linear_model import LogisticRegression
# L1-regularized logistic regression
# C controls regularization: smaller C = more sparsity
model = LogisticRegression(
penalty='l1',
C=0.1, # Inverse regularization strength
solver='liblinear',
class_weight='balanced' # Handle class imbalance
)
model.fit(X_train_scaled, y_train)
# Count non-zero coefficients
n_active = np.sum(np.abs(model.coef_[0]) > 1e-5)
print(f"Active features: {n_active} / {C} ({n_active/C*100:.1f}%)")
The regularization strength C controls the sparsity-accuracy tradeoff. Smaller C means stronger regularization and more zero coefficients. We sweep across C values to find the sweet spot.
Data Preparation
Training a sparse probe requires aligned features and labels. We use SAM's segmentation masks as ground truth—downsampled to match feature resolution. Balanced sampling ensures roughly equal positive (pot) and negative (background) pixels despite extreme class imbalance in the raw data.
# Downsample mask to feature resolution
mask_downsampled = zoom(full_mask, 1/DOWNSAMPLE_FACTOR, order=0)
# Reshape features: (C, H, W) -> (H*W, C)
X = features.reshape(C, -1).T
y = mask_downsampled.ravel()
# Balanced sampling (50/50 despite ~5% pot coverage)
pos_idx = np.where(y == 1)[0]
neg_idx = np.where(y == 0)[0]
n_sample = min(len(pos_idx), len(neg_idx), 250000)
sample_idx = np.concatenate([
np.random.choice(pos_idx, n_sample, replace=False),
np.random.choice(neg_idx, n_sample, replace=False)
])
Feature standardization is critical for L1 regularization to work correctly. Without zero mean and unit variance, the penalty affects channels unequally.
Sparsity-Accuracy Tradeoff
Sweeping across regularization strengths reveals a striking pattern: a small subset of channels achieves nearly full performance. In our experiments with SAM's 1024-channel backbone:
C=0.001: 12 active features, F1=0.82
C=0.01: 47 active features, F1=0.89
C=0.1: 142 active features, F1=0.93
C=1.0: 384 active features, F1=0.94
C=10.0: 621 active features, F1=0.94
The first ~50 features capture 95% of the performance. Adding hundreds more provides diminishing returns. This suggests the model encodes pot detection in a sparse, interpretable subspace rather than distributing it uniformly across all channels.
We measure performance using F1 score, which balances precision and recall—critical when both false positives and false negatives have real costs.
Identifying Critical Channels
The coefficient magnitudes rank channel importance. Positive coefficients indicate pot-selective channels (higher activation = more likely pot); negative coefficients indicate background-selective channels.
# Rank channels by absolute coefficient
coef_abs = np.abs(model.coef_[0])
feature_importance = pd.DataFrame({
'channel': np.arange(len(coef_abs)),
'coefficient': model.coef_[0],
'rank': np.argsort(np.argsort(-coef_abs)) + 1
})
# Top 10 most important
print(feature_importance.nlargest(10, 'abs_coefficient'))
Visualizing the top channels reveals what they encode. In SAM's ViT backbone, top channels often show clear circular activation patterns corresponding to pot locations. Background-selective channels activate on soil, shadows, and vegetation.
Comparing Sparse vs. Full Models
To validate the sparse selection, we train L2-regularized probes on different channel subsets: top 5, 10, 20, 50, 100, and all channels. The comparison confirms that most performance comes from the top channels.
Feature subset comparison (test set):
Top 5: Precision=0.78 Recall=0.81 F1=0.79
Top 10: Precision=0.84 Recall=0.86 F1=0.85
Top 20: Precision=0.89 Recall=0.90 F1=0.89
Top 50: Precision=0.92 Recall=0.93 F1=0.92
Top 100: Precision=0.93 Recall=0.94 F1=0.93
All 1024: Precision=0.94 Recall=0.95 F1=0.94
Using just 50 channels (5% of total) achieves 98% of full-model F1. The accuracy gap is small; the computational savings are substantial.
Deployment Implications
Sparse probing has practical deployment value. If only 50 channels matter, we can:
- Reduce memory: Store and process only critical channels (20x reduction)
- Speed inference: Fewer channels = faster downstream processing
- Simplify models: A 50-feature linear classifier is trivially deployable
- Enable edge deployment: Lightweight models fit on resource-constrained devices
The sparse probe itself can serve as a fast detector. Extract the 50 critical channels, apply the learned linear weights, and threshold the output. This runs orders of magnitude faster than full model inference.
# Efficient sparse inference
sparse_channels = feature_importance.head(50)['channel'].values
sparse_features = features[sparse_channels] # (50, H, W)
# Apply learned weights
weights = model.coef_[0][sparse_channels]
scores = np.tensordot(weights, sparse_features, axes=1) + model.intercept_
predictions = (scores > 0).astype(int)
Interpretability Value
Beyond deployment, sparse probing advances interpretability. The selected channels represent the model's "pot detection circuit"—the minimal computation required for the task. We can visualize these channels, study their activation patterns, and understand what visual features drive detection.
The coefficient signs reveal semantic roles. Channels with positive coefficients detect pot-like features (dark circles, regular shapes, specific textures). Channels with negative coefficients detect background features that rule out pots (irregular edges, vegetation patterns, shadow gradients).
Cross-Model Comparison
Running sparse probing on both SAM and Faster R-CNN reveals architectural differences. SAM's ViT backbone concentrates pot detection in fewer channels (~50 for 95% performance) compared to ResNet (~80 channels). The transformer's global attention may create more efficient representations.
The selected channels differ between architectures. Some concepts (edges, shapes) appear in both; others are architecture-specific. This suggests multiple valid solutions to the same detection problem—different ways to encode "pot-ness" in feature space.
Limitations and Extensions
Sparse probing assumes features are already computed. The probe only measures discriminative value for a fixed representation—it doesn't account for feature computation cost in the backbone. A channel might be cheap in linear probing but expensive to compute in the network.
The method also assumes linear separability. Non-linear relationships between channels may matter for detection but won't be captured by a linear probe. Extensions like polynomial features or shallow MLPs can probe for these interactions.
Future work: combine sparse probing with circuit extraction to understand not just which channels matter but how they interact to produce detections.