Skip to content

Feature: ETag Strategy Alignment with Web Best Practices #101

@melvincarvalho

Description

@melvincarvalho

Summary

This issue proposes a comprehensive ETag strategy for JavaScriptSolidServer aligned with HTTP caching best practices. ETags are critical for efficient caching, bandwidth reduction, and preventing mid-air collisions during concurrent edits.

Difficulty: 35/100
Estimated Effort: 2-3 days
Dependencies: None


Current State Analysis

ETag Generation

File: src/storage/filesystem.js line 32

etag: `"${crypto.createHash('md5').update(stats.mtime.toISOString() + stats.size).digest('hex')}"`

Current approach: MD5 hash of mtime + size (metadata-based)

Aspect Current Issue
Algorithm MD5 Cryptographically weak (acceptable for ETags, but not ideal)
Input mtime + size Not content-based, can miss changes if mtime preserved
Type Strong (no W/ prefix) Claims byte-identical but isn't truly content-based
Caching Synchronous crypto Blocks event loop on every stat() call

Conditional Request Handling

File: src/utils/conditional.js (154 lines)

Well implemented:

  • If-Match header for safe updates (412 on mismatch)
  • If-None-Match for GET/HEAD (304 Not Modified)
  • If-None-Match for PUT/POST (create-only with *)
  • Wildcard (*) support
  • Proper normalization (strips W/ prefix, quotes)

Cache-Control Headers

Current usage:

Location Value Purpose
resource.js:225 no-store Mashlib HTML responses
resource.js:303 no-store Mashlib HTML responses
idp/index.js:202 public, max-age=3600 JWKS endpoint
idp/index.js:209 public, max-age=3600 OpenID configuration
idp/credentials.js:147 no-store Credentials endpoint

Missing: No Cache-Control on regular resource responses.

Last-Modified Header

Status: ❌ Not implemented

mtime is available from stat() but not exposed as Last-Modified header.


Issues Identified

1. Strong ETag Mismatch

Severity: MEDIUM

Current ETags are formatted as strong ("abc123") but are generated from metadata, not content. Per RFC 7232:

A strong validator is representation metadata that changes value whenever a change occurs to the representation data that would be observable in the payload body of a 200 (OK) response to GET.

Problem: If a file is modified but mtime is preserved (e.g., touch -m), the ETag won't change even though content changed.

2. Missing Content-Based ETags for Dynamic Content

Severity: MEDIUM

Content negotiation transforms stored content:

  • Turtle → JSON-LD conversion
  • JSON-LD → Turtle conversion
  • HTML data island extraction

These transformations produce different byte streams, but may use the same source file ETag.

3. No Last-Modified Header

Severity: LOW

Some clients prefer Last-Modified over ETags. Both should be provided per best practices.

4. Container Listing ETags

Severity: MEDIUM

Container listings are dynamically generated. Current implementation may use directory mtime, but this doesn't reflect:

  • File additions/deletions
  • Nested container changes
  • ACL changes affecting visibility

5. Synchronous Hash Calculation

Severity: LOW (Performance)

Runs synchronously on every stat() call. For high-traffic servers, this could become a bottleneck.

6. Cache-Control Strategy Missing

Severity: MEDIUM

No systematic Cache-Control headers on resource responses. This means:

  • Browsers may cache indefinitely (heuristic caching)
  • Or revalidate on every request (no caching benefit)
  • CDNs can't optimize caching

Web Best Practices

RFC 7232 - Conditional Requests

  • Strong ETags: Byte-for-byte identical representations
  • Weak ETags: Semantically equivalent (use W/ prefix)
  • If-Match: For safe mutations (optimistic concurrency)
  • If-None-Match: For caching (GET) or create-only (PUT)

RFC 7234 - HTTP Caching

  • Cache-Control: Primary caching directive
  • ETag + Cache-Control: Work together for efficient revalidation
  • Last-Modified: Fallback for clients not supporting ETags

Industry Recommendations

Source Recommendation
MDN Use both ETag and Last-Modified; combine with Cache-Control
Cloudflare Strong ETags for byte-identical; weak for semantic equivalence
Fastly Content hash for strong ETags; metadata for weak
Google Set explicit Cache-Control; don't rely on heuristics

Proposed Strategy

1. ETag Generation Tiers

Tier 1: Strong ETag (content-based)

  • Use for: Static files where content hash is feasible
  • Algorithm: SHA-256, base64url encoded, 27 chars

Tier 2: Weak ETag (metadata-based)

  • Use for: Large files, dynamic content, containers
  • Format: W/"hash" with mtime + size + extras

Tier 3: Version ETag (for transformed content)

  • Use for: Content negotiation results
  • Format: W/"hash" derived from source ETag + transformation type

2. ETag Strategy by Resource Type

Resource Type ETag Strategy Rationale
Small files (<1MB) Strong (content hash) Accurate, worth the compute
Large files (>1MB) Weak (metadata) Too expensive to hash
Containers Weak (mtime + child count) Dynamic, changes frequently
Conneg results Weak (source + transform) Derived content
Mashlib/UI Weak (version) Static but frequently updated

3. Cache-Control Strategy

Profile Cache-Control Value Use Case
resource private, no-cache, must-revalidate User-generated content
container private, no-cache, must-revalidate Container listings
static public, max-age=3600, stale-while-revalidate=86400 Mashlib, schemas
immutable public, max-age=31536000, immutable Versioned assets
sensitive private, no-store Credentials, tokens
discovery public, max-age=3600 Well-known endpoints

4. Last-Modified Header

Add Last-Modified to all resource responses using stats.mtime.toUTCString().

5. Vary Header for Content Negotiation

When content negotiation is enabled, add:

Vary: Accept, Accept-Language

This tells caches that different Accept headers produce different responses.


Implementation Plan

Phase 1: Foundation

  • Create src/utils/etag.js with tiered generation functions
  • Create src/utils/caching.js with cache profiles
  • Replace MD5 with SHA-256 in etag generation
  • Switch large files to weak ETags

Phase 2: Headers

  • Add Last-Modified header to all resource responses
  • Implement Cache-Control profiles by resource type
  • Add Vary header for conneg responses

Phase 3: Content-Based ETags

  • Implement content hashing for small files (<1MB threshold)
  • Cache computed ETags to avoid repeated hashing
  • Add async ETag computation option

Phase 4: Container ETags

  • Improve container ETag calculation (include child count, newest mtime)
  • Consider membership hash for accurate container ETags

Phase 5: Conneg ETags

  • Generate distinct ETags for transformed content
  • Include transformation type in ETag calculation

Configuration Options

{
  "etag": {
    "algorithm": "sha256",
    "strongThreshold": 1048576,
    "cacheEtags": true,
    "cacheMaxSize": 10000
  },
  "caching": {
    "defaultProfile": "resource",
    "staticMaxAge": 3600,
    "immutableAssets": false
  }
}

Comparison Matrix

Current vs Proposed

Aspect Current Proposed
ETag algorithm MD5 SHA-256
ETag basis Metadata only Content (small) / Metadata (large)
ETag type Always strong Strong or weak based on accuracy
Last-Modified ❌ Missing ✅ Always included
Cache-Control ❌ Inconsistent ✅ Profile-based
Vary header ❌ Missing ✅ For conneg
Container ETags Basic mtime Enhanced (children, membership)
Conneg ETags Source ETag Distinct per transformation

Solid Ecosystem Comparison

Server ETag Strategy
Node Solid Server Content hash (MD5)
Community Solid Server Content hash + representation metadata
ESS (Inrupt) Proprietary, content-based
JSS (current) Metadata-based MD5
JSS (proposed) Tiered: content/metadata with proper typing

Testing Plan

Unit Tests

  • Strong ETag format validation
  • Weak ETag format validation
  • Different ETags for conneg transforms
  • 304 responses for matching ETags
  • 412 responses for If-Match mismatch

Integration Tests

  • CDN compatibility (Cloudflare, Fastly)
  • Browser caching behavior
  • Concurrent edit scenarios (If-Match)
  • Solid app compatibility (SolidOS, Penny, etc.)

Security Considerations

  1. ETag as fingerprint: ETags can be used to track users across requests. Mitigated by using private in Cache-Control.

  2. Timing attacks: Content-based ETags reveal if content changed. This is inherent to caching and generally acceptable.

  3. ETag collision: SHA-256 with 162+ bits is collision-resistant. MD5 collisions are feasible but unlikely to be exploited via ETags.


References


Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    cachingHTTP caching, ETags, Cache-ControlenhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions