# PIIvacy

> Serverless PII scrubber for LLM calls. Scrub PII out of prompts before sending to OpenAI/Anthropic/Google/etc, then put it back when the response returns. Two functions, three substitution modes (token, realistic, pass-through), 35+ regex patterns, BYO-LLM helpers, and a 267k-name fake-name table built from US public-domain data. Zero runtime dependencies. MIT licensed.

The package is `piivacy` on npm. Source: https://github.com/callieschneider/piivacy

## Why it exists

When you paste a customer's email, SSN, API key, or address into an LLM prompt, the data goes to a third party and may be retained for training, logged, or otherwise processed beyond your control. PIIvacy scrubs the input before the call and restores the original values from the response, so you can use real-world text in real-world LLM features without leaking PII.

## Core API

- `scrub(text, session?, opts?)` — async. Walks regex patterns + literal-secret pre-pass. Returns `{ text, session }` where `text` has PII replaced and `session` carries the bidirectional mapping.
- `restore(text, session)` — sync. Three-pass longest-match-first restore: tokens → realistic fakes → reference forms ("Marcus", "Marcus's", "Mr. Chen"). LLM-invented or truncated tokens pass through unchanged.
- `createSession(opts?)` — JSON-serializable plain object with sliding TTL (30 min default).
- `listRedactions(session)` — inventory for audit trails / downstream feature calls.

## Three substitution modes

| Mode | Output | Restore | LLM fluency | Use when |
|------|--------|---------|-------------|----------|
| `token` (default) | `[[EMAIL_1]]` | bulletproof | poor | high-stakes data, audit trails |
| `realistic` | `redacted1@example.com` | longest-match (good) | excellent | conversational AI, content gen |
| `pass-through` | original value | n/a | excellent | LLM legitimately needs the value (addresses for local search, dates for scheduling) |

Per-call, per-category, or per-label. Hard-coded safety override: `secrets`, `financial`, and `identifiers` categories can NEVER be pass-through, even if explicitly requested.

## What it catches

35+ regex patterns across 6 categories with format validators:

- **secrets** (token-only): OpenAI, Anthropic, GitHub, AWS, Google, Stripe, Slack, Twilio, SendGrid, Mailgun, Notion, Figma, Linear, Vercel, Supabase, Cloudflare, DigitalOcean, npm, HuggingFace, Replicate, Groq, Cohere, ElevenLabs, Datadog, Sentry, Discord/Slack webhooks, GitLab, Heroku, JWTs, PEM/SSH/PGP private keys, URL-embedded credentials (postgres, mongodb, redis, etc.)
- **contact**: emails, US phones (NANP), E.164 international phones
- **financial**: IBANs (mod-97 validated), credit cards (Luhn validated), Bitcoin and Ethereum wallets
- **identifiers**: SSNs (range-validated), CA SINs (Luhn), US passports, VINs (check-digit validated), MAC addresses
- **location**: US/UK/Canadian addresses + postcodes, lat/long
- **network**: IPv4/IPv6, dates of birth (calendar-validated)

## What it doesn't catch (use the LLM second-pass)

Names, companies, project codenames, free-form sensitive prose, unicode email local-parts, non-Latin-script identifiers. The package ships `buildPiiCheckPrompt`, `parsePiiCheckResponse`, and `applyPiiCheckIssues` so you can wire any chat LLM — including in-browser via WebLLM — into a second-pass detection loop.

## Documentation

- [Live demo + interactive playground](https://piivacy.dev/)
- [Quick start](https://piivacy.dev/#docs)
- [Pattern catalog](https://piivacy.dev/#patterns)
- [LLM helper workflows](https://piivacy.dev/#llm)
- [Names data + bucketing algorithm](https://piivacy.dev/#names)
- [Full README](https://github.com/callieschneider/piivacy#readme)

## Optional

- [GitHub repository](https://github.com/callieschneider/piivacy)
- [Issue tracker](https://github.com/callieschneider/piivacy/issues)
- [Public-domain data sources](https://piivacy.dev/#names)

## Disclaimer

PIIvacy reduces the surface area for accidental PII leakage. It does not provide cryptographic guarantees, replace your security review process, or constitute compliance with GDPR/CCPA/HIPAA/etc. It is defense in depth — regex is the fast filter, the LLM second-pass is the accurate filter, your security review is the human in the loop.