The Hidden Cost of Accidental Data Exposure
Personally Identifiable Information is any data that can identify a specific individual, either directly or when combined with other information.
Every day, sensitive information leaks through channels we don’t think twice about: a stack trace pasted into a Slack message, a customer email forwarded to a vendor, a debug log shared in a GitHub issue. These aren’t malicious breaches—they’re ordinary workflows that happen to contain data that shouldn’t be shared.
The problem isn’t carelessness. It’s that PII (Personally Identifiable Information) is often invisible until you know to look for it.
What Counts as PII?
PII is any data that can identify a specific individual, either directly or when combined with other information. The definition varies by regulation, but generally includes:
Direct identifiers — Data that points to a specific person on its own:
Full names
Email addresses
Phone numbers
Social Security Numbers
Passport and driver’s license numbers
Biometric data
Indirect identifiers — Data that can identify someone when combined:
IP addresses
Device IDs
Location data
Dates of birth
Employment information
Financial data — Often regulated separately but equally sensitive:
Credit card numbers
Bank account and routing numbers
IBAN codes
Authentication secrets — Not traditionally “PII” but equally dangerous:
API keys and tokens
Passwords
Private keys
Session tokens
The last category is often overlooked. An exposed AWS key isn’t personal information, but it can grant access to systems containing millions of personal records. The blast radius of a leaked credential often exceeds that of a leaked SSN.
Credentials are the keys to the PII vault. A single exposed API key can unlock databases containing millions of personal records. That’s why effective scanning must detect both PII and secrets.
Where PII Hides
The obvious places—databases, CRM systems, HR files—usually have controls. The risk is in the unstructured data that flows through daily work:
Support tickets: A customer reports a bug and includes their full account details. The ticket gets escalated, exported to a spreadsheet, shared with engineering. Each hop increases exposure.
Log files: Application logs capture request parameters, user IDs, IP addresses, sometimes full payloads. Developers copy these into debugging sessions, paste them into chat, attach them to tickets.
Code repositories: Test files contain sample data. Configuration files contain connection strings. Comments contain “temporary” credentials. README files contain example API calls with real tokens.
AI prompts: Users paste customer conversations, error messages, database queries into ChatGPT or Claude for help. These prompts may be used for model training unless explicitly opted out.
Email threads: A message gets forwarded, then forwarded again. By the fifth hop, nobody remembers that the original contained a customer’s SSN in the signature block.
Screenshots: A developer shares a screenshot of a bug. The browser’s address bar shows a URL with a session token. The page content shows a user’s profile.
The Regulatory Landscape
Data protection regulations have teeth. Under GDPR, fines can reach €20 million or 4% of global annual revenue—whichever is higher. CCPA allows statutory damages of $100–$750 per consumer, per incident. A single leak of 1,000 customer emails could theoretically result in a $750,000 liability.
But the real cost is often operational:
Breach notification requirements: GDPR requires notification within 72 hours. This means incident response, legal review, customer communication—all on a tight timeline.
Right to erasure: If you can’t track where data has been copied, you can’t guarantee deletion.
Audit requirements: Demonstrating compliance requires knowing what data you have and where it lives.
Most breaches don’t make headlines. They’re discovered during audits, reported by customers, or found by security researchers. The exposure may have existed for months before detection.
Detection is Harder Than It Looks
Why doesn’t everyone just scan for PII before sharing? Because detection is genuinely difficult:
Format variation: Phone numbers appear as (555) 123-4567, 555-123-4567, +1 555 123 4567, 5551234567. Email addresses get obfuscated as john[at]example[dot]com. Credit cards have spaces, dashes, or neither.
False positives: A 9-digit number might be an SSN or a random ID. A 16-digit number might be a credit card or a tracking number. Without validation, scanners either miss things or flag everything.
Context matters: “John Smith” in a novel isn’t PII. “John Smith, Account #12345” in a support ticket is. Simple pattern matching can’t distinguish.
Secrets are diverse: AWS keys start with AKIA. GitHub tokens start with ghp_. Stripe keys start with sk_live_. Generic API keys follow no pattern at all. Each requires specific detection logic.
Encoding layers: Data gets base64 encoded, embedded in JSON, nested in XML. A scanner that only checks surface text misses encoded content.
Building a Detection Approach
Effective PII detection combines multiple techniques:
Pattern matching handles well-formatted data. SSNs follow XXX-XX-XXXX. Credit cards match specific prefixes (4 for Visa, 5 for Mastercard). Email addresses have predictable structure.
Checksum validation reduces false positives. Credit card numbers include a check digit validated by the Luhn algorithm. IBANs have country-specific formats with built-in verification. A random 16-digit number fails these checks.
Prefix detection catches credentials. Cloud provider keys use identifiable prefixes: AKIA for AWS access keys, ghp_ for GitHub tokens, sk_live_ for Stripe, AIza for Google APIs. These prefixes exist specifically to enable detection.
Confidence scoring acknowledges uncertainty. A pattern match against an SSN format with proper separators is high confidence. A 9-digit number without context is low confidence. Surfacing the score lets users prioritize review.
A Tool to Help
We built Privacy Scanner to make PII detection accessible. It’s a free, browser-based tool that identifies sensitive data in text and files.
The scanner detects:
Email addresses (including obfuscated formats)
Phone numbers (US and international)
Social Security Numbers
Credit cards (with Luhn validation)
Physical addresses (US, UK, EU formats)
Bank account numbers and IBANs
Cloud credentials (AWS, GitHub, Stripe, Google, Azure, Slack)
JWT tokens and private key headers
Passwords in plaintext
Each detection includes a confidence score and contributes to an overall risk assessment. The tool generates a redacted preview you can copy directly.
For sensitive use cases, there’s a “Browsers only mode” where your text never leaves the browser—the backend only returns coordinates, and masking happens locally.
No signup required. No data stored.
Building Habits
Tools help, but habits matter more. Some practices that reduce accidental exposure:
Assume logs contain PII. Before sharing any log output, scan it. Better yet, configure your logging framework to redact sensitive fields at the source.
Sanitize before escalation. When forwarding a customer issue, take 30 seconds to remove identifying details that aren’t necessary for resolution.
Use separate test data. Maintain a library of fake but realistic test data. Never copy production data into development environments without anonymization.
Review before commit. Add PII scanning to your pre-commit hooks. Catch credentials and test data before they enter version control.
Question AI prompts. Before pasting into an LLM, ask: does this contain customer data? Could this identify someone? Is there a way to get the same help with anonymized input?
Privacy incidents rarely involve sophisticated attacks. They happen when ordinary people do ordinary things without realizing what’s embedded in the data they’re handling. The fix isn’t perfect security—it’s awareness and accessible tools.
Questions or feedback? Post your comments, we're improving the scanner based on feedback.


