TeamPrompt Research · Q2 2026 (1st edition)
State of Prompt Data Leakage — Q2 2026
What sensitive data actually leaks into AI prompts, by category, role, and tool. A synthesis of public research plus aggregate signals from TeamPrompt's free Prompt PII Scanner.
CC-BY 4.0 license. Free to cite with attribution; no email gate.
Headline finding
of prompts sent to consumer AI tools contain confidential data — based on combined Cyberhaven (2023) and TeamPrompt PII Scanner aggregate (Q2 2026).
Key findings
Five takeaways
- 1
11.3% of prompts to consumer AI tools contain confidential data — synthesized across Cyberhaven (2023) and TeamPrompt's PII Scanner aggregate (Q2 2026). The number has not moved meaningfully despite three years of AI security investment, because the controls protecting the channel have largely not changed.
- 2
Engineering and customer-support teams have the highest per-prompt leak rates (14.2% and 12.8% respectively). The pattern: high prompt volume, high data-sensitivity per prompt, low organizational visibility into either.
- 3
The long tail of AI tools — Poe, Character.AI, the dozen Perplexity clones — has a leak rate (18.4%) nearly 2x that of major providers. This is the strongest argument for DNS-level allowlisting as the first control rather than the last.
- 4
Only 21% of organizations with a written AI usage policy enforce it technically. The other 79% rely on employee discipline alone, which fails predictably under deadline pressure.
- 5
Source code is the most-leaked single category at 24% of all flagged prompts. Consumer-tier ChatGPT, Claude, and Gemini may use submitted code for model improvement absent a contract that says otherwise.
By the numbers
Eight headline statistics
of prompts contain confidential data
Combined Cyberhaven (11.0%) and TeamPrompt scanner aggregate (11.6%) — across 50k+ prompts from non-enterprise tiers.
Source: Cyberhaven Labs (2023)of knowledge workers use AI tools weekly
But only 38% of organizations have a written AI usage policy. The gap is shadow AI.
Source: Microsoft Work Trend Index (2025)average cost per data breach
AI-related incidents have a 9.6% premium versus baseline due to compounded discovery + notification cost.
Source: IBM Cost of a Data Breach Report (2024)of prompts contain credentials or API keys
Mostly developer prompts: GitHub PATs, Stripe keys, OpenAI/Anthropic keys, AWS access keys, JWTs. Highest-severity category.
Source: TeamPrompt Prompt PII Scanner aggregate (2026)of healthcare prompts contain PHI
Includes diagnosis codes, MRN labels, dates of birth, patient names. Each is a per-prompt HIPAA exposure absent a BAA.
Source: BSI AI Threat Landscape 2025 (2025)of engineering prompts include proprietary source code
Consumer-tier ChatGPT may use this data for model training. Enterprise-tier providers commit otherwise by contract.
Source: TeamPrompt Prompt PII Scanner aggregate (2026)AI tools active per 100-person organization
Median count from DNS-gateway log audits. CISOs typically know about 3-5. The gap is shadow AI again.
Source: Cloud Security Alliance 2026 State of AI in Enterprise (2026)of organizations have a written AI acceptable use policy
Of those, only 21% have technical enforcement of the policy — making the other 79% policy-by-honor-system.
Source: ISACA State of AI Governance 2025 (2025)Category breakdown
What's actually in the leaked 11.3%
Engineering teams. Most consumer-tier providers may use for training.
Names, emails, phone numbers in CS / sales workflows.
Highest severity. AWS keys, Stripe, GitHub PATs, JWTs, PEM blocks.
Strategy decks, hiring docs, contract drafts.
ICD-10 codes, MRN labels, DOB, patient details. HIPAA exposure.
Card numbers, account IDs, internal financial reports.
Non-medical, non-financial PII. Often in HR-adjacent workflows.
Database passwords, SSO tokens, internal URLs, OAuth client secrets.
Percentages reflect share of flagged prompts containing each category. Single prompts often contain multiple categories, so percentages sum to more than 100%.
By role
Leak rate by job function
| Role | Leak rate |
|---|---|
Engineers / Developers Source code, API keys | 14.2% |
Customer Support Customer PII | 12.8% |
Finance / Accounting Financial records | 11.6% |
Healthcare clinical staff PHI | 9.4% |
Legal / Compliance Contracts, internal docs | 7.9% |
Marketing Customer lists, campaign data | 6.1% |
HR / People Ops Employee PII | 5.3% |
Sales Customer records, deal data | 4.8% |
By tool
Leak rate by AI provider
| Tool | Leak rate |
|---|---|
ChatGPT (consumer) Highest volume. Consumer tier may train on prompts. | 12.1% |
Claude.ai (consumer) Claude commits to no training on consumer tier as of 2024. | 9.8% |
Gemini (consumer) Workspace tier excluded; consumer Gemini may use for product improvement. | 10.5% |
Copilot (Microsoft) M365 Copilot Business and above commits to no training. | 7.2% |
Perplexity Mid-tier coverage; growing usage in research workflows. | 8.9% |
Long-tail (Poe, Character.AI, etc.) Highest leak rate. Often used for unsanctioned workflows. | 18.4% |
Methodology
How we got these numbers
This report combines three data sources:
(1) Public studies. We synthesize numbers from Cyberhaven's 2023 paste-rate study, IBM's Cost of a Data Breach Report 2024, the BSI AI Threat Landscape 2025, ISACA's State of AI Governance 2025, the Cloud Security Alliance's 2026 State of AI in Enterprise, and OWASP's LLM Top 10 2025 edition. Every figure carries a citation. Where studies disagree we use the most recent. Where they agree we cite both.
(2) TeamPrompt Prompt PII Scanner aggregate signals. The free tool at teamprompt.app/tools/prompt-pii-scanner runs entirely client-side — no prompt content is sent to our servers. We do anonymously log structural metadata: category counts per scan, risk severity bands, and approximate character lengths. These signals are aggregated weekly and contribute to the rates above with explicit attribution.
(3) Category and role attributions. The category breakdown (credentials, source code, PHI, etc.) is derived from scanner-flagged categories normalized against published taxonomies. The role attribution combines TeamPrompt usage signals where employer/role is voluntarily disclosed by operators with public role-distribution data from Verizon DBIR 2024.
What this report is not. It is not a longitudinal study with rigorous N and control groups. The aggregate scanner signal is biased toward users who self-select to test their own prompts; the public studies are bounded by their original methodologies. We treat the numbers as directionally accurate, useful for prioritization and budgeting, and not as causal-inference-grade evidence.
We will release a Q3 2026 edition with broader denominator coverage and a per-industry breakdown. Methodology updates will be appended to this same page with a changelog.
Press-ready summary
The one-pager
- 11.3% of prompts sent to consumer AI tools contain confidential data — and that rate has held flat for three years.
- Engineering and customer support teams have the highest per-prompt leak rates (14.2% and 12.8%).
- Long-tail AI tools (Poe, Character.AI, model routers) have a 2x higher leak rate than major providers.
- Only 21% of orgs with an AI policy enforce it technically. The other 79% rely on employee discipline.
- Source code is the most-leaked category at 24% of flagged prompts. Consumer ChatGPT may train on it.
- 12+ AI tools are active per 100-person org. Most CISOs know about 3-5.
Citation: TeamPrompt, "State of Prompt Data Leakage — Q2 2026," published May 20, 2026. CC-BY 4.0.
Your team has prompts going somewhere right now.
See what's in them — for free, in your browser. The same engine that produces these aggregate signals also runs on the public scanner.