Research/Q2 2026

TeamPrompt Research · Q2 2026 (1st edition)

State of Prompt Data Leakage — Q2 2026

Name: State of Prompt Data Leakage — Q2 2026
Creator: TeamPrompt
Published: 2026-05-20
License: https://creativecommons.org/licenses/by/4.0/

What sensitive data actually leaks into AI prompts, by category, role, and tool. A synthesis of public research plus aggregate signals from TeamPrompt's free Prompt PII Scanner.

By Eric Campton, Founder, TeamPromptPublished May 20, 2026

Download data (CSV) JSON

CC-BY 4.0 license. Free to cite with attribution; no email gate.

Headline finding

11.3%

of prompts sent to consumer AI tools contain confidential data — based on combined Cyberhaven (2023) and TeamPrompt PII Scanner aggregate (Q2 2026).

Key findings

Five takeaways

1
11.3% of prompts to consumer AI tools contain confidential data — synthesized across Cyberhaven (2023) and TeamPrompt's PII Scanner aggregate (Q2 2026). The number has not moved meaningfully despite three years of AI security investment, because the controls protecting the channel have largely not changed.
2
Engineering and customer-support teams have the highest per-prompt leak rates (14.2% and 12.8% respectively). The pattern: high prompt volume, high data-sensitivity per prompt, low organizational visibility into either.
3
The long tail of AI tools — Poe, Character.AI, the dozen Perplexity clones — has a leak rate (18.4%) nearly 2x that of major providers. This is the strongest argument for DNS-level allowlisting as the first control rather than the last.
4
Only 21% of organizations with a written AI usage policy enforce it technically. The other 79% rely on employee discipline alone, which fails predictably under deadline pressure.
5
Source code is the most-leaked single category at 24% of all flagged prompts. Consumer-tier ChatGPT, Claude, and Gemini may use submitted code for model improvement absent a contract that says otherwise.

By the numbers

Eight headline statistics

11.3%

of prompts contain confidential data

Combined Cyberhaven (11.0%) and TeamPrompt scanner aggregate (11.6%) — across 50k+ prompts from non-enterprise tiers.

Source: Cyberhaven Labs (2023)

73%

of knowledge workers use AI tools weekly

But only 38% of organizations have a written AI usage policy. The gap is shadow AI.

Source: Microsoft Work Trend Index (2025)

4.88M USD

average cost per data breach

AI-related incidents have a 9.6% premium versus baseline due to compounded discovery + notification cost.

Source: IBM Cost of a Data Breach Report (2024)

1.8%

of prompts contain credentials or API keys

Mostly developer prompts: GitHub PATs, Stripe keys, OpenAI/Anthropic keys, AWS access keys, JWTs. Highest-severity category.

Source: TeamPrompt Prompt PII Scanner aggregate (2026)

6.4%

of healthcare prompts contain PHI

Includes diagnosis codes, MRN labels, dates of birth, patient names. Each is a per-prompt HIPAA exposure absent a BAA.

Source: BSI AI Threat Landscape 2025 (2025)

8.7%

of engineering prompts include proprietary source code

Consumer-tier ChatGPT may use this data for model training. Enterprise-tier providers commit otherwise by contract.

Source: TeamPrompt Prompt PII Scanner aggregate (2026)

12+

AI tools active per 100-person organization

Median count from DNS-gateway log audits. CISOs typically know about 3-5. The gap is shadow AI again.

Source: Cloud Security Alliance 2026 State of AI in Enterprise (2026)

38%

of organizations have a written AI acceptable use policy

Of those, only 21% have technical enforcement of the policy — making the other 79% policy-by-honor-system.

Source: ISACA State of AI Governance 2025 (2025)

Category breakdown

What's actually in the leaked 11.3%

Source code

24%

Engineering teams. Most consumer-tier providers may use for training.

Customer records

19%

Names, emails, phone numbers in CS / sales workflows.

Credentials / API keys

17%

Highest severity. AWS keys, Stripe, GitHub PATs, JWTs, PEM blocks.

Internal documents

12%

Strategy decks, hiring docs, contract drafts.

Protected Health Info

ICD-10 codes, MRN labels, DOB, patient details. HIPAA exposure.

Financial / PCI data

Card numbers, account IDs, internal financial reports.

General PII (SSN, address, DOB)

Non-medical, non-financial PII. Often in HR-adjacent workflows.

Other secrets

Database passwords, SSO tokens, internal URLs, OAuth client secrets.

Percentages reflect share of flagged prompts containing each category. Single prompts often contain multiple categories, so percentages sum to more than 100%.

By role

Leak rate by job function

Role	Leak rate
Engineers / Developers Source code, API keys	14.2%
Customer Support Customer PII	12.8%
Finance / Accounting Financial records	11.6%
Healthcare clinical staff PHI	9.4%
Legal / Compliance Contracts, internal docs	7.9%
Marketing Customer lists, campaign data	6.1%
HR / People Ops Employee PII	5.3%
Sales Customer records, deal data	4.8%

By tool

Leak rate by AI provider

Tool	Leak rate
ChatGPT (consumer) Highest volume. Consumer tier may train on prompts.	12.1%
Claude.ai (consumer) Claude commits to no training on consumer tier as of 2024.	9.8%
Gemini (consumer) Workspace tier excluded; consumer Gemini may use for product improvement.	10.5%
Copilot (Microsoft) M365 Copilot Business and above commits to no training.	7.2%
Perplexity Mid-tier coverage; growing usage in research workflows.	8.9%
Long-tail (Poe, Character.AI, etc.) Highest leak rate. Often used for unsanctioned workflows.	18.4%

Methodology

How we got these numbers

This report combines three data sources:

(1) Public studies. We synthesize numbers from Cyberhaven's 2023 paste-rate study, IBM's Cost of a Data Breach Report 2024, the BSI AI Threat Landscape 2025, ISACA's State of AI Governance 2025, the Cloud Security Alliance's 2026 State of AI in Enterprise, and OWASP's LLM Top 10 2025 edition. Every figure carries a citation. Where studies disagree we use the most recent. Where they agree we cite both.

(2) TeamPrompt Prompt PII Scanner aggregate signals. The free tool at teamprompt.app/tools/prompt-pii-scanner runs entirely client-side — no prompt content is sent to our servers. We do anonymously log structural metadata: category counts per scan, risk severity bands, and approximate character lengths. These signals are aggregated weekly and contribute to the rates above with explicit attribution.

(3) Category and role attributions. The category breakdown (credentials, source code, PHI, etc.) is derived from scanner-flagged categories normalized against published taxonomies. The role attribution combines TeamPrompt usage signals where employer/role is voluntarily disclosed by operators with public role-distribution data from Verizon DBIR 2024.

What this report is not. It is not a longitudinal study with rigorous N and control groups. The aggregate scanner signal is biased toward users who self-select to test their own prompts; the public studies are bounded by their original methodologies. We treat the numbers as directionally accurate, useful for prioritization and budgeting, and not as causal-inference-grade evidence.

We will release a Q3 2026 edition with broader denominator coverage and a per-industry breakdown. Methodology updates will be appended to this same page with a changelog.

Press-ready summary

The one-pager

11.3% of prompts sent to consumer AI tools contain confidential data — and that rate has held flat for three years.
Engineering and customer support teams have the highest per-prompt leak rates (14.2% and 12.8%).
Long-tail AI tools (Poe, Character.AI, model routers) have a 2x higher leak rate than major providers.
Only 21% of orgs with an AI policy enforce it technically. The other 79% rely on employee discipline.
Source code is the most-leaked category at 24% of flagged prompts. Consumer ChatGPT may train on it.
12+ AI tools are active per 100-person org. Most CISOs know about 3-5.

Citation: TeamPrompt, "State of Prompt Data Leakage — Q2 2026," published May 20, 2026. CC-BY 4.0.

Your team has prompts going somewhere right now.

See what's in them — for free, in your browser. The same engine that produces these aggregate signals also runs on the public scanner.

Try the free scanner Start a free workspace