AI Usage Playbooks PortalBack to top

AI Playbook - Business Analysis

Here is some practical guidance on how to set up your own AI Agent, whom you can talk to about the code and database in your project. After successful configuration, you will be able to ask questions (and get reliable answers!) regarding how the software works, what DB schema is, how services are interconnected, drawing UML sequence diagrams etc.

Core Principles: All recommendations in this playbook align with Xebia's official Core Principles for Working with AI. Refer to that document for the foundational rules that govern every AI interaction at Xebia.


How this Playbook is organized

In this document, you will find step-by-step manuals on how to:

  1. Create your own AI Agent in VS Code
  2. Add additional security layer to the code (so you do not commit any changes by accident)
  3. Configure GIT to download a code repository to your local machine
  4. Configure SQL extension to be able to interact with the data on your local machine
  5. Anonymize prompts
  6. Refine prompts
  7. Generate dummy data
  8. Transform data

1. BA.agent.md

Here are the instructions to set up a BA agent for your codebase in VS Code. You can set up an agent and - if needed for your project - add guardrails to be doubly sure that none of the code is changed.

General rule no.1: If something doesn't work for you, just ask agent to fix it 🙂 General rule no.2: Use Ask or Plan instead of Agent mode in Github Copilot chat window.

VS Code agent.md configuration

  1. Create and place the file (BA.agent.md) in your workspace root in .github\agents folder within a VS Code workspace
  2. Reference it (choose from the list) when you want the BA agent to analyze features, generate user stories, or provide gap analysis
  3. Contents of the BA.agent.md file:
# Role: AI Business Analyst & System Architect

You are a specialized Business Analyst (BA) agent. Your primary objective is to bridge the gap between business requirements and the existing technical implementation. 

## 🚫 CRITICAL RESTRICTION: READ-ONLY MODE
1. **NO CODE CHANGES:** You are strictly forbidden from writing, modifying, or deleting any code files.
2. **NO COMMITS:** You must never attempt to stage, commit, or push changes to the repository.
3. **NO REFACTORING:** Even if requested, do not provide "code fixes." Instead, describe the logic required in plain English or pseudocode within the chat.

## 🎯 Core Mission
Your purpose is to provide high-level insights for stakeholders. You act as a "Code-to-Business Translator."

### 1. Code Explanation
- When asked about a feature, trace the logic through the files and explain **how** it works in business terms.
- Identify dependencies and business rules hidden in the logic.
- Use flowcharts (Mermaid syntax) to visualize complex logic when helpful.

### 2. Requirement Engineering & User Stories
When a user requests a new feature or change, your task is to:
- Analyze the existing code to see where this feature would fit.
- Generate structured **User Stories** using the following template:
    - **Title:** [Brief Feature Name]
    - **User Story:** As a [Persona], I want to [Action], so that [Value].
    - **Acceptance Criteria:** (List of specific conditions that must be met).
    - **Technical Impact:** (High-level list of existing modules/files that will be affected).

### 3. Gap Analysis
- Identify if a requested feature conflicts with existing system logic.
- Flag potential "edge cases" that the user might have missed in their request.

## 🛠 Operational Guidelines
- **Context First:** Always start by scanning the relevant directory structure to understand the scope before answering.
- **Language:** Maintain a professional, analytical, and concise tone. Avoid developer jargon unless explaining a specific technical constraint to a stakeholder.
- **Clarity:** If a request is ambiguous, ask clarifying "Discovery Questions" before proposing a solution.

Now you are ready to clone GIT repository (and fork it if needed) and/or connect DB to VS Code.

It is also beneficial to install mermaid.js extension for VS Code - you will be able to see diagrams in VS Code - quite handy!


2. VS Code safety guardrails

Talk to your Tech Lead about additional safety guardrails. If needed feel free to use one or all below approaches.

The "Hard Stop": Git Pre-Commit Hook

Even if the AI manages to write code to a file, you can prevent that code from ever being committed to the repository. This is the most effective technical guardrail.

Create a file named .git/hooks/pre-commit (or use a tool like Husky) with this script:

Bash

#!/bin/sh
# Block all commits to ensure a Read-Only Analyst environment
echo "❌ ERROR: This is a Read-Only Analyst environment. Commits are disabled."
exit 1
  • Result: Any attempt to git commit (by the AI or the human) will fail instantly with that message.

VS Code Workspace Settings (.vscode/settings.json)

You can disable the AI's "intrusive" features so it doesn't even try to suggest code inline while the BA is reading. Add this to your project's .vscode/settings.json:

JSON

{
  // Disable Ghost Text/Inline completions to keep the UI clean for analysis
  "github.copilot.editor.enableAutoCompletions": false,
  
  // Prevent the AI from automatically triggering actions
  "github.copilot.enable": {
    "*": true,
    "plaintext": true,
    "markdown": true
  },

  // Ensure the BA doesn't accidentally edit files
  "files.readonlyInclude": {
    "**/*": true
  },

  // Visual cue: Change the title bar color so the BA knows they are in "Analyst Mode"
  "workbench.colorCustomizations": {
    "titleBar.activeBackground": "#2d3436",
    "titleBar.activeForeground": "#fab1a0"
  }
}

Note: files.readonlyInclude makes the files read-only within the VS Code editor itself, which provides a UI warning if anyone tries to type in the code.

Repository-Level "Protection" (If using GitHub)

If this repository is hosted on GitHub/GitLab, use Branch Protection Rules:

  • Restrict Pushes: Lock the main or develop branches so that no one (including the AI agent) can push directly.

  • Require Pull Request: Force all changes through a PR that requires an approval. Since the AI won't have "Approval" authority, it can't merge its own logic.


3. GIT configuration

Here you can find a step-by-step plan from total scratch (Windows) to cloning and working with GitHub repos

Create your GitHub account

  • Go to https://github.com/signup
  • Verify email and enable 2FA (recommended, often required for org repos).
  • Set your profile name and username.

Install Git on Windows

  • Download from https://git-scm.com/download/win
  • Run installer with defaults (safe for beginners).
  • Important options to keep:
    • Git from the command line and also from 3rd-party software
    • Use bundled OpenSSH
    • Use Git Credential Manager
  • Restart terminal after install.

Verify Git installation Open PowerShell and run:

git --version

You should see something like git version 2.x.x.

Set your Git identity (one-time)

git config --global user.name "Your Name"
git config --global user.email "your-email@example.com"

Check it:

git config --global --list

Pick authentication method (recommended: HTTPS + browser sign-in) For most beginners on Windows, easiest path is HTTPS with Git Credential Manager.

  • First clone/push will open browser auth automatically.
  • No manual token handling needed in most cases.

Clone your repository From the repo page on GitHub:

  • Click Code → choose HTTPS → copy URL (e.g. https://github.com/org/repo.git)

In PowerShell:

cd C:\Users\YourUser\Documents
git clone https://github.com/org/repo.git
cd repo

4. SQL Connection

SQL and MySQL Workbench
There is no official GitHub Copilot plugin for SQL Workbench. There is also no official native Copilot integration inside MySQL Workbench.

What works well instead:

  1. Install DB Client on your computer (e.g. SQL Server Management Studio) and configure DB
  2. Use VS Code with GitHub Copilot + a SQL extension (e.g. SQL Server).
  3. Connect VS Code to your database.

Good prompt examples for Copilot:

  1. “Write a query to find top 10 customers by revenue in the last 90 days.”
  2. “Optimize this query and explain index recommendations.”
  3. “Convert this MySQL query to PostgreSQL syntax.”
  4. "Draw diagram of DB in mermaid.js"

5. Prompt anonymization

Why this matters

BA and PO artifacts - user stories, interview notes, requirements docs, backlog tickets - routinely contain client names, internal system names, contract values, personal data from personas and stakeholders. These artifacts get pasted into AI tools far more often than source code does, and anonymization is usually an afterthought.

The AI Use Policy tells you what you must not share. This page tells you how to sanitize a prompt before you send it, in a way that still gets you a useful answer back.

What to anonymize - BA-specific checklist

Before pasting anything into an AI tool, scan for:

  • Client / customer names - including codenames, internal nicknames, and easily-identifiable descriptors ("our largest APAC insurer")
  • Project and product names - especially unreleased ones
  • Internal system names - if they reveal architecture, vendor choice, or competitive setup
  • Financial data - contract values, revenue figures, pricing models, discount structures
  • Personal data in personas and quotes - real names, job titles tied to identifiable individuals, direct quotes from interviews
  • Third-party vendor names under NDA - including tools, consultancies, and data providers
  • Regulatory / audit context - references to specific findings, regulator names, case numbers
  • Credentials, endpoints, IDs - API keys, URLs with tokens, database names, ticket IDs that link to internal systems
  • Attachments and screenshots - metadata and visible UI often leak more than the prompt text itself

💡 Rule of thumb: if the sentence, read aloud at a conference, would make your security officer uncomfortable, it needs anonymization.

Classification - light tiering

The AI Use Policy is the source of truth. Use this only as a working mental model when the policy doesn't give you a direct answer for a specific artifact:

Tier Example content Handling
Low sensitivity Generic how-to questions, framework advice, public documentation Paste as-is
Medium sensitivity Internal process descriptions, anonymized user stories, generic data shapes Anonymize, then use approved cloud AI tool
High sensitivity Real client data, PII, financials, anything explicitly restricted by policy Anonymize and process locally (Ollama or equivalent), or don't use AI at all

If a piece of content could plausibly belong to two tiers, treat it as the higher one.

Three approaches - pick by context

Approach When to use Effort
Manual placeholders Short, ad-hoc prompts Low
Structured replacement + reverse mapping Longer docs where output must come back with real names Medium
Local-model pre-processing (Ollama) Repeated workflows, bulk volume, high-sensitivity content High setup, low per-use
1. Manual placeholders

Replace sensitive terms inline before pasting. Use bracketed, ALL_CAPS placeholders that won't be confused with real text:

  • [CLIENT_A], [CLIENT_B] - different clients
  • [PROJECT_X] - project name
  • [SYSTEM_LEGACY], [SYSTEM_TARGET] - systems in a migration
  • [AMOUNT], [DATE] - financials and dates where precise values don't matter
  • [PERSON_1], [ROLE_1] - individuals

Keep the placeholder consistent within a single prompt so the AI can still reason about relationships ("CLIENT_A integrates with SYSTEM_LEGACY, CLIENT_B does not"). Use semantic suffixes, not just numbers, when the distinction matters to the answer.

2. Structured replacement with reverse mapping

When you want the AI's output to come back with real names filled in - e.g. a user story that goes straight into Jira, or a refactored requirements doc - keep a mapping file locally and substitute back after the AI responds.

Minimal workflow:

  1. Build a mapping: mapping.json
    {
      "CLIENT_A": "Actual Client Name",
      "PROJECT_X": "Actual Project Name",
      "SYSTEM_LEGACY": "Actual Legacy System"
    }
    
  2. Anonymize the prompt using the mapping (find-and-replace, or a small script)
  3. Send to AI, get response
  4. De-anonymize the response with the inverse mapping
  5. Delete the mapping when the session ends

A 15-line Python script does this well; a VS Code snippet or a sed one-liner is enough for simple cases. The point is that the mapping never leaves your machine.

Important: the mapping file itself is now a sensitive artifact. It links placeholders to real entities. Treat it accordingly:

  • Store locally only - never in a cloud-synced folder, repo, or shared drive
  • Don't commit it (add to .gitignore)
  • Delete it at end of day or end of session - don't accumulate a week of mappings
  • For high-sensitivity work, keep it encrypted (e.g. on an encrypted volume) and only decrypt during active use
3. Local-model pre-processing

For repeated or bulk work, run a local LLM (Ollama with Llama 3.1, Mistral, or similar) as a pre-filter. You give it the raw text and a prompt like "replace all client names, project names, financial figures, and personal names with bracketed placeholders; return the anonymized text and a mapping table". Then you send the anonymized output to the cloud AI.

Same model works for the reverse direction too - passing cloud AI output through a local review step to catch anything that accidentally came back with real names or leaked context.

For truly high-sensitivity content, a local model can also handle the full task end-to-end - no cloud call needed. Slower and less capable than frontier models, but for well-scoped BA tasks (drafting acceptance criteria from a requirement, summarizing notes) often good enough.

For generating clean substitute values (fake names, addresses, companies), Python's faker library is a quick helper - useful when you need replacements that look realistic rather than obviously-placeholder.

Before / after examples

Example 1 - user story with embedded PII

Before:

As Anna Kowalska, head of underwriting at Vistara Insurance, I want to approve quotes above 250,000 PLN manually through the Vistara QuoteEngine system, so that we stay compliant with our internal delegation matrix.

After:

As [ROLE_1], [ROLE_DEPARTMENT] lead at [CLIENT_A], I want to approve quotes above [AMOUNT_THRESHOLD] manually through [SYSTEM_QUOTE], so that we stay compliant with our internal delegation matrix.

Notice: the business logic (threshold-based manual approval, delegation matrix compliance) is fully preserved. The AI can still draft acceptance criteria, identify edge cases, or spot conflicts - it doesn't need to know the real client.

Example 2 - interview notes

Before:

During the workshop on Tuesday, Marek from the pricing team said the current Confluence-based process breaks down when the Singapore branch submits their quarterly revisions because the approval chain still points to the old org structure from before the reorganization.

After:

During a workshop, [PERSON_1] from the [DEPARTMENT] team said the current [TOOL] process breaks down when [REGIONAL_BRANCH] submits their quarterly revisions because the approval chain still points to an outdated org structure.

Specific day removed (not relevant to the logic). Person anonymized. The substantive issue - broken approval routing post-reorg - is intact.

Example 3 - data shape in a requirements doc

Before:

The export contains columns: client_id, client_name, pesel, broker_commission_pct, annual_premium_pln, renewal_date. We need to transform this into the new partner_feed.json format used by the DataLake Bronze layer.

After:

The export contains columns: id, name, national_id, commission_pct, annual_amount, renewal_date. We need to transform this into a target JSON format for ingestion into a data lake bronze layer.

Column names kept in spirit (enough for the AI to understand the schema), specific identifiers and internal layer names generalized.

Common pitfalls

  • Context leakage. You remove the client name but describe a process so specific - "the only Polish bancassurance player with a direct integration to the KNF reporting gateway" - that it's still identifiable. Anonymize entities and sanitize uniquely identifying descriptions.
  • Fake anonymization. Replacing "Orbia" with "Company O" or "Vistara Insurance" with "V-Insurance" doesn't help anyone. Use abstract placeholders.
  • Screenshots and pasted tables. A screenshot of a Jira board can contain more sensitive data than the surrounding prompt. Check the image, not just the text.
  • Reverse mapping stored badly. The mapping file is a direct link from placeholder to real entity. If it ends up in OneDrive, a Git repo, or an email draft, you've made things worse, not better.
  • Over-anonymization. If you strip every detail, the AI gives you a generic, useless answer. Keep the logic and structure; anonymize identity and specificity.
  • Accidentally realistic dummies. "Let me just use Jan Kowalski and 11111111116" - fine for Polish names (very common), risky for IDs and addresses that might coincidentally match a real person. Use faker or clearly-synthetic placeholders.
  • Conversation memory. Some AI tools retain context across turns in a session. If you anonymized turn 1 but pasted raw data in turn 5, the whole conversation is now sensitive.

Quick reference - before you hit Enter

  1. Scan the prompt for names, amounts, IDs, system names, quotes
  2. Check for attachments and screenshots - open and review each one
  3. Replace sensitive terms with consistent placeholders
  4. Keep the business logic and relationships intact
  5. Confirm tier vs. tooling: right sensitivity level → right tool (cloud / local / none)
  6. If using reverse mapping: mapping file stays local, gets deleted after the session

6. Prompt rewriting

Prompt Rewriting / Refinement

Why this matters

BA inputs are fuzzy by nature: meeting notes, someone else's half-finished requirements, vague stakeholder questions. Fuzzy input produces fuzzy output. You can either regenerate the AI's answer five times hoping it improves, or spend 30 seconds fixing the prompt and get it right the first time.

This is a separate concern from anonymization. Anonymization removes what shouldn't leave your machine. Rewriting shapes what remains so the AI actually understands what you want.

When to use / when to skip

Worth rewriting: complex requirements with multiple constraints, structured outputs you'll paste somewhere (user stories, AC, gap analyses), repeatable tasks worth templating, high-stakes outputs where a bad answer costs someone else's time.

Skip it: quick lookups, exploratory "help me think" sessions, simple Q&A where you'll iterate in chat anyway.

⚠ Security note: asking AI to rewrite your prompt means the rewriter sees your data. If the content is sensitive, anonymize first (see Prompt Anonymization), then ask for the rewrite.

Two modes

Self-rewrite (meta-prompting). You paste your raw prompt and ask the AI to improve it before answering. Good when you don't know what's missing - the AI surfaces ambiguities you didn't see.

Framework-driven. You structure the prompt yourself using a known template. The AI just executes. Predictable, repeatable, lower variance - good when you know what you want and will run similar prompts many times.

Frameworks worth knowing

CO-STAR

Six parts. Most comprehensive. Use for complex work or anything going to stakeholders.

  • Context - background the AI needs to understand the situation
  • Objective - what exactly you want the AI to produce
  • Style - how the output should be written (formal, technical, conversational)
  • Tone - emotional register (neutral, supportive, urgent, plain-spoken)
  • Audience - who will read the output
  • Response - format (markdown, JSON, table, bulleted list, user story)

Example (uses Vistara for readability - in real use, anonymize first):

# Context
Vistara Insurance is migrating quote approvals to a new platform. Senior
underwriters can manually approve quotes above 250,000 PLN per the
delegation matrix.

# Objective
Draft 3 user stories covering the manual approval flow, including happy
path and main edge cases.

# Style
Concise, standard Agile user story format.

# Tone
Neutral, professional.

# Audience
Development team and QA - for estimation and test design.

# Response
Markdown, one user story per block. Each block: Title, user story
(As a... I want... so that...), 3-5 Given/When/Then acceptance criteria.
RTF

Three parts. Minimal. Use for quick structured tasks.

  • Role - who the AI should be ("you are a senior BA with insurance domain experience")
  • Task - what to do
  • Format - how to structure the output

You are a senior BA with insurance domain experience. Draft acceptance criteria for a quote approval workflow where underwriters above a seniority threshold can manually approve quotes over a defined limit. Output as Given/When/Then, 5 criteria maximum.

Which one, when
Situation Framework
Single quick task, you know what you want RTF
Complex task, stakeholder-facing output CO-STAR
You want multiple alternative approaches CRISPE (adds "Experiment" - asks for variations)
You don't know what's missing Meta-prompt pattern (below)

The meta-prompt pattern

A ready-to-use meta-prompt when you're not sure your prompt is good enough:

Before answering, improve the following prompt. Identify:
- Ambiguities or missing context
- Unclear success criteria
- Missing constraints or format requirements

Then return:
1. A short list of what was missing or unclear
2. A rewritten version of the prompt
3. Any clarifying questions you need answered before proceeding

My prompt: [paste your prompt here]

BA template library

Templates use Vistara for readability. In real use, anonymize first.

Template 1 - User story generation (CO-STAR)
# Context
[Business situation, system being changed, why it matters. 2-4 sentences.]

# Objective
Draft [N] user stories covering the new flow, including happy path and
main edge cases.

# Style
Standard Agile user story format, concise.

# Tone
Neutral, professional.

# Audience
Development team and QA - for estimation and test design.

# Response
For each story:
- Title
- User story: As a [role], I want to [action], so that [value]
- 3-5 acceptance criteria in Given/When/Then
- Short list of technical impact points (modules, systems touched)
Template 2 - Acceptance criteria drafting (CO-STAR)
# Context
[Paste user story or feature description. Include business rules.]

# Objective
Generate complete acceptance criteria covering:
- Happy path
- Validation and error states
- Permissions and role boundaries
- Edge cases (limits, timeouts, concurrency)

# Style
Given/When/Then, one criterion per block.

# Tone
Neutral, testable - each criterion verifiable by QA without extra context.

# Audience
QA engineers and developers.

# Response
Numbered Given/When/Then blocks. Flag any criterion where context is
insufficient as "Needs clarification" at the end.
Template 3 - Gap analysis between two documents (CO-STAR)
# Context
Document A is [e.g. current-state process]. Document B is [e.g. target
requirements from the vendor]. They should align but there are likely gaps.

[Paste Document A]
[Paste Document B]

# Objective
Identify gaps in three categories:
1. Requirements in B not covered by A
2. Elements in A that conflict with B
3. Areas where both documents are ambiguous

# Style
Analytical, structured.

# Tone
Neutral, no speculation beyond what the documents say.

# Audience
BA / PO team and project sponsor.

# Response
Markdown with three sections. Each gap: title, location (document +
section), description, suggested clarifying question for the next meeting.
Template 4 - Meeting notes to action items (RTF)
You are a senior BA summarizing meeting notes. Turn the notes below into:
- A 3-bullet summary of key decisions
- A table of action items (owner, action, due date if mentioned)
- A list of open questions raised but not resolved

Notes:
[paste raw notes]
Template 5 - Stakeholder communication rewrite (RTF)
You are a senior BA writing to a non-technical executive sponsor. Rewrite
the text below to remove jargon, lead with the business impact, and keep
it under 150 words. Preserve all specific numbers and dates.

Text:
[paste draft]

Common pitfalls

  • Over-engineering. A 500-word CO-STAR prompt for a task a one-liner would have solved. If the answer is predictable, skip the ceremony.
  • Thin Context, thin output. If Context is "we are migrating a system", the output will be equally generic. Specificity in Context drives specificity in output.
  • Meta-prompt loops. Asking the AI to rewrite the rewrite of the rewrite. One pass is usually enough. If the second pass doesn't help, the problem is missing context, not prompt structure.
  • Forgetting anonymization during rewrite. The rewriter sees everything. If the original prompt had sensitive content, the rewritten one does too - plus any extra detail the AI added "for clarity".
  • Template rot. Templates go stale when underlying processes change. Re-read anything older than a few months before reusing.

7. Dummy data generation

Why this matters

BA work constantly needs example data: UAT scenarios, concrete examples inside requirements docs, mock API responses in specs, demo datasets for stakeholder workshops. Without good dummy data, you either use production data (security issue) or write abstract specifications that stakeholders can't evaluate.

Generating dummy data is also one of the least risky places to use AI - as long as the schema you describe isn't itself sensitive.

Three grades of dummy data

The grade drives the tool and the effort. Pick one first, then choose the tool.

Grade What it has to do Typical use
Schema-grade Correct shape, valid types, parseable Inline examples in requirements, explaining a data structure
Test-grade Covers happy path + edge cases (nulls, boundaries, invalid inputs) UAT scenarios, QA test data, acceptance criteria backing
Demo-grade Looks realistic to a non-technical stakeholder Stakeholder demos, prototype populations, screenshots for decks

A single requirements doc might need all three in different sections. Don't over-invest: if you only need schema-grade, don't spend an hour making it demo-grade.

Tool selection

No single tool covers everything. Most BA workflows end up hybrid.

Tool Best for Weakness
AI (Claude / ChatGPT) Small sets (5-50 records), scenario-rich data with context Volume, referential integrity, reproducibility
Python Faker Bulk (100s-1000s), deterministic, localization, atomic fields No narrative, no domain semantics
Local LLM (Ollama) Sensitive schema you shouldn't paste into cloud AI Slower, less capable than frontier models
AI-generated generator BA doesn't know Faker API but wants a reusable script Needs one-time review before trusting
AI directly

Best for small, narrative-rich sets. You can say "generate 10 policies where at least 3 are renewals with partial payments, 2 are lapsed, 1 is pending underwriter approval" and the AI will produce something plausible. Don't use it for volume - quality degrades past ~50 records, and the output isn't reproducible.

⚠ Use this with your head on. Even though the output is dummy, the input isn't. Don't paste your full client data model, real field names from production, actual business rules with specific thresholds, or anything you wouldn't show to someone outside the project. Describe the shape abstractly: "a policy record with client info, premium amount in a realistic range, and status" is enough. If the schema itself is sensitive, use the Local LLM option or the AI-generated generator approach and run it offline.

Python Faker

The workhorse for anything over a few dozen records. Install with pip install faker.

from faker import Faker

fake = Faker('pl_PL')
Faker.seed(42)  # reproducibility: same seed = same data every run

fake.name()            # "Anna Kowalska"
fake.company()         # "Vistara Sp. z o.o."
fake.address()         # Polish-formatted address
fake.pesel()           # Valid PESEL with correct checksum
fake.nip()             # Valid NIP with checksum
fake.phone_number()    # +48 format
fake.email()
fake.date_between(start_date='-2y', end_date='today')

A realistic BA-sized batch:

from faker import Faker
import csv

fake = Faker('pl_PL')
Faker.seed(42)

TYPES = ['life', 'property', 'vehicle']
STATUSES = ['active', 'expired', 'cancelled']

with open('vistara_policies_TEST.csv', 'w', newline='') as f:
    w = csv.DictWriter(f, fieldnames=[
        'id', 'client_name', 'pesel', 'policy_type',
        'annual_premium_pln', 'start_date', 'status', 'is_test'
    ])
    w.writeheader()
    for _ in range(200):
        w.writerow({
            'id': fake.uuid4(),
            'client_name': fake.name(),
            'pesel': fake.pesel(),
            'policy_type': fake.random_element(TYPES),
            'annual_premium_pln': fake.random_int(min=500, max=25000),
            'start_date': fake.date_between(start_date='-2y', end_date='today'),
            'status': fake.random_element(STATUSES),
            'is_test': True,
        })

The is_test: True column is cheap insurance against dummy data accidentally being treated as real later.

Local LLMs (Ollama) for sensitive schemas

If the schema itself is confidential (e.g. internal system field names, proprietary scoring buckets, regulator-specific classification codes), don't paste it into cloud AI to generate data. Either use Faker locally (no prompt sent anywhere), or run a local model via Ollama:

ollama pull llama3.1
ollama run llama3.1 "Generate 20 JSON records matching this schema: ..."

Slower and less fluent than frontier models, but the schema never leaves your machine. For scenario-rich outputs with confidential schemas, this is often the only acceptable path.

AI-generated generator (the hybrid approach)

The most practical option for BAs who don't want to learn the Faker API by heart: describe the data you need and ask the AI to write the generator script for you. You then save and run the script locally.

Example prompt:

Write a Python script using Faker (locale pl_PL) that generates 500 policy
records for testing. Schema:
- id: UUID
- client_name: Polish full name
- pesel: valid PESEL
- policy_type: one of [life, property, vehicle]
- annual_premium_pln: integer between 500 and 25000
- start_date: within the last 2 years
- status: active|expired|cancelled with 70/20/10 distribution
- is_test: always true

Use seed=42 for reproducibility. Output to CSV named vistara_policies_TEST.csv.
Include 10 edge-case records at the end: nulls in optional fields,
boundary values on premiums, very old start_dates.

Benefits: no Faker API knowledge needed, script is reproducible and reviewable, runs locally (no cloud re-runs for each regeneration), easy to iterate by adjusting the prompt.

⚠ Review the generated script before running. AI occasionally invents Faker methods that don't exist or produces code with subtle bugs.

What to specify in an AI prompt

When generating data with AI directly (not a script), be explicit about:

  • Volume - "10 records" not "a few"
  • Fields and types - give a schema, not a description
  • Edge cases explicitly - AI defaults to a bland bell curve, so spell out nulls, boundaries, invalid states
  • Format - CSV with headers, JSON array, Markdown table
  • Localization - "Polish names and addresses", not assumed
  • Reproducibility hint - ask for a seeded script instead if you'll need to regenerate
  • Synthetic marker - e.g. is_test: true, or prefix IDs with TEST_

Polish-specific cautions

Polish identifiers have checksums. AI that hallucinates digits produces invalid values - fine for schema-grade, broken for anything that validates.

  • PESEL (11 digits) - has a modulo-based checksum. Use fake.pesel() from Faker pl_PL or a dedicated generator. Don't let AI invent PESELs inline.
  • NIP (10 digits) - weighted checksum. Same logic: use fake.nip().
  • REGON (9 or 14 digits) - yet another checksum. Use fake.regon().
  • IBAN - use fake.iban() which produces a structurally valid Polish IBAN.
  • Addresses - Faker may produce a fictional street in a real city. Usually fine, but if the data will be shown to stakeholders who might Google addresses, add a clearly-synthetic city like "Testowo" or flag the dataset as synthetic up front.
  • Names - common Polish names (Jan Kowalski, Anna Nowak) are so common that collision with real people is guaranteed. This is fine if the dataset is clearly marked as test data - it's a problem if it isn't.

Examples

Example 1 - bad prompt vs good prompt

Bad:

Generate some customer data for my insurance project.

Output: 5 generic records with American-sounding names, no PESEL, no edge cases, inconsistent types.

Good:

Generate 15 insurance customer records for Vistara Insurance in Polish context. Schema: id (UUID), full_name (Polish), pesel (valid 11-digit with checksum), email, phone (+48 format), segment (retail|SME|corporate). Include 3 edge cases: one with null email, one SME with unusually long company name, one with a birthdate making them 17 (under-age flag should apply). Mark all records is_test: true. Output as JSON array.

Output: usable test data with intentional edge cases.

Example 2 - UAT scenario

For an acceptance criterion like "senior underwriters can manually approve quotes above 250,000 PLN":

Generate 5 quote records for UAT. Schema: quote_id, client_id, underwriter_role (junior|senior), amount_pln, requires_manual_approval (boolean). Design them to cover: (1) junior underwriter under threshold - auto-approve expected, (2) junior above threshold - should be blocked, (3) senior under threshold - auto-approve, (4) senior exactly at 250,000 - boundary case, (5) senior above threshold - manual approval path. Output as Markdown table.

Example 3 - mock API response

For an API spec in a requirements doc:

Generate a JSON example response for GET /policies/{id}. Schema: id (UUID), client (nested: name, pesel, email), policy (nested: type, annual_premium_pln, start_date, status), broker_commission_pct (decimal, 2 places). Use Polish values. Make it a single realistic-looking record suitable for inclusion in API documentation. Mark as synthetic with "_example": true.

Common pitfalls

  • Accidentally real data. AI generates "Jan Kowalski, ul. Marszałkowska 1, Warszawa" - the person may not exist but the address does. Low-risk for clearly-marked test data, real problem if the dataset ends up in a public demo without the test marker.
  • No edge cases. AI defaults to averages. If you don't explicitly ask for nulls, boundaries, invalid states, you won't get them - and your UAT will miss the bugs that matter.
  • Broken referential integrity. Across multiple tables, AI doesn't maintain FK relationships without being told. For linked entities, generate in one pass or script it in Faker.
  • Non-reproducibility. AI gives you different records every run. Fine for one-off demos, bad for UAT that someone else needs to reproduce. If reproducibility matters, get a script (Faker with seed) instead of raw data.
  • Dummy data leaks into production. Without a clear marker (is_test: true, TEST_ prefix, synthetic city name), test datasets have a habit of ending up in production exports, dashboards, or analytics.
  • Volume creep. "Generate 10,000 records" via AI in chat - you'll get 50 records, a "[... continues for 9,950 more rows ...]", and waste tokens. For anything over ~50 records, script it.
  • Pasting real data as a template. "Here's a sample of our real customer data, generate 20 more like this" - this is a leak, not a generation prompt. If you need to show shape, describe the schema abstractly instead.

8. Data transformation

Data Transformation (CSV ↔ JSON)

Why this matters

BAs constantly translate between formats. Stakeholder sends a CSV that needs to land in an API spec as JSON. An API response needs to go into Excel for the business. A legacy export needs to be reshaped into the import format of a new system. AI is genuinely good at this - it understands instructions like "group by client_id and nest the policies as an array" without you writing a single line of code.

Security first

The default mode for this work should be: anonymize the prompt, then transform locally.

Three patterns, in order of preference:

  1. AI writes the script, you run it locally. Describe the transformation, paste a small synthetic sample (3-5 rows you make up, matching the real schema), get a Python or jq script back. Run it on your machine against the real file. The real data never leaves your laptop. This is the recommended default for anything with sensitive content.

  2. Local LLM (Ollama) does the transformation. When you genuinely need AI to do the reshaping itself (e.g. complex semantic mapping the script can't easily express), run a local model. Slower, less fluent, but the data stays on your machine.

  3. Cloud AI on anonymized data. Only when the data is non-sensitive or fully anonymized. Use placeholder values for any field that could identify clients, individuals, or internal systems (see Prompt Anonymization).

⚠ The wrong default is "paste the CSV into ChatGPT and ask for JSON." That sends the whole dataset to a cloud provider.

When AI, when traditional tools

Tool Best for
AI directly (in chat) One-off ad-hoc, small/medium volumes, anonymized or non-sensitive data, complex semantic reshaping
AI-generated script (run locally) Default for anything sensitive. Repeatable transformations. Real data stays local.
Python (pandas, json) Bulk, production, full control
jq CSV/JSON wrangling on the command line, especially nested JSON
Excel / Power Query When the final destination is Excel anyway
Local LLM (Ollama) Sensitive data + transformation logic too complex for a script
Quick jq examples

jq is a command-line JSON processor that ships on most systems and is easy to install (brew install jq, apt install jq). For BA-sized transformations it's often faster than Python.

# Pretty-print a JSON file
jq . input.json

# Extract a single field across all records
jq '.[].client_name' input.json

# Filter records
jq '.[] | select(.status == "active")' input.json

# CSV to JSON array (using a one-liner with python or csvkit's csvjson)
csvjson input.csv > output.json

# JSON to CSV
jq -r '.[] | [.id, .name, .premium] | @csv' input.json > output.csv

For BAs not familiar with jq: ask AI to write the jq expression you need. It's good at this.

Common transformation patterns

Pattern 1 - flat CSV → JSON array

The simplest case. One row becomes one JSON object.

CSV:
id,client_name,policy_type,annual_premium_pln
P001,[CLIENT_A],life,5000
P002,[CLIENT_B],property,12000

JSON:
[
  {"id": "P001", "client_name": "[CLIENT_A]", "policy_type": "life", "annual_premium_pln": 5000},
  {"id": "P002", "client_name": "[CLIENT_B]", "policy_type": "property", "annual_premium_pln": 12000}
]
Pattern 2 - flat CSV → nested JSON (grouping)

Where AI shines. Reshaping based on a key that needs to become a parent.

CSV:
client_id,policy_id,policy_type,premium
C001,P001,life,5000
C001,P002,property,12000
C002,P003,vehicle,3500

JSON (grouped by client):
[
  {
    "client_id": "C001",
    "policies": [
      {"policy_id": "P001", "type": "life", "premium": 5000},
      {"policy_id": "P002", "type": "property", "premium": 12000}
    ]
  },
  {
    "client_id": "C002",
    "policies": [
      {"policy_id": "P003", "type": "vehicle", "premium": 3500}
    ]
  }
]
Pattern 3 - nested JSON → flat CSV (denormalization)

Inverse of pattern 2. Each child row repeats the parent fields. Useful when stakeholders want Excel.

Pattern 4 - schema mapping (system A → system B)

The classic BA migration task. Source system has one schema, target system has another, and you need a deterministic mapping.

Best handled as: prepare a synthetic sample matching the source schema, ask AI to generate a mapping script that transforms A-shape into B-shape, run locally on the full real dataset.

A useful prompt structure:

I need to transform records from System A format into System B format.

Source schema (System A) - synthetic sample:
{paste 3 made-up records matching the real schema}

Target schema (System B) - synthetic example:
{paste 1 made-up record showing target shape}

Transformation rules:
- A.client_full_name → B.firstName + B.lastName (split on first space)
- A.premium_pln → B.amount (also rename currency field)
- A.start_dt (format: YYYY-MM-DD) → B.startDate (format: ISO 8601 with timezone)
- Drop A.legacy_internal_id entirely
- B.source_system = "SYSTEM_A" (constant)

Generate a Python script that reads input.csv, applies the transformation,
and writes output.json. Handle missing optional fields gracefully.

You get a script. You review it. You run it locally. The real data never leaves your laptop.

The structured prompt pattern

When you do use AI directly for a transformation (small, anonymized data), be explicit about:

  • Sample input - 3-5 rows is enough. Never paste real data. Either make up synthetic rows that match the schema, or paste real headers + replace all values with placeholders ([CLIENT_A], 12345, etc.). The AI needs the shape, not the actual contents.
  • Target shape - show one full example of the output you want (also synthetic).
  • Field-level rules - rename, transform, derive, drop. Be exhaustive.
  • Edge case handling - what to do with nulls, missing fields, malformed values.
  • Output format - "valid JSON only, no markdown fences", or "CSV with headers, comma-separated".

Bad: "Convert this CSV to JSON." Good: "Convert this CSV to a JSON array. Each row becomes an object. Keep all field names as-is. Convert empty strings to null. Output JSON only, no commentary."

Validation - trust but verify

This is the most-skipped step and the most expensive one to skip. AI in transformation tasks fails quietly: it drops rows you didn't notice, invents a field that wasn't there, silently changes a date format, or mis-handles one specific edge case. The output looks plausible.

Always check, in this order:

  1. Row count. Input had 247 rows. Output has 247 records. If not, something dropped or duplicated.
  2. Field set. Every record has exactly the expected keys. No extras, no missing.
  3. Spot-check. First record, last record, two random middle records. Compare manually against input.
  4. Boundary cases. Find the row with the longest name, the null value, the special character, the unusual date - check these specifically.
  5. For production transformations: write a validation script (assert row count, assert schema, assert no nulls in required fields). Don't trust eyeballing for anything that goes downstream into another system.

A 10-line Python validation script catches issues that 15 minutes of eyeballing misses.

Context window limits

Large files don't fit in the prompt. Trying to paste a 50,000-row CSV will fail or get silently truncated. The rule:

  • < 100 rows, non-sensitive: paste directly.
  • Anything sensitive, any size: synthetic sample + script run locally.
  • > 1000 rows: always script-based regardless. Don't try to paste.

If the schema is the only thing AI needs to understand, paste the headers and 3-5 made-up rows matching the schema - that's enough context to generate a working transformation script for any volume.

Examples (using Vistara, anonymize for real use)

Example 1 - CSV of policies to JSON grouped by client

Synthetic sample (made up to match the real schema, not real data):

client_id,client_name,policy_id,policy_type,annual_premium_pln,status
C001,[CLIENT_A],P001,life,5000,active
C001,[CLIENT_A],P002,property,12000,active
C002,[CLIENT_B],P003,vehicle,3500,expired

Prompt:

Transform this CSV into a JSON array, grouped by client_id. Each client object should have: client_id, client_name, and a "policies" array containing policy_id, policy_type, annual_premium_pln, and status. Output valid JSON only.

Example 2 - API response JSON to Excel-friendly CSV

I have a JSON response with nested policy data. The business team wants it in Excel. Flatten it so each policy becomes one row. Repeat client fields across rows belonging to the same client. Use semicolon as separator (Excel-PL friendly). Include headers. Sample input: [paste 2-3 synthetic records matching the real shape].

Example 3 - schema mapping for a migration

Generate a Python script that reads input.csv (legacy format) and writes output.json (target format). Mapping rules: ... [list rules]. Handle nulls. Mark every output record with source_system: "LEGACY".

Run the script locally. The legacy CSV never leaves your laptop.