AI Agile Framework

AI-Adapted Agile / Scrum

Standard Scrum adapted for teams building AI-integrated products. AI uncertainty, prompt iteration, model evaluation, and data governance are first-class artifacts — not afterthoughts squeezed into a regular sprint.

2-week sprintsPrompt storiesEval ceremoniesAI Definition of DoneRisk register

Roles & Team Composition

Product Owner

Business

Owns AI acceptance criteria — must understand probabilistic outputs
Defines success metrics (not just user stories, but eval thresholds)
Prioritises the prompt backlog alongside the feature backlog
Signs off on eval results before AI features ship to production

AI / ML Engineer

AI Core

Owns prompt engineering, version control, and eval design
Selects and benchmarks models for each use case
Designs the RAG pipeline, fine-tuning strategy, or agent architecture
Investigates unexpected model behaviour in production

Software Engineer

Integration

Integrates Higain.ai SDK into the application layer
Builds streaming UI, context management, and fallback handling
Implements feature flags for AI rollout
Maintains CI/CD pipeline including eval gates

Data Steward

Data

Owns data pipeline, RAG corpus, and labeling quality
Maintains data governance docs (provenance, consent, PII audit)
Coordinates human evaluators for eval and feedback review
Tracks data versioning (DVC or equivalent)

QA / Eval Engineer

Quality

Owns automated eval harness — runs on every PR
Designs and runs human eval protocols
Red-teams AI features before each release
Maintains eval dataset quality and prevents label leakage

Scrum Master

Process

Facilitates AI-specific ceremonies (Prompt Lab, Eval Review)
Ensures AI uncertainty is made explicit in planning estimates
Shields team from scope creep driven by model hype
Tracks AI story velocity separately from standard engineering velocity

Sprint Structure (2-week cycle)

Day 1

Sprint Planning

3–4 hours

Review product backlog — AI stories and standard feature stories together
Assign token cost estimates to AI stories (Low < 1K, Med < 10K, High > 10K tokens/request)
Write measurable eval criteria for every AI story before it enters the sprint
Identify AI risk cards: hallucination risk, data quality risk, cost overrun risk
Separate AI workstream tasks: Prompt, Data, Integration, Eval

Daily

Standup (15 min)

15 min

Standard: What did I do yesterday? What will I do today? Any blockers?
AI-specific: Did any prompt or model behaviour change unexpectedly yesterday?
Flag immediately: unexpected refusals, cost spikes, schema failures, or latency regressions

Day 5–6

Prompt Lab Session

1–2 hours

Mid-sprint timebox dedicated to prompt iteration and experimentation
AI Engineer + QA run variants against the eval set in a shared notebook
Winning prompt committed to version control before session ends
Output: updated eval scores and a CHANGELOG entry

Day 10

Sprint Review

2 hours

Demo includes eval dashboard, not just feature UI — stakeholders see quality metrics
Present: eval score before vs. after, token cost trend, any regressions and their resolution
PO signs off using the acceptance criteria rubric defined at sprint planning
Undone AI stories: assess whether the eval bar was too aggressive or the feature needs more work

Day 10

AI Retrospective

1.5 hours

Standard retro format + AI-specific lenses:
→ Were our eval criteria realistic and well-defined?
→ Did any model behaviour surprise us? What did we learn?
→ Did we spend too much / too little time on prompt engineering vs. integration?
→ What data quality issues slowed us down? How do we prevent them next sprint?

Story Types

The standard user story format doesn't translate cleanly to AI work. Use these five story types in your backlog — each has a different estimation approach and different definition of done.

Feature Story

Feature

Format

“As a [user], I want [functionality] so that [business value].”

Estimation

Story points (standard Fibonacci)

Definition of Done

Code merged, tests passing, eval criteria met, feature flag set.

Example

As a user, I want the AI to summarise my document in 3 bullet points so that I can decide whether to read it.

Prompt Story

Prompt

Format

“As an AI Engineer, I need [prompt change] so that [eval metric] improves by [threshold].”

Estimation

Prompt complexity: Low (quick tweak) / Medium (restructure) / High (new strategy) / Research (unknown)

Definition of Done

Prompt version committed and tagged. Eval suite shows improvement. No regression on other metrics.

Example

As an AI Engineer, I need to add output length constraints to the summariser prompt so that ROUGE-L improves from 0.82 to ≥ 0.87.

Data Story

Data

Format

“As a [role], I need [data asset] so that [AI capability or eval goal] is achievable.”

Estimation

Story points based on pipeline complexity, not volume

Definition of Done

Data ingested, versioned, PII-audited, quality gate passing. Eval set updated if applicable.

Example

As a Data Steward, I need 50 labelled support ticket examples so that we can establish a baseline eval for the summariser.

Eval Story

Eval

Format

“As a QA Engineer, I need [evaluation capability] so that [quality signal] is measurable.”

Estimation

Story points (typically 2–5 pts)

Definition of Done

Eval runs in CI. Results logged to eval database. Dashboard updated.

Example

As a QA Engineer, I need an LLM-as-judge scorer for faithfulness so that RAG outputs are evaluated automatically on every PR.

Spike

Format

“Investigate [question] to inform [decision]. Time-box: [N] days.”

Estimation

Not pointed — time-boxed. Maximum 1 sprint.

Definition of Done

Written findings doc shared with team. Go/no-go recommendation made. Spike branch deleted.

Example

Investigate whether DeepSeek-R1 outperforms Llama 3.1 70B on Nepali-language reasoning tasks. Time-box: 3 days.

Definition of Done for AI Features

An AI feature is not "done" when the code is merged. It is done when these gates pass. Every item is mandatory unless explicitly waived by the PO with written justification.

Code & Integration

Feature code merged to main and all CI checks passing.
Unit tests cover application logic with LLM responses mocked.
Integration test calls real API against golden test cases.
Feature flag configured — off by default, switchable without deployment.
Fallback behaviour tested: what happens when the model returns an error or empty response?

Prompt & Model Quality

System prompt committed to version control with semantic version tag (e.g. v2.3.0).
Eval suite passes with no metric below the agreed minimum threshold.
No regression vs. baseline on any dimension by more than 5%.
Human spot-check: minimum 10 samples reviewed and scored ≥ 3.8 mean.
Red team checklist completed: prompt injection, jailbreak attempts, edge inputs.

Cost & Performance

Token cost per request measured and within ±20% of the estimate from sprint planning.
p95 latency measured and within agreed SLA.
Load test run at 2× expected peak traffic without degradation.

Observability & Compliance

Production alert configured for: error rate spike, latency breach, cost overrun, refusal rate anomaly.
Logging in place: every request logs model, tokens used, latency, and outcome (not the full prompt content unless required).
PII audit: confirm no personal data is sent to the model or logged in cleartext.
Data residency confirmed: all inference happening on Nepal-hosted Higain.ai infrastructure.

AI Risk Management

Maintain a dedicated AI Risk Register alongside the standard project risk log. Review at every Sprint Review. The following risk categories apply to all AI features.

Risk Category	What it means	Early warning signals	Mitigation
Hallucination	Model confidently outputs false information	Eval faithfulness score drops; user complaints about wrong facts	RAG grounding; output citations; human review for high-stakes outputs
Bias	Systematically unfair outputs toward a group	Disparate error rates across demographic slices in eval	Bias audit on eval set; diverse annotators; bias-specific red teaming
Data Quality	Garbage in, garbage out — bad examples corrupt the model's behaviour	Eval score variance; inconsistent output quality	Data quality gates; inter-annotator agreement checks; regular audits
Cost Overrun	Token usage far exceeds budget, making the feature uneconomical	Token cost per request trending up; context window creeping up	Token budgets in requirements; hard limits in code; cost alerts
Prompt Injection	Malicious user input overrides system instructions	Unexpected instruction-following; system prompt leakage reports	Input sanitisation; delimiter isolation; output validation; rate limiting
Regulatory / Privacy	PII in prompts/logs; data leaving Nepal contrary to policy	Accidental PII in logs; developer error routing to foreign API	PII masking pipeline; data residency audit; all inference via Higain.ai

Backlog Hygiene for AI Teams

Maintain separate swim lanes: Features · Prompts · Data · Evals · Ops. Do not mix them in a single flat backlog.
Prompt stories expire after 2 sprints if not picked up — models, context, and user expectations all shift too fast.
Spike results must produce a written decision doc. Never close a spike with verbal-only findings.
Track prompt engineering velocity separately. It is non-linear: 80% of gains come from the first 20% of iterations.
Data stories carry hidden dependencies — always tag them against the governance checklist from Phase 3.

AI Agile Cheatsheet

Sprint length

2 weeks

Prompt Lab

Day 5–6, 1–2 hrs

Story types

Feature, Prompt, Data, Eval, Spike

Spike max

1 sprint, no points

Eval gate in CI

Every PR, block on regression

Human spot-check

≥ 10 samples per story

Rollout start

Shadow → 1% → 10% → 100%

Monitoring

Daily + weekly human review

Risk review

Every Sprint Review

← AI SDLC Methodology Get started with Higain.ai