AI Agile Framework

AI-Adapted Agile / Scrum

Standard Scrum adapted for teams building AI-integrated products. AI uncertainty, prompt iteration, model evaluation, and data governance are first-class artifacts — not afterthoughts squeezed into a regular sprint.

2-week sprintsPrompt storiesEval ceremoniesAI Definition of DoneRisk register

Roles & Team Composition

Product Owner

Business
  • Owns AI acceptance criteria — must understand probabilistic outputs
  • Defines success metrics (not just user stories, but eval thresholds)
  • Prioritises the prompt backlog alongside the feature backlog
  • Signs off on eval results before AI features ship to production

AI / ML Engineer

AI Core
  • Owns prompt engineering, version control, and eval design
  • Selects and benchmarks models for each use case
  • Designs the RAG pipeline, fine-tuning strategy, or agent architecture
  • Investigates unexpected model behaviour in production

Software Engineer

Integration
  • Integrates Higain.ai SDK into the application layer
  • Builds streaming UI, context management, and fallback handling
  • Implements feature flags for AI rollout
  • Maintains CI/CD pipeline including eval gates

Data Steward

Data
  • Owns data pipeline, RAG corpus, and labeling quality
  • Maintains data governance docs (provenance, consent, PII audit)
  • Coordinates human evaluators for eval and feedback review
  • Tracks data versioning (DVC or equivalent)

QA / Eval Engineer

Quality
  • Owns automated eval harness — runs on every PR
  • Designs and runs human eval protocols
  • Red-teams AI features before each release
  • Maintains eval dataset quality and prevents label leakage

Scrum Master

Process
  • Facilitates AI-specific ceremonies (Prompt Lab, Eval Review)
  • Ensures AI uncertainty is made explicit in planning estimates
  • Shields team from scope creep driven by model hype
  • Tracks AI story velocity separately from standard engineering velocity

Sprint Structure (2-week cycle)

Day 1

Sprint Planning

3–4 hours
  • Review product backlog — AI stories and standard feature stories together
  • Assign token cost estimates to AI stories (Low < 1K, Med < 10K, High > 10K tokens/request)
  • Write measurable eval criteria for every AI story before it enters the sprint
  • Identify AI risk cards: hallucination risk, data quality risk, cost overrun risk
  • Separate AI workstream tasks: Prompt, Data, Integration, Eval
Daily

Standup (15 min)

15 min
  • Standard: What did I do yesterday? What will I do today? Any blockers?
  • AI-specific: Did any prompt or model behaviour change unexpectedly yesterday?
  • Flag immediately: unexpected refusals, cost spikes, schema failures, or latency regressions
Day 5–6

Prompt Lab Session

1–2 hours
  • Mid-sprint timebox dedicated to prompt iteration and experimentation
  • AI Engineer + QA run variants against the eval set in a shared notebook
  • Winning prompt committed to version control before session ends
  • Output: updated eval scores and a CHANGELOG entry
Day 10

Sprint Review

2 hours
  • Demo includes eval dashboard, not just feature UI — stakeholders see quality metrics
  • Present: eval score before vs. after, token cost trend, any regressions and their resolution
  • PO signs off using the acceptance criteria rubric defined at sprint planning
  • Undone AI stories: assess whether the eval bar was too aggressive or the feature needs more work
Day 10

AI Retrospective

1.5 hours
  • Standard retro format + AI-specific lenses:
  • → Were our eval criteria realistic and well-defined?
  • → Did any model behaviour surprise us? What did we learn?
  • → Did we spend too much / too little time on prompt engineering vs. integration?
  • → What data quality issues slowed us down? How do we prevent them next sprint?

Story Types

The standard user story format doesn't translate cleanly to AI work. Use these five story types in your backlog — each has a different estimation approach and different definition of done.

Feature Story

Feature
Format

As a [user], I want [functionality] so that [business value].

Estimation

Story points (standard Fibonacci)

Definition of Done

Code merged, tests passing, eval criteria met, feature flag set.

Example

As a user, I want the AI to summarise my document in 3 bullet points so that I can decide whether to read it.

Prompt Story

Prompt
Format

As an AI Engineer, I need [prompt change] so that [eval metric] improves by [threshold].

Estimation

Prompt complexity: Low (quick tweak) / Medium (restructure) / High (new strategy) / Research (unknown)

Definition of Done

Prompt version committed and tagged. Eval suite shows improvement. No regression on other metrics.

Example

As an AI Engineer, I need to add output length constraints to the summariser prompt so that ROUGE-L improves from 0.82 to ≥ 0.87.

Data Story

Data
Format

As a [role], I need [data asset] so that [AI capability or eval goal] is achievable.

Estimation

Story points based on pipeline complexity, not volume

Definition of Done

Data ingested, versioned, PII-audited, quality gate passing. Eval set updated if applicable.

Example

As a Data Steward, I need 50 labelled support ticket examples so that we can establish a baseline eval for the summariser.

Eval Story

Eval
Format

As a QA Engineer, I need [evaluation capability] so that [quality signal] is measurable.

Estimation

Story points (typically 2–5 pts)

Definition of Done

Eval runs in CI. Results logged to eval database. Dashboard updated.

Example

As a QA Engineer, I need an LLM-as-judge scorer for faithfulness so that RAG outputs are evaluated automatically on every PR.

Spike

Spike
Format

Investigate [question] to inform [decision]. Time-box: [N] days.

Estimation

Not pointed — time-boxed. Maximum 1 sprint.

Definition of Done

Written findings doc shared with team. Go/no-go recommendation made. Spike branch deleted.

Example

Investigate whether DeepSeek-R1 outperforms Llama 3.1 70B on Nepali-language reasoning tasks. Time-box: 3 days.

Definition of Done for AI Features

An AI feature is not "done" when the code is merged. It is done when these gates pass. Every item is mandatory unless explicitly waived by the PO with written justification.

Code & Integration

  • Feature code merged to main and all CI checks passing.
  • Unit tests cover application logic with LLM responses mocked.
  • Integration test calls real API against golden test cases.
  • Feature flag configured — off by default, switchable without deployment.
  • Fallback behaviour tested: what happens when the model returns an error or empty response?

Prompt & Model Quality

  • System prompt committed to version control with semantic version tag (e.g. v2.3.0).
  • Eval suite passes with no metric below the agreed minimum threshold.
  • No regression vs. baseline on any dimension by more than 5%.
  • Human spot-check: minimum 10 samples reviewed and scored ≥ 3.8 mean.
  • Red team checklist completed: prompt injection, jailbreak attempts, edge inputs.

Cost & Performance

  • Token cost per request measured and within ±20% of the estimate from sprint planning.
  • p95 latency measured and within agreed SLA.
  • Load test run at 2× expected peak traffic without degradation.

Observability & Compliance

  • Production alert configured for: error rate spike, latency breach, cost overrun, refusal rate anomaly.
  • Logging in place: every request logs model, tokens used, latency, and outcome (not the full prompt content unless required).
  • PII audit: confirm no personal data is sent to the model or logged in cleartext.
  • Data residency confirmed: all inference happening on Nepal-hosted Higain.ai infrastructure.

AI Risk Management

Maintain a dedicated AI Risk Register alongside the standard project risk log. Review at every Sprint Review. The following risk categories apply to all AI features.

Risk CategoryWhat it meansEarly warning signalsMitigation
HallucinationModel confidently outputs false informationEval faithfulness score drops; user complaints about wrong factsRAG grounding; output citations; human review for high-stakes outputs
BiasSystematically unfair outputs toward a groupDisparate error rates across demographic slices in evalBias audit on eval set; diverse annotators; bias-specific red teaming
Data QualityGarbage in, garbage out — bad examples corrupt the model's behaviourEval score variance; inconsistent output qualityData quality gates; inter-annotator agreement checks; regular audits
Cost OverrunToken usage far exceeds budget, making the feature uneconomicalToken cost per request trending up; context window creeping upToken budgets in requirements; hard limits in code; cost alerts
Prompt InjectionMalicious user input overrides system instructionsUnexpected instruction-following; system prompt leakage reportsInput sanitisation; delimiter isolation; output validation; rate limiting
Regulatory / PrivacyPII in prompts/logs; data leaving Nepal contrary to policyAccidental PII in logs; developer error routing to foreign APIPII masking pipeline; data residency audit; all inference via Higain.ai

Backlog Hygiene for AI Teams

  • Maintain separate swim lanes: Features · Prompts · Data · Evals · Ops. Do not mix them in a single flat backlog.
  • Prompt stories expire after 2 sprints if not picked up — models, context, and user expectations all shift too fast.
  • Spike results must produce a written decision doc. Never close a spike with verbal-only findings.
  • Track prompt engineering velocity separately. It is non-linear: 80% of gains come from the first 20% of iterations.
  • Data stories carry hidden dependencies — always tag them against the governance checklist from Phase 3.

AI Agile Cheatsheet

Sprint length
2 weeks
Prompt Lab
Day 5–6, 1–2 hrs
Story types
Feature, Prompt, Data, Eval, Spike
Spike max
1 sprint, no points
Eval gate in CI
Every PR, block on regression
Human spot-check
≥ 10 samples per story
Rollout start
Shadow → 1% → 10% → 100%
Monitoring
Daily + weekly human review
Risk review
Every Sprint Review