AI SDLC Methodology

AI-Integrated Development Lifecycle

A complete, opinionated framework for shipping AI-integrated software products — from identifying whether AI is the right solution to keeping it healthy in production. Designed to work alongside Agile/Scrum with AI-specific additions at every stage.

8 PhasesAI-Aware AgilePrompt EngineeringData GovernanceLLM EvaluationNepal Context
01

AI Discovery & Problem Framing

Is AI the right solution here?

Before writing a single line of code, validate that AI will actually solve the problem. Most projects add AI where rule-based logic, a database query, or a simple algorithm would work better — and be far cheaper.

Decision matrix: should you use AI?

SignalUse AIDon't use AI
Output typeNatural language, images, structured extractionA number, a boolean, a DB lookup
Rules complexityThousands of edge cases, hard to enumerate< 50 deterministic rules
Data availabilityLarge corpus of examples or public knowledge< 100 labeled examples
ToleranceOccasional errors acceptable with human reviewZero error tolerance, safety-critical
Latency budget> 200ms acceptable< 50ms hard requirement

Key activities

  • Stakeholder interviews: who uses this feature, what outcome matters, what does failure look like?
  • AI literacy assessment: does the team understand probabilistic output and hallucination risk?
  • Data availability audit: does training/prompt data exist and is it legally usable?
  • Ethical pre-screen: bias, fairness, and consent implications of using AI here.
  • Nepal-specific constraints: connectivity, language (Nepali/English mix), regulatory requirements.
  • Build vs. fine-tune vs. prompt-only decision — documented and signed off.

For most Nepal-context applications, prompt-only integration with Higain.ai hosted models covers 80% of use cases without any training data or fine-tuning budget.

Deliverables

AI Feasibility Report
Problem Statement Doc
Build vs. Prompt Decision
Stakeholder Sign-off
02

AI-Aware Requirements Engineering

What should the system do — and how good is good enough?

AI requirements differ fundamentally from traditional ones. Outputs are probabilistic, not deterministic. Acceptance criteria must be ranges and rubrics — not binary pass/fail.

AI-specific functional requirements

  • Define "good enough" quantitatively: "Summarisation must score ≥ 0.82 ROUGE-L on our test set."
  • Specify confidence thresholds: when should the system abstain or escalate to a human?
  • Document fallback behaviour: what happens when the model returns an unusable response?
  • Identify human-in-the-loop checkpoints: which decisions require human approval before acting?
  • Define output schema: if the AI must return structured data (JSON, table), specify the schema.

Non-functional requirements

Token cost budget
≤ NPR 0.05 per user request
Latency target
p95 < 3s for chat; p95 < 500ms for classification
Hallucination tolerance
< 2% on factual Q&A eval set
Language support
Must handle Nepali–English code-switching
Data sovereignty
All inference on Nepal-hosted infrastructure
Uptime SLA
99.5% monthly, degraded mode if model unavailable

Writing AI user stories

# Standard user story
As a customer support agent,
I want the AI to summarise the customer's complaint in 2 sentences,
So that I can triage tickets 3× faster.
# Acceptance criteria (AI-specific)
- Summary length: 40–80 words
- Must include: root cause + customer sentiment
- Evaluated on 50 labelled tickets — human score ≥ 4/5 on 80%
- Must not include PII in output

Deliverables

AI Requirements Spec
Acceptance Criteria Rubrics
Token Cost Model
Non-functional Requirements Doc
03

Data Strategy & Governance

The fuel for your AI — collect, clean, govern.

Even when using pre-trained open source models with prompt-only integration, you still need a data strategy — for few-shot examples, RAG documents, and evaluation sets.

Data types you need

Prompt examples
Few-shot demonstrations in your system prompt. 5–20 high-quality examples.
RAG corpus
Documents the model retrieves from. Must be clean, chunked, and indexed.
Evaluation set
Gold-standard input/output pairs for testing. Min 50–200 examples.
Fine-tuning data
Only if prompting isn't enough. Needs 500+ labelled examples per task.
Feedback logs
Production thumbs-up/down from users. Invaluable for iteration.
Synthetic data
Use the model itself to generate training data for edge cases.

Nepali language considerations

  • Most models tokenise Devanagari less efficiently than Latin script — expect 1.5–2× more tokens per word.
  • Code-switching (Nepali + English in the same message) is common; test explicitly for this.
  • Qwen 2.5 14B and Llama 3.1 70B have the best Devanagari coverage among models on Higain.ai.
  • If building for Nepal government or education, consider a Nepali-language fine-tune (reach out to Higain Labs).

Governance checklist

  • Document data sources, licenses, and consent status for every dataset.
  • PII audit: mask names, phone numbers, and addresses before indexing into RAG.
  • Data residency: all storage and processing on Nepal-hosted infrastructure.
  • Version your datasets with DVC or equivalent — model behaviour is reproducible only if data is versioned.
  • Define data retention and deletion policy before launch.

For RAG pipelines, chunk documents at 300–500 tokens with 50-token overlap. Use Higain.ai's embeddings endpoint to generate vectors. Store in pgvector (PostgreSQL) or Chroma for lightweight local deployments.

Deliverables

Data Inventory & Lineage Doc
Evaluation Dataset (labelled)
Data Pipeline Design
Privacy & Consent Audit
04

AI System Design & Architecture

Blueprint for intelligence — structure before you build.

Choose the right integration pattern before writing code. The wrong architecture — fine-tuning when prompting would do, or direct API calls when you need RAG — is expensive to reverse mid-sprint.

Integration pattern selection

Prompt-onlyStart here
When: General tasks, Q&A, summarisation, translation, classification with ≤ 20 examples.
Trade-off: Fastest to build. Limited customisation. Context window is your only constraint.
RAG (Retrieval-Augmented Generation)Most common
When: Questions over your own documents, knowledge base, or frequently-updated data.
Trade-off: More infra (vector DB, chunking pipeline). Much better factual grounding.
Fine-tuningUse sparingly
When: Domain-specific style, very consistent output format, 500+ labelled examples available.
Trade-off: Expensive to train, maintain, and re-train. Only use if prompting clearly fails.
Agentic / Tool useAdvanced
When: Multi-step tasks, real-world actions (web search, API calls, code execution).
Trade-off: Complex to test and debug. Build agent evals before shipping to production.

Prompt architecture

# Recommended system prompt structure
[ROLE] You are a {persona} with expertise in {domain}.
[CONTEXT] You are serving users in Nepal. Respond in {language}.
[RULES] - Never fabricate specific numbers or citations.
- If unsure, say "I don't know" rather than guessing.
[OUTPUT FORMAT] Respond as JSON: {"answer": string, "confidence": 0–1}
[EXAMPLES] (few-shot examples go here)

Security: prompt injection mitigation

  • Treat user input as untrusted data — never concatenate it directly into the system prompt without sanitisation.
  • Use delimiters (XML tags or triple-quotes) to clearly separate system instructions from user content.
  • Apply output validation: if you expect JSON, parse it and reject anything that doesn't match the schema.
  • Rate-limit the API endpoint to prevent prompt flooding and token exhaustion attacks.

Deliverables

Architecture Decision Record (ADR)
System Prompt v1 (versioned)
RAG Schema (if applicable)
Token Budget Model
05

AI-Integrated Development Sprints

Build with models in the loop.

AI features span multiple disciplines simultaneously: prompt engineering, data plumbing, API integration, and conventional software engineering. Structure your sprints to reflect this — not as waterfall phases, but as parallel workstreams.

Sprint workstreams

Prompt Engineering
  • Draft & version system prompts
  • Curate few-shot examples
  • Test prompt variants against eval set
  • Commit winning prompt to git
Data Plumbing
  • Ingest & chunk documents (RAG)
  • Generate embeddings via Higain.ai API
  • Build vector search endpoint
  • Write data quality tests
Application Code
  • Integrate Higain.ai SDK
  • Implement streaming UI
  • Handle errors & fallbacks
  • Build feature flags for AI rollout
Evaluation & Tests
  • Run eval suite on every PR
  • Add regression test for each bug
  • Human spot-check on sample outputs
  • Track eval metrics in CI dashboard

Prompt version control

# prompts/summariser/v3.txt
You are a concise summariser for Nepali customer support tickets...
# prompts/summariser/CHANGELOG.md
v3 — Added Nepali language instruction. ROUGE-L: 0.84 → 0.89
v2 — Added output length constraint. Reduced verbosity.
v1 — Initial version.

Testing AI features

  • Unit test: mock the LLM response and test your application logic independently of the model.
  • Integration test: call the real API with a fixed seed (or deterministic model) against golden outputs.
  • Semantic assertion: don't test exact string equality — test meaning. Use embedding cosine similarity ≥ 0.92.
  • Regression gate: block merges if any eval metric drops below baseline by more than 5%.

Use Higain.ai's temperature=0 and seed=42 for deterministic outputs in CI tests.

Deliverables

Working AI feature (PR merged)
Versioned prompt files in git
Eval suite passing in CI
Feature flag config
06

AI Evaluation & Quality Assurance

How good is good enough? Measure before you ship.

LLM evaluation is different from traditional QA. You cannot write a test that checks whether an answer is "correct" with assertEquals. You need a layered evaluation strategy.

Evaluation dimensions

Faithfulness
Does the answer only use information from the source context? (critical for RAG)
Relevance
Does the response actually answer what was asked?
Accuracy
Is factual content verifiably correct against ground truth?
Safety
Does the output contain harmful, toxic, or biased content?
Fluency
Is the output grammatically correct and well-formed in the target language?
Format
Does the output match the expected structure (JSON schema, length, tone)?

Automated evaluation methods

  • LLM-as-judge: use a second Higain.ai model call to score outputs 1–5 on each dimension. Works well at scale.
  • Embedding similarity: compute cosine similarity between generated output and reference answer. Threshold ≥ 0.88.
  • ROUGE / BLEU: useful for summarisation and translation tasks where a reference output exists.
  • JSON schema validation: for structured outputs, parse and validate on every test run.
  • Regex / keyword checks: blunt but fast — ensure key terms appear or don't appear in output.

Human evaluation protocol

# Evaluation rubric (1–5 scale per dimension)
5 — Excellent: Fully correct, appropriately formatted, no issues
4 — Good: Minor issues not affecting usability
3 — Acceptable: Usable but noticeable problems
2 — Poor: Significant errors; would mislead or frustrate user
1 — Fail: Wrong, harmful, or completely off-topic
# Minimum bar for shipping: mean score ≥ 3.8, no dimension below 3.0

Red teaming

  • Jailbreak attempts: try to make the model ignore its system prompt ("ignore all previous instructions").
  • Prompt injection via user content: embed instructions inside user-supplied documents or input fields.
  • Edge inputs: extremely long inputs, empty inputs, non-UTF-8 characters, Devanagari mixed with emojis.
  • Adversarial personas: test as a malicious user trying to extract system prompt content.

Deliverables

Eval Report (pre-launch)
Automated Eval Suite
Human Eval Results
Red Team Log
07

AI Deployment & Release Strategy

Ship carefully — AI failures are visible and surprising.

AI features deserve more cautious rollout than conventional features because failures are often silent, hard to detect, and erode user trust quickly.

Rollout strategy

Shadow modeRecommended start

Run AI feature in parallel with existing system. Log outputs but don't show to users. Compare for 1 week before exposing.

Canary (1–5%)Week 1

Expose to a small percentage of real users. Monitor error rate, output quality signals, and user feedback intensely.

Progressive rollout (10% → 50% → 100%)Weeks 2–4

Increase exposure only if key metrics stay within defined thresholds. Each gate requires explicit sign-off.

Full rolloutPost-validation

100% traffic. Monitoring stays active. Rollback procedure documented and tested.

Model pinning

  • Always pin to a specific model version in production — never use a floating alias that updates automatically.
  • Test every model update in staging with your full eval suite before changing pinned version in prod.
  • Keep the previous model version available for 2 weeks as a rollback target.

CI/CD for AI applications

# .github/workflows/ai-quality.yml (example)
on: [pull_request]
jobs:
eval:
steps:
- run: python run_evals.py --model llama-3.1-70b
- run: python check_thresholds.py --min-score 3.8
- run: python validate_schemas.py
# Fails PR if any eval score drops > 5% from baseline

Deliverables

Deployment runbook
Rollback procedure
Model version lock file
CI/CD eval pipeline
08

AI Monitoring, Observability & Iteration

AI systems drift — watch them continuously.

Unlike traditional software, AI systems degrade silently. User behaviour changes, data distributions shift, and what worked in March may fail by September. Monitoring is not optional — it is a core feature.

Metrics to track in production

Latency (p50 / p95)
Performance
p95 < 3s
Token usage
Cost
Within budget ± 10%
API error rate
Reliability
< 0.5%
Refusal rate
Quality
< 2% (investigate if ↑)
User thumbs-down rate
Quality
< 5%
Output schema failures
Correctness
< 0.1%

Feedback loop design

  • Embed thumbs-up / thumbs-down on every AI response in your UI. Log with full context (prompt, output, model, timestamp).
  • Sample 2% of production traffic daily for async human review. Log scores back to your eval database.
  • Surface flagged outputs to the team every Monday. Triage into: prompt fix, data fix, model switch, or expected edge case.
  • Use production feedback as your next iteration's eval set — it represents real user intent better than synthetic data.

Drift detection

  • Input drift: track the distribution of prompt topics using embedding clustering. Alert if new cluster emerges with > 5% share.
  • Output drift: track average output length, refusal rate, and schema compliance weekly — sudden changes signal a problem.
  • Model-level drift: if Higain.ai updates the underlying model weights, re-run your full eval suite before trusting existing behaviour.

Continuous improvement cadence

DailyReview error rate and latency dashboards. Investigate any spike immediately.
WeeklyHuman review sample. Update eval set with new failures. Triage feedback backlog.
MonthlyFull eval run. Prompt improvement sprint. Token cost review and plan adjustment.
QuarterlyModel upgrade evaluation. Architecture review. Revisit requirements against user growth.

Deliverables

Monitoring dashboard (live)
Alert runbook
Weekly eval reports
Continuous improvement backlog