AI SDLC Methodology

AI-Integrated Development Lifecycle

A complete, opinionated framework for shipping AI-integrated software products — from identifying whether AI is the right solution to keeping it healthy in production. Designed to work alongside Agile/Scrum with AI-specific additions at every stage.

8 PhasesAI-Aware AgilePrompt EngineeringData GovernanceLLM EvaluationNepal Context

Discovery

AI Discovery & Problem Framing

Is AI the right solution here?

Before writing a single line of code, validate that AI will actually solve the problem. Most projects add AI where rule-based logic, a database query, or a simple algorithm would work better — and be far cheaper.

Decision matrix: should you use AI?

Signal	Use AI	Don't use AI
Output type	Natural language, images, structured extraction	A number, a boolean, a DB lookup
Rules complexity	Thousands of edge cases, hard to enumerate	< 50 deterministic rules
Data availability	Large corpus of examples or public knowledge	< 100 labeled examples
Tolerance	Occasional errors acceptable with human review	Zero error tolerance, safety-critical
Latency budget	> 200ms acceptable	< 50ms hard requirement

Key activities

Stakeholder interviews: who uses this feature, what outcome matters, what does failure look like?
AI literacy assessment: does the team understand probabilistic output and hallucination risk?
Data availability audit: does training/prompt data exist and is it legally usable?
Ethical pre-screen: bias, fairness, and consent implications of using AI here.
Nepal-specific constraints: connectivity, language (Nepali/English mix), regulatory requirements.
Build vs. fine-tune vs. prompt-only decision — documented and signed off.

→

For most Nepal-context applications, prompt-only integration with Higain.ai hosted models covers 80% of use cases without any training data or fine-tuning budget.

Deliverables

AI Feasibility Report

Problem Statement Doc

Build vs. Prompt Decision

Stakeholder Sign-off

AI-Aware Requirements Engineering

What should the system do — and how good is good enough?

AI requirements differ fundamentally from traditional ones. Outputs are probabilistic, not deterministic. Acceptance criteria must be ranges and rubrics — not binary pass/fail.

AI-specific functional requirements

Define "good enough" quantitatively: "Summarisation must score ≥ 0.82 ROUGE-L on our test set."
Specify confidence thresholds: when should the system abstain or escalate to a human?
Document fallback behaviour: what happens when the model returns an unusable response?
Identify human-in-the-loop checkpoints: which decisions require human approval before acting?
Define output schema: if the AI must return structured data (JSON, table), specify the schema.

Non-functional requirements

Token cost budget

≤ NPR 0.05 per user request

Latency target

p95 < 3s for chat; p95 < 500ms for classification

Hallucination tolerance

< 2% on factual Q&A eval set

Language support

Must handle Nepali–English code-switching

Data sovereignty

All inference on Nepal-hosted infrastructure

Uptime SLA

99.5% monthly, degraded mode if model unavailable

Writing AI user stories

# Standard user story

As a customer support agent,

I want the AI to summarise the customer's complaint in 2 sentences,

So that I can triage tickets 3× faster.

# Acceptance criteria (AI-specific)

- Summary length: 40–80 words

- Must include: root cause + customer sentiment

- Evaluated on 50 labelled tickets — human score ≥ 4/5 on 80%

- Must not include PII in output

Deliverables

AI Requirements Spec

Acceptance Criteria Rubrics

Token Cost Model

Non-functional Requirements Doc

Data Strategy & Governance

The fuel for your AI — collect, clean, govern.

Even when using pre-trained open source models with prompt-only integration, you still need a data strategy — for few-shot examples, RAG documents, and evaluation sets.

Data types you need

Prompt examples

Few-shot demonstrations in your system prompt. 5–20 high-quality examples.

RAG corpus

Documents the model retrieves from. Must be clean, chunked, and indexed.

Evaluation set

Gold-standard input/output pairs for testing. Min 50–200 examples.

Fine-tuning data

Only if prompting isn't enough. Needs 500+ labelled examples per task.

Feedback logs

Production thumbs-up/down from users. Invaluable for iteration.

Synthetic data

Use the model itself to generate training data for edge cases.

Nepali language considerations

Most models tokenise Devanagari less efficiently than Latin script — expect 1.5–2× more tokens per word.
Code-switching (Nepali + English in the same message) is common; test explicitly for this.
Qwen 2.5 14B and Llama 3.1 70B have the best Devanagari coverage among models on Higain.ai.
If building for Nepal government or education, consider a Nepali-language fine-tune (reach out to Higain Labs).

Governance checklist

Document data sources, licenses, and consent status for every dataset.
PII audit: mask names, phone numbers, and addresses before indexing into RAG.
Data residency: all storage and processing on Nepal-hosted infrastructure.
Version your datasets with DVC or equivalent — model behaviour is reproducible only if data is versioned.
Define data retention and deletion policy before launch.

→

For RAG pipelines, chunk documents at 300–500 tokens with 50-token overlap. Use Higain.ai's embeddings endpoint to generate vectors. Store in pgvector (PostgreSQL) or Chroma for lightweight local deployments.

Deliverables

Data Inventory & Lineage Doc

Evaluation Dataset (labelled)

Data Pipeline Design

Privacy & Consent Audit

AI System Design & Architecture

Blueprint for intelligence — structure before you build.

Choose the right integration pattern before writing code. The wrong architecture — fine-tuning when prompting would do, or direct API calls when you need RAG — is expensive to reverse mid-sprint.

Integration pattern selection

Prompt-onlyStart here

When: General tasks, Q&A, summarisation, translation, classification with ≤ 20 examples.

Trade-off: Fastest to build. Limited customisation. Context window is your only constraint.

RAG (Retrieval-Augmented Generation)Most common

When: Questions over your own documents, knowledge base, or frequently-updated data.

Trade-off: More infra (vector DB, chunking pipeline). Much better factual grounding.

Fine-tuningUse sparingly

When: Domain-specific style, very consistent output format, 500+ labelled examples available.

Trade-off: Expensive to train, maintain, and re-train. Only use if prompting clearly fails.

Agentic / Tool useAdvanced

When: Multi-step tasks, real-world actions (web search, API calls, code execution).

Trade-off: Complex to test and debug. Build agent evals before shipping to production.

Prompt architecture

# Recommended system prompt structure

[ROLE] You are a {persona} with expertise in {domain}.

[CONTEXT] You are serving users in Nepal. Respond in {language}.

[RULES] - Never fabricate specific numbers or citations.

- If unsure, say "I don't know" rather than guessing.

[OUTPUT FORMAT] Respond as JSON: {"answer": string, "confidence": 0–1}

[EXAMPLES] (few-shot examples go here)

Security: prompt injection mitigation

Treat user input as untrusted data — never concatenate it directly into the system prompt without sanitisation.
Use delimiters (XML tags or triple-quotes) to clearly separate system instructions from user content.
Apply output validation: if you expect JSON, parse it and reject anything that doesn't match the schema.
Rate-limit the API endpoint to prevent prompt flooding and token exhaustion attacks.

Deliverables

Architecture Decision Record (ADR)

System Prompt v1 (versioned)

RAG Schema (if applicable)

Token Budget Model

AI-Integrated Development Sprints

Build with models in the loop.

AI features span multiple disciplines simultaneously: prompt engineering, data plumbing, API integration, and conventional software engineering. Structure your sprints to reflect this — not as waterfall phases, but as parallel workstreams.

Sprint workstreams

Prompt Engineering

Draft & version system prompts
Curate few-shot examples
Test prompt variants against eval set
Commit winning prompt to git

Data Plumbing

Ingest & chunk documents (RAG)
Generate embeddings via Higain.ai API
Build vector search endpoint
Write data quality tests

Application Code

Integrate Higain.ai SDK
Implement streaming UI
Handle errors & fallbacks
Build feature flags for AI rollout

Evaluation & Tests

Run eval suite on every PR
Add regression test for each bug
Human spot-check on sample outputs
Track eval metrics in CI dashboard

Prompt version control

# prompts/summariser/v3.txt

You are a concise summariser for Nepali customer support tickets...

# prompts/summariser/CHANGELOG.md

v3 — Added Nepali language instruction. ROUGE-L: 0.84 → 0.89

v2 — Added output length constraint. Reduced verbosity.

v1 — Initial version.

Testing AI features

Unit test: mock the LLM response and test your application logic independently of the model.
Integration test: call the real API with a fixed seed (or deterministic model) against golden outputs.
Semantic assertion: don't test exact string equality — test meaning. Use embedding cosine similarity ≥ 0.92.
Regression gate: block merges if any eval metric drops below baseline by more than 5%.

→

Use Higain.ai's temperature=0 and seed=42 for deterministic outputs in CI tests.

Deliverables

Working AI feature (PR merged)

Versioned prompt files in git

Eval suite passing in CI

Feature flag config

AI Evaluation & Quality Assurance

How good is good enough? Measure before you ship.

LLM evaluation is different from traditional QA. You cannot write a test that checks whether an answer is "correct" with assertEquals. You need a layered evaluation strategy.

Evaluation dimensions

Faithfulness

Does the answer only use information from the source context? (critical for RAG)

Relevance

Does the response actually answer what was asked?

Accuracy

Is factual content verifiably correct against ground truth?

Safety

Does the output contain harmful, toxic, or biased content?

Fluency

Is the output grammatically correct and well-formed in the target language?

Format

Does the output match the expected structure (JSON schema, length, tone)?

Automated evaluation methods

LLM-as-judge: use a second Higain.ai model call to score outputs 1–5 on each dimension. Works well at scale.
Embedding similarity: compute cosine similarity between generated output and reference answer. Threshold ≥ 0.88.
ROUGE / BLEU: useful for summarisation and translation tasks where a reference output exists.
JSON schema validation: for structured outputs, parse and validate on every test run.
Regex / keyword checks: blunt but fast — ensure key terms appear or don't appear in output.

Human evaluation protocol

# Evaluation rubric (1–5 scale per dimension)

5 — Excellent: Fully correct, appropriately formatted, no issues

4 — Good: Minor issues not affecting usability

3 — Acceptable: Usable but noticeable problems

2 — Poor: Significant errors; would mislead or frustrate user

1 — Fail: Wrong, harmful, or completely off-topic

# Minimum bar for shipping: mean score ≥ 3.8, no dimension below 3.0

Red teaming

Jailbreak attempts: try to make the model ignore its system prompt ("ignore all previous instructions").
Prompt injection via user content: embed instructions inside user-supplied documents or input fields.
Edge inputs: extremely long inputs, empty inputs, non-UTF-8 characters, Devanagari mixed with emojis.
Adversarial personas: test as a malicious user trying to extract system prompt content.

Deliverables

Eval Report (pre-launch)

Automated Eval Suite

Human Eval Results

Red Team Log

AI Deployment & Release Strategy

Ship carefully — AI failures are visible and surprising.

AI features deserve more cautious rollout than conventional features because failures are often silent, hard to detect, and erode user trust quickly.

Rollout strategy

Shadow modeRecommended start

Run AI feature in parallel with existing system. Log outputs but don't show to users. Compare for 1 week before exposing.

Canary (1–5%)Week 1

Expose to a small percentage of real users. Monitor error rate, output quality signals, and user feedback intensely.

Progressive rollout (10% → 50% → 100%)Weeks 2–4

Increase exposure only if key metrics stay within defined thresholds. Each gate requires explicit sign-off.

Full rolloutPost-validation

100% traffic. Monitoring stays active. Rollback procedure documented and tested.

Model pinning

Always pin to a specific model version in production — never use a floating alias that updates automatically.
Test every model update in staging with your full eval suite before changing pinned version in prod.
Keep the previous model version available for 2 weeks as a rollback target.

CI/CD for AI applications

# .github/workflows/ai-quality.yml (example)

on: [pull_request]

jobs:

eval:

steps:

- run: python run_evals.py --model llama-3.1-70b

- run: python check_thresholds.py --min-score 3.8

- run: python validate_schemas.py

# Fails PR if any eval score drops > 5% from baseline

Deliverables

Deployment runbook

Rollback procedure

Model version lock file

CI/CD eval pipeline

AI Monitoring, Observability & Iteration

AI systems drift — watch them continuously.

Unlike traditional software, AI systems degrade silently. User behaviour changes, data distributions shift, and what worked in March may fail by September. Monitoring is not optional — it is a core feature.

Metrics to track in production

Latency (p50 / p95)

Performance

p95 < 3s

Token usage

Cost

Within budget ± 10%

API error rate

Reliability

< 0.5%

Refusal rate

Quality

< 2% (investigate if ↑)

User thumbs-down rate

Quality

< 5%

Output schema failures

Correctness

< 0.1%

Feedback loop design

Embed thumbs-up / thumbs-down on every AI response in your UI. Log with full context (prompt, output, model, timestamp).
Sample 2% of production traffic daily for async human review. Log scores back to your eval database.
Surface flagged outputs to the team every Monday. Triage into: prompt fix, data fix, model switch, or expected edge case.
Use production feedback as your next iteration's eval set — it represents real user intent better than synthetic data.

Drift detection

Input drift: track the distribution of prompt topics using embedding clustering. Alert if new cluster emerges with > 5% share.
Output drift: track average output length, refusal rate, and schema compliance weekly — sudden changes signal a problem.
Model-level drift: if Higain.ai updates the underlying model weights, re-run your full eval suite before trusting existing behaviour.

Continuous improvement cadence

DailyReview error rate and latency dashboards. Investigate any spike immediately.

WeeklyHuman review sample. Update eval set with new failures. Triage feedback backlog.

MonthlyFull eval run. Prompt improvement sprint. Token cost review and plan adjustment.

QuarterlyModel upgrade evaluation. Architecture review. Revisit requirements against user growth.

Deliverables

Monitoring dashboard (live)

Alert runbook

Weekly eval reports

Continuous improvement backlog

← API Reference AI Agile Framework