Kartik Makkar — Senior AI/ML Engineer & Software Architect

Applied ML Research

Domain-Specialized SLM Fine-Tuning — Gemma 12B (bf16 LoRA)
for Grounded RAG That Knows When to Refuse

Independent Project · 2025–2026 · bf16 LoRA on 8×H200 single-node DDP (Modal) · benchmarked head-to-head vs GPT-5.4 / GPT-5.4-mini

Fine-tuned Google Gemma 12B into a domain Q&A specialist to test whether a self-hosted model can hold its own against a frontier API on specialized grounded QA at a fraction of inference cost — building the retrieval, RAFT data-generation, training, and evaluation pipeline end-to-end. The honest result: it ties GPT-5.4-mini, trails full GPT-5.4 on answering, and decisively wins calibration — knowing when to refuse.

79.5%

Balanced answer/refuse acc — best of 3

74%

Refuses unanswerable — frontier: 16%

0.925

RAGAS faithfulness — most grounded

See the full evaluation — interactive charts, the answer/refuse confusion matrix, the bias-balanced judge panel & RAGAS. Open dashboard →

Calibration — the deployable edge. Over all 1,880 test cases, the 12B leads balanced answer/refuse accuracy 79.5% vs 67.4% / 57.3% and correctly refuses 74% of unanswerable questions vs the full frontier's 16% (which hallucinates an answer to 84% of them) — the no-hallucination property a high-stakes RAG copilot needs.
Judged fairly. A bias-balanced 3-model-family LLM-judge panel (independent majority, answer-order-swapped) at full coverage (1,572 grounded rows × both frontier tiers, via the Anthropic Batch API) cancels judge self-preference; the 12B ties GPT-5.4-mini and trails full GPT-5.4 — after exposing overlap metrics (ROUGE/BERTScore) as teacher-mimicry, not correctness.
Retriever-first, retriever-aware training. Froze a hybrid retriever (dense + BM25 + RRF + cross-encoder rerank, hit@5 ~0.78) before the generator, then built every RAFT example from the retriever's real top-5 output — its actual near-misses and genuinely unanswerable cases — so the training distribution matched deployment.
Training architecture. Full-precision bf16 LoRA (frozen base) on 8×H200 single-node DDP (Modal); val token-accuracy 0.934; leakage-safe holdout split at the source-entity level so near-duplicates can't span splits.

View results dashboard ↗ GitHub ↗

MAKBench — Agent evaluation harness
Hermetic, cost-metered, reliability-first

Independent Project · 2026 · Docker sandboxes · LiteLLM gateway metering · split-phase grading · generated leaderboard · any CLI agent × any LLM provider

Open-source benchmark infrastructure: evaluate any CLI agent against any LLM on versioned packs of complex, multi-step enterprise and SWE tasks — full workflows over mock tool stacks and planted-defect repos, graded on exact outcome artifacts — and get numbers you can defend. Every attempt is hermetic — fresh container, fresh workspace, per-attempt budget-capped API key, gateway-only network — so dollars and tokens are metered at the network edge, never self-reported, and grading runs in a separate trusted container the agent can never touch. Repeated independent executions per cell turn "can it?" into "does it reliably?" — the consistency and flakiness metrics leaderboards can't compute from one run. Every published number recomputes from raw attempt records.

Results & method — charts, task prompts, grading, and the Nemotron follow-up. Open page →

Controlled attempts. Fresh workspace, same fixture and git baseline, digest-pinned agent images, fixed CPU/memory and wall-clock/step budgets, gateway-only egress.
Gateway metering. Per-attempt budget-capped virtual keys; provider secrets stay on the host; spend and tokens come from the call log (per-token vs flat-rate not mixed in rankings).
Split-phase grading. Agent sandbox and verifier container are separate; workspace is mounted read-only for checks. Tasks fail on a pristine fixture (no-op scores zero).
Complex tasks by design. Task packs are full workflows, not quiz items: multi-step enterprise operations over mock tool stacks (CRM, mail, calendar, ticketing, policy engines) and planted-defect SWE repos. Outcome-graded against hidden weighted checks, with an anti-triviality gate — a no-op agent scores exactly zero. A live demo board and its findings (including why the same model can be reliable on one scaffold and useless on another) are on the results page.

View results ↗ GitHub ↗

Experience

Cisco Systems

Nov 2018 – Present · San Francisco Bay Area, CA · CX Healthcare · AppDynamics · CX Collaboration

Software Architect — Enterprise Healthcare AI Nov 2024 – Present

Lead architect for enterprise AI in a regulated healthcare setting — production systems that turn enterprise documents and application data into governed knowledge, search, chat, and multi-agent analysis. I design domain-agnostic agent platforms so new capabilities ship as declarative skills rather than hand-rolled agent loops.

Agentic systems. Domain-agnostic agent SDKs on LangGraph — declarative skills, schema-validated outputs, parallel orchestration, and bounded retries — plus supervisor / multi-agent patterns that ship new capabilities fast and safely.
Retrieval & RAG. Hybrid retrieval (vector + BM25 + Reciprocal Rank Fusion + re-ranking), grounded generation with deterministic citation, and natural-language-to-database querying.
Safe LLM automation. Governed code generation behind static-analysis allowlists, human approval, and sandboxed execution — so nothing ungoverned reaches production.
AI observability & evaluation. Self-hosted, HIPAA-aware tracing, token & cost accounting, and LLM-as-judge scoring for agentic systems.
Developer tooling & security. Multi-IDE AI developer tooling, an encrypted secret vault, and security / compliance (CSDL, OWASP) rigor at enterprise scale.

Software Consulting Engineer IV — CX Healthcare Sep 2022 – Nov 2024

Built an enterprise assessment-automation platform from the ground up with a small team — compressing weeks of expert work into repeatable, automated workflows for healthcare and life-sciences customers.

Platform build. Architected and shipped end-to-end across Java 21 / Spring Boot / Spring Cloud microservices, a Python AI service, and an Angular standalone UI; containerized delivery on a hardened JRE base image.
Delivery. Led parallel delivery of new product capabilities, then rationalized them into reusable platform components and a shared scoring model.
Security hardening. Centralized secrets management; drove static-analysis and compliance — Spring / security upgrades, OWASP ZAP, OAuth + CSP hardening.

Software Consulting Engineer IV — AppDynamics, Customer Engineering Feb 2022 – Sep 2022

Customer-facing engineering-consulting role on observability strategy for large enterprise customers across telecom, distribution, and travel.

Technical SME. Embedded with enterprise AppDynamics customers as architecture-level SME. Codified KPIs with engineering leadership, engaged C-suite to capture the observability vision, translated it into engineering requirements through delivery. Earned top-quartile bonus and a formal recommendation from the engineering lead.

Software Consulting Engineer III → IV — CX Collaboration Nov 2018 – Feb 2022

Engineering lead on an enterprise collaboration-provisioning platform serving a Big Four U.S. bank and a global technology company. Promoted SCE III → IV in Apr 2021.

Architecture lead. Owned the data-extraction and data-aggregation services; drove cluster consolidation, data-migration services, and major enterprise customer rollouts.
Real-time metrics dashboard. Designed and shipped a full-stack, live operations dashboard end-to-end (Angular + Spring Cloud + MongoDB).

Infosys Limited

Aug 2011 – Nov 2018 · Bangalore → SF Bay Area

Backend / Systems Engineer — Global-bank provisioning platform 2011 – 2017 · Bangalore

Built Java + jBPM workflow scripts and Spring REST + SOAP-WS APIs for a major global bank's enterprise provisioning platform. Processed 500K+ requests across a 5-year engagement. Promoted twice; relocated to the US in 2017.

Full-Stack Engineer — Custom Application Development & Integrations (CADI) 2017 – 2018 · SF Bay Area

Continued the major-bank engagement from the US delivery center; owned 100% of usability improvements; delivered a complete UI redesign and the platform's log-management system.

Technical Skills

AI / ML Skills

Fine-tuning SLMs / LLMsAgentic SDK development LangGraphRAG / Hybrid Retrieval LoRA / QLoRASynthetic data generation LLM-as-judgePairwise evaluation NL→MongoDB / NL→SQLGraphRAG BM25 / TF-IDFReciprocal Rank Fusion BGE re-rankerMatryoshka embeddings Multi-agent orchestrationMCP AI guardrailsHIPAA-aware AI observability LLM codegen (AST-gated)Prompt-injection defense

AI / ML Tools

PyTorchHuggingFace Transformers TRLPEFT MLflowModalDatabricks LangChainLangSmithRAGAS pandasNumPyscikit-learn XGBoostTensorFlow AWS BedrockAWS SageMaker GPT-4oAnthropic Claude GeminiLlamaAzure OpenAI Ollama

Backend Engineering & Databases

Python (FastAPI · Flask)Java 21 Spring Boot 3 / Spring CloudMaven REST / gRPC / SSEMicroservices Distributed systemsEvent-driven architecture MongoDB AtlasPostgreSQLDuckDB

Frontend & Security

Angular 18TypeScript HIPAA-aware designCSDL compliance OWASP ZAPTrivySemgrep OAuthJWT AWS Secrets ManagerAge-encrypted vault

Infrastructure & Tools

DockerKubernetes OpenShiftTerraform CI/CDJenkins AWSGCPAzure CursorClaude Code GitHub CopilotJira Splunk

Education

M.S. Computer Science — Artificial Intelligence

Georgia Institute of Technology · Atlanta, GA (Online)

Dec 2024

B.Tech. Electronics & Communications Engineering

Punjabi University, Patiala · Punjab, India

Jan 2011