Aisulu Tustykbayeva | Senior TPM, Infrastructure

The problems I'm hired to solve

Nobody owns delivery

Deadlines slip. Dependencies go dark. Escalations replace planning. I map the gaps, design the operating model, and deploy it across XFN teams without formal authority. The system runs. I'm not in the critical path.

Production system replacement with no downtime

Live revenue base. No margin for data loss. Engineers who disagree on sequencing. I design the dual-write window, the validation gate, and the rollback trigger before engineering starts. Not after the first incident.

Technical depth is required

I earn engineers' trust by showing up technically prepared: reading the specs, surfacing the design gap before it becomes a P0, running the test that finds what the architecture missed. That's the entry condition. Everything else follows.

Hard external deadline attached to a dollar amount

Certification, contract, or competitor launch. I run the evidence framework, surface amber flags before they become findings, and close programs on schedule when missing the date has a public cost.

Technical surface, organizational root

Two teams, no one accountable. Neither moves first. I find the root cause of the resistance, redesign the sequence so each team has a reason to go, and drive the decision to close.

Program portfolio

The company

PS Cloud Services: a multi-tenant IaaS platform processing 1.2M requests/second at peak across 7 data centers in Kazakhstan and Uzbekistan. Enterprise customers include fintech and oil production operators with contractual SLA and compliance requirements. Engineering organized across 8 service teams. None reported to me. I ran programs across all 8.

What I built there

Built the company's first TPM practice from zero. Defined the operating model, established program governance, and grew the function over 4.5 years. Two direct reports promoted within 18 months.

Program	Outcome	The hard constraint	Teams spanned	Skills applied
CRM migration via GraphQL federation	Zero downtime, $20M revenue base protected. 12 months to 4. Query p99: 3–5s → <500ms.	Live $20M revenue base. No margin for data loss. 5 platforms in disagreement on sequencing.	Platform, DBaaS, K8s, LBaaS, SRE	Hard GraphQL federation Dual-write migration Data validation pipelines PostgreSQL / WHMCS Soft Decision documentation Root cause reframing Schedule pressure resistance
Enterprise churn prediction & early warning	Churn 15%→9%. Support alert action rate 87%. 11% enterprise net retention lift. $11.7M revenue base protected.	No team, no budget. Data across 5 disconnected systems with no shared identifier.	CRO, Sales, Support, Infrastructure, Data Analytics	Hard ML model design (logistic regression, gradient boosting) SHAP feature attribution Data unification (5 systems) Ceph storage telemetry Soft Program charter under constraint Kill criteria definition Support team co-design Handoff discipline
PCI DSS 4.0.1 certification, 7 DCs	Certified 3 weeks early. $3.2M ARR converted. Deploy frequency 40+/week maintained.	Competitor on the same deadline. 8 service teams, no direct authority.	LBaaS, DBaaS, K8s, Object Storage, Networking, Security, SRE, DevOps	Hard PCI DSS 4.0.1 RAG evidence dashboards Sidecar-emitted compliance signals API schema governance Soft QSA negotiation Concession strategy Amber rule governance Influence without authority
DC expansion + 100 Gbps optical ring	4 to 7 sites, 99.95% SLA maintained. $8M in deals unblocked. Sub-50ms failover.	6 vendors, 2 countries. Design gap surfaced mid-program in synthetic load test.	Networking, Platform, SRE, Security + 6 external vendors	Hard DWDM optical networking G.8032 ERPS BFD + BGP Multi-DC architecture etcd quorum topology Soft Vendor negotiation Physical site verification Financial modeling for technical decisions
K8s provisioning automation (Terraform + GitOps)	96% provisioning time reduction (4–6 hrs → 10–15 min). Error rate down 10×.	8 skeptics. One never converted. System works regardless.	DevOps, K8s Platform, DBaaS, Networking, Security	Hard Kubernetes / Helm Terraform state management GitOps CI gates TOCTOU risk mitigation Soft Change management Adoption without mandate Rollout sequencing by trust
Kubernetes cost visibility (Kubecost)	$400K deals protected. License $15K→$11K. 620 eng-hours preserved. Adopted by 2 additional product teams.	VP decision already announced. Critical-path engineers named for the build.	Engineering (VP), Infrastructure, Sales, K8s Platform, DBaaS, LBaaS	Hard Kubecost OPA admission controller Namespace label taxonomy Build vs. buy analysis Soft Executive decision reversal Cost modeling Vendor contract negotiation Decision documentation
Firewall SDN migration (FortiGate-VM + NSX)	Latency 3× AWS to competitive. Zero customer impact across 7 DCs.	PCI-scoped migration. One incident ends the program permanently.	Networking, Security, K8s Platform, SRE, DBaaS	Hard FortiGate-VM / NSX SDN migration Network segmentation LBaaS traffic routing Soft Zero-incident governance Rollback planning Risk register management
Infrastructure capacity planning & demand modeling	Forecast variance ±35%→±21% (40% improvement). Bottlenecks identified ahead of each DC expansion phase.	No demand telemetry. Forecasts based on sales pipeline estimates only.	Infrastructure, Finance, Sales, Engineering	Hard What-if simulation design Cost modeling Traffic forecasting Demand telemetry Soft Cross-functional data gathering Forecast credibility building Executive communication
Cross-team observability & deadlock resolution	6-month deadlock resolved in 8 weeks. $2M+ pricing decision enabled. $500K decommissioning savings. 40% capacity forecast improvement.	No owner. Previous mandate produced silent resistance. Neither team would move first.	Product, Infrastructure (LBaaS, K8s, Compute, Object Storage), Data	Hard Prometheus / Grafana Metrics taxonomy design PoC design & execution Schema freeze governance Soft Resistance root-cause mapping Rollout sequencing by trust Ambiguity ownership Program durability under leadership change
Delivery operating system	On-time delivery: 40% to 85%, company-wide.	No mandate. Adoption had to be earned, not assigned.	All 8 service teams + Product, Legal, Finance	Hard OKR / KPI framework design RAID logging Dependency tracking Operating model design Soft Influence without authority Earned adoption Company-wide change management

96%

K8s Provisioning Reduction

4–6 hours to 10–15 min. Error rate down 10×.

75%

New-Service Integration Faster

4–6 weeks to 1 week via GraphQL federation across 5 legacy systems.

85%

On-Time Delivery

Company-wide. Up from 40%.

Skills

AI & GPU Infrastructure

GPU fleet lifecycle management Compute stack delivery Node-level fault detection Slurm / job scheduling GPU cost attribution ML observability

Infrastructure & Systems

DC build program execution Multi-DC architecture Bare-metal & on-prem infrastructure Optical networking (DWDM, G.8032) BFD + BGP / traffic engineering L4/L7 load balancing Kubernetes / Helm / Terraform / GitOps Prometheus / Grafana / Kubecost

Program Delivery

Portfolio management Technical Architecture Review Quarterly strategic review Cross-org dependency management Agile / hybrid delivery Wave-based migration planning RAID / risk management Operating model design Hardware procurement & vendor management Facilitation Consensus building Strategic execution

Compliance & Data

PCI DSS 4.0.1 QSA audit management Zero-downtime migration Data integrity validation Compliance evidence governance

I build to stay close to the systems my programs run on

Each project starts with a real operational problem, named constraints, and artifacts that belong in a program tracker.

GPU fleet reliability classifier

PythonSlurmPrometheusGrafana

Slurm job failure classifier for a GPU cluster. Catches node-level faults (dead GPUs, driver failures, thermal throttling) before they cascade into downstream job failures. Stops scheduling to affected nodes. Outputs P1-to-P4 SLA-tiered alerts, runbooks per failure class, and cost-attributed reports per node.

Why it matters: A dead node that keeps accepting jobs burns cluster hours and delays training runs. The right intervention is at the scheduler, not the postmortem. Same framing a fleet ops team uses to make decommission decisions.

Stack: Python, Slurm, Prometheus, Grafana · NVIDIA NCA-AIIO certification in progress

Code →

DSPy Prompt Optimizer

PythonStreamlitDSPyLLM Eval

Interactive tool for structured LLM prompt optimization. Engineers routinely cargo-cult DSPy settings without understanding what GEPA/BayesOpt is actually optimizing or why the labeled dataset size matters. Built a 5-step walkthrough that enforces data minimums, parses plain English into typed Signatures, runs optimization with live progress, and shows before/after metric comparison.

Design constraint: GEPA requires a minimum labeled dataset before optimization produces meaningful results. Running without that gate produces confidently wrong output, the same failure class as building migration resolvers before completing the identity map. The gate is not optional; it's surfaced at step 2, not buried in docs.

TPM framing: Mirrors wave-based migration planning. Each phase has a defined entry condition, a transformation, and an exit criterion. Same gate-and-validate structure, different domain.

Demo →

Also built

Safe AI Tutor for Kids 3–8

Agentic AI2B-Param On-DeviceVector DBContent Governance

General-purpose LLMs lack developmental scaffolding and safe routing for sensitive disclosures. Built a 2B-parameter on-device model with three hard constraints: a safety constitution (100% harmful-content interception in evaluation), a developmental scaffolding matrix limiting vocabulary by age, and a pre-vetted knowledge base that eliminates hallucination at the cost of topical flexibility.

Tradeoffs: Hard-coded safety constitution sacrifices adaptability for predictability. No persistent cross-session memory prioritizes privacy over personalization.

TPM artifact: Safety constitution used as deterministic QA acceptance criteria. Curriculum matrix drafted as the functional spec.

Dependency Risk Classifier

Claude APIWeb AppNetlify

Tracking dependencies across complex program graphs is manual and error-prone. A tool that ingests freeform dependency descriptions and automatically classifies them as CONFIRMED, ASSUMED, or AT-RISK, then exports directly into RAID-log format.

TPM framing: Encodes a senior infra TPM's three-tier dependency mental model directly into automated software. Removes the gap between how TPMs think about risk and how it gets tracked.

About

My first job was in sales. I noticed the enrollment process was broken, told the CEO, and ran the fix. I've been doing that ever since. Built the TPM function from scratch, which meant owning whatever the company needed: networking, compliance, migrations, cost governance. Often at once.

What I'm good at: finding the unlabeled problem that costs money, getting deep enough to act on it, and not becoming the engineer in the process.

Currently

Consulting as TPM at Algebras AI since May 2026. In parallel, building depth in GPU and AI infrastructure: a GPU fleet failure classifier (Python, Slurm, Prometheus), and a prompt optimization tool for ML engineers (Streamlit, DSPy). Both linked in Projects.

Recognition

During the COVID-19 response in Astana, organized delivery of 4,500+ hot meals to frontline medical workers across multiple hospitals during lockdown, from idea to execution in 72 hours. Received a formal letter of gratitude from the Minister of Health of Kazakhstan.

Coverage: Forbes Kazakhstan · Caravan · New Times · Sputnik · New Times People of Year · Ministry of Health · Charity concert · Nazarbayev Foundation video

Credentials

M.Sc. Computer Science

PMP Certified

NVIDIA NCA-AIIO (in progress)

8+ years of experience

San Francisco, CA · Green Card holder

aisulu.tustykbayeva@gmail.com · linkedin.com/in/aisulu-t · Resume (PDF)

Beyond the program

Scientific research funding

Funded peer-reviewed air quality research in Almaty in 2021. Findings published in Atmosphere (MDPI): Seasonal and Spatial Variation of Volatile Organic Compounds in Ambient Air of Almaty City, Kazakhstan.

Early entrepreneurship

Founded Anti.Smog in 2019: air pollution masks tied to a research fundraising campaign. Created the first Kazakhstani brand of weighted blankets in 2021.

Coverage: The Astana Times · The Village Kazakhstan · New Times

Testimonials

Nikita Krasulin · CEO, PS Cloud Services "The system that Aisulu implemented reduced escalations by 3x."

Miras Sagnayev · CEO, MyChina "Aisulu is a true professional, she helped us get the company to a new level by building operational systems we didn't know we needed."

Marina Yambayeva · Head of Product, PS Cloud Services "Aisulu's technical acumen has always helped us get on the same page with Engineering and ship faster."

Diana Safina · CBO, Algebras AI "I see a lot of progress in the way we run things since Aisulu has started, I'm impressed."

I run infrastructure programs that ship.