Aisulu Tustykbayeva | Sr Technical Program Manager, Cloud/AI Infra
My job is done when the system runs without me. I design to that standard from day one.
$10M+ revenue enabled · $4M+ largest program budget
Deadlines slip. Dependencies go dark. Escalations replace planning. I map the gaps, design the operating model, and deploy it across XFN teams without formal authority. The system runs. I'm not in the critical path.
Live revenue base. No margin for data loss. Engineers who disagree on sequencing. I design the dual-write window, the validation gate, and the rollback trigger before engineering starts. Not after the first incident.
I earn engineers' trust by showing up technically prepared: reading the specs, surfacing the design gap before it becomes a P0, running the test that finds what the architecture missed. That's the entry condition. Everything else follows.
Certification, contract, or competitor launch. I run the evidence framework, surface amber flags before they become findings, and close programs on schedule when missing the date has a public cost.
Two teams, no one accountable. Neither moves first. I find the root cause of the resistance, redesign the sequence so each team has a reason to go, and drive the decision to close.
PS Cloud Services: a multi-tenant IaaS platform processing 1.2M requests/second at peak across 7 data centers in Kazakhstan and Uzbekistan. Enterprise customers include fintech and oil production operators with contractual SLA and compliance requirements. Engineering organized across 8 service teams. None reported to me. I ran programs across all 8.
Built the company's first TPM practice from zero. Defined the operating model, established program governance, and grew the function over 4.5 years. Two direct reports promoted within 18 months.
| Program | Outcome | The hard constraint | Teams spanned | Skills applied |
|---|---|---|---|---|
| CRM migration via GraphQL federation | Zero downtime, $20M revenue base protected. 12 months to 4. Query p99: 3–5s → <500ms. | Live $20M revenue base. No margin for data loss. 5 platforms in disagreement on sequencing. | Platform, DBaaS, K8s, LBaaS, SRE |
Hard
GraphQL federation
Dual-write migration
Data validation pipelines
PostgreSQL / WHMCS
Soft
Decision documentation
Root cause reframing
Schedule pressure resistance
|
| Enterprise churn prediction & early warning | Churn 15%→9%. Support alert action rate 87%. 11% enterprise net retention lift. $11.7M revenue base protected. | No team, no budget. Data across 5 disconnected systems with no shared identifier. | CRO, Sales, Support, Infrastructure, Data Analytics |
Hard
ML model design (logistic regression, gradient boosting)
SHAP feature attribution
Data unification (5 systems)
Ceph storage telemetry
Soft
Program charter under constraint
Kill criteria definition
Support team co-design
Handoff discipline
|
| PCI DSS 4.0.1 certification, 7 DCs | Certified 3 weeks early. $3.2M ARR converted. Deploy frequency 40+/week maintained. | Competitor on the same deadline. 8 service teams, no direct authority. | LBaaS, DBaaS, K8s, Object Storage, Networking, Security, SRE, DevOps |
Hard
PCI DSS 4.0.1
RAG evidence dashboards
Sidecar-emitted compliance signals
API schema governance
Soft
QSA negotiation
Concession strategy
Amber rule governance
Influence without authority
|
| DC expansion + 100 Gbps optical ring | 4 to 7 sites, 99.95% SLA maintained. $8M in deals unblocked. Sub-50ms failover. | 6 vendors, 2 countries. Design gap surfaced mid-program in synthetic load test. | Networking, Platform, SRE, Security + 6 external vendors |
Hard
DWDM optical networking
G.8032 ERPS
BFD + BGP
Multi-DC architecture
etcd quorum topology
Soft
Vendor negotiation
Physical site verification
Financial modeling for technical decisions
|
| K8s provisioning automation (Terraform + GitOps) | 96% provisioning time reduction (4–6 hrs → 10–15 min). Error rate down 10×. | 8 skeptics. One never converted. System works regardless. | DevOps, K8s Platform, DBaaS, Networking, Security |
Hard
Kubernetes / Helm
Terraform state management
GitOps CI gates
TOCTOU risk mitigation
Soft
Change management
Adoption without mandate
Rollout sequencing by trust
|
| Kubernetes cost visibility (Kubecost) | $400K deals protected. License $15K→$11K. 620 eng-hours preserved. Adopted by 2 additional product teams. | VP decision already announced. Critical-path engineers named for the build. | Engineering (VP), Infrastructure, Sales, K8s Platform, DBaaS, LBaaS |
Hard
Kubecost
OPA admission controller
Namespace label taxonomy
Build vs. buy analysis
Soft
Executive decision reversal
Cost modeling
Vendor contract negotiation
Decision documentation
|
| Firewall SDN migration (FortiGate-VM + NSX) | Latency 3× AWS to competitive. Zero customer impact across 7 DCs. | PCI-scoped migration. One incident ends the program permanently. | Networking, Security, K8s Platform, SRE, DBaaS |
Hard
FortiGate-VM / NSX
SDN migration
Network segmentation
LBaaS traffic routing
Soft
Zero-incident governance
Rollback planning
Risk register management
|
| Infrastructure capacity planning & demand modeling | Forecast variance ±35%→±21% (40% improvement). Bottlenecks identified ahead of each DC expansion phase. | No demand telemetry. Forecasts based on sales pipeline estimates only. | Infrastructure, Finance, Sales, Engineering |
Hard
What-if simulation design
Cost modeling
Traffic forecasting
Demand telemetry
Soft
Cross-functional data gathering
Forecast credibility building
Executive communication
|
| Cross-team observability & deadlock resolution | 6-month deadlock resolved in 8 weeks. $2M+ pricing decision enabled. $500K decommissioning savings. 40% capacity forecast improvement. | No owner. Previous mandate produced silent resistance. Neither team would move first. | Product, Infrastructure (LBaaS, K8s, Compute, Object Storage), Data |
Hard
Prometheus / Grafana
Metrics taxonomy design
PoC design & execution
Schema freeze governance
Soft
Resistance root-cause mapping
Rollout sequencing by trust
Ambiguity ownership
Program durability under leadership change
|
| Delivery operating system | On-time delivery: 40% to 85%, company-wide. | No mandate. Adoption had to be earned, not assigned. | All 8 service teams + Product, Legal, Finance |
Hard
OKR / KPI framework design
RAID logging
Dependency tracking
Operating model design
Soft
Influence without authority
Earned adoption
Company-wide change management
|
4–6 hours to 10–15 min. Error rate down 10×.
4–6 weeks to 1 week via GraphQL federation across 5 legacy systems.
Company-wide. Up from 40%.
Slurm job failure classifier for a GPU cluster. Catches node-level faults (dead GPUs, driver failures, thermal throttling) before they cascade into downstream job failures. Stops scheduling to affected nodes. Outputs P1-to-P4 SLA-tiered alerts, runbooks per failure class, and cost-attributed reports per node.
Why it matters: A dead node that keeps accepting jobs burns cluster hours and delays training runs. The right intervention is at the scheduler, not the postmortem. Same framing a fleet ops team uses to make decommission decisions.
Stack: Python, Slurm, Prometheus, Grafana · NVIDIA NCA-AIIO certification in progress
Code →Interactive tool for structured LLM prompt optimization. Engineers routinely cargo-cult DSPy settings without understanding what GEPA/BayesOpt is actually optimizing or why the labeled dataset size matters. Built a 5-step walkthrough that enforces data minimums, parses plain English into typed Signatures, runs optimization with live progress, and shows before/after metric comparison.
Design constraint: GEPA requires a minimum labeled dataset before optimization produces meaningful results. Running without that gate produces confidently wrong output, the same failure class as building migration resolvers before completing the identity map. The gate is not optional; it's surfaced at step 2, not buried in docs.
TPM framing: Mirrors wave-based migration planning. Each phase has a defined entry condition, a transformation, and an exit criterion. Same gate-and-validate structure, different domain.
Demo →General-purpose LLMs lack developmental scaffolding and safe routing for sensitive disclosures. Built a 2B-parameter on-device model with three hard constraints: a safety constitution (100% harmful-content interception in evaluation), a developmental scaffolding matrix limiting vocabulary by age, and a pre-vetted knowledge base that eliminates hallucination at the cost of topical flexibility.
Tradeoffs: Hard-coded safety constitution sacrifices adaptability for predictability. No persistent cross-session memory prioritizes privacy over personalization.
TPM artifact: Safety constitution used as deterministic QA acceptance criteria. Curriculum matrix drafted as the functional spec.
Tracking dependencies across complex program graphs is manual and error-prone. A tool that ingests freeform dependency descriptions and automatically classifies them as CONFIRMED, ASSUMED, or AT-RISK, then exports directly into RAID-log format.
TPM framing: Encodes a senior infra TPM's three-tier dependency mental model directly into automated software. Removes the gap between how TPMs think about risk and how it gets tracked.
My first job was in sales. I noticed the enrollment process was broken, told the CEO, and ran the fix. I've been doing that ever since. Built the TPM function from scratch, which meant owning whatever the company needed: networking, compliance, migrations, cost governance. Often at once.
What I'm good at: finding the unlabeled problem that costs money, getting deep enough to act on it, and not becoming the engineer in the process.
Consulting as TPM at Algebras AI since May 2026. In parallel, building depth in GPU and AI infrastructure: a GPU fleet failure classifier (Python, Slurm, Prometheus), and a prompt optimization tool for ML engineers (Streamlit, DSPy). Both linked in Projects.
During the COVID-19 response in Astana, organized delivery of 4,500+ hot meals to frontline medical workers across multiple hospitals during lockdown, from idea to execution in 72 hours. Received a formal letter of gratitude from the Minister of Health of Kazakhstan.
Coverage: Forbes Kazakhstan · Caravan · New Times · Sputnik · New Times People of Year · Ministry of Health · Charity concert · Nazarbayev Foundation video
San Francisco, CA · Green Card holder
Funded peer-reviewed air quality research in Almaty in 2021. Findings published in Atmosphere (MDPI): Seasonal and Spatial Variation of Volatile Organic Compounds in Ambient Air of Almaty City, Kazakhstan.
Founded Anti.Smog in 2019: air pollution masks tied to a research fundraising campaign. Created the first Kazakhstani brand of weighted blankets in 2021.
Coverage: The Astana Times · The Village Kazakhstan · New Times
Nikita Krasulin · CEO, PS Cloud Services "The system that Aisulu implemented reduced escalations by 3x."
Miras Sagnayev · CEO, MyChina "Aisulu is a true professional, she helped us get the company to a new level by building operational systems we didn't know we needed."
Marina Yambayeva · Head of Product, PS Cloud Services "Aisulu's technical acumen has always helped us get on the same page with Engineering and ship faster."
Diana Safina · CBO, Algebras AI "I see a lot of progress in the way we run things since Aisulu has started, I'm impressed."