Full-Stack AI Architect | Thomson Reuters & Bank of New York | Chemistry · Pharma · Biotech · Manufacturing

Beyond Correlation: Understanding Causality

15+ years building enterprise platforms that bridge research and production. Specializing in Causal AI + Simulation to create systems that don't just predict—they explain, simulate, and guide decision-making.

Executive Summary

The portfolio at a glance — key capabilities, approach, and value proposition

Who I Am

  • Full-Stack AI Solution Architect with 15+ years enterprise experience
  • Expert in Chemistry/Pharma, Biotech, and Advanced Manufacturing domains
  • Bridge between academic research and production-ready solutions

What I Bring

  • Causal AI & Simulation: Understanding 'why' and enabling 'what-if' scenarios
  • Enterprise Architecture: Scalable platforms serving 10,000+ users
  • Domain Translation: Converting complex science into operational systems
  • AI Education: Leading study groups bridging academia and industry

How I Work

  • Mechanistic Understanding: Physics-informed AI that explains decisions
  • Virtual Experimentation: Reduce physical trials by 70-90% through simulation
  • Production-Ready: From research prototype to enterprise deployment

Value I Create

  • Cost Reduction: Dramatically fewer physical experiments while maintaining rigor
  • Faster Innovation: Parallel virtual testing accelerates R&D cycles
  • Robust Decisions: Uncertainty quantification for risk-aware planning
  • Scalable Knowledge: Platforms that capture and amplify expert insights

Key Differentiators

Causal + Simulation
Not just prediction—mechanistic understanding
Research ↔ Production
Academic rigor meets enterprise scale
Domain Expertise
Deep understanding of scientific workflows

Core Expertise

Three pillars of value creation

Enterprise AI Architecture

15+ years building platforms that scale from research to production

  • Thomson Reuters: Multi-tenant SaaS serving 10,000+
  • Bank of New York: Real-time data pipelines processing billions of records
  • Microservices, event-driven architectures, and cloud-native design

Causal AI & Simulation

Going beyond 'what' to 'why' and 'what if'

  • Causal inference for understanding mechanisms, not just patterns
  • Physics-informed simulation for virtual experimentation
  • Uncertainty quantification and robust decision-making

Research ↔ Industry Translation

Making cutting-edge methods production-ready

  • Translating academic research into industrial applications
  • Teaching AI/ML foundations to industry practitioners
  • Building communities that bridge academia and business

Enterprise Case Studies

Real-world platform architecture from financial services and legal tech

Multi-Client AI Platform with Enhanced Data Security and Real-Time Performance

Process millions of legal documents daily for thousands of enterprise clients with real-time analysis, risk identification, and compliance checking.

Key Technical Challenges

Data Quality & Format Heterogeneity

Documents ranged from pristine PDFs to low-quality scanned images with handwritten annotations. No standardization across jurisdictions or document types.

Multi-Tenant Data Isolation at Scale

3,000+ clients with strict confidentiality requirements. Could not afford separate infrastructure per client due to cost, but regulatory compliance demanded zero data leakage.

Real-Time Performance with Complex AI

Legal professionals needed instant results, but document analysis required NLP models with 10+ seconds inference time. Traditional batch processing was unacceptable.

Key Architectural Solutions

Unified Data Ingestion
Solves: Data Quality & Format Heterogeneity

Standardized metadata across diverse document formats (PDF, scanned images, Excel). Built OCR pipeline with automatic error classification and human review queue.

Key Insight

Three-tier quality classification: auto-process high-confidence documents, flag medium-confidence for review, route low-quality to specialized OCR models. Feedback loop continuously improved classification thresholds.

Processed 5M+ documents daily with 99.2% parsing accuracy
Multi-Tenant Architecture
Solves: Multi-Tenant Data Isolation at Scale

Three-layer storage design: raw data isolation, processed data with schema separation, and anonymized feature store for cross-client AI training using differential privacy.

Key Insight

Physical data separation at raw layer (S3 bucket per client), logical separation at processed layer (database schemas), and privacy-preserving aggregation at feature store. Cost amortized through shared compute infrastructure.

Served 3,000+ clients with zero data leakage incidents
MLOps at Scale
Solves: Real-Time Performance with Complex AI

Built Feature Store as central platform component. Automated model training pipeline with A/B testing, canary deployments, and real-time performance monitoring.

Key Insight

Pre-computed features for 80% of common queries, streaming inference for complex cases. Implemented model distillation: large models train small, fast models for production. Result caching with intelligent invalidation.

Reduced model development cycle from weeks to days

Relevance to:

Pharmaceutical and chemical CROs face identical architectural challenges when scaling R&D data platforms.

Direct Architectural Parallels:
Multi-Source Experimental Data

Like legal documents, experimental data comes from diverse sources: HPLC chromatograms, NMR spectra, mass spec results, ELN entries, and analytical reports. Each instrument has different formats, quality levels, and metadata standards requiring unified ingestion pipelines.

Client Data Isolation & IP Protection

CROs serve competing pharmaceutical companies simultaneously. Client A's synthesis route for a cancer drug must be completely isolated from Client B's similar research. The multi-tenant architecture with three-layer isolation (raw data per client, processed data with logical separation, anonymized features for shared AI models) directly applies.

Real-Time AI for Decision Support

Chemists need instant prediction of reaction outcomes, toxicity, or synthesis routes while planning experiments. Pre-computed molecular features, model distillation for fast inference, and intelligent caching enable real-time AI without compromising model sophistication.

Value Proposition

Build a unified platform where multiple pharma clients can leverage shared AI capabilities (retrosynthesis prediction, property prediction, formulation optimization) while maintaining absolute data confidentiality and enabling real-time decision support for R&D teams.

Causal AI + Simulation: Beyond Predictive Models

Understanding the 'why' beyond correlation—crucial for high-value, low-data domains like drug discovery through causal inference + simulation

vs
Traditional Predictive AI
What It Does

Finds patterns in historical data and predicts outcomes. 'If conditions match pattern X, expect outcome Y'

Limitations
  • Cannot explain why predictions work
  • Struggles with novel scenarios outside training data
  • Cannot answer 'what-if' questions about interventions
  • Correlation ≠ Causation: Confounds spurious associations
  • No mechanistic understanding, cannot extrapolate
Causal AI + Simulation
What It Does

Models causal mechanisms, then simulates interventions before experiments. 'Temperature affects yield via kinetics and side reactions—simulate 1000 scenarios to find optimum'

Advantages
  • Explains mechanisms: Why X causes Y through physical/biological pathways
  • Handles interventions: Predicts outcomes of actions never seen in training data
  • Virtual experiments: Test thousands of scenarios in simulation at zero cost
  • Transfers knowledge: Mechanisms generalize across molecules, scales, domains
  • Physics-informed constraints ensure predictions obey natural laws
Case Study: Reaction Scale-Up via Causal Simulation

Challenge

Scale Suzuki coupling from 5L lab reactor to 500L pilot plant. Traditional approach: 20+ pilot batches at $50k each, 30% first-batch failure rate.

Approach

Build causal simulation framework: thermodynamics + mass transfer + reaction kinetics. Causal graph identifies key mechanisms: reactor geometry → mixing regime → local concentration gradients → selectivity. Simulate 200+ scale-up scenarios to predict failure modes before physical experiments.

Result

First pilot batch achieved 88% yield (typical 60-70%). Only 3 batches needed for optimization vs 20, saving $850k and 5 months. Causal insights revealed critical agitation speeds to prevent hotspots.

Why Chemistry/Pharma Needs Causal AI + Simulation
  • Virtual Experiments: $2.6B per drug—simulate thousands of scenarios at zero marginal cost before lab work
  • Physics-Informed Learning: Combine mechanistic models with data—causal structure ensures physical plausibility and extrapolation
  • Regulatory Compliance: FDA/NMPA require mechanistic understanding—causal simulation provides both prediction and explanation
Technical Capabilities
  • Causal Discovery & Inference: Learn and reason over causal graphs using PC algorithm, NOTEARS, DoWhy
  • Hybrid Simulation: Physics-Informed Neural Networks (PINNs) with causal constraints for mechanistic modeling
  • Counterfactual Reasoning: Answer 'what-if'—virtually test process changes before expensive experiments
  • Digital Twin Integration: Real-time causal inference updates simulation parameters from live process data
Deep Dive: Understanding Causality in Practice

From 'Correlation' to 'Causality': How AI Can Truly Understand the 'Why' of Chemical Reactions

The core of chemical research lies in exploring causal relationships—understanding how changes in functional groups lead to differences in activity, and how adjustments to reaction conditions affect final yield and selectivity. However, traditional chemical AI models largely remain at the level of correlation learning. They can find patterns but cannot answer: Is B truly caused by A?

I. The Nature of Causality: From 'Observation' to 'Intervention'

1. Correlation: Based on 'Observation'

Question: 'In the existing data, is reactant feature X correlated with high yield?' Limitation: This could be mere accompaniment. For instance, all reactions using expensive Pd catalysts were recorded in advanced lab notebooks. The model might incorrectly learn 'neat lab notes' cause high yield, missing the true driver: 'Pd catalyst'. This is spurious correlation.

2. Causality: Based on 'Intervention'

Question: 'If I actively intervene and perform an operation do(X) on a reactant (e.g., force a functional group addition), how will the yield change causally?' Essence: The core of causality lies in the 'do' operator. It disregards the world's original state and asks what the outcome would be in a new world that has been intervened upon and altered. This is precisely how chemists think when designing experiments: 'If I make this change, what result will I get?'

II. Core Applications of Causal AI in Chemistry

1. Counterfactual Reasoning & Molecular Editing: Answering 'What If'

Scenario: You have a lead compound with decent but not ideal activity. Traditional AI might suggest structures 'similar' to highly active ones but cannot quantify the effect of modifications. Causal AI can perform 'counterfactual' predictions: 'If we replace -NO₂ (electron-withdrawing) with -OCH₃ (electron-donating), by how much will binding affinity increase?' Significance: Elevates from 'suggesting similar compounds' to 'precisely guiding molecular structure optimization', greatly accelerating rational design in drug discovery and materials development.

2. Causal Discovery & Mechanistic Elucidation: Automatically Discovering Causal Chains from Data

Scenario: Understanding which factors drive selectivity in complex reactions. Traditional AI offers a list of features correlated with high selectivity (potentially long and including irrelevant factors). Causal AI uses causal discovery algorithms to build causal graphs, identifying that 'Fukui function value at site A (nucleophilicity)' is a direct cause of 'product isomer ratio', while solvent polarity, though correlated, is not a dominant factor in this reaction. Significance: Helps chemists validate or even discover new reaction mechanisms from data, elevating AI from a predictive tool to a scientific discovery partner.

3. Overcoming Data Bias for Robust Condition Recommendations

Scenario: Recommending the best catalyst for a new reaction. Traditional AI tends to suggest the most common catalysts in training data, which may not be optimal (data bias). Causal AI can estimate the true causal effect of each catalyst by modeling. Even if an efficient catalyst is rare in historical data, the causal model can identify its superior efficacy and confidently recommend it. Significance: Recommendation systems become more reliable, daring to make exploratory suggestions for novel, out-of-distribution reactions, breaking the limits of historical human experience.

III. Implications and Summary

  • From Black Box to White Box: Causal models provide explanations for 'why a prediction is made' through counterfactual analysis or causal graphs, enhancing the trustworthiness of results.
  • From Prediction to Decision-Making: The core objective is no longer just to predict outcomes but to support decisions—guiding what 'we should do' to achieve desired results.
  • From Data-Driven to Knowledge-Data Fusion: It integrates qualitative chemical mechanistic understanding (e.g., electronic effects, steric hindrance) into a mathematical framework, making AI reasoning more aligned with chemists' logic.

Ultimately, the goal of Causal AI is to transform artificial intelligence from an 'assistant' that merely reports statistical patterns into a 'research collaborator' capable of causal reasoning and proposing testable hypotheses. This is not just about more accurate predictions, but about deeper scientific discovery.

AI Club — Academic & Industry Bridge

A collaborative learning community connecting theoretical depth with practical applications. We organize systematic study groups spanning AI, mathematics, system design, and domain expertise—from first principles to production deployment.

Rigorous Curriculum

Structured 8-16 week courses blending theory, implementation, and real-world applications across AI, math, and engineering domains.

Collaborative Learning

Bring together researchers, practitioners, and domain experts to learn from each other through joint exploration in cohort-based study groups.

Active Communication

Regular discussions, knowledge sharing sessions, and peer-to-peer exchanges that foster deep understanding and build lasting professional networks.

Industry Translation

Focus on connecting academic research with industrial problems—understanding how theory solves real-world challenges.

Study Group Courses

View Full Curriculum
12 weeks

AI for Molecular Discovery

A comprehensive 12-week course covering AI applications in drug discovery, from cheminformatics and graph neural networks to generative models and retrosynthesis.

Multi-week

From Queries to Clusters

Master database architecture, performance, and more—from SQL foundations to distributed systems.

Multi-week

Full-Stack JavaScript

From browser to backend mastery. Explore the language, frameworks, and architectures that power the modern web.

Multi-week

From Algorithms to Pipelines

Master data structures, algorithms, and data engineering in Python—from theory to production deployment with Airflow and Dask.

Multi-week

System Design & Software Architecture

From whiteboard ideas to scalable real-world software systems. Explore architecture, patterns, and deployment strategies.

Multi-week

Quality by Design (QbD) Mastery

Integrating FDA regulatory perspectives, statistical foundations, and practical applications for robust biopharma and manufacturing innovation.

16 weeks

AI & Math for Discovery

Building mathematical foundations for scientific AI—linear algebra, calculus, optimization, and probability.

16 weeks

Deep Learning: Graphs to Generalization

A 16-week deep learning course covering theory, implementation, and generalization—from neural networks to GNNs and transformers.

16 weeks

Causal AI & Simulation

A 16-week course on causal inference, graphical models, and simulation—from theory to practical applications in chemistry and pharma.

16 weeks

Statistics & ML (No Black Box)

A 16-week study of statistics and classical machine learning focused on interpretability, causality, and mathematical elegance—no deep learning.

15 weeks

Mathematical Unity

A 15-week journey exploring deep connections in pure mathematics, from number theory and infinity to category theory and modern frontiers.

Multi-week

From Silicon to Syntax

Explore how hardware shapes programming languages like C++ and Python, from CPU architecture to compilers and concurrency.

8 weeks

Chaos Theory: A Mathematical Journey

An 8-week reading path exploring how chaos emerges from deterministic systems, from smooth functions to strange attractors.

Interested in joining our study groups or exploring collaboration opportunities?

Explore Curriculum

Core Philosophy

Lessons learned from building enterprise-scale AI systems in regulated industries

Platform Thinking Over Project Thinking

Identify common capabilities and build reusable services. New requirements should combine existing services, not rebuild from scratch.

Data Quality is AI's Ceiling

The most sophisticated model cannot compensate for poor data. Invest in data standardization, validation, and lineage tracking from day one.

Security & Compliance Cannot Be Retrofitted

Building security measures later costs 10x more than early investment. Design for regulatory requirements and data protection from the start.

Observability is Not Optional

You cannot improve what you cannot measure. Logs, metrics, and tracing must be built-in, not added later. Problem diagnosis time drops from hours to minutes.

Let's Collaborate

Open, pragmatic technical discussions about AI infrastructure, causal inference, and applications in chemistry/pharma

This is about exchange, not a sales pitch. I welcome open, practical technical discussions.

Technical Discussion
Share challenges & approaches
Collaboration Framework
Explore potential synergies
Knowledge Exchange
Connect research & industry

Discussion Topics by Domain

Enterprise AI platform architecture & multi-tenant SaaS design
Causal AI for reaction optimization & process scale-up
Data isolation, security, and regulatory compliance
MLOps pipelines for chemistry/pharma applications
Academic-industry research collaboration
Knowledge transfer & methodology workshops
Current Collaborations
SyntellyPharma AI Research
AI Club — Academic & Industry Bridge