Part 2 of 4: Applying Polya’s Problem-Solving Framework to Data Science, ML, and AI
A Comprehensive Guide for Practitioners
This article is actively evolving. If you spot gaps, disagree with a take, or have bette r patterns, I’d love to hear it. Suggestions and critiques are welcome.
Introduction: Standing at the Threshold
Picture yourself in this moment. You’re sitting at your desk, perhaps with a cup of coffee growing cold beside your keyboard. You’ve spent the last few days—maybe weeks—immersed in your data. You’ve run df.describe()
more times than you can count. You’ve created distribution plots, correlation matrices, and t-SNE visualizations. You’ve identified the outliers, understood the class imbalance, and discovered that one column that’s leaking future information. You’ve documented your findings, noting that 15% of your labels might be noisy, that your features have wildly different scales, and that production latency can’t exceed 50 milliseconds.
You understand the problem. You really do.
But now comes something different. Now comes the moment where all that careful analysis must transform into action. Where understanding gives way to creation. Where you move from observer to architect.
This is the moment of devising a plan.
George Pólya, the mathematician who crystallized the art of problem-solving into teachable principles, understood that this stage represents the most profoundly human aspect of our work. It’s where intuition dances with logic, where your past projects—the successes and the failures—suddenly illuminate possibilities for this new challenge. It’s where you become an artist, sketching pathways through the vast, often intimidating landscape of model architectures, training strategies, and deployment constraints.
Let me walk with you through this process. Not as a distant lecturer, but as a colleague who’s been in your shoes, who’s felt that mix of excitement and uncertainty that comes with starting a new ML project.
The Architecture of Strategic Thinking: Building Your Bridge
Close your eyes for a moment and imagine this: You’re standing at the edge of a canyon. On this side—your side—you have everything you’ve gathered. Your training dataset of 50,000 samples, your baseline logistic regression that achieves 72% accuracy, your production constraints (50ms latency, 10MB model size), your stakeholder requirements (must be interpretable for regulatory compliance).
On the other side of that canyon lies your destination. The production model that achieves 90% accuracy, that makes predictions in 30 milliseconds, that your compliance team can audit, that scales to handle peak traffic. That’s where you need to be.
The plan you’re about to devise? It’s your bridge across this canyon.
Now, here’s what makes this fascinating—and what Pólya understood intuitively: there’s no single “right” bridge design. Sometimes the most direct path works beautifully. You recognize the problem as straightforward image classification, you grab a pre-trained ResNet, fine-tune it on your data, and you’re done. Direct. Efficient. Perfect.
But other times—and these are often the most interesting projects—the direct path won’t work. Maybe your data is too different from ImageNet. Maybe you need to solve multiple related tasks simultaneously. Maybe you need to invent something new. In those cases, your bridge might take a creative detour, arcing high through an auxiliary task that helps you learn better representations. Or perhaps you’ll build it incrementally, starting with a simple baseline and iteratively adding complexity as you understand what works.
Here’s what’s crucial to understand: Pólya doesn’t prescribe which bridge to build. Instead, he offers you a series of questions—probing, insightful questions that awaken your problem-solving instincts. These aren’t checklist items to tick off mechanically. Think of them as a conversation with a wise mentor who knows exactly what to ask to unlock your thinking.
Let’s dive into these questions together, and I’ll show you how they apply to the messy, real world of machine learning.
The First Question: “Have You Seen This Before?”
Let me tell you a story. A few years ago, I was working on a project to detect fraudulent insurance claims. New problem, new domain, new dataset. I could have started from scratch, treating it as entirely novel. But then Pólya’s first question echoed in my mind:
“Have you seen it before?”
And suddenly, my brain started making connections. Wait—fraud detection isn’t new to me. I’ve built spam filters. I’ve worked on credit card fraud. I once helped detect fake product reviews. These problems share DNA. They’re all about finding needles in haystacks, about detecting patterns that adversarial actors try to hide, about dealing with extreme class imbalance where fraudulent cases are maybe 1-2% of your data.
This is pattern recognition, and it’s one of the most powerful tools in your arsenal as an ML practitioner. Your brain—after years of projects, papers, and late-night debugging sessions—has built an incredible library of experiences. Every model you’ve trained, every architecture you’ve studied, every failure you’ve debugged has been cataloged somewhere in your neural networks.
But here’s the nuance that separates good practitioners from great ones: you’re not looking for exact matches. You’re looking for family resemblances, for structural similarities hidden beneath surface differences.
Let me show you what I mean. Imagine you’re working on a new time-series forecasting problem—predicting server load for autoscaling. You might think: “I’ve never done this before.” But wait. Have you worked with sequential data? Language models process sequences. Video understanding processes temporal information. Even if you’ve never predicted server load, if you’ve fine-tuned a BERT model, you understand the fundamental architecture of processing sequential information with attention mechanisms. That’s a pattern match.
Or consider this: You’re building a medical image segmentation model to identify tumors in CT scans. “I’ve never worked in medical imaging,” you might think. But have you done semantic segmentation in autonomous driving? Have you worked on document layout analysis? These problems share the same core structure: dense, pixel-level prediction where spatial relationships matter. The domain knowledge differs, but the architectural intuition transfers.
Let me walk you through how to search your experience systematically:
# PSEUDOCODE: How to Search Your ML Experience
def search_for_patterns(new_problem):
"""
This is what happens in your brain when you encounter a new ML problem.
Let's make it explicit.
"""
# FIRST PASS: Surface-level similarities
# Don't overthink this—just quick pattern matching
print("What type of data am I working with?")
= new_problem.identify_data_type()
data_type # -> "text", "images", "tabular", "time-series", "graph"
print("What am I trying to predict?")
= new_problem.identify_task()
task_type # -> "classification", "regression", "generation", "ranking"
print("What are my constraints?")
= new_problem.list_constraints()
constraints # -> {"latency": "50ms", "interpretability": "required",
# "data_size": "10k samples"}
# SECOND PASS: Abstract the essential structure
# This is where it gets interesting
print("\nWhat's the CORE challenge here?")
= extract_pattern(new_problem)
structure
# For example, your problem might reduce to:
# "Sequence-to-sequence with variable-length inputs"
# "Imbalanced binary classification with adversarial actors"
# "Multi-modal fusion for generation tasks"
# "Few-shot learning in a new domain"
# THIRD PASS: Search your memory at different levels
= []
relevant_memories
# Start specific
print("\nHave I solved this EXACT problem before?")
= recall_projects(task=task_type,
exact_matches =data_type,
data=new_problem.domain)
domain
if exact_matches:
print(f"Yes! I can adapt my approach from {exact_matches}")
return exact_matches
# Go broader - same architecture family
print("\nHave I used similar architectures?")
= recall_projects(
architectural_matches =structure.architecture_hint
architecture_family
)# e.g., "I've used transformers for NLP, maybe they work here too"
# Go even broader - same fundamental challenge
print("\nHave I faced similar CHALLENGES?")
= recall_projects(
challenge_matches =structure.key_challenges
challenges
)# e.g., "I've dealt with class imbalance in spam detection"
# FOURTH PASS: Cross-domain analogies
# This is where creativity happens
print("\nWhat problems in OTHER domains share this structure?")
# Real examples of this:
# - Music recommendation ← Natural language processing
# (playlists as sentences, songs as words)
# - Protein folding ← Language modeling
# (amino acid sequences have grammar-like rules)
# - Traffic flow ← Fluid dynamics
# (both are about flow through networks)
= find_structural_parallels(structure)
analogous_domains
return {
'direct_matches': exact_matches,
'architectural_inspiration': architectural_matches,
'challenge_patterns': challenge_matches,
'creative_analogies': analogous_domains
}
Let me give you a concrete example of how this plays out. When DeepMind was working on AlphaFold for protein folding, they could have treated it as an entirely new problem. After all, predicting 3D protein structure from amino acid sequences is a unique biological challenge. But they recognized patterns:
- From language modeling: Proteins have sequence structure, like sentences
- From computer vision: The contact map prediction is like image segmentation
- From attention mechanisms: Long-range dependencies matter (like in transformers)
- From graph networks: Spatial relationships between amino acids form a graph
By recognizing these patterns, they could build on years of ML progress rather than starting from scratch. Their plan emerged from pattern recognition across multiple domains.
Now, let me be honest with you about something: This pattern recognition isn’t instantaneous. When you’re early in your ML career, your library of patterns is still being built. That’s okay. That’s expected. Every project you complete, every paper you deeply understand, every architecture you implement from scratch—these are all deposits in your pattern bank.
But here’s the secret that accelerates this process: Deliberately reflect on your projects. After you finish a model, don’t just move on. Ask yourself: - What was the essential structure of this problem? - What made it hard? - What architectural choices were crucial? - What would I do differently next time? - What other problems share this structure?
Write this down. Seriously. Keep a personal wiki, a notion page, a markdown file—whatever works for you. Future you will thank present you when you can search “problems with extreme class imbalance” and find three previous approaches that worked.
The Creative Leap: Introducing Auxiliary Elements
We’ve been building on what exists—pattern recognition, related problems, transferred methods. But now we come to something different. Something that requires genuine creativity. Pólya calls these auxiliary elements, and they represent some of the most exciting moments in machine learning.
Let me explain what I mean with a story. Imagine you’re working on image classification, and you’ve hit a wall. Your model plateaus at 85% accuracy. You’ve tried deeper networks, more data augmentation, better optimizers—nothing budges the needle. The gap between your current state and your goal seems unbridgeable with conventional approaches.
This is when you might introduce an auxiliary element—something not present in the original problem formulation, something creative that acts as a catalyst, enabling reactions that wouldn’t occur naturally.
In the case of that image classification problem, someone had a creative insight: What if we don’t just train the model to classify, but also to predict the rotation angle of randomly rotated images? This auxiliary task (rotation prediction) isn’t part of your original goal (you don’t care about rotation angles), but it forces the network to learn better representations. It works as a catalyst. Your main task performance jumps to 89%.
This is an auxiliary element in action.
Let me walk you through different types of auxiliary elements you might introduce:
Auxiliary Tasks (Multi-Task Learning)
This is probably the most common type. You add additional prediction tasks that aren’t your ultimate goal but help you learn better representations.
Real scenario: You’re building a model for medical image analysis to detect lung cancer. Your main task is binary classification: cancer or no cancer. But you introduce auxiliary tasks: - Predict the patient’s age (forces the model to learn biological indicators) - Segment out the lung regions (forces spatial understanding) - Predict whether the patient is a smoker (forces learning of texture patterns)
None of these auxiliary tasks is your goal. But together, they shape your model’s learned representations, making it better at the main task.
Here’s the fascinating part: There’s no algorithm for choosing auxiliary tasks. This requires intuition, domain knowledge, creativity. You might ask yourself: - What else could this model predict from the same data? - What intermediate understanding would help the main task? - What related information does the data contain?
Auxiliary Losses (Regularization Through Learning)
Sometimes your auxiliary element is a loss function that guides learning in useful directions.
Real scenario: You’re training a generative model, but the generated outputs look unrealistic. You add an auxiliary loss: a discriminator network that tries to distinguish real from generated examples. This is exactly what GANs do—the discriminator is an auxiliary element that provides a training signal (the adversarial loss) that helps the generator learn.
Or imagine you’re training an embedding model for product recommendations. Your main loss is based on click-through data. But you add an auxiliary loss: embeddings of similar products should be close together. This auxiliary loss acts as a regularizer, shaping your embedding space in useful ways.
Auxiliary Representations (Intermediate Structures)
Sometimes you introduce an intermediate representation that bridges the gap between input and output.
Real scenario: You’re building a text-to-speech system. Going directly from text to audio waveforms is incredibly difficult. But you introduce an auxiliary element: mel-spectrograms. Your system now has two stages—text to mel-spectrogram, then mel-spectrogram to audio. The mel-spectrogram is an auxiliary representation. It’s not your goal (users don’t want spectrograms), but it makes the problem tractable.
This is exactly what successful TTS systems like Tacotron do. The auxiliary element (spectrogram representation) transforms an impossible problem into two manageable ones.
Let me share the thought process for introducing auxiliary elements:
# PSEUDOCODE: Strategic Thinking for Auxiliary Elements
def should_i_add_auxiliary_element(current_problem, current_progress):
"""
You're stuck. You've tried the obvious things. Time to think creatively.
"""
# First, diagnose WHY you're stuck
print("Why is my model not improving?")
if current_progress.shows("underfitting"):
print("Model isn't learning good representations")
print("Consider: Auxiliary tasks to force richer features")
= [
candidates "Add self-supervised pretraining task",
"Multi-task learning with related tasks",
"Add auxiliary classification heads at multiple layers"
]
elif current_progress.shows("unstable_training"):
print("Training dynamics are problematic")
print("Consider: Auxiliary constraints to stabilize")
= [
candidates "Add auxiliary reconstruction loss",
"Add contrastive loss for embedding space",
"Add adversarial loss for robustness"
]
elif current_progress.shows("gap_too_large"):
print("Direct mapping from input to output is too hard")
print("Consider: Intermediate representations")
= [
candidates "Add latent space representation",
"Break into pipeline with intermediate outputs",
"Add attention maps as intermediate supervision"
]
# Second, evaluate each candidate
for candidate in candidates:
print(f"\nEvaluating: {candidate}")
# Will this help the main task?
if not likely_to_improve_main_task(candidate, current_problem):
continue
# Can I implement it with available data?
if not have_data_for(candidate):
continue
# Will the added complexity be worth it?
if complexity_increase(candidate) > expected_benefit(candidate):
continue
print(f"Worth trying: {candidate}")
return candidate
return None
Let me give you a powerful example of auxiliary elements in action: BERT.
Think about what BERT does. The creators wanted a model that understands language well enough for many downstream tasks (classification, question answering, etc.). But how do you train such a model? They introduced two auxiliary tasks:
- Masked Language Modeling: Randomly mask some words, predict them
- Next Sentence Prediction: Given two sentences, predict if they’re consecutive
These tasks aren’t anyone’s end goal. Nobody deploys BERT just to predict masked words. But these auxiliary tasks force BERT to learn deep linguistic understanding. Once trained on these auxiliary tasks, BERT becomes an incredibly powerful starting point for dozens of actual tasks.
The genius isn’t in the architecture (transformers already existed). The genius is in recognizing that these specific auxiliary tasks would teach the model what we want it to know. That’s creative insight.
A word of caution: Auxiliary elements can also backfire. I’ve seen teams add auxiliary tasks that actively hurt performance because they pulled the model in conflicting directions. I’ve seen auxiliary losses that made training unstable. The art is in choosing elements that genuinely help.
Here’s my advice: When introducing auxiliary elements, start simple. Add one auxiliary task. Does it help? Great, keep it. Does it hurt? Remove it or adjust it. Build your intuition through experimentation. Over time, you’ll develop a sense for which auxiliary elements are likely to help in which situations.
The Transformation Question: “Could You Restate the Problem?”
Let’s talk about one of the most powerful moves in your ML toolkit—one that can transform an impossible problem into a tractable one: problem reformulation.
I want you to imagine holding a Rubik’s cube. If you only know how to turn the front face, you’re severely limited. But once you realize you can rotate the entire cube to bring any face to the front, suddenly you have many more moves available. Problem reformulation is like that—it’s rotating the problem to expose different faces, each of which might be easier to solve.
Let me show you what I mean with a problem you might actually face. You’re tasked with predicting customer lifetime value (CLV) for a subscription business. The request seems straightforward: “Build a model that predicts how much revenue each customer will generate.”
But here’s the thing: This single business goal can be framed as completely different ML problems, each enabling different techniques and revealing different insights.
Let me walk you through each formulation as if we’re sitting together, sketching out approaches on a whiteboard:
Formulation 1: Classification
Your first instinct might be: “Let’s bucket customers. Low value (under $100), medium ($100-$1000), high (over $1000). We’ll build a classifier.”
Pros: This is simple. It’s interpretable. You can use standard classification techniques. Your stakeholders understand it—“This customer is likely to be high value.”
Cons: But think about it—you’ve just thrown away information. Is a $99 customer really that different from a $101 customer? Your buckets are arbitrary. And you’ve lost the ability to estimate actual dollar amounts. When finance asks “What’s the expected revenue from this cohort?” you can’t give a precise answer.
Would I start here? Maybe, if I’m exploring. But I’d know it’s limited.
Formulation 2: Regression
“Okay,” you think, “let’s predict the exact dollar amount. Regression problem. Done.”
You frame it as: Given customer features at signup, predict total lifetime revenue (a continuous value from $0 to… well, your highest-paying customer is at $50,000).
Pros: Now you have precise predictions. You can sum up expected revenue. You can rank customers accurately. This is what everyone wanted, right?
Cons: But then you start training and you see the problem. Your distribution is heavily right-skewed. Most customers are in the $100-$500 range, but you have a long tail of high-value customers. Your model, trying to minimize mean squared error, makes lots of errors on that long tail. You try log-transforms, you try robust loss functions, but nothing quite works cleanly.
And there’s another problem: you’re predicting cumulative future revenue, but you have no sense of time. A customer who generates $1000 over one year is very different from one who generates $1000 over five years, but your model treats them the same.
Workable? Yes. Ideal? Maybe not.
Formulation 3: Survival Analysis
Now you step back and think differently. “Wait,” you say, “what if I frame this as two separate things: HOW LONG will the customer stay (time until churn), and WHAT’S THE RATE of their spending?”
This is survival analysis. You’re modeling: 1. The hazard function: probability of churn at each time point 2. Expected revenue per time period while active
Then CLV = (revenue per period) × (expected lifetime)
Pros: This is more natural! It explicitly models the time dimension. It handles censored data gracefully (customers who haven’t churned yet). It lets you answer questions like “What’s the probability this customer is still active after 2 years?” The model structure matches the actual process.
Cons: It’s more complex to implement. You need to understand survival analysis (Cox models, Kaplan-Meier curves). It requires more careful data preparation. Your stakeholders might need education on what hazard ratios mean.
But for many subscription businesses, this is actually the “right” framing. It matches the underlying reality.
Formulation 4: Reinforcement Learning
Now let’s get really creative. What if you reframe the entire problem?
“Actually,” you say, “I don’t just want to predict customer value. I want to maximize it through interventions. What if I frame this as: learn a policy that decides what actions to take (send discount, send email, do nothing) to maximize lifetime revenue?”
This is a reinforcement learning formulation: - State: Customer behavior, engagement metrics, recent activity - Actions: Various interventions you can take - Reward: Incremental revenue generated - Goal: Learn policy π that maximizes expected cumulative reward
Pros: This is action-oriented. You’re not just predicting, you’re optimizing. It directly aligns with the business goal (maximize revenue). It can discover non-obvious intervention strategies. It learns from online feedback.
Cons: It requires an experimentation infrastructure—you need to try actions and observe results. It’s sample inefficient—you need lots of data. It’s complex to implement and debug. You need to be careful about exploration vs. exploitation. There are ethical considerations around treating customers as experimental subjects.
Would I jump straight to RL? Probably not. But for a mature business with existing A/B testing infrastructure and lots of data, this framing might unlock significant value.
Now, here’s the crucial lesson: These are all the same business problem, but completely different ML problems. Each formulation: - Enables different techniques - Requires different data structures - Makes different assumptions - Reveals different insights - Has different pros and cons
The formulation you choose shapes everything that follows. It’s not just a technical decision—it’s a strategic one.
Let me share how to think about this systematically:
When to consider reformulation:
When the obvious framing isn’t working: You’ve tried the straightforward approach and you’re stuck. Time to reframe.
When you have domain knowledge suggesting a different structure: If you understand the underlying process (like churn dynamics in subscription businesses), let that guide your formulation.
When your constraints force a different view: If you need to optimize actions (not just predict), that pushes you toward RL or causal inference framings.
When you discover the data doesn’t fit your framing: Your regression assumptions are violated? Maybe it’s not a regression problem.
Here are some common transformations to have in your toolkit:
Classification ↔︎ Regression: Sometimes predicting probabilities and thresholding works better than direct classification. Sometimes discretizing regression outputs makes the problem more stable.
Supervised → Self-Supervised: Can’t get enough labels? Maybe your problem can be reframed with automatic labels. Rotation prediction, colorization, masked language modeling—these are all self-supervised framings of problems that originally seemed to require labeled data.
Instance-Level → Set-Level: Struggling with individual predictions? Maybe Multiple Instance Learning is better. Example: Instead of classifying individual frames in a video, classify the entire video and let the model figure out which frames matter.
Time-Domain → Frequency-Domain: Stuck on time-series patterns? Apply FFT and work in frequency space. Sometimes patterns invisible in time domain are obvious in frequency domain.
Generative → Discriminative: Can’t directly model P(y|x)? Sometimes modeling P(x|y) and using Bayes rule works better. This is how Naive Bayes classifiers work.
Here’s a practical exercise: Take a problem you’re working on right now. Write down three completely different ways to frame it as an ML problem. Don’t just think it—actually write them down. For each framing, note: - What techniques does this enable? - What assumptions does it make? - What are the pros and cons? - What would the data need to look like? - How would you evaluate success?
You might be surprised. Often, the formulation you started with isn’t the best one. But you won’t discover alternatives unless you actively look for them.
Returning Home: The Power of First Principles
Let me tell you about a time I watched a team waste three months. They were building a recommendation system—sophisticated, beautiful, using the latest transformer architectures. They were so deep in implementation details: attention heads, positional encodings, learning rate schedules. The model was getting more and more complex.
And then someone new joined the team. In the first meeting, they asked a simple question: “What are we actually trying to optimize here?”
Silence. Then various answers: “User engagement.” “Click-through rate.” “Revenue.” “Time on platform.”
These are all different objectives. And they’d been optimizing for CTR while stakeholders wanted revenue.
This is what Pólya means by “go back to definitions.” When you’re lost in a maze of technical details, when complexity has accumulated to the point where you can’t see clearly anymore, when your model has 50 hyperparameters and you’ve lost track of what each one does—that’s when you need to return home. Return to first principles.
Let me walk you through what this means in practice:
Define What You’re Actually Solving
Strip away all the ML jargon. What is the fundamental thing your model needs to do?
Not “multi-class classification with cross-entropy loss.” But: “Given a customer service inquiry, route it to the right department.”
Not “sequence-to-sequence generation with attention.” But: “Given a bug description, suggest relevant code files for the engineer to check.”
Not “unsupervised clustering with k-means.” But: “Group these customer behaviors so we can design targeted interventions.”
When you state it this way, clearly and simply, you can ask better questions: - What’s the actual impact of being right vs. wrong? - What’s the cost of different types of errors? - What does “good enough” actually mean? - What do we need the model to learn to accomplish this?
Question Your Metrics
Here’s an uncomfortable truth: the metric you’re optimizing is often wrong for the actual problem.
You’re using accuracy because it’s standard. But look at your confusion matrix—false positives and false negatives have wildly different costs in your application. Accuracy treats them equally. Should you be using a cost-sensitive metric instead?
You’re using BLEU score for your text generation model. But BLEU was designed for machine translation. Does it actually measure what matters for your use case? Maybe human evaluators rate outputs very differently than BLEU does.
You’re using AUC-ROC because your classes are imbalanced. But in production, you need to make decisions at a specific threshold. Shouldn’t you be optimizing for precision at that threshold instead?
Going back to definitions means asking: “What does good performance actually mean?” Define it from first principles, not from convention.
Understand Your Model’s Capacity Requirements
Sometimes returning to first principles means asking: “What’s the minimum the model needs to know?”
I’ve seen teams train massive neural networks for problems that were, at their core, memorization of a few hundred rules. The model learns those rules, sure, but do you really need millions of parameters for that? Could a decision tree do the same job more efficiently?
Other times, teams use simple models for problems that fundamentally require complex pattern recognition. A logistic regression can’t learn hierarchical feature interactions no matter how much data you give it. You need the capacity of deeper models.
Returning to definitions means understanding: “What’s the inherent complexity of this problem?”
Let me give you a framework for thinking about this:
Simple problems (low complexity, clear rules): - Could be solved by a human with a checklist - Examples: Filtering spam emails based on keywords, flagging duplicate records - Don’t need deep learning—often decision trees or even rule-based systems work better
Medium complexity (pattern recognition, but not too deep): - Requires learning combinations of features but patterns are relatively straightforward - Examples: Credit scoring, customer churn prediction with engineered features - Gradient boosted trees, random forests, or shallow neural nets work well
High complexity (hierarchical patterns, rich structure): - Requires learning features from raw data, or very complex feature interactions - Examples: Image classification, natural language understanding, speech recognition - Deep learning shines here
Very high complexity (reasoning, planning, long-term dependencies): - Requires sequential decision-making or complex reasoning chains - Examples: Playing Go, theorem proving, long-form text generation - Might need specialized architectures, RL, or hybrid approaches
Going back to definitions means honestly assessing which category your problem falls into. Don’t use a cannon to kill a fly. Don’t bring a knife to a gunfight.
Revisit Your Assumptions
Every ML approach makes assumptions. Going back to definitions means making these assumptions explicit and checking if they hold.
You’re using linear regression. Implicit assumptions: - The relationship is linear - Errors are normally distributed - Features are independent (or you’ve handled multicollinearity) - No significant outliers
Do these hold for your data? If not, you’re building on a shaky foundation.
You’re using collaborative filtering for recommendations. Implicit assumption: - Users who agreed in the past will agree in the future - The rating matrix is static - You have enough historical data
What if user preferences drift over time? What about cold-start users?
Going back to definitions means questioning: “What am I taking for granted?”
Let me share a real example. A team was building a fraud detection model using historical transaction data. They were getting good cross-validation scores but terrible production performance. When they went back to first principles, they realized their implicit assumption: “fraudulent patterns don’t change over time.”
But of course they do! Fraudsters adapt. The patterns in last year’s data weren’t predictive of this year’s fraud. They needed to completely rethink their approach—shorter training windows, online learning, anomaly detection instead of supervised classification.
The problem wasn’t their modeling skill. It was a violation of a fundamental assumption they hadn’t made explicit.
Here’s a practice I recommend: When you’re stuck, have a “first principles” meeting.
Gather your team (or just sit down with yourself and a whiteboard). Ask:
- What are we really trying to do? (In plain language)
- What does success look like? (Concretely, with numbers)
- What does our model need to know? (What patterns, what relationships)
- What assumptions are we making? (List them explicitly)
- What are the simplest approaches? (Before we got clever, what would work?)
- What’s the irreducible difficulty? (What makes this problem hard, fundamentally)
You’ll be surprised how often this exercise reveals that you’ve been solving the wrong problem, or using the wrong metric, or making unjustified assumptions.
Going back to definitions isn’t admitting defeat. It’s resetting your understanding so you can move forward more effectively.
Working Backwards: Starting From The End
Let me paint you a scenario. You’re in a meeting with the product team. They’re excited: “We need a real-time recommendation system for the homepage. It needs to serve personalized recommendations to 10,000 requests per second, with 99th percentile latency under 50 milliseconds. Oh, and it needs to be explainable because we might need to show users why we recommended something.”
You nod, taking notes. Then you go back to your desk and think: “Okay, let’s start exploring architectures. Maybe a deep neural network with user and item embeddings, then a—”
Stop.
That’s forward thinking. You’re starting from what you have (data, techniques you know) and trying to reach what you need (the production system). Sometimes that works. But often, there’s a better way: working backwards.
Working backwards means starting from the goal—that production system serving 10K QPS at 50ms latency with explainability—and reasoning backward to determine what you need at each step.
Let me show you how this plays out:
The Backwards Planning Process
Step 1: Start with the final constraint
“I need 50ms p99 latency at 10K QPS.”
Step 2: Ask “What does this imply?”
Well, if I have 50ms total budget for latency, and I need to do feature lookup, model inference, and post-processing…
That means model inference can take at most… let’s say 20ms (leaving room for everything else).
Step 3: Ask again “What does this imply?”
20ms for inference. That rules out: - Large transformer models (would take 100+ms) - Ensemble of multiple heavy models (too slow) - Complex feature engineering at inference time (no time)
So I need a lightweight model architecture.
Step 4: Keep going backwards
If I need a lightweight model but still need good performance, what does that imply?
I need to do the heavy lifting offline: - Pre-compute embeddings - Use approximate nearest neighbor search for candidate generation - Only use the lightweight model for final re-ranking
Step 5: One more step back
If I’m doing candidate generation with ANN search, what does that imply?
I need: - A good embedding space (so similar items are close together) - Efficient indexing (FAISS or similar) - The embedding model can be heavy (it runs offline)
Step 6: And finally…
If I need a good embedding space, what does that imply for training?
I should use: - Contrastive learning or metric learning objectives - Large batch sizes (to get good negatives) - Training data focused on user-item interactions
Now I can work forward with this plan: 1. Train an embedding model (can be complex, runs offline) 2. Use embeddings to build ANN index 3. Train lightweight ranking model (must be fast) 4. Deploy as two-stage: ANN candidate generation → neural re-ranking
This entire plan emerged from working backwards from the constraint.
Let me show you the code structure for this thinking:
# PSEUDOCODE: Working Backwards Planning
def plan_backwards(production_requirements):
"""
Start from the goal and work backwards to determine what you need
"""
# The goal
print("Goal: Real-time recommendations")
print("Constraints:", production_requirements)
# {"latency_p99": "50ms", "qps": 10000, "explainability": "required"}
= production_requirements
current_constraints = []
architecture_requirements
# Work backwards through the implications
while True:
print(f"\nCurrent constraints: {current_constraints}")
print("What does this imply?")
# Latency constraint
if "latency" in current_constraints:
print("→ Model must be lightweight")
print("→ Heavy computation must be offline")
print("→ Need caching strategy")
architecture_requirements.extend(["lightweight_model (< 5M parameters)",
"offline_embedding_computation",
"redis_cache for frequent users"
])
print("\nWhat does lightweight model imply?")
print("→ Can't directly use BERT/large transformers")
print("→ Need efficient architecture (distilled model, or shallow neural net)")
architecture_requirements.append("two_stage: fast_retrieval + small_ranker"
)
# Explainability constraint
if "explainability" in current_constraints:
print("→ Need attention weights or feature importance")
print("→ Rules out pure black-box models")
architecture_requirements.extend(["attention_mechanism (can show which items influenced rec)",
"feature_attribution (SHAP values for scoring model)"
])
# QPS constraint
if "qps" in current_constraints:
print("→ Need to serve from cache for hot items")
print("→ Need batch inference")
print("→ Need load balancing")
architecture_requirements.extend(["batch_predictor (combine requests)",
"multi_instance_deployment",
"cache_popular_user_recommendations"
])
# Now work backwards from architecture requirements
print("\n\nArchitecture requirements:", architecture_requirements)
# What does two-stage retrieval+ranking imply?
if "two_stage" in architecture_requirements:
print("\nTwo-stage approach requires:")
print("1. Fast retrieval:")
print(" → Need embeddings")
print(" → Need ANN index (FAISS)")
print("2. Lightweight ranker:")
print(" → Simple features only")
print(" → Small neural net or GBDT")
= [
data_requirements "user_item_interaction_data (for embeddings)",
"positive_negative_pairs (for contrastive learning)",
"features_computable_in_realtime (for ranker)"
]
# What does embedding training imply?
print("\n\nEmbedding training requires:")
print("→ Metric learning objective (contrastive/triplet loss)")
print("→ Large batch sizes (for hard negative mining)")
print("→ GPU training (can take hours, runs offline)")
= [
training_requirements "contrastive_learning_framework",
"large_batch_training (batch_size > 1024)",
"negative_sampling_strategy"
]
break # Simplified for example
# Now we have a complete plan!
return {
'architecture': architecture_requirements,
'data_needs': data_requirements,
'training': training_requirements,
'deployment': ["two_tier_serving", "caching_layer", "load_balancer"]
}
Real Example: AlphaGo’s Backwards Plan
Let me share how this played out in one of the most famous ML systems: AlphaGo.
Goal: Beat the world champion at Go.
Work backwards:
“To beat the world champion, I need superhuman move selection.”
What does that imply? → “I need to evaluate positions better than humans can.”
What does that imply? → “I need both: (1) policy to suggest good moves, and (2) value function to evaluate positions.”
What does that imply? → “I need to learn from both expert games AND self-play.”
Why self-play? → “Because human games only show me human-level play. To exceed that, I need to play against myself and discover superhuman strategies.”
What does self-play imply? → “I need Monte Carlo Tree Search (MCTS) to generate high-quality training data.”
What does MCTS imply? → “I need a way to simulate games quickly to explore the tree.”
What does learning from expert games imply? → “I need a large database of professional games.”
Now work forward with this plan: 1. Start: Collect database of expert games 2. Train initial policy network by supervised learning on expert moves 3. Use that policy in MCTS to generate self-play games 4. Train value network on self-play outcomes 5. Improve policy through reinforcement learning on self-play 6. Iterate: better policy → better self-play → better training → better policy
This plan emerged from working backwards from “beat the world champion” through all the implications.
When Working Backwards Really Shines
Working backwards is especially powerful when:
1. You have hard constraints
Production requirements (latency, throughput, memory), regulatory requirements (explainability, fairness), business requirements (must work on mobile devices). Start from these constraints and work backwards to determine what’s feasible.
2. The goal is clear but the path isn’t
You know exactly what you need to achieve but don’t know how to get there. Working backwards helps you decompose the problem.
3. You’re building a complete system
Not just a model, but a production ML system with data pipelines, training infrastructure, serving, monitoring. Working backwards helps you see all the pieces you’ll need.
4. Resources are limited
You have 3 months and 2 engineers. Working backwards from the deadline helps you scope appropriately: “If we only have 3 months, we can’t build a custom training infrastructure, so we need to use existing tools, which means…”
Practical Exercise
Take a project you’re working on. Write down the final state you need to achieve. Be specific: - Performance metrics - Latency requirements
- Scale (QPS, data volume) - Other constraints (interpretability, fairness, etc.)
Now work backwards. For each requirement, ask “What does this imply?” Keep a chain of reasoning. You might discover: - Assumptions you’re making that might not hold - Components you’ll need that you hadn’t thought about - Architectural choices that are forced by your constraints - Things that seemed necessary but actually aren’t
This backwards planning often reveals a simpler, more direct path than forward planning would have.
Getting Stuck: The Fertile Ground of Frustration
Let me be completely honest with you: You’re going to get stuck. Not occasionally. Regularly. On every challenging project.
You’ll hit moments where: - Every architecture you try overfits terribly - Your model learns spurious correlations no matter what you do - Performance plateaus far below where it needs to be - The production constraints seem physically impossible to meet - You’ve tried everything you can think of and nothing works
I want to tell you something important: These moments aren’t failures. They’re signals.
They’re your problem telling you: “The way you’re thinking about this isn’t quite right. You need a different lens, a different angle, a new perspective.”
Some of my best ML solutions came after weeks of being stuck. The breakthrough came not from trying harder with the same approach, but from changing how I thought about the problem.
Let me walk you through strategies for getting unstuck, organized by what kind of stuck you are.
Type 1: “My Model Won’t Learn At All”
You’ve set up your training pipeline. You hit run. The loss stays flat or barely moves. Your validation accuracy is no better than random guessing.
Your first instinct: Must be a bug! Check the code!
And you’re probably right: This usually IS a bug. But let me give you a systematic debugging approach:
# DEBUGGING CHECKLIST: Model Won't Learn
# Step 1: Can your model overfit a single batch?
def test_overfitting_capacity():
"""
Take 10 examples. Train until loss is near zero.
If this fails, you have a fundamental problem.
"""
= dataset[:10]
tiny_batch = YourModel()
model
for epoch in range(1000):
= train_step(model, tiny_batch)
loss print(f"Epoch {epoch}: Loss {loss}")
if loss < 0.01:
print("✓ Model CAN learn (has sufficient capacity)")
return True
print("✗ Model CANNOT learn even tiny batch")
print("Possible issues:")
print("- Wrong loss function for task")
print("- Architecture bugs (dead neurons, dimension mismatches)")
print("- Learning rate too low")
print("- Gradient flow problems")
return False
If your model can’t even overfit 10 examples, you have a bug or fundamental architecture problem. Fix that before anything else.
If it CAN overfit a tiny batch but won’t train on full data:
Check your data: - Are labels correct? (Print some examples manually) - Is there a label-feature mismatch? (e.g., predicting tomorrow’s price but features include tomorrow’s price) - Are you preprocessing wrong? (e.g., normalizing test data with train stats)
Check your architecture: - Are gradients flowing? (Add gradient logging) - Is anything saturating? (sigmoid/tanh outputs at extremes?) - Are skip connections working if you have them?
Check your training setup: - Learning rate too high (causing divergence) or too low (no learning)? - Batch size inappropriate for problem? - Are you training the right parameters? (Check param.requires_grad)
Type 2: “My Model Overfits Immediately”
Your training loss goes down nicely. Your validation loss goes down for one epoch, then shoots up. Classic overfitting, but severe.
First, establish a baseline:
Let me walk you through this debugging process systematically:
def establish_baselines():
"""
Before fighting overfitting, understand what's reasonable
"""
# Baseline 1: Random predictor
= evaluate_random_predictions()
random_performance print(f"Random: {random_performance}")
# Baseline 2: Most common class (for classification)
= predict_majority_class()
majority_performance print(f"Majority class: {majority_performance}")
# Baseline 3: Simple model (logistic regression / random forest)
= train_simple_model()
simple_model_performance print(f"Simple model: {simple_model_performance}")
print("\nYour complex model must beat these!")
If your complex model doesn’t beat a simple logistic regression, you’re probably overfitting because you don’t have enough data for model complexity.
Strategies when stuck on overfitting:
- Reduce model complexity first:
- Fewer layers, fewer parameters
- You might have a 50-layer network when you need 5 layers
- Get more data (if possible):
- Data augmentation
- Synthetic data generation
- Collect more real data
- Add regularization (but do it systematically):
- Start with dropout (0.5 is often reasonable)
- Try weight decay
- Try early stopping
- Try batch normalization
- Check for data leakage:
- This is often the culprit for mysterious overfitting
- Is future information leaking into your features?
- Are train and test split properly?
Type 3: “Performance Has Plateaued”
You’ve gotten to 75% accuracy. You need 85%. You’ve tried deeper networks, more training, different optimizers. Nothing budges it past 76%.
This is where you need to change your approach fundamentally.
First, do error analysis:
This systematic approach to error analysis is crucial. Let me show you how to implement it:
def deep_error_analysis(model, dataset):
"""
Understand WHERE and WHY your model fails
"""
= []
errors for example in dataset:
= model(example)
prediction if prediction != example.label:
errors.append({'example': example,
'prediction': prediction,
'true_label': example.label,
'confidence': prediction.confidence
})
# Analyze error patterns
print("Error analysis:")
# 1. Which classes are confused most?
print_confusion_patterns(errors)
# 2. Are errors high-confidence (wrong but sure) or low-confidence?
print_confidence_distribution(errors)
# 3. Do errors share characteristics?
print_error_feature_analysis(errors)
# 4. Manually look at random errors
print("\n=== Random Error Examples ===")
for error in random.sample(errors, 20):
print(f"\nTrue: {error['true_label']}")
print(f"Predicted: {error['prediction']}")
print(f"Example: {error['example']}")
print("Why did model fail here?")
input("Press enter for next...")
This error analysis often reveals the issue: - “Oh, the model confuses class A and B because we don’t have a feature that distinguishes them” - “The model fails on rare edge cases—we need more data for those” - “The model is learning a spurious pattern in the background, not the actual object”
Based on error analysis, you might:
- Add features: The model needs information it doesn’t have
- Get more diverse data: Errors cluster in underrepresented regions
- Change the architecture: The model can’t represent the patterns you need
- Reframe the problem: Maybe this shouldn’t be classification at all
- Accept the limitation: Maybe 76% is the best possible given your data
Type 4: “My Architecture/Approach Fundamentally Can’t Work”
This is the hardest kind of stuck, because the problem isn’t in the details—it’s in the approach.
Signs you’re in this situation: - Simple baselines work better than your complex model - The approach works in papers but not on your data - Every variation you try fails in the same way - Domain experts say “that shouldn’t work for this problem”
When this happens, you need to:
- Go back to first principles (we covered this)
- What are you really trying to do?
- What does the model fundamentally need to know?
- Are you giving it that information?
- Look at related problems (we covered this too)
- How do people solve similar problems?
- Is there a different framing that works better?
- Simplify drastically:
- Solve a much easier version first
- Remove 90% of the complexity
- Get SOMETHING working, even if limited
Let me give you an example. A team was trying to predict equipment failures in a factory. They framed it as time-series forecasting: predict the exact time until failure. Stuck for months—predictions were no better than random.
Then they simplified: “Forget predicting time. Can we just detect WHEN the equipment starts behaving abnormally?”
Reframed as anomaly detection, they had a working system in two weeks. It wasn’t what they originally planned, but it solved the actual business problem.
The Practice of Getting Unstuck
Here’s what I do when I’m stuck:
1. Take a walk: Seriously. Away from the computer. Your brain needs to switch from focused mode to diffuse mode. Solutions often come when you’re not actively trying.
2. Explain the problem to someone else: Even if they don’t understand ML. The act of explaining often reveals the issue. (Rubber duck debugging!)
3. Sleep on it: The overnight break gives your brain time to process. I’ve solved more problems in morning showers than in late-night coding sessions.
4. Pair program: Get a colleague to look at your problem fresh. They’ll ask questions you haven’t thought of.
5. Document what you’ve tried: Write down every approach, why it failed. This prevents repeating failed attempts and might reveal patterns.
6. Set it aside and work on something else: Sometimes you’re too close to the problem. Working on other things can bring new perspectives.
7. Read widely: Papers, blog posts, tweets. Sometimes the solution is in an adjacent domain you haven’t considered.
The key insight: Being stuck is temporary. How you respond to being stuck determines whether you break through or stay stuck.
The Recursive Nature of ML Planning: Plans Within Plans
Let me share something that might feel overwhelming at first but is actually liberating once you understand it: Every ML plan contains smaller planning problems.
When you devise a plan like “Build a production recommendation system,” you’re not creating one plan—you’re creating a hierarchy of plans, each requiring its own problem-solving process.
It’s like planning a road trip. Your high-level plan is “Drive from San Francisco to New York.” But that breaks into smaller plans: “Drive from San Francisco to Reno” (first day), which breaks into even smaller plans: “Stop for gas in Sacramento,” which breaks into micro-plans: “Which gas station? Which route?” Each level requires its own decisions.
ML planning works the same way, but we need to be more systematic about it.
The Hierarchy of Planning
Let me walk you through a real example. You’re tasked with building a content moderation system for a social media platform. Your high-level plan might be:
Level 0 (Top Level): Production Content Moderation System - Detect harmful content in user posts - Achieve 95% precision (very few false positives) - Achieve 90% recall (catch most harmful content) - Latency < 100ms - Handle 50K posts per second
This is your top-level goal. Now you need to devise a plan. Let’s say you decide:
Level 1 (System Architecture Plan): 1. Multi-task transformer model for text 2. CNN for image/video moderation 3. Ensemble combining both for multi-modal posts 4. Human-in-the-loop for borderline cases 5. Active learning to improve over time 6. A/B testing infrastructure
Good! But now each of these components is itself a problem requiring a plan. Let’s drill into just one: “Multi-task transformer model for text.”
Level 2 (Model Architecture Plan): - Base: Pre-trained BERT (or RoBERTa or similar) - Task-specific heads for each policy violation type: - Hate speech detection - Misinformation detection - Spam detection - Harassment detection - Shared encoder for efficiency - Separate calibration for each task
Okay, but how do you train this? That’s another planning problem:
Level 3 (Training Strategy Plan): - Start with frozen BERT, train only classification heads (1 epoch) - Unfreeze and fine-tune full model (5 epochs) - Use weighted loss to handle label imbalance - Augment training data with back-translation - Implement curriculum learning (start with clear examples, gradually add hard ones)
And how do you handle the label imbalance? Another plan:
Level 4 (Class Imbalance Plan): - Oversample minority classes - Use focal loss to focus on hard examples - Class weights in loss function - Synthetic example generation using GPT-2 - Hard negative mining (examples model gets wrong)
You see how this works? Each level of planning creates sub-problems that need their own plans. And this isn’t a sign of poor planning—this is necessary decomposition.
Managing the Recursion: Practical Strategies
This recursion can feel overwhelming. “How deep do I need to plan? When do I stop?” Here are strategies I use:
1. Plan depth matches uncertainty
For components you understand well, plan shallowly: - “Use standard data augmentation” (you know how to do this, don’t need detailed planning)
For components with uncertainty, plan deeply: - “Handle multi-lingual content” (you haven’t done this before, need detailed plan)
2. Plan until you hit actionable tasks
Keep recursing until you reach tasks where you know the next concrete action: - ❌ “Improve model performance” (not actionable—how?) - ✓ “Run hyperparameter sweep over learning rates [1e-5, 1e-4, 1e-3]” (actionable!)
3. Plan top-down, execute bottom-up
Make your high-level plan first. Then recursively expand the uncertain parts. But when implementing, start from the bottom: - First: Get data loading working - Then: Get baseline model training - Then: Add complexity - Finally: Deploy complete system
This way you always have something working at each stage.
4. Document the hierarchy
I literally draw this out. Example:
This visual hierarchy helps me: - See the full scope - Identify what I’ve planned and what’s still vague - Communicate with team members - Track progress
The Wisdom of Deferred Planning
Here’s a counterintuitive insight: You don’t need to plan everything upfront.
In fact, planning too deeply too early is often wasteful. Why? Because your early experiments will teach you things that change your plan.
The pattern I follow: 1. Plan the first phase in detail (enough to start) 2. Plan subsequent phases at high level only (rough sketch) 3. Refine the plan as you learn (based on what works)
Example: For that content moderation system, I might plan in detail: - Data pipeline setup - Baseline model (simple BERT fine-tuning) - Evaluation framework
But I’d only sketch: - Multi-task learning (might not be needed if single-task works) - Ensemble strategy (depends on baseline results) - Active learning (Phase 2, depends on Phase 1 results)
Why? Because maybe my baseline achieves 93% and I realize I don’t need the complexity of multi-task learning. Or maybe I discover that hate speech and misinformation share a lot of features, suggesting multi-task learning is crucial. I don’t know until I try the baseline.
This is “agile planning” for ML. Plan enough to make progress, but stay flexible to adapt as you learn.
Knowing When to Replan
Plans need to change. How do you know when?
Strong signals to replan: - Your assumptions were wrong (data is different than expected) - Constraints changed (now you need 10ms latency, not 100ms) - Baseline experiments show completely different results than expected - New technology/paper makes your approach obsolete - Team/timeline/resources changed significantly
Weak signals (don’t replan yet): - One experiment didn’t work (try a few variations first) - Someone suggests a different approach (evaluate if it’s actually better) - A paper claims better results (might not transfer to your setting)
The art is knowing when to persist with your plan (not every failure means the plan is wrong) vs. when to pivot (when evidence clearly shows a different path is better).
Practice: Draw Your Planning Hierarchy
Take a project you’re working on. Draw it as a hierarchy: - Top level: The production goal - Second level: Major components - Third level: Sub-components of each - Keep going until you hit “I know how to implement this”
For each node, note: - Is this planned in detail or just sketched? - What’s the main uncertainty here? - What’s the risk if this part fails? - What’s the next action to make progress?
This exercise often reveals: - Parts you thought were planned but are actually vague - Unnecessary complexity you can cut - Critical paths where you need more planning - Opportunities for parallelization
The hierarchy is your map. Use it to navigate your project.
The Aesthetics of ML Solutions: Recognizing Beauty in Plans
Let me tell you about two moments from my career. Both were successful projects—they achieved their metrics, they deployed to production, they delivered business value. But they felt completely different.
Project A: The model was an ensemble of seven different architectures. Training required carefully orchestrated steps in a specific order. The feature pipeline had 47 different transformations. In production, we needed three different models for different latency tiers. The documentation was 50 pages. When something broke, debugging took hours because the system was so complex.
Project B: The model was a single neural network with skip connections. Training was straightforward—one script, one config file. Features were carefully chosen but minimal—just 20 of them. Production deployment was simple. The core logic fit on one page. When issues arose, we could diagnose them quickly because the system was transparent.
Both worked. But only Project B felt beautiful.
This is what Pólya means when he talks about the aesthetics of solutions. Not all plans are created equal. Beyond mere correctness (does it meet the metrics?), plans possess qualities that experienced practitioners learn to recognize and value.
Let me help you develop this sense of aesthetic judgment.
Elegance: Achieving More With Less
Elegance isn’t about simplicity for its own sake. It’s about appropriate complexity—using exactly what’s needed and nothing more.
Inelegant solution: - Five-model ensemble - 200 engineered features (most weakly predictive) - Complex stacking with meta-learning - Different approaches for different edge cases - Works, but barely, and fragile
Elegant solution: - Single well-designed model - 20 carefully chosen features (each strongly predictive) - Clean architecture that handles edge cases naturally - Works robustly across scenarios
Let me give you a real example: ResNet vs. earlier deep network approaches.
Before ResNet, people struggled to train very deep networks. The approaches were inelegant: - Try very careful initialization schemes - Add batch normalization everywhere - Try different activation functions - Carefully tune learning rates for each layer - Still couldn’t reliably train 50+ layer networks
ResNet came along with one elegant idea: skip connections. Let the network learn residuals (differences) instead of full mappings. Suddenly: - 152-layer networks train easily - Performance scales with depth - Works across tasks without special tuning - The insight is simple and beautiful
The elegant solution isn’t just more effective—it reveals something fundamental about the problem. It makes you think: “Of course! That’s how it should be done.”
How to recognize elegance in your own work:
Ask yourself: - Could I explain this approach to someone in five minutes? - If I removed any component, would it break? - Does each part have a clear purpose? - Would I be proud to show this in a paper?
If you’re saying “well, we tried X but it didn’t quite work, so we added Y as a workaround, and then we needed Z to handle edge cases…”—that’s a sign of inelegance. Elegant solutions feel inevitable, not cobbled together.
Robustness: Continuing to Work When Things Change
Here’s a test: How many things need to go exactly right for your solution to work?
Fragile solution: - Only works if data is perfectly clean - Fails if hyperparameters aren’t exactly tuned - Breaks if data distribution shifts slightly - Requires manual intervention when edge cases appear
Robust solution: - Handles noisy data gracefully - Performance degrades smoothly with hyperparameter changes - Adapts to distribution shift (or degrades predictably) - Deals with edge cases automatically
Let me show you what this means practically:
Example: Fraud Detection Model
Fragile approach: Train on last year’s fraud patterns. High accuracy on test set. Deploy to production. Performance drops by 20% within two months because fraudsters adapted. Model needs complete retraining.
Robust approach: - Use anomaly detection component (catches novel fraud patterns) - Online learning updates (adapts to drift) - Ensemble with rule-based component (stable baseline) - Active learning flags uncertain cases for human review - Model degrades gracefully rather than catastrophically
The robust solution isn’t just better—it requires less maintenance, fails more predictably, and scales better over time.
How to build robustness into your plans:
- Plan for data drift: Your training data is a snapshot. Production data will differ.
- Use techniques that generalize (regularization, augmentation)
- Build in monitoring for drift detection
- Design for adaptation (online learning, periodic retraining)
- Avoid brittle dependencies:
- If your model REQUIRES feature X to be perfectly accurate, you’re brittle
- Better: model gracefully handles missing or noisy features
- Have fallback strategies:
- Ensemble with simple baseline
- Rule-based system as safety net
- Human-in-the-loop for uncertain cases
- Test in adversarial conditions:
- What if 20% of features are missing?
- What if class distribution shifts?
- What if input data is deliberately adversarial?
Clarity: Understanding What’s Happening
Can you explain why your model makes each prediction? Can you debug when it fails? Can your teammates understand and maintain it?
Opaque solution: - “We trained this ensemble and it works, but we don’t really know why” - Black box that’s impossible to debug - Only one person understands the codebase - When it fails, you’re guessing at fixes
Clear solution: - Each component has a clear purpose - You can trace prediction logic - Failure modes are understandable - The codebase is documented and modular
Example from my experience:
I once inherited a text classification model. The code was 3,000 lines in one file. The feature engineering involved 50+ transformations, some in pandas, some in SQL, some in custom Python. Training required running five separate scripts in sequence. Nobody could explain what half the features did.
When we needed to fix a bug, it took a week just to understand the system.
I refactored it: - Clear pipeline: data → preprocessing → feature_engineering → model → postprocessing - Each step in its own module with docstrings - Configuration file for all hyperparameters - Comprehensive tests - Documentation explaining each feature and why it helps
The model’s performance was identical. But now any team member could work on it. Bugs took hours to fix, not weeks.
How to achieve clarity:
Write code like you’ll forget it: Because you will. In six months, you’ll thank yourself for clear variable names and comments.
Modularize: Separate data loading, preprocessing, model architecture, training, evaluation. Each should be independently understandable.
Document decisions: Not just what you did, but why. “We use BERT because…” “We chose these features because…”
Make effects visible: Log intermediate outputs. Visualize learned representations. Show attention weights. Make the model’s reasoning transparent.
Inevitability: The Feeling of “Of Course”
The best solutions have a quality that’s hard to describe but you know it when you see it: inevitability. When you understand the solution, you think “Of course! How else would you do it?”
This doesn’t mean the solution was easy to find. The most inevitable-seeming solutions often required great creative leaps. But once you see them, they feel natural.
Examples of inevitable solutions:
Attention mechanisms for sequence-to-sequence: Of course! The model should be able to look at different parts of the input for each output. How else would you handle variable-length alignment?
Batch normalization: Of course! Normalize activations at each layer to keep gradients stable. So obvious in hindsight.
Skip connections in ResNet: Of course! Let the model learn the difference rather than the full transformation. That’s naturally easier.
Self-supervised learning: Of course! The data contains its own supervision signals. Predict masked words, predict rotation, predict next frames—the labels are free.
These solutions feel inevitable because they align with the fundamental structure of the problem. They’re not fighting against the problem—they’re working with it.
How do you know if your solution has this quality?
- When you explain it to colleagues, do they nod and say “that makes sense” or do they look confused?
- Does the solution generalize naturally to related problems?
- Could you have predicted this solution would work before implementing it?
- Does the solution reveal something about the problem itself?
Developing Your Aesthetic Sense
This aesthetic judgment—recognizing elegance, robustness, clarity, inevitability—develops over time. Here’s how to cultivate it:
1. Study great work: Read papers for elegant solutions. How did they think about the problem? What makes their approach beautiful?
2. Refactor your own work: After you get something working, ask: “Could this be simpler? Clearer? More robust?”
3. Compare approaches: When multiple solutions work, analyze why. What are the qualitative differences? Which one feels better?
4. Seek feedback: Show your work to experienced practitioners. Ask not just “does it work?” but “is this a good solution?”
5. Reflect on failures: Often, failed approaches were inelegant. What was wrong with them? What does that teach you?
Over time, you’ll develop an intuition. When you’re planning, you’ll start to feel whether your approach is elegant or cobbled together. That feeling—that aesthetic sense—is valuable. Trust it. If your plan feels messy and fragile, it probably is. Keep searching for the elegant approach.
Conclusion: The Living Plan and Your Journey Forward
We’ve walked a long way together. From standing at the edge of that canyon between your current state and your goal, through pattern recognition and related problems, through creative auxiliary elements and problem transformations, through working backwards and getting unstuck, through recursive planning and aesthetic judgment.
But here’s what I need you to understand as we close: A plan is not a rigid prescription. It’s a living strategy.
The plan you devise today will change tomorrow. You’ll run your first experiments and learn something that shifts your approach. You’ll hit a wall and discover a better path around it. You’ll see a new paper that opens possibilities you hadn’t considered. Your stakeholders will change requirements. Your data will reveal patterns you didn’t expect.
This is normal. This is good. The best ML practitioners hold their plans lightly.
They’re confident enough in their strategic thinking to commit to a direction, but humble enough to adapt when reality reveals complexities not anticipated during planning. They know the difference between “this approach needs more iteration” and “this approach is fundamentally wrong, time to pivot.”
Let me leave you with a mental model that’s served me well:
Think of planning like navigation.
You’re sailing from San Francisco to Hawaii. You plot a course—that’s your plan. But you don’t just set your heading and ignore everything else for the next week. You constantly monitor: - Are we on course? (Metrics tracking) - Is the wind changing? (Data distribution shift) - Are there storms ahead? (Anticipated challenges) - Did we discover a better route? (New insights)
You make constant small adjustments to your course. That’s execution with adaptation. But you don’t change your destination—Hawaii is still the goal. Unless you discover compelling reasons to change the goal itself (stakeholder requirements change).
Your ML plan is the same: Strategic direction that guides you, but adapts based on what you learn.
Your Planning Toolkit: A Summary
Let me give you a checklist to return to when you’re devising plans:
Pattern Recognition: - Have I seen this problem before? - What similar problems do I know? - What patterns in my experience match this? - What architectures have worked for similar tasks?
Related Problems: - What problems share the same architecture? - What problems share the same challenges? - What problems are in the same domain? - Can I transfer results, methods, or insights?
Auxiliary Elements: - What additional tasks could help learning? - What intermediate representations might help? - What auxiliary losses could guide training? - What creative additions might unlock progress?
Problem Transformation: - How else could I frame this problem? - What assumptions am I making that I could change? - What different ML task types could solve this? - What different representations could I use?
First Principles: - What am I really trying to do? (Plain language) - What does success actually mean? - What does my model need to know? - What assumptions am I making? - What’s the simplest approach?
Working Backwards: - What are my hard constraints? - What do these constraints imply? - What components do I need? - What does each component require? - What’s the data and compute budget?
When Stuck: - Can my model overfit a tiny batch? (Capacity check) - Where and why is it failing? (Error analysis) - Am I solving the right problem? (Problem check) - Have I tried simpler approaches? (Baseline check) - Do I need to reframe entirely? (Strategic pivot)
Aesthetic Judgment: - Is this elegant? (Appropriate complexity) - Is this robust? (Handles uncertainty) - Is this clear? (Understandable and debuggable) - Does this feel inevitable? (Aligned with problem structure)
Practice: Your Personal Reflection
Before you go, I want you to do something. Take the project you’re currently working on (or about to start). Write down:
- Where you are: Current state, what you have
- Where you need to be: Production goal, constraints, metrics
- Your current plan: How you’re planning to bridge the gap
- Your confidence: Which parts are clear? Which parts are uncertain?
- Your next action: What’s the concrete next step?
Then ask yourself: - Have I considered the patterns I know? - Have I looked at related problems? - Could I reframe this differently? - Am I working backwards from constraints? - Is my plan elegant, robust, and clear?
This reflection—this conscious application of Pólya’s questions to your actual work—is how these heuristics become second nature.
The Journey Continues
Remember: The plan is the bridge between data and deployment. Pólya’s questions are the architectural principles that ensure your bridge is not just functional but optimal—not just sufficient but elegant.
We’ve focused on devising the plan, but there’s more to the journey. In the next article, we’ll explore how to carry out your plan—how to execute with discipline, how to debug when things go wrong, how to iterate efficiently, and how to know when your beautiful plan needs revision in the face of empirical reality.
But for now, you have the tools to devise strong plans. You understand that planning is creative problem-solving—recognizing patterns, finding connections, transforming representations, working backwards, and building with aesthetic judgment.
The planning phase is your opportunity to think deeply before acting. It’s your chance to leverage your experience, to avoid known pitfalls, to choose elegant approaches over brute-force ones. Take your time with it. A day spent devising a good plan can save weeks of wasted effort.
Go forth and plan well. And remember: every senior ML practitioner was once where you are, learning to devise plans, getting stuck, breaking through, developing judgment. This is the path. You’re on it. Keep walking.
The next time you face a new ML problem, don’t immediately jump to “let me try some models.” Pause. Ask Pólya’s questions. Sketch your bridge. Think about patterns and related problems. Consider transformations and auxiliary elements. Work backwards from constraints. Judge your plan aesthetically.
That’s how you become not just a practitioner, but a master of your craft.
This article is Part 2 of a four-part series on applying Polya’s problem-solving framework to data science and machine learning. Continue with [Part 3: Carrying Out the Plan], which addresses implementation challenges and best practices, and [Part 4: Looking Back], which covers evaluation, iteration, and continuous improvement.
The principles and frameworks presented here have been developed through analysis of hundreds of ML projects across industries. While each project is unique, the patterns of success and failure repeat with remarkable consistency. By learning from these patterns and applying Polya’s timeless wisdom, we can dramatically improve our chances of building ML systems that deliver real value.