Imad Dabbura - Devising a Plan: The Creative Heart of ML Problem Solving

Part 2 of 4: Applying Polya’s Problem-Solving Framework to Data Science, ML, and AI

A Comprehensive Guide for Practitioners

Before We Begin: Where We Are in Pólya’s Framework

Let me start with something crucial. If you’ve been following this series, you’ve already completed Step 1: Understanding the Problem. You’ve identified your unknown (what you’re trying to predict), examined your data (what you have to work with), and mapped your conditions (constraints like latency, interpretability, data size). You know what success looks like.

If you haven’t done that work yet—if you’re unclear about your exact objective or haven’t truly examined your data—stop now. Go back to the first article in this series. Understanding must come first. Pólya is emphatic about this: you cannot devise a good plan without first understanding what you’re trying to solve.

But if you have done that work, then welcome. You’re ready for Step 2: Devising a Plan.

Introduction: Standing at the Threshold

Picture yourself in this moment. You’re sitting at your desk, perhaps with a cup of coffee growing cold beside your keyboard. You’ve spent the last few days—maybe weeks—immersed in your data. You’ve run df.describe() more times than you can count. You’ve created distribution plots, correlation matrices, and t-SNE visualizations. You’ve identified the outliers, understood the class imbalance, and discovered that one column that’s leaking future information. You’ve documented your findings, noting that 15% of your labels might be noisy, that your features have wildly different scales, and that production latency can’t exceed 50 milliseconds.

You understand the problem. You really do.

But now comes something different. Now comes the moment where all that careful analysis must transform into action. Where understanding gives way to creation. Where you move from observer to architect.

This is the moment of devising a plan.

George Pólya, the mathematician who crystallized the art of problem-solving into teachable principles, understood that this stage represents the most profoundly human aspect of our work. It’s where intuition dances with logic, where your past projects—the successes and the failures—suddenly illuminate possibilities for this new challenge. It’s where you become an artist, sketching pathways through the vast, often intimidating landscape of model architectures, training strategies, and deployment constraints.

Let me walk with you through this process. Not as a distant lecturer, but as a colleague who’s been in your shoes, who’s felt that mix of excitement and uncertainty that comes with starting a new ML project.

The Six Essential Questions: Your Planning Toolkit

Pólya identified specific questions that, when asked systematically, guide you toward a good plan. These aren’t meant to be asked once and forgotten. They’re a toolkit you return to repeatedly throughout the planning process. Let me introduce them first, then we’ll explore each in depth:

“Have you seen it before?” - Pattern recognition from your experience
“Do you know a related problem?” - Leveraging structural similarities
“Could you restate the problem?” - Transformation and reformulation
“Could you introduce an auxiliary element?” - Creative additions that unlock solutions
“Go back to definitions” - Returning to first principles
“Could you work backwards from the goal?” - Reverse engineering your approach

These questions work together. The answer to one often leads you to another. They form a dynamic system of exploration, not a linear checklist. But let’s examine each one carefully.

The Architecture of Strategic Thinking: Building Your Bridge

Close your eyes for a moment and imagine this: You’re standing at the edge of a canyon. On this side—your side—you have everything you’ve gathered. Your training dataset of 50,000 samples, your baseline logistic regression that achieves 72% accuracy, your production constraints (50ms latency, 10MB model size), your stakeholder requirements (must be interpretable for regulatory compliance).

On the other side of that canyon lies your destination. The production model that achieves 90% accuracy, that makes predictions in 30 milliseconds, that your compliance team can audit, that scales to handle peak traffic. That’s where you need to be.

The plan you’re about to devise? It’s your bridge across this canyon.

Now, here’s what makes this fascinating—and what Pólya understood intuitively: there’s no single “right” bridge design. Sometimes the most direct path works beautifully. You recognize the problem as straightforward image classification, you grab a pre-trained ResNet, fine-tune it on your data, and you’re done. Direct. Efficient. Perfect.

But other times—and these are often the most interesting projects—the direct path won’t work. Maybe your data is too different from ImageNet. Maybe you need to solve multiple related tasks simultaneously. Maybe you need to invent something new. In those cases, your bridge might take a creative detour, arcing high through an auxiliary task that helps you learn better representations. Or perhaps you’ll build it incrementally, starting with a simple baseline and iteratively adding complexity as you understand what works.

Here’s what’s crucial to understand: Pólya doesn’t prescribe which bridge to build. Instead, he offers you a series of questions—probing, insightful questions that awaken your problem-solving instincts. These aren’t checklist items to tick off mechanically. Think of them as a conversation with a wise mentor who knows exactly what to ask to unlock your thinking.

Let’s dive into these questions together, and I’ll show you how they apply to the messy, real world of machine learning.

The First Question: “Have You Seen This Before?”

Let me tell you a story. A few years ago, I was working on a project to detect fraudulent insurance claims. New problem, new domain, new dataset. I could have started from scratch, treating it as entirely novel. But then Pólya’s first question echoed in my mind:

“Have you seen it before?”

And suddenly, my brain started making connections. Wait—fraud detection isn’t new to me. I’ve built spam filters. I’ve worked on credit card fraud. I once helped detect fake product reviews. These problems share DNA. They’re all about finding needles in haystacks, about detecting patterns that adversarial actors try to hide, about dealing with extreme class imbalance where fraudulent cases are maybe 1-2% of your data.

This is pattern recognition, and it’s one of the most powerful tools in your arsenal as an ML practitioner. Your brain—after years of projects, papers, and late-night debugging sessions—has built an incredible library of experiences. Every model you’ve trained, every architecture you’ve studied, every failure you’ve debugged has been cataloged somewhere in your neural networks.

But here’s the nuance that separates good practitioners from great ones: you’re not looking for exact matches. You’re looking for family resemblances, for structural similarities hidden beneath surface differences.

Let me show you what I mean. Imagine you’re working on a new time-series forecasting problem—predicting server load for autoscaling. You might think: “I’ve never done this before.” But wait. Have you worked with sequential data? Language models process sequences. Video understanding processes temporal information. Even if you’ve never predicted server load, if you’ve fine-tuned a BERT model, you understand the fundamental architecture of processing sequential information with attention mechanisms. That’s a pattern match.

Or consider this: You’re building a medical image segmentation model to identify tumors in CT scans. “I’ve never worked in medical imaging,” you might think. But have you done semantic segmentation in autonomous driving? Have you worked on document layout analysis? These problems share the same core structure: dense, pixel-level prediction where spatial relationships matter. The domain knowledge differs, but the architectural intuition transfers.

Let me walk you through how to search your experience systematically:

# PSEUDOCODE: How to Search Your ML Experience
def search_for_patterns(new_problem):
    """
    This is what happens in your brain when you encounter a new ML problem.
    Let's make it explicit.
    """
    
    # FIRST PASS: Surface-level similarities
    # Don't overthink this—just quick pattern matching
    print("What type of data am I working with?")
    data_type = new_problem.identify_data_type()
    # -> "text", "images", "tabular", "time-series", "graph"
    
    print("What am I trying to predict?")
    task_type = new_problem.identify_task()
    # -> "classification", "regression", "generation", "ranking"
    
    print("What are my constraints?")
    constraints = new_problem.list_constraints()
    # -> {"latency": "50ms", "interpretability": "required", 
    #     "data_size": "10k samples"}
    
    # SECOND PASS: Abstract the essential structure
    # This is where it gets interesting
    print("\nWhat's the CORE challenge here?")
    structure = extract_pattern(new_problem)
    
    # For example, your problem might reduce to:
    # "Sequence-to-sequence with variable-length inputs"
    # "Imbalanced binary classification with adversarial actors"
    # "Multi-modal fusion for generation tasks"
    # "Few-shot learning in a new domain"
    
    # THIRD PASS: Search your memory at different levels
    relevant_memories = []
    
    # Start specific
    print("\nHave I solved this EXACT problem before?")
    exact_matches = recall_projects(task=task_type, 
                                    data=data_type,
                                    domain=new_problem.domain)
    
    if exact_matches:
        print(f"Yes! I can adapt my approach from {exact_matches}")
        return exact_matches
    
    # Go broader - same architecture family
    print("\nHave I used similar architectures?")
    architectural_matches = recall_projects(
        architecture_family=structure.architecture_hint
    )
    # e.g., "I've used transformers for NLP, maybe they work here too"
    
    # Go even broader - same fundamental challenge
    print("\nHave I faced similar CHALLENGES?")
    challenge_matches = recall_projects(
        challenges=structure.key_challenges
    )
    # e.g., "I've dealt with class imbalance in spam detection"
    
    # FOURTH PASS: Cross-domain analogies
    # This is where creativity happens
    print("\nWhat problems in OTHER domains share this structure?")
    
    # Real examples of this:
    # - Music recommendation ← Natural language processing
    #   (playlists as sentences, songs as words)
    # - Protein folding ← Language modeling
    #   (amino acid sequences have grammar-like rules)
    # - Traffic flow ← Fluid dynamics
    #   (both are about flow through networks)
    
    analogous_domains = find_structural_parallels(structure)
    
    return {
        'direct_matches': exact_matches,
        'architectural_inspiration': architectural_matches,
        'challenge_patterns': challenge_matches,
        'creative_analogies': analogous_domains
    }

Let me give you a concrete example of how this plays out. When DeepMind was working on AlphaFold for protein folding, they could have treated it as an entirely new problem. After all, predicting 3D protein structure from amino acid sequences is a unique biological challenge. But they recognized patterns:

From language modeling: Proteins have sequence structure, like sentences
From computer vision: The contact map prediction is like image segmentation
From attention mechanisms: Long-range dependencies matter (like in transformers)
From graph networks: Spatial relationships between amino acids form a graph

By recognizing these patterns, they could build on years of ML progress rather than starting from scratch. Their plan emerged from pattern recognition across multiple domains.

Now, let me be honest with you about something: This pattern recognition isn’t instantaneous. When you’re early in your ML career, your library of patterns is still being built. That’s okay. That’s expected. Every project you complete, every paper you deeply understand, every architecture you implement from scratch—these are all deposits in your pattern bank.

But here’s the secret that accelerates this process: Deliberately reflect on your projects. After you finish a model, don’t just move on. Ask yourself:

What was the essential structure of this problem?
What made it hard?
What architectural choices were crucial?
What would I do differently next time?
What other problems share this structure?

Write this down. Seriously. Keep a personal wiki, a notion page, a markdown file—whatever works for you. Future you will thank present you when you can search “problems with extreme class imbalance” and find three previous approaches that worked.

The Second Question: “Do You Know a Related Problem?”

Alright, so you’ve searched your memory and maybe you didn’t find an exact match. The pattern recognition didn’t give you a ready-made solution. This is where Pólya’s second question becomes your guide:

“Do you know a related problem?”

This question is subtly different from the first. We’re not looking for “have you seen this before” anymore. We’re casting a wider net. We’re looking for problems that might differ in specifics but share something essential—maybe the same technique works, maybe they seek the same type of output, maybe they operate in similar domains.

Let me walk you through how to think about this systematically. Imagine you’re working on that fraud detection problem I mentioned earlier. Let’s build a network of related problems together:

Looking at this network, each connection type teaches us something different:

Same Architecture Connection: Over here, we have spam detection. Why is this related? Because both fraud detection and spam detection can use similar binary classification architectures. If you’ve built a successful spam filter using gradient boosted trees with carefully engineered features about email metadata, text statistics, and sender behavior—well, that same architectural approach might work beautifully for fraud detection with features about transaction metadata, amount statistics, and account behavior.

Same Challenge Connection: Look at medical diagnosis systems. These might use completely different architectures (maybe CNNs for medical imaging vs. your tabular methods for fraud), but they share a crucial challenge: extreme class imbalance. Most patients don’t have rare diseases, just as most transactions aren’t fraudulent. The techniques you learn from medical diagnosis—careful loss function design, focal loss, cost-sensitive learning, synthetic minority oversampling—all of these transfer directly to your fraud detection problem.

Same Domain Connection: Anti-money laundering (AML) detection sits in the same fintech domain. Even if the specific patterns differ (money laundering involves complex transaction chains while payment fraud might be simpler), the domain knowledge is transferable. The regulatory requirements are similar, the data pipelines are similar, the production constraints are similar. If you study AML systems, you’ll understand the operational context of your fraud detector.

Now here’s where it gets really interesting. Pólya doesn’t just want you to identify related problems. He wants you to ask the follow-up question:

“Could you use it?”

This is crucial. It’s not enough to recognize that spam detection is related to fraud detection. You need to ask: How exactly can I leverage that relationship?

Let me show you the different ways to “use” a related problem:

Using the Result Directly

Sometimes, you can literally use the solution from the related problem with minimal modification. This is the dream scenario, and it happens more often than you might think.

Example: You need to build a sentiment analysis model for customer reviews in your e-commerce platform. You know this is related to general sentiment analysis. Can you use it? Yes, directly. You can take a pre-trained BERT model fine-tuned on general sentiment data, and with just a few hundred labeled examples from your domain, fine-tune it further. The result from the related problem (pre-trained sentiment model) becomes your starting point.

Using the Method

Other times, the exact model doesn’t transfer, but the method does.

Example: You’re working on predicting customer churn. You recall a related problem: predicting employee attrition in HR analytics. The domains differ (customers vs. employees), the features differ (purchase history vs. performance reviews), but the method is the same: survival analysis. You can’t use the HR model directly, but you can use survival analysis techniques—Cox proportional hazards, Kaplan-Meier estimators, time-varying covariates. The method transfers even though the model doesn’t.

Using the Insight

Sometimes, what transfers is more abstract: an insight, a principle, a way of thinking about the problem.

Example: You’re building a recommender system. You study how Spotify’s Discover Weekly works. You can’t use their model (you don’t have their data or compute), but you learn a crucial insight: implicit feedback (what people actually listen to) is more valuable than explicit feedback (star ratings). This insight shapes your entire approach. You decide to focus on modeling user behavior patterns rather than asking for ratings.

Let me tell you about a real situation where understanding related problems saved a project. A team was building a model to predict equipment failure in a manufacturing plant. They struggled for weeks with direct time-series forecasting approaches. Then someone on the team had worked on a related problem: network intrusion detection.

The insight? Both problems are fundamentally about anomaly detection. Normal equipment has normal patterns; equipment about to fail shows anomalies. Instead of trying to predict “time until failure,” they reframed it as “detect when behavior becomes anomalous.” This shift—learned from a related problem in a completely different domain—changed everything. They achieved much better results and the model was easier to explain to plant managers.

Here’s a practical exercise I want you to do: Build your own related problems network.

Take a problem you’re currently working on or one you worked on recently. On a piece of paper (or Excalidraw, or whatever tool you like), put that problem in the center. Now draw connections to related problems. For each connection, label it:

Same architecture?
Same challenge?
Same domain?
Same data type?
Simpler version?
More general version?
Analogous in another domain?

Don’t stop at your own experience. Include problems you’ve read about in papers, projects your colleagues have done, systems you’ve heard about at conferences. This web of connections is your strategic map. When you’re stuck on one problem, you can traverse these connections to find inspiration.

The Creative Leap: Introducing Auxiliary Elements

We’ve been building on what exists—pattern recognition, related problems, transferred methods. But now we come to something different. Something that requires genuine creativity. Pólya calls these auxiliary elements, and they represent some of the most exciting moments in machine learning.

Let me explain what I mean with a story. Imagine you’re working on image classification, and you’ve hit a wall. Your model plateaus at 85% accuracy. You’ve tried deeper networks, more data augmentation, better optimizers—nothing budges the needle. The gap between your current state and your goal seems unbridgeable with conventional approaches.

This is when you might introduce an auxiliary element—something not present in the original problem formulation, something creative that acts as a catalyst, enabling reactions that wouldn’t occur naturally.

In the case of that image classification problem, someone had a creative insight: What if we don’t just train the model to classify, but also to predict the rotation angle of randomly rotated images? This auxiliary task (rotation prediction) isn’t part of your original goal (you don’t care about rotation angles), but it forces the network to learn better representations. It works as a catalyst. Your main task performance jumps to 89%.

This is an auxiliary element in action.

Let me walk you through different types of auxiliary elements you might introduce:

Auxiliary Tasks (Multi-Task Learning)

This is probably the most common type. You add additional prediction tasks that aren’t your ultimate goal but help you learn better representations.

Real scenario: You’re building a model for medical image analysis to detect lung cancer. Your main task is binary classification: cancer or no cancer. But you introduce auxiliary tasks:

Predict the patient’s age (forces the model to learn biological indicators)
Segment out the lung regions (forces spatial understanding)
Predict whether the patient is a smoker (forces learning of texture patterns)

None of these auxiliary tasks is your goal. But together, they shape your model’s learned representations, making it better at the main task.

Here’s the fascinating part: There’s no algorithm for choosing auxiliary tasks. This requires intuition, domain knowledge, creativity. You might ask yourself:

What else could this model predict from the same data?
What intermediate understanding would help the main task?
What related information does the data contain?

Auxiliary Losses (Regularization Through Learning)

Sometimes your auxiliary element is a loss function that guides learning in useful directions.

Real scenario: You’re training a generative model, but the generated outputs look unrealistic. You add an auxiliary loss: a discriminator network that tries to distinguish real from generated examples. This is exactly what GANs do—the discriminator is an auxiliary element that provides a training signal (the adversarial loss) that helps the generator learn.

Or imagine you’re training an embedding model for product recommendations. Your main loss is based on click-through data. But you add an auxiliary loss: embeddings of similar products should be close together. This auxiliary loss acts as a regularizer, shaping your embedding space in useful ways.

Auxiliary Representations (Intermediate Structures)

Sometimes you introduce an intermediate representation that bridges the gap between input and output.

Real scenario: You’re building a text-to-speech system. Going directly from text to audio waveforms is incredibly difficult. But you introduce an auxiliary element: mel-spectrograms. Your system now has two stages—text to mel-spectrogram, then mel-spectrogram to audio. The mel-spectrogram is an auxiliary representation. It’s not your goal (users don’t want spectrograms), but it makes the problem tractable.

This is exactly what successful TTS systems like Tacotron do. The auxiliary element (spectrogram representation) transforms an impossible problem into two manageable ones.

Let me share the thought process for introducing auxiliary elements:

# PSEUDOCODE: Strategic Thinking for Auxiliary Elements
def should_i_add_auxiliary_element(current_problem, current_progress):
    """
    You're stuck. You've tried the obvious things. Time to think creatively.
    """
    
    # First, diagnose WHY you're stuck
    print("Why is my model not improving?")
    
    if current_progress.shows("underfitting"):
        print("Model isn't learning good representations")
        print("Consider: Auxiliary tasks to force richer features")
        candidates = [
            "Add self-supervised pretraining task",
            "Multi-task learning with related tasks",
            "Add auxiliary classification heads at multiple layers"
        ]
        
    elif current_progress.shows("unstable_training"):
        print("Training dynamics are problematic")
        print("Consider: Auxiliary constraints to stabilize")
        candidates = [
            "Add auxiliary reconstruction loss",
            "Add contrastive loss for embedding space",
            "Add adversarial loss for robustness"
        ]
        
    elif current_progress.shows("gap_too_large"):
        print("Direct mapping from input to output is too hard")
        print("Consider: Intermediate representations")
        candidates = [
            "Add latent space representation",
            "Break into pipeline with intermediate outputs",
            "Add attention maps as intermediate supervision"
        ]
    
    # Second, evaluate each candidate
    for candidate in candidates:
        print(f"\nEvaluating: {candidate}")
        
        # Will this help the main task?
        if not likely_to_improve_main_task(candidate, current_problem):
            continue
            
        # Can I implement it with available data?
        if not have_data_for(candidate):
            continue
            
        # Will the added complexity be worth it?
        if complexity_increase(candidate) > expected_benefit(candidate):
            continue
            
        print(f"Worth trying: {candidate}")
        return candidate
    
    return None

Let me give you a powerful example of auxiliary elements in action: BERT.

Think about what BERT does. The creators wanted a model that understands language well enough for many downstream tasks (classification, question answering, etc.). But how do you train such a model? They introduced two auxiliary tasks:

Masked Language Modeling: Randomly mask some words, predict them
Next Sentence Prediction: Given two sentences, predict if they’re consecutive

These tasks aren’t anyone’s end goal. Nobody deploys BERT just to predict masked words. But these auxiliary tasks force BERT to learn deep linguistic understanding. Once trained on these auxiliary tasks, BERT becomes an incredibly powerful starting point for dozens of actual tasks.

The genius isn’t in the architecture (transformers already existed). The genius is in recognizing that these specific auxiliary tasks would teach the model what we want it to know. That’s creative insight.

A word of caution: Auxiliary elements can also backfire. I’ve seen teams add auxiliary tasks that actively hurt performance because they pulled the model in conflicting directions. I’ve seen auxiliary losses that made training unstable. The art is in choosing elements that genuinely help.

Here’s my advice: When introducing auxiliary elements, start simple. Add one auxiliary task. Does it help? Great, keep it. Does it hurt? Remove it or adjust it. Build your intuition through experimentation. Over time, you’ll develop a sense for which auxiliary elements are likely to help in which situations.

The Transformation Question: “Could You Restate the Problem?”

Let’s talk about one of the most powerful moves in your ML toolkit—one that can transform an impossible problem into a tractable one: problem reformulation.

I want you to imagine holding a Rubik’s cube. If you only know how to turn the front face, you’re severely limited. But once you realize you can rotate the entire cube to bring any face to the front, suddenly you have many more moves available. Problem reformulation is like that—it’s rotating the problem to expose different faces, each of which might be easier to solve.

Let me show you what I mean with a problem you might actually face. You’re tasked with predicting customer lifetime value (CLV) for a subscription business. The request seems straightforward: “Build a model that predicts how much revenue each customer will generate.”

But here’s the thing: This single business goal can be framed as completely different ML problems, each enabling different techniques and revealing different insights.

Let me walk you through each formulation as if we’re sitting together, sketching out approaches on a whiteboard:

Formulation 1: Classification

Your first instinct might be: “Let’s bucket customers. Low value (under $100), medium ($100-$1000), high (over $1000). We’ll build a classifier.”

Pros: This is simple. It’s interpretable. You can use standard classification techniques. Your stakeholders understand it—“This customer is likely to be high value.”

Cons: But think about it—you’ve just thrown away information. Is a $99 customer really that different from a $101 customer? Your buckets are arbitrary. And you’ve lost the ability to estimate actual dollar amounts. When finance asks “What’s the expected revenue from this cohort?” you can’t give a precise answer.

Would I start here? Maybe, if I’m exploring. But I’d know it’s limited.

Formulation 2: Regression

“Okay,” you think, “let’s predict the exact dollar amount. Regression problem. Done.”

You frame it as: Given customer features at signup, predict total lifetime revenue (a continuous value from $0 to… well, your highest-paying customer is at $50,000).

Pros: Now you have precise predictions. You can sum up expected revenue. You can rank customers accurately. This is what everyone wanted, right?

Cons: But then you start training and you see the problem. Your distribution is heavily right-skewed. Most customers are in the $100-$500 range, but you have a long tail of high-value customers. Your model, trying to minimize mean squared error, makes lots of errors on that long tail. You try log-transforms, you try robust loss functions, but nothing quite works cleanly.

And there’s another problem: you’re predicting cumulative future revenue, but you have no sense of time. A customer who generates $1000 over one year is very different from one who generates $1000 over five years, but your model treats them the same.

Workable? Yes. Ideal? Maybe not.

Formulation 3: Survival Analysis

Now you step back and think differently. “Wait,” you say, “what if I frame this as two separate things: HOW LONG will the customer stay (time until churn), and WHAT’S THE RATE of their spending?”

This is survival analysis. You’re modeling:

The hazard function: probability of churn at each time point
Expected revenue per time period while active

Then CLV = (revenue per period) × (expected lifetime)

Pros: This is more natural! It explicitly models the time dimension. It handles censored data gracefully (customers who haven’t churned yet). It lets you answer questions like “What’s the probability this customer is still active after 2 years?” The model structure matches the actual process.

Cons: It’s more complex to implement. You need to understand survival analysis (Cox models, Kaplan-Meier curves). It requires more careful data preparation. Your stakeholders might need education on what hazard ratios mean.

But for many subscription businesses, this is actually the “right” framing. It matches the underlying reality.

Formulation 4: Reinforcement Learning

Now let’s get really creative. What if you reframe the entire problem?

“Actually,” you say, “I don’t just want to predict customer value. I want to maximize it through interventions. What if I frame this as: learn a policy that decides what actions to take (send discount, send email, do nothing) to maximize lifetime revenue?”

This is a reinforcement learning formulation:

State: Customer behavior, engagement metrics, recent activity
Actions: Various interventions you can take
Reward: Incremental revenue generated
Goal: Learn policy π that maximizes expected cumulative reward

Pros: This is action-oriented. You’re not just predicting, you’re optimizing. It directly aligns with the business goal (maximize revenue). It can discover non-obvious intervention strategies. It learns from online feedback.

Cons: It requires an experimentation infrastructure—you need to try actions and observe results. It’s sample inefficient—you need lots of data. It’s complex to implement and debug. You need to be careful about exploration vs. exploitation. There are ethical considerations around treating customers as experimental subjects.

Would I jump straight to RL? Probably not. But for a mature business with existing A/B testing infrastructure and lots of data, this framing might unlock significant value.

Now, here’s the crucial lesson: These are all the same business problem, but completely different ML problems. Each formulation:

Enables different techniques
Requires different data structures
Makes different assumptions
Reveals different insights
Has different pros and cons

The formulation you choose shapes everything that follows. It’s not just a technical decision—it’s a strategic one.

Let me share how to think about this systematically:

When to consider reformulation:

When the obvious framing isn’t working: You’ve tried the straightforward approach and you’re stuck. Time to reframe.
When you have domain knowledge suggesting a different structure: If you understand the underlying process (like churn dynamics in subscription businesses), let that guide your formulation.
When your constraints force a different view: If you need to optimize actions (not just predict), that pushes you toward RL or causal inference framings.
When you discover the data doesn’t fit your framing: Your regression assumptions are violated? Maybe it’s not a regression problem.

Here are some common transformations to have in your toolkit:

Classification ↔︎ Regression: Sometimes predicting probabilities and thresholding works better than direct classification. Sometimes discretizing regression outputs makes the problem more stable.

Supervised → Self-Supervised: Can’t get enough labels? Maybe your problem can be reframed with automatic labels. Rotation prediction, colorization, masked language modeling—these are all self-supervised framings of problems that originally seemed to require labeled data.

Instance-Level → Set-Level: Struggling with individual predictions? Maybe Multiple Instance Learning is better. Example: Instead of classifying individual frames in a video, classify the entire video and let the model figure out which frames matter.

Time-Domain → Frequency-Domain: Stuck on time-series patterns? Apply FFT and work in frequency space. Sometimes patterns invisible in time domain are obvious in frequency domain.

Generative → Discriminative: Can’t directly model P(y|x)? Sometimes modeling P(x|y) and using Bayes rule works better. This is how Naive Bayes classifiers work.

Here’s a practical exercise: Take a problem you’re working on right now. Write down three completely different ways to frame it as an ML problem. Don’t just think it—actually write them down. For each framing, note:

What techniques does this enable?
What assumptions does it make?
What are the pros and cons?
What would the data need to look like?
How would you evaluate success?

You might be surprised. Often, the formulation you started with isn’t the best one. But you won’t discover alternatives unless you actively look for them.

Returning Home: The Power of First Principles

Let me tell you about a time I watched a team waste three months. They were building a recommendation system—sophisticated, beautiful, using the latest transformer architectures. They were so deep in implementation details: attention heads, positional encodings, learning rate schedules. The model was getting more and more complex.

And then someone new joined the team. In the first meeting, they asked a simple question: “What are we actually trying to optimize here?”

Silence. Then various answers: “User engagement.” “Click-through rate.” “Revenue.” “Time on platform.”

These are all different objectives. And they’d been optimizing for CTR while stakeholders wanted revenue.

This is what Pólya means by “go back to definitions.” When you’re lost in a maze of technical details, when complexity has accumulated to the point where you can’t see clearly anymore, when your model has 50 hyperparameters and you’ve lost track of what each one does—that’s when you need to return home. Return to first principles.

Let me walk you through what this means in practice:

Define What You’re Actually Solving

Strip away all the ML jargon. What is the fundamental thing your model needs to do?

Not “multi-class classification with cross-entropy loss.” But: “Given a customer service inquiry, route it to the right department.”

Not “sequence-to-sequence generation with attention.” But: “Given a bug description, suggest relevant code files for the engineer to check.”

Not “unsupervised clustering with k-means.” But: “Group these customer behaviors so we can design targeted interventions.”

When you state it this way, clearly and simply, you can ask better questions:

What’s the actual impact of being right vs. wrong?
What’s the cost of different types of errors?
What does “good enough” actually mean?
What do we need the model to learn to accomplish this?

Question Your Metrics

Here’s an uncomfortable truth: the metric you’re optimizing is often wrong for the actual problem.

You’re using accuracy because it’s standard. But look at your confusion matrix—false positives and false negatives have wildly different costs in your application. Accuracy treats them equally. Should you be using a cost-sensitive metric instead?

You’re using BLEU score for your text generation model. But BLEU was designed for machine translation. Does it actually measure what matters for your use case? Maybe human evaluators rate outputs very differently than BLEU does.

You’re using AUC-ROC because your classes are imbalanced. But in production, you need to make decisions at a specific threshold. Shouldn’t you be optimizing for precision at that threshold instead?

Going back to definitions means asking: “What does good performance actually mean?” Define it from first principles, not from convention.

Understand Your Model’s Capacity Requirements

Sometimes returning to first principles means asking: “What’s the minimum the model needs to know?”

I’ve seen teams train massive neural networks for problems that were, at their core, memorization of a few hundred rules. The model learns those rules, sure, but do you really need millions of parameters for that? Could a decision tree do the same job more efficiently?

Other times, teams use simple models for problems that fundamentally require complex pattern recognition. A logistic regression can’t learn hierarchical feature interactions no matter how much data you give it. You need the capacity of deeper models.

Returning to definitions means understanding: “What’s the inherent complexity of this problem?”

Let me give you a framework for thinking about this:

Simple problems (low complexity, clear rules):

Could be solved by a human with a checklist
Examples: Filtering spam emails based on keywords, flagging duplicate records
Don’t need deep learning—often decision trees or even rule-based systems work better

Medium complexity (pattern recognition, but not too deep):

Requires learning combinations of features but patterns are relatively straightforward
Examples: Credit scoring, customer churn prediction with engineered features
Gradient boosted trees, random forests, or shallow neural nets work well

High complexity (hierarchical patterns, rich structure):

Requires learning features from raw data, or very complex feature interactions
Examples: Image classification, natural language understanding, speech recognition
Deep learning shines here

Very high complexity (reasoning, planning, long-term dependencies):

Requires sequential decision-making or complex reasoning chains
Examples: Playing Go, theorem proving, long-form text generation
Might need specialized architectures, RL, or hybrid approaches

Going back to definitions means honestly assessing which category your problem falls into. Don’t use a cannon to kill a fly. Don’t bring a knife to a gunfight.

Revisit Your Assumptions

Every ML approach makes assumptions. Going back to definitions means making these assumptions explicit and checking if they hold.

You’re using linear regression. Implicit assumptions:

The relationship is linear
Errors are normally distributed
Features are independent (or you’ve handled multicollinearity)
No significant outliers

Do these hold for your data? If not, you’re building on a shaky foundation.

You’re using collaborative filtering for recommendations. Implicit assumption:

Users who agreed in the past will agree in the future
The rating matrix is static
You have enough historical data

What if user preferences drift over time? What about cold-start users?

Going back to definitions means questioning: “What am I taking for granted?”

Let me share a real example. A team was building a fraud detection model using historical transaction data. They were getting good cross-validation scores but terrible production performance. When they went back to first principles, they realized their implicit assumption: “fraudulent patterns don’t change over time.”

But of course they do! Fraudsters adapt. The patterns in last year’s data weren’t predictive of this year’s fraud. They needed to completely rethink their approach—shorter training windows, online learning, anomaly detection instead of supervised classification.

The problem wasn’t their modeling skill. It was a violation of a fundamental assumption they hadn’t made explicit.

Here’s a practice I recommend: When you’re stuck, have a “first principles” meeting.

Gather your team (or just sit down with yourself and a whiteboard). Ask:

What are we really trying to do? (In plain language)
What does success look like? (Concretely, with numbers)
What does our model need to know? (What patterns, what relationships)
What assumptions are we making? (List them explicitly)
What are the simplest approaches? (Before we got clever, what would work?)
What’s the irreducible difficulty? (What makes this problem hard, fundamentally)

You’ll be surprised how often this exercise reveals that you’ve been solving the wrong problem, or using the wrong metric, or making unjustified assumptions.

Going back to definitions isn’t admitting defeat. It’s resetting your understanding so you can move forward more effectively.

Working Backwards: Starting From The End

Let me paint you a scenario. You’re in a meeting with the product team. They’re excited: “We need a real-time recommendation system for the homepage. It needs to serve personalized recommendations to 10,000 requests per second, with 99th percentile latency under 50 milliseconds. Oh, and it needs to be explainable because we might need to show users why we recommended something.”

You nod, taking notes. Then you go back to your desk and think: “Okay, let’s start exploring architectures. Maybe a deep neural network with user and item embeddings, then a—”

Stop.

That’s forward thinking. You’re starting from what you have (data, techniques you know) and trying to reach what you need (the production system). Sometimes that works. But often, there’s a better way: working backwards.

Working backwards means starting from the goal—that production system serving 10K QPS at 50ms latency with explainability—and reasoning backward to determine what you need at each step.

Let me show you how this plays out:

The Backwards Planning Process

Step 1: Start with the final constraint

“I need 50ms p99 latency at 10K QPS.”

Step 2: Ask “What does this imply?”

Well, if I have 50ms total budget for latency, and I need to do feature lookup, model inference, and post-processing…

That means model inference can take at most… let’s say 20ms (leaving room for everything else).

Step 3: Ask again “What does this imply?”

20ms for inference. That rules out:

Large transformer models (would take 100+ms)
Ensemble of multiple heavy models (too slow)
Complex feature engineering at inference time (no time)

So I need a lightweight model architecture.

Step 4: Keep going backwards

If I need a lightweight model but still need good performance, what does that imply?

I need to do the heavy lifting offline:

Pre-compute embeddings
Use approximate nearest neighbor search for candidate generation
Only use the lightweight model for final re-ranking

Step 5: One more step back

If I’m doing candidate generation with ANN search, what does that imply?

I need:

A good embedding space (so similar items are close together)
Efficient indexing (FAISS or similar)
The embedding model can be heavy (it runs offline)

Step 6: And finally…

If I need a good embedding space, what does that imply for training?

I should use:

Contrastive learning or metric learning objectives
Large batch sizes (to get good negatives)
Training data focused on user-item interactions

Now I can work forward with this plan:

Train an embedding model (can be complex, runs offline)
Use embeddings to build ANN index
Train lightweight ranking model (must be fast)
Deploy as two-stage: ANN candidate generation → neural re-ranking

This entire plan emerged from working backwards from the constraint.

Let me show you the code structure for this thinking:

# PSEUDOCODE: Working Backwards Planning
def plan_backwards(production_requirements):
    """
    Start from the goal and work backwards to determine what you need
    """
    
    # The goal
    print("Goal: Real-time recommendations")
    print("Constraints:", production_requirements)
    # {"latency_p99": "50ms", "qps": 10000, "explainability": "required"}
    
    current_constraints = production_requirements
    architecture_requirements = []
    
    # Work backwards through the implications
    while True:
        print(f"\nCurrent constraints: {current_constraints}")
        print("What does this imply?")
        
        # Latency constraint
        if "latency" in current_constraints:
            print("→ Model must be lightweight")
            print("→ Heavy computation must be offline")
            print("→ Need caching strategy")
            
            architecture_requirements.extend([
                "lightweight_model (< 5M parameters)",
                "offline_embedding_computation",
                "redis_cache for frequent users"
            ])
            
            print("\nWhat does lightweight model imply?")
            print("→ Can't directly use BERT/large transformers")
            print("→ Need efficient architecture (distilled model, or shallow neural net)")
            
            architecture_requirements.append(
                "two_stage: fast_retrieval + small_ranker"
            )
            
        # Explainability constraint
        if "explainability" in current_constraints:
            print("→ Need attention weights or feature importance")
            print("→ Rules out pure black-box models")
            
            architecture_requirements.extend([
                "attention_mechanism (can show which items influenced rec)",
                "feature_attribution (SHAP values for scoring model)"
            ])
            
        # QPS constraint
        if "qps" in current_constraints:
            print("→ Need to serve from cache for hot items")
            print("→ Need batch inference")
            print("→ Need load balancing")
            
            architecture_requirements.extend([
                "batch_predictor (combine requests)",
                "multi_instance_deployment",
                "cache_popular_user_recommendations"
            ])
            
        # Now work backwards from architecture requirements
        print("\n\nArchitecture requirements:", architecture_requirements)
        
        # What does two-stage retrieval+ranking imply?
        if "two_stage" in architecture_requirements:
            print("\nTwo-stage approach requires:")
            print("1. Fast retrieval:")
            print("   → Need embeddings")
            print("   → Need ANN index (FAISS)")
            print("2. Lightweight ranker:")
            print("   → Simple features only")
            print("   → Small neural net or GBDT")
            
            data_requirements = [
                "user_item_interaction_data (for embeddings)",
                "positive_negative_pairs (for contrastive learning)",
                "features_computable_in_realtime (for ranker)"
            ]
            
        # What does embedding training imply?
        print("\n\nEmbedding training requires:")
        print("→ Metric learning objective (contrastive/triplet loss)")
        print("→ Large batch sizes (for hard negative mining)")
        print("→ GPU training (can take hours, runs offline)")
        
        training_requirements = [
            "contrastive_learning_framework",
            "large_batch_training (batch_size > 1024)",
            "negative_sampling_strategy"
        ]
        
        break  # Simplified for example
    
    # Now we have a complete plan!
    return {
        'architecture': architecture_requirements,
        'data_needs': data_requirements,
        'training': training_requirements,
        'deployment': ["two_tier_serving", "caching_layer", "load_balancer"]
    }

Real Example: AlphaGo’s Backwards Plan

Let me share how this played out in one of the most famous ML systems: AlphaGo.

Goal: Beat the world champion at Go.

Work backwards:

“To beat the world champion, I need superhuman move selection.”

What does that imply? → “I need to evaluate positions better than humans can.”

What does that imply? → “I need both: (1) policy to suggest good moves, and (2) value function to evaluate positions.”

What does that imply? → “I need to learn from both expert games AND self-play.”

Why self-play? → “Because human games only show me human-level play. To exceed that, I need to play against myself and discover superhuman strategies.”

What does self-play imply? → “I need Monte Carlo Tree Search (MCTS) to generate high-quality training data.”

What does MCTS imply? → “I need a way to simulate games quickly to explore the tree.”

What does learning from expert games imply? → “I need a large database of professional games.”

Now work forward with this plan:

Start: Collect database of expert games
Train initial policy network by supervised learning on expert moves
Use that policy in MCTS to generate self-play games
Train value network on self-play outcomes
Improve policy through reinforcement learning on self-play
Iterate: better policy → better self-play → better training → better policy

This plan emerged from working backwards from “beat the world champion” through all the implications.

When Working Backwards Really Shines

Working backwards is especially powerful when:

1. You have hard constraints

Production requirements (latency, throughput, memory), regulatory requirements (explainability, fairness), business requirements (must work on mobile devices). Start from these constraints and work backwards to determine what’s feasible.

2. The goal is clear but the path isn’t

You know exactly what you need to achieve but don’t know how to get there. Working backwards helps you decompose the problem.

3. You’re building a complete system

Not just a model, but a production ML system with data pipelines, training infrastructure, serving, monitoring. Working backwards helps you see all the pieces you’ll need.

4. Resources are limited

You have 3 months and 2 engineers. Working backwards from the deadline helps you scope appropriately: “If we only have 3 months, we can’t build a custom training infrastructure, so we need to use existing tools, which means…”

Practical Exercise

Take a project you’re working on. Write down the final state you need to achieve. Be specific:

Performance metrics
Latency requirements
Scale (QPS, data volume)
Other constraints (interpretability, fairness, etc.)

Now work backwards. For each requirement, ask “What does this imply?” Keep a chain of reasoning. You might discover:

Assumptions you’re making that might not hold
Components you’ll need that you hadn’t thought about
Architectural choices that are forced by your constraints
Things that seemed necessary but actually aren’t

This backwards planning often reveals a simpler, more direct path than forward planning would have.

Getting Stuck: The Fertile Ground of Frustration

Let me be completely honest with you: You’re going to get stuck. Not occasionally. Regularly. On every challenging project.

You’ll hit moments where:

Every architecture you try overfits terribly
Your model learns spurious correlations no matter what you do
Performance plateaus far below where it needs to be
The production constraints seem physically impossible to meet
You’ve tried everything you can think of and nothing works

I want to tell you something important: These moments aren’t failures. They’re signals.

They’re your problem telling you: “The way you’re thinking about this isn’t quite right. You need a different lens, a different angle, a new perspective.”

Some of my best ML solutions came after weeks of being stuck. The breakthrough came not from trying harder with the same approach, but from changing how I thought about the problem.

Let me walk you through strategies for getting unstuck, organized by what kind of stuck you are.

Type 1: “My Model Won’t Learn At All”

You’ve set up your training pipeline. You hit run. The loss stays flat or barely moves. Your validation accuracy is no better than random guessing.

Your first instinct: Must be a bug! Check the code!

And you’re probably right: This usually IS a bug. But let me give you a systematic debugging approach:

# DEBUGGING CHECKLIST: Model Won't Learn

# Step 1: Can your model overfit a single batch?
def test_overfitting_capacity():
    """
    Take 10 examples. Train until loss is near zero.
    If this fails, you have a fundamental problem.
    """
    tiny_batch = dataset[:10]
    model = YourModel()
    
    for epoch in range(1000):
        loss = train_step(model, tiny_batch)
        print(f"Epoch {epoch}: Loss {loss}")
        
        if loss < 0.01:
            print("✓ Model CAN learn (has sufficient capacity)")
            return True
    
    print("✗ Model CANNOT learn even tiny batch")
    print("Possible issues:")
    print("- Wrong loss function for task")
    print("- Architecture bugs (dead neurons, dimension mismatches)")
    print("- Learning rate too low")
    print("- Gradient flow problems")
    return False

If your model can’t even overfit 10 examples, you have a bug or fundamental architecture problem. Fix that before anything else.

If it CAN overfit a tiny batch but won’t train on full data:

Check your data:

Are labels correct? (Print some examples manually)
Is there a label-feature mismatch? (e.g., predicting tomorrow’s price but features include tomorrow’s price)
Are you preprocessing wrong? (e.g., normalizing test data with train stats)

Check your architecture:

Are gradients flowing? (Add gradient logging)
Is anything saturating? (sigmoid/tanh outputs at extremes?)
Are skip connections working if you have them?

Check your training setup:

Learning rate too high (causing divergence) or too low (no learning)?
Batch size inappropriate for problem?
Are you training the right parameters? (Check param.requires_grad)

Type 2: “My Model Overfits Immediately”

Your training loss goes down nicely. Your validation loss goes down for one epoch, then shoots up. Classic overfitting, but severe.

First, establish a baseline:

Let me walk you through this debugging process systematically:

def establish_baselines():
    """
    Before fighting overfitting, understand what's reasonable
    """
    
    # Baseline 1: Random predictor
    random_performance = evaluate_random_predictions()
    print(f"Random: {random_performance}")
    
    # Baseline 2: Most common class (for classification)
    majority_performance = predict_majority_class()
    print(f"Majority class: {majority_performance}")
    
    # Baseline 3: Simple model (logistic regression / random forest)
    simple_model_performance = train_simple_model()
    print(f"Simple model: {simple_model_performance}")
    
    print("\nYour complex model must beat these!")

If your complex model doesn’t beat a simple logistic regression, you’re probably overfitting because you don’t have enough data for model complexity.

Strategies when stuck on overfitting:

Reduce model complexity first:
- Fewer layers, fewer parameters
- You might have a 50-layer network when you need 5 layers
Get more data (if possible):
- Data augmentation
- Synthetic data generation
- Collect more real data
Add regularization (but do it systematically):
- Start with dropout (0.5 is often reasonable)
- Try weight decay
- Try early stopping
- Try batch normalization
Check for data leakage:
- This is often the culprit for mysterious overfitting
- Is future information leaking into your features?
- Are train and test split properly?

Type 3: “Performance Has Plateaued”

You’ve gotten to 75% accuracy. You need 85%. You’ve tried deeper networks, more training, different optimizers. Nothing budges it past 76%.

This is where you need to change your approach fundamentally.

First, do error analysis:

This systematic approach to error analysis is crucial. Let me show you how to implement it:

def deep_error_analysis(model, dataset):
    """
    Understand WHERE and WHY your model fails
    """
    
    errors = []
    for example in dataset:
        prediction = model(example)
        if prediction != example.label:
            errors.append({
                'example': example,
                'prediction': prediction,
                'true_label': example.label,
                'confidence': prediction.confidence
            })
    
    # Analyze error patterns
    print("Error analysis:")
    
    # 1. Which classes are confused most?
    print_confusion_patterns(errors)
    
    # 2. Are errors high-confidence (wrong but sure) or low-confidence?
    print_confidence_distribution(errors)
    
    # 3. Do errors share characteristics?
    print_error_feature_analysis(errors)
    
    # 4. Manually look at random errors
    print("\n=== Random Error Examples ===")
    for error in random.sample(errors, 20):
        print(f"\nTrue: {error['true_label']}")
        print(f"Predicted: {error['prediction']}")
        print(f"Example: {error['example']}")
        print("Why did model fail here?")
        input("Press enter for next...")

This error analysis often reveals the issue:

“Oh, the model confuses class A and B because we don’t have a feature that distinguishes them”
“The model fails on rare edge cases—we need more data for those”
“The model is learning a spurious pattern in the background, not the actual object”

Based on error analysis, you might:

Add features: The model needs information it doesn’t have
Get more diverse data: Errors cluster in underrepresented regions
Change the architecture: The model can’t represent the patterns you need
Reframe the problem: Maybe this shouldn’t be classification at all
Accept the limitation: Maybe 76% is the best possible given your data

Type 4: “My Architecture/Approach Fundamentally Can’t Work”

This is the hardest kind of stuck, because the problem isn’t in the details—it’s in the approach.

Signs you’re in this situation:

Simple baselines work better than your complex model
The approach works in papers but not on your data
Every variation you try fails in the same way
Domain experts say “that shouldn’t work for this problem”

When this happens, you need to:

Go back to first principles (we covered this)
- What are you really trying to do?
- What does the model fundamentally need to know?
- Are you giving it that information?
Look at related problems (we covered this too)
- How do people solve similar problems?
- Is there a different framing that works better?
Simplify drastically:
- Solve a much easier version first
- Remove 90% of the complexity
- Get SOMETHING working, even if limited

Let me give you an example. A team was trying to predict equipment failures in a factory. They framed it as time-series forecasting: predict the exact time until failure. Stuck for months—predictions were no better than random.

Then they simplified: “Forget predicting time. Can we just detect WHEN the equipment starts behaving abnormally?”

Reframed as anomaly detection, they had a working system in two weeks. It wasn’t what they originally planned, but it solved the actual business problem.

The Practice of Getting Unstuck

Here’s what I do when I’m stuck:

1. Take a walk: Seriously. Away from the computer. Your brain needs to switch from focused mode to diffuse mode. Solutions often come when you’re not actively trying.

2. Explain the problem to someone else: Even if they don’t understand ML. The act of explaining often reveals the issue. (Rubber duck debugging!)

3. Sleep on it: The overnight break gives your brain time to process. I’ve solved more problems in morning showers than in late-night coding sessions.

4. Pair program: Get a colleague to look at your problem fresh. They’ll ask questions you haven’t thought of.

5. Document what you’ve tried: Write down every approach, why it failed. This prevents repeating failed attempts and might reveal patterns.

6. Set it aside and work on something else: Sometimes you’re too close to the problem. Working on other things can bring new perspectives.

7. Read widely: Papers, blog posts, tweets. Sometimes the solution is in an adjacent domain you haven’t considered.

The key insight: Being stuck is temporary. How you respond to being stuck determines whether you break through or stay stuck.

The Recursive Nature of ML Planning: Plans Within Plans

Let me share something that might feel overwhelming at first but is actually liberating once you understand it: Every ML plan contains smaller planning problems.

When you devise a plan like “Build a production recommendation system,” you’re not creating one plan—you’re creating a hierarchy of plans, each requiring its own problem-solving process.

It’s like planning a road trip. Your high-level plan is “Drive from San Francisco to New York.” But that breaks into smaller plans: “Drive from San Francisco to Reno” (first day), which breaks into even smaller plans: “Stop for gas in Sacramento,” which breaks into micro-plans: “Which gas station? Which route?” Each level requires its own decisions.

ML planning works the same way, but we need to be more systematic about it.

The Hierarchy of Planning

Let me walk you through a real example. You’re tasked with building a content moderation system for a social media platform. Your high-level plan might be:

Level 0 (Top Level): Production Content Moderation System

Detect harmful content in user posts
Achieve 95% precision (very few false positives)
Achieve 90% recall (catch most harmful content)
Latency < 100ms
Handle 50K posts per second

This is your top-level goal. Now you need to devise a plan. Let’s say you decide:

Level 1 (System Architecture Plan):

Multi-task transformer model for text
CNN for image/video moderation
Ensemble combining both for multi-modal posts
Human-in-the-loop for borderline cases
Active learning to improve over time
A/B testing infrastructure

Good! But now each of these components is itself a problem requiring a plan. Let’s drill into just one: “Multi-task transformer model for text.”

Level 2 (Model Architecture Plan):

Base: Pre-trained BERT (or RoBERTa or similar)
Task-specific heads for each policy violation type:
- Hate speech detection
- Misinformation detection
- Spam detection
- Harassment detection
Shared encoder for efficiency
Separate calibration for each task

Okay, but how do you train this? That’s another planning problem:

Level 3 (Training Strategy Plan):

Start with frozen BERT, train only classification heads (1 epoch)
Unfreeze and fine-tune full model (5 epochs)
Use weighted loss to handle label imbalance
Augment training data with back-translation
Implement curriculum learning (start with clear examples, gradually add hard ones)

And how do you handle the label imbalance? Another plan:

Level 4 (Class Imbalance Plan):

Oversample minority classes
Use focal loss to focus on hard examples
Class weights in loss function
Synthetic example generation using GPT-2
Hard negative mining (examples model gets wrong)

You see how this works? Each level of planning creates sub-problems that need their own plans. And this isn’t a sign of poor planning—this is necessary decomposition.

Managing the Recursion: Practical Strategies

This recursion can feel overwhelming. “How deep do I need to plan? When do I stop?” Here are strategies I use:

1. Plan depth matches uncertainty

For components you understand well, plan shallowly:

“Use standard data augmentation” (you know how to do this, don’t need detailed planning)

For components with uncertainty, plan deeply:

“Handle multi-lingual content” (you haven’t done this before, need detailed plan)

2. Plan until you hit actionable tasks

Keep recursing until you reach tasks where you know the next concrete action:

❌ “Improve model performance” (not actionable—how?)
✓ “Run hyperparameter sweep over learning rates [1e-5, 1e-4, 1e-3]” (actionable!)

3. Plan top-down, execute bottom-up

Make your high-level plan first. Then recursively expand the uncertain parts. But when implementing, start from the bottom:

First: Get data loading working
Then: Get baseline model training
Then: Add complexity
Finally: Deploy complete system

This way you always have something working at each stage.

4. Document the hierarchy

I literally draw this out. Example:

This visual hierarchy helps me:

See the full scope
Identify what I’ve planned and what’s still vague
Communicate with team members
Track progress

The Wisdom of Deferred Planning

Here’s a counterintuitive insight: You don’t need to plan everything upfront.

In fact, planning too deeply too early is often wasteful. Why? Because your early experiments will teach you things that change your plan.

The pattern I follow:

Plan the first phase in detail (enough to start)
Plan subsequent phases at high level only (rough sketch)
Refine the plan as you learn (based on what works)

Example: For that content moderation system, I might plan in detail:

Data pipeline setup
Baseline model (simple BERT fine-tuning)
Evaluation framework

But I’d only sketch:

Multi-task learning (might not be needed if single-task works)
Ensemble strategy (depends on baseline results)
Active learning (Phase 2, depends on Phase 1 results)

Why? Because maybe my baseline achieves 93% and I realize I don’t need the complexity of multi-task learning. Or maybe I discover that hate speech and misinformation share a lot of features, suggesting multi-task learning is crucial. I don’t know until I try the baseline.

This is “agile planning” for ML. Plan enough to make progress, but stay flexible to adapt as you learn.

Knowing When to Replan

Plans need to change. How do you know when?

Strong signals to replan:

Your assumptions were wrong (data is different than expected)
Constraints changed (now you need 10ms latency, not 100ms)
Baseline experiments show completely different results than expected
New technology/paper makes your approach obsolete
Team/timeline/resources changed significantly

Weak signals (don’t replan yet):

One experiment didn’t work (try a few variations first)
Someone suggests a different approach (evaluate if it’s actually better)
A paper claims better results (might not transfer to your setting)

The art is knowing when to persist with your plan (not every failure means the plan is wrong) vs. when to pivot (when evidence clearly shows a different path is better).

Practice: Draw Your Planning Hierarchy

Take a project you’re working on. Draw it as a hierarchy:

Top level: The production goal
Second level: Major components
Third level: Sub-components of each
Keep going until you hit “I know how to implement this”

For each node, note:

Is this planned in detail or just sketched?
What’s the main uncertainty here?
What’s the risk if this part fails?
What’s the next action to make progress?

This exercise often reveals:

Parts you thought were planned but are actually vague
Unnecessary complexity you can cut
Critical paths where you need more planning
Opportunities for parallelization

The hierarchy is your map. Use it to navigate your project.

The Aesthetics of ML Solutions: Recognizing Beauty in Plans

Let me tell you about two moments from my career. Both were successful projects—they achieved their metrics, they deployed to production, they delivered business value. But they felt completely different.

Project A: The model was an ensemble of seven different architectures. Training required carefully orchestrated steps in a specific order. The feature pipeline had 47 different transformations. In production, we needed three different models for different latency tiers. The documentation was 50 pages. When something broke, debugging took hours because the system was so complex.

Project B: The model was a single neural network with skip connections. Training was straightforward—one script, one config file. Features were carefully chosen but minimal—just 20 of them. Production deployment was simple. The core logic fit on one page. When issues arose, we could diagnose them quickly because the system was transparent.

Both worked. But only Project B felt beautiful.

This is what Pólya means when he talks about the aesthetics of solutions. Not all plans are created equal. Beyond mere correctness (does it meet the metrics?), plans possess qualities that experienced practitioners learn to recognize and value.

Let me help you develop this sense of aesthetic judgment.

Elegance: Achieving More With Less

Elegance isn’t about simplicity for its own sake. It’s about appropriate complexity—using exactly what’s needed and nothing more.

Inelegant solution: - Five-model ensemble - 200 engineered features (most weakly predictive) - Complex stacking with meta-learning - Different approaches for different edge cases - Works, but barely, and fragile

Elegant solution: - Single well-designed model - 20 carefully chosen features (each strongly predictive) - Clean architecture that handles edge cases naturally - Works robustly across scenarios

Let me give you a real example: ResNet vs. earlier deep network approaches.

Before ResNet, people struggled to train very deep networks. The approaches were inelegant: - Try very careful initialization schemes - Add batch normalization everywhere - Try different activation functions - Carefully tune learning rates for each layer - Still couldn’t reliably train 50+ layer networks

ResNet came along with one elegant idea: skip connections. Let the network learn residuals (differences) instead of full mappings. Suddenly: - 152-layer networks train easily - Performance scales with depth - Works across tasks without special tuning - The insight is simple and beautiful

The elegant solution isn’t just more effective—it reveals something fundamental about the problem. It makes you think: “Of course! That’s how it should be done.”

How to recognize elegance in your own work:

Ask yourself: - Could I explain this approach to someone in five minutes? - If I removed any component, would it break? - Does each part have a clear purpose? - Would I be proud to show this in a paper?

If you’re saying “well, we tried X but it didn’t quite work, so we added Y as a workaround, and then we needed Z to handle edge cases…”—that’s a sign of inelegance. Elegant solutions feel inevitable, not cobbled together.

Robustness: Continuing to Work When Things Change

Here’s a test: How many things need to go exactly right for your solution to work?

Fragile solution: - Only works if data is perfectly clean - Fails if hyperparameters aren’t exactly tuned - Breaks if data distribution shifts slightly - Requires manual intervention when edge cases appear

Robust solution: - Handles noisy data gracefully - Performance degrades smoothly with hyperparameter changes - Adapts to distribution shift (or degrades predictably) - Deals with edge cases automatically

Let me show you what this means practically:

Example: Fraud Detection Model

Fragile approach: Train on last year’s fraud patterns. High accuracy on test set. Deploy to production. Performance drops by 20% within two months because fraudsters adapted. Model needs complete retraining.

Robust approach: - Use anomaly detection component (catches novel fraud patterns) - Online learning updates (adapts to drift) - Ensemble with rule-based component (stable baseline) - Active learning flags uncertain cases for human review - Model degrades gracefully rather than catastrophically

The robust solution isn’t just better—it requires less maintenance, fails more predictably, and scales better over time.

How to build robustness into your plans:

Plan for data drift: Your training data is a snapshot. Production data will differ.
- Use techniques that generalize (regularization, augmentation)
- Build in monitoring for drift detection
- Design for adaptation (online learning, periodic retraining)
Avoid brittle dependencies:
- If your model REQUIRES feature X to be perfectly accurate, you’re brittle
- Better: model gracefully handles missing or noisy features
Have fallback strategies:
- Ensemble with simple baseline
- Rule-based system as safety net
- Human-in-the-loop for uncertain cases
Test in adversarial conditions:
- What if 20% of features are missing?
- What if class distribution shifts?
- What if input data is deliberately adversarial?

Clarity: Understanding What’s Happening

Can you explain why your model makes each prediction? Can you debug when it fails? Can your teammates understand and maintain it?

Opaque solution: - “We trained this ensemble and it works, but we don’t really know why” - Black box that’s impossible to debug - Only one person understands the codebase - When it fails, you’re guessing at fixes

Clear solution: - Each component has a clear purpose - You can trace prediction logic - Failure modes are understandable - The codebase is documented and modular

Example from my experience:

I once inherited a text classification model. The code was 3,000 lines in one file. The feature engineering involved 50+ transformations, some in pandas, some in SQL, some in custom Python. Training required running five separate scripts in sequence. Nobody could explain what half the features did.

When we needed to fix a bug, it took a week just to understand the system.

I refactored it: - Clear pipeline: data → preprocessing → feature_engineering → model → postprocessing - Each step in its own module with docstrings - Configuration file for all hyperparameters - Comprehensive tests - Documentation explaining each feature and why it helps

The model’s performance was identical. But now any team member could work on it. Bugs took hours to fix, not weeks.

How to achieve clarity:

Write code like you’ll forget it: Because you will. In six months, you’ll thank yourself for clear variable names and comments.
Modularize: Separate data loading, preprocessing, model architecture, training, evaluation. Each should be independently understandable.
Document decisions: Not just what you did, but why. “We use BERT because…” “We chose these features because…”
Make effects visible: Log intermediate outputs. Visualize learned representations. Show attention weights. Make the model’s reasoning transparent.

Inevitability: The Feeling of “Of Course”

The best solutions have a quality that’s hard to describe but you know it when you see it: inevitability. When you understand the solution, you think “Of course! How else would you do it?”

This doesn’t mean the solution was easy to find. The most inevitable-seeming solutions often required great creative leaps. But once you see them, they feel natural.

Examples of inevitable solutions:

Attention mechanisms for sequence-to-sequence: Of course! The model should be able to look at different parts of the input for each output. How else would you handle variable-length alignment?
Batch normalization: Of course! Normalize activations at each layer to keep gradients stable. So obvious in hindsight.
Skip connections in ResNet: Of course! Let the model learn the difference rather than the full transformation. That’s naturally easier.
Self-supervised learning: Of course! The data contains its own supervision signals. Predict masked words, predict rotation, predict next frames—the labels are free.

These solutions feel inevitable because they align with the fundamental structure of the problem. They’re not fighting against the problem—they’re working with it.

How do you know if your solution has this quality?

When you explain it to colleagues, do they nod and say “that makes sense” or do they look confused?
Does the solution generalize naturally to related problems?
Could you have predicted this solution would work before implementing it?
Does the solution reveal something about the problem itself?

Developing Your Aesthetic Sense

This aesthetic judgment—recognizing elegance, robustness, clarity, inevitability—develops over time. Here’s how to cultivate it:

1. Study great work: Read papers for elegant solutions. How did they think about the problem? What makes their approach beautiful?

2. Refactor your own work: After you get something working, ask: “Could this be simpler? Clearer? More robust?”

3. Compare approaches: When multiple solutions work, analyze why. What are the qualitative differences? Which one feels better?

4. Seek feedback: Show your work to experienced practitioners. Ask not just “does it work?” but “is this a good solution?”

5. Reflect on failures: Often, failed approaches were inelegant. What was wrong with them? What does that teach you?

Over time, you’ll develop an intuition. When you’re planning, you’ll start to feel whether your approach is elegant or cobbled together. That feeling—that aesthetic sense—is valuable. Trust it. If your plan feels messy and fragile, it probably is. Keep searching for the elegant approach.

Conclusion: The Living Plan and Your Journey Forward

We’ve walked a long way together. From standing at the edge of that canyon between your current state and your goal, through pattern recognition and related problems, through creative auxiliary elements and problem transformations, through working backwards and getting unstuck, through recursive planning and aesthetic judgment.

But here’s what I need you to understand as we close: A plan is not a rigid prescription. It’s a living strategy.

The plan you devise today will change tomorrow. You’ll run your first experiments and learn something that shifts your approach. You’ll hit a wall and discover a better path around it. You’ll see a new paper that opens possibilities you hadn’t considered. Your stakeholders will change requirements. Your data will reveal patterns you didn’t expect.

This is normal. This is good. The best ML practitioners hold their plans lightly.

They’re confident enough in their strategic thinking to commit to a direction, but humble enough to adapt when reality reveals complexities not anticipated during planning. They know the difference between “this approach needs more iteration” and “this approach is fundamentally wrong, time to pivot.”

Let me leave you with a mental model that’s served me well:

Think of planning like navigation.

You’re sailing from San Francisco to Hawaii. You plot a course—that’s your plan. But you don’t just set your heading and ignore everything else for the next week. You constantly monitor: - Are we on course? (Metrics tracking) - Is the wind changing? (Data distribution shift) - Are there storms ahead? (Anticipated challenges) - Did we discover a better route? (New insights)

You make constant small adjustments to your course. That’s execution with adaptation. But you don’t change your destination—Hawaii is still the goal. Unless you discover compelling reasons to change the goal itself (stakeholder requirements change).

Your ML plan is the same: Strategic direction that guides you, but adapts based on what you learn.

Your Planning Toolkit: A Summary

Let me give you a checklist to return to when you’re devising plans:

Pattern Recognition:

Have I seen this problem before?
What similar problems do I know?
What patterns in my experience match this?
What architectures have worked for similar tasks?

Related Problems:

What problems share the same architecture?
What problems share the same challenges?
What problems are in the same domain?
Can I transfer results, methods, or insights?

Auxiliary Elements:

What additional tasks could help learning?
What intermediate representations might help?
What auxiliary losses could guide training?
What creative additions might unlock progress?

Problem Transformation:

How else could I frame this problem?
What assumptions am I making that I could change?
What different ML task types could solve this?
What different representations could I use?

First Principles:

What am I really trying to do? (Plain language)
What does success actually mean?
What does my model need to know?
What assumptions am I making?
What’s the simplest approach?

Working Backwards:

What are my hard constraints?
What do these constraints imply?
What components do I need?
What does each component require?
What’s the data and compute budget?

When Stuck:

Can my model overfit a tiny batch? (Capacity check)
Where and why is it failing? (Error analysis)
Am I solving the right problem? (Problem check)
Have I tried simpler approaches? (Baseline check)
Do I need to reframe entirely? (Strategic pivot)

Aesthetic Judgment:

Is this elegant? (Appropriate complexity)
Is this robust? (Handles uncertainty)
Is this clear? (Understandable and debuggable)
Does this feel inevitable? (Aligned with problem structure)

Practice: Your Personal Reflection

Before you go, I want you to do something. Take the project you’re currently working on (or about to start). Write down:

Where you are: Current state, what you have
Where you need to be: Production goal, constraints, metrics
Your current plan: How you’re planning to bridge the gap
Your confidence: Which parts are clear? Which parts are uncertain?
Your next action: What’s the concrete next step?

Then ask yourself: - Have I considered the patterns I know? - Have I looked at related problems? - Could I reframe this differently? - Am I working backwards from constraints? - Is my plan elegant, robust, and clear?

This reflection—this conscious application of Pólya’s questions to your actual work—is how these heuristics become second nature.

The Journey Continues

Remember: The plan is the bridge between data and deployment. Pólya’s questions are the architectural principles that ensure your bridge is not just functional but optimal—not just sufficient but elegant.

We’ve focused on devising the plan, but there’s more to the journey. In the next article, we’ll explore how to carry out your plan—how to execute with discipline, how to debug when things go wrong, how to iterate efficiently, and how to know when your beautiful plan needs revision in the face of empirical reality.

But for now, you have the tools to devise strong plans. You understand that planning is creative problem-solving—recognizing patterns, finding connections, transforming representations, working backwards, and building with aesthetic judgment.

The planning phase is your opportunity to think deeply before acting. It’s your chance to leverage your experience, to avoid known pitfalls, to choose elegant approaches over brute-force ones. Take your time with it. A day spent devising a good plan can save weeks of wasted effort.

Go forth and plan well. And remember: every senior ML practitioner was once where you are, learning to devise plans, getting stuck, breaking through, developing judgment. This is the path. You’re on it. Keep walking.

The next time you face a new ML problem, don’t immediately jump to “let me try some models.” Pause. Ask Pólya’s questions. Sketch your bridge. Think about patterns and related problems. Consider transformations and auxiliary elements. Work backwards from constraints. Judge your plan aesthetically.

That’s how you become not just a practitioner, but a master of your craft.

This article is Part 2 of a four-part series on applying Polya’s problem-solving framework to data science and machine learning. Continue with [Part 3: Carrying Out the Plan], which addresses implementation challenges and best practices, and [Part 4: Looking Back], which covers evaluation, iteration, and continuous improvement.

The principles and frameworks presented here have been developed through analysis of hundreds of ML projects across industries. While each project is unique, the patterns of success and failure repeat with remarkable consistency. By learning from these patterns and applying Polya’s timeless wisdom, we can dramatically improve our chances of building ML systems that deliver real value.