Imad Dabbura

Make ML Systems Ship Again

Imad Dabbura — Sun, 21 Sep 2025 05:00:00 GMT

evergreen

Introduction

You burn six months “optimizing.” Swap in transformers. Squeeze another + accuracy. Rewrite the feature pipeline. Add a shiny GPU cluster. And still: alert fatigue, missed incidents, and latency that kills real-time response.

That’s optimization theater.

This pattern shows up everywhere in production ML. Fraud teams add transaction features that never reduce false positives. Recommendation engines get fancier models that don’t move click-through rates. Forecasting pipelines gain complexity without improving planning accuracy. Parts get optimized. Systems don’t.

This post gives you a systematic method to break out of the cycle. It’s based on the Theory of Constraints — originally developed for manufacturing, but a natural fit for ML systems. We’ll use a network anomaly detection system as our running example, but the playbook works for any ML system in production.

Roadmap

Section	What You’ll Learn / Do	Why It Matters
The Theory of Constraints	The core idea and why single-bottleneck focus works	Gives you the mental model that makes the steps principled, not arbitrary
1. Goal & Constraint	Set SLOs, then build a constraint ledger to find the bottleneck	Defines success and focuses effort on the one thing that governs throughput
2. Understand Why It’s Stuck	Root-cause analysis	Prevents solving symptoms
3. See the Hidden Tradeoff	Map the conflict	Reveals why simple fixes haven’t worked
4. Break the Tradeoff	Challenge assumptions, then innovate	Achieves step-function improvement
5. Prove It Works	Minimum Viable Experiment	Validates before full investment

The Theory of Constraints in 5 Minutes

The traditional approach to improving ML systems is based on a seemingly logical but flawed assumption: if you improve each component, the whole system improves. It doesn’t. The sum of all local improvements doesn’t give you a system improvement.

The breakthrough insight, from Eli Goldratt’s The Goal (1984) and made operational by Alan Barnard’s pairing method, is simple: every system has exactly one constraint at any given moment — the single resource or stage you don’t have enough of. That constraint sets the ceiling for the entire system. Improving anything else delivers diminishing-to-zero returns.

A factory line can only produce as fast as its slowest machine. If the paint booth takes 10 minutes per car while everything else takes 2 minutes, buying faster welding robots changes nothing. You have to speed up the paint booth — or the line will forever produce one car every 10 minutes.

In a serial pipeline — which is what most ML systems are — this is even starker than it sounds. Throughput equals the throughput of the slowest stage. If your feature extraction handles 30K records/sec and everything else handles 100K, the system does 30K. Making inference 10x faster? Still 30K. Doubling ingest capacity? Still 30K. Only improving the bottleneck stage moves the number. Every other “optimization” is buying faster welding robots while the paint booth sets the pace.

Barnard turns this into a strict chain of focused pairings. Each pairing links a WHAT (what you need) to a HOW (how to get it), maintaining a one-to-one relationship that keeps focus razor-sharp:

Goal → Constraint: WHAT do I want? More of the Goal. HOW? By getting more of the Constraint — the single resource I don’t have enough of.
Constraint → Problem: WHAT limits the Constraint? The one Problem causing at least 50% of the gap.
Problem → Conflict: WHY hasn’t the Problem been solved? Because it’s an unresolved Conflict between two necessary-but-competing approaches.
Conflict → Innovation: HOW do I resolve it? With an Innovation that captures the Pros of both the current approach and the new idea. The aim is all the Pros — but some tradeoffs may remain. The key is they’re deliberate and tolerable, not the paralyzing either/or you started with.
Innovation → Experiment: HOW do I know it works? With a Minimally Viable Experiment — before building anything.

The five how-to steps below translate these pairings into ML-systems language. Step 1 defines the Goal (SLOs) and finds the Constraint (bottleneck). Step 2 uncovers the Problem (root cause). Step 3 maps the Conflict (hidden tradeoff). Step 4 designs the Innovation. Step 5 runs the Experiment.

So why does this matter for ML specifically? Because ML pipelines are textbook flow systems: ingest → features → inference → action. They have measurable stages with capacity limits. And they accumulate complexity over time — teams add features, models, and infrastructure without ever removing anything. This makes them natural candidates for constraint-based thinking. But ML teams rarely think this way, because they’re trained to optimize models, not systems.

Step 1: Define Your Goal and Find the Bottleneck

Before you can find the bottleneck, you need to define what success actually means — in numbers, not aspirations. And before you can fix the bottleneck, you need to know which stage is actually holding the system back. This step covers both.

Set Your SLOs

“Detect anomalies,” “reduce fraud,” and “improve recommendations” aren’t goals. They’re wishes. Without measurable targets, every team member optimizes for a different thing, and you can’t tell whether you’re constrained by latency, precision, coverage, or something else entirely.

The fix is Service Level Objectives — specific, measurable thresholds tied to business outcomes:

SLO Dimension	What It Measures	Fill In
Time-to-Decision (TTD)	How fast the system produces an actionable output	p95 ≤ ___
Decision Budget	How many outputs a human can realistically handle	≤ ___ per day
Outcome-Weighted Performance	Accuracy weighted by business impact, not volume	≥ ___%
Coverage	Fraction of relevant events actually processed	≥ ___%
Data Loss	Events dropped or degraded in transit	≤ ___%

These five dimensions force hard conversations. A model with 99% accuracy but 30-minute detection latency fails the TTD target. A model with perfect precision but 500 daily alerts fails the decision budget. The SLOs define the feasible region — and crucially, reveal what’s blocking you from reaching it.

Here’s how our network anomaly detection system instantiated these:

TTD: p95 ≤ 5 minutes from event to alert
Alert Budget: ≤ 10 analyst-actionable alerts/day
Incident-Weighted Recall: ≥ 90%

The same template applies to other domains. A fraud detection team might set TTD ≤ 200ms with ≤ 50 manual reviews/day. A recommendation system might target TTD ≤ 100ms with CTR-weighted precision ≥ X%.

When defining SLOs, involve the people who use the system’s outputs — not just the team that builds it. Security analysts, operations teams, business stakeholders. When they disagree (and they will — security wants recall, ops wants fewer alerts), the SLOs make the tradeoff explicit rather than hiding it inside model thresholds.

Pitfall: Vanity Metrics Over Business Outcomes

Teams optimize metrics that sound impressive but don’t connect to business value. “99.9% precision” means nothing if you’re missing 90% of incidents. “Processing 1M events/second” is irrelevant if decisions take 30 minutes. In our case, we celebrated achieving 99% detection rate on port scans — which the SOC ignored anyway — while missing lateral movement using legitimate credentials. Define SLOs tied to outcomes, not to model scorecards.

SLOs don’t just measure success — they reveal what’s blocking it. If you can’t meet your TTD target, the bottleneck is somewhere in your latency path. If you can’t meet your alert budget, the bottleneck is in precision or triage capacity. Now let’s find exactly where.

Find the Bottleneck

Mental Model

Your ML pipeline is a series of stages, each with a capacity ceiling. The stage with the lowest effective capacity is your constraint — it sets the ceiling for the entire system. Everything upstream queues up; everything downstream sits idle. Barnard’s memorable shortcut: “Check what you’re waiting for. Where’s the backlog?”

With SLOs defined, you can systematically measure where the system breaks down. Build a constraint ledger — a table measuring capacity, utilization, latency, queue depth, and top failure mode at each pipeline stage:

Stage	Capacity (rec/s)	Utilization	p95 Latency	Queue Depth	Top Failure Mode
Ingest	100K	60%	2ms	0	burst loss
Feature-Tier1	100K	65%	5ms	0	cache miss
Feature-Tier2	30K	95%	50ms	1.2K	window skew
Feature-Tier3	10K	20%	200ms	0	cold start
Inference	50K	40%	10ms	0	batch sizing
Alerting	1K	10%	100ms	0	dedup thrash

The diagnostic pattern is simple: high utilization + growing queue = bottleneck. Feature-Tier2 jumps out — 95% utilization with a queue of 1.2K while other stages sit at 10–65%. During peak periods, the system is forced to either sample traffic (missing attacks), queue records (violating TTD), or drop features (hurting accuracy). The model never sees complete feature representations because feature extraction can’t keep pace.

Build this table for your own system. The constraint is almost always obvious once you measure.

Capacity conversion: 10 Gbps network traffic ≈ 100K flows/sec. 1M daily e-commerce orders ≈ 12/sec average, 50/sec peak. 10K IoT sensors at 1Hz ≈ 10K records/sec.

Validate Before You Invest

Before building anything, run a 24-hour experiment: temporarily throw 3x resources at your suspected bottleneck. If system-level metrics improve dramatically, you’ve found the right constraint. If not, look elsewhere. This experiment costs a day; building the wrong solution costs months.

We provisioned 3x compute for Feature-Tier2, enabling 90K records/sec. The results were dramatic: detection time dropped, false positives decreased (the model makes better decisions with complete feature sets), and we nearly met our SLOs. No other improvement — not model accuracy, not infrastructure, not threshold tuning — would have achieved this.

Pitfall: Premature Model Optimization

Teams spend months improving model accuracy while system-level metrics stagnate. The pipeline logic explains why: if the constraint isn’t in the model, then making the model infinitely better has zero impact on system throughput. We spent three months experimenting with transformer architectures for 2% accuracy improvement — while 70% of traffic was never analyzed due to feature extraction bottlenecks. The transformer detected sophisticated attacks brilliantly, on the 30% of traffic it actually saw. Always validate the constraint before optimizing.

Key takeaway: The constraint is the only thing worth optimizing right now. Everything else is rearranging deck chairs.

Step 2: Understand Why It’s Stuck

You’ve found the bottleneck. Now resist the urge to fix the surface symptom. “Feature extraction is slow” is a temperature reading, not a diagnosis. You need the underlying cause — because the cause determines the cure.

Five Whys — With Evidence

The Five Whys technique is simple: ask “why” repeatedly until you reach a root cause, but validate each answer with evidence before proceeding to the next. Unvalidated whys lead to plausible-sounding but wrong root causes.

Here’s how this played out for our Feature-Tier2 bottleneck:

Why is Feature-Tier2 at 95% utilization? → It computes 47 features per record. (Validated: profiling shows 89% of computation in 12% of features)
Why so many features? → Designed for offline research with unlimited compute. (Validated: 31 features contribute <0.1% to decisions)
Why no production constraints in the design? → Development was disconnected from deployment. (Validated: git history shows features added without removal)
Why disconnected? → ML team and platform team operate in silos. (Validated: team interviews confirm no shared requirements)
Why silos? → No ownership of end-to-end system performance.

Notice where we ended up: the root cause isn’t technical — it’s organizational.

Common Root Causes in ML Systems — Check Which Applies

Feature Explosion: Teams extract every conceivable signal because “it might help.” Features grow monotonically — each has an advocate, none has a removal date. Most provide redundant information.
Multi-granularity Overhead: Computing signals at every timescale (seconds, minutes, hours, days) when most decisions only need one. Common in anomaly detection, fraud, and demand forecasting.
Stale Reference Data: Maintaining expensive rolling statistics (baselines, embeddings, aggregates) for thousands of entities, even though most change negligibly between updates. The recomputation cost dwarfs the information gained.

If your Five Whys keep ending at technical causes, go one more level. The technical problem often has an organizational parent — siloed teams, misaligned incentives, no end-to-end ownership. These patterns aren’t unique to our system. Fraud detection, recommendations, and forecasting all exhibit the same failure modes.

Pitfall: Feature Creep Without Cost Analysis

Feature counts grow monotonically because each has an advocate who remembers when it caught something. Our system grew from 50 to 247 features over two years. Analysis showed 180 contributed <0.1% to decisions but consumed 60% of computation. Track a feature value score — importance divided by computational cost — and require cost-benefit analysis for new features.

Key takeaway: Root causes are usually organizational, not algorithmic. If you fix the technical symptom without fixing the organizational cause, the symptom will return.

Step 3: See the Hidden Tradeoff

You know the root cause. So why hasn’t anyone fixed it? Almost always, it’s because the problem is an unresolved conflict — and people are stuck choosing between two approaches that both seem necessary.

Barnard puts it precisely: any problem can be defined as an unresolved conflict. In our case, the Feature-Tier2 bottleneck persists because of a fundamental tension:

We need rich feature analysis for accurate detection of sophisticated attacks. We also need efficient processing for real-time response and cost control. These seem to contradict each other, so the team oscillates — add features after a missed attack, remove features after a performance degradation. Two years later, they’re exactly where they started.

Why Teams Get Stuck

Barnard identifies two failure modes that keep teams trapped in these oscillations:

Getting stuck / procrastinating: Exaggerated fears — fear of losing what the current approach does well, or fear of the effort and risk required to change. (“If we remove features, we’ll miss attacks.”)
Overreacting / jumping to conclusions: Exaggerated frustration with the current approach’s downsides, or exaggerated expectations of a new solution. (“Let’s just throw out all the expensive features and rely on the model.”)

Most ML teams alternate between these two modes without recognizing the pattern.

Pitfall: Alert Budget Myopia

A textbook case of oscillation: facing missed incidents, teams lower thresholds (overreacting). This floods analysts with alerts, who start ignoring them, leading to more missed incidents — which triggers another round of threshold lowering. Little’s Law makes the math concrete: L = λW — if analysts can investigate 50 alerts/day and each takes 45 minutes, that’s the hard capacity ceiling. No threshold change can overcome it. This is the precision/coverage conflict manifesting as a vicious cycle.

Map Your Conflict

The breakthrough comes from asking: what assumptions make this conflict seem unresolvable? To find them, map the conflict explicitly:

We need [rich feature analysis] to achieve [accurate detection].
We need [efficient processing] to achieve [real-time response].
These conflict because we assume:
  1. All records need the same analysis depth
  2. Features must be computed synchronously
  3. One model handles all decisions

Challenge each:
  - Is assumption 1 always true? No — routine DNS queries
    don't need the same scrutiny as connections to unknown IPs.
  - Is assumption 2 always true? No — historical comparisons
    could be asynchronous.
  - Is assumption 3 always true? No — different attack types
    could use specialized models.

This template works for any ML conflict. A fraud detection team might write: “We need comprehensive transaction analysis AND sub-200ms decisions. Hidden assumption: every transaction needs the same analysis depth.” A recommendation team: “We need deep personalization AND instant page load. Hidden assumption: personalization must happen at request time.”

Common ML Conflicts

Accuracy vs Latency: Complex models are more accurate but slower
Precision vs Coverage: Tight thresholds reduce false positives but miss edge cases
Real-time vs Historical Context: Immediate response vs rich contextual analysis
Generic vs Specific Models: Broad coverage vs environment-specific accuracy

Key takeaway: The tradeoff that’s blocking you is almost never fundamental. It persists because of hidden assumptions. Find the assumption. Challenge it. The conflict evaporates.

Step 4: Break the Tradeoff

You’ve identified the assumptions propping up the conflict. Now comes the payoff: designing a solution that captures the Pros of both the current approach and the alternative.

The goal is to get as many Pros from both sides as possible. Sometimes you genuinely get all of them. More often, some tradeoffs remain — added complexity, operational overhead, calibration effort. The difference from compromise is that these residual cons are deliberate and manageable, not the paralyzing either/or that kept the team stuck. You’re not splitting the difference. You’re changing the game so the remaining tradeoffs feel trivial compared to where you started.

The Thinking Process

After mapping your conflict and challenging assumptions (Step 3), work through each challenged assumption systematically:

Sketch the system without the assumption. If you challenged “all records need the same analysis depth,” draw the pipeline where they don’t. What would variable-depth processing look like? What decides the depth?
Look for the four reusable patterns. Most ML system innovations are combinations of these:
- Cascade filtering — cheap check first, expensive check only when needed. Applicable whenever most inputs are routine. (Fraud: score transactions with simple rules before running the full model. Recs: serve cached recommendations before running personalization.)
- Async enrichment — decide now, enrich later. Useful whenever decision speed and decision quality have different time horizons. (Generate an alert with basic info immediately; add forensic context over the next 30 seconds.)
- Confidence-based routing — let the model decide how much compute each input deserves. Turns a fixed-cost pipeline into an adaptive one. (High-confidence benign traffic exits at Tier-1; uncertain traffic escalates.)
- Feature caching — never compute the same thing twice across pipeline stages. Obvious but rarely implemented. (Features from early triage stages are reused in deep analysis — we achieved 84% cache hit rates.)
Check for async opportunities — what’s being computed before the decision that could move to after?
Check for caching opportunities — what’s being computed repeatedly across stages, records, or time windows?

A Worked Example: Progressive Analysis

In our case, the most load-bearing assumption was: “All records need the same analysis depth.” Once you challenge it, the architecture follows from the patterns above — cascade filtering with confidence-based routing between tiers:

Tier	Features	Model	Traffic Seen	Latency
Tier-1: Wire-speed Triage	5 cheap features	Logistic regression	100% (68% exits)	~3ms
Tier-2: Fast Analysis	25 features	Moderate	~32%	~15ms
Tier-3: Deep Analysis	100 features	Complex	~4%	~100ms
Forensic: Full Analysis	All features	Exhaustive	<1%	~500ms

Each stage outputs a prediction and a confidence score. High-confidence benign traffic exits immediately. Low confidence escalates. When stages experience backlog, confidence thresholds adjust dynamically — low-risk records defer to async processing during congestion, ensuring high-risk traffic always gets full analysis. And alerts are generated immediately with basic info, then progressively enriched over 30 seconds with connection context, historical patterns, and full forensics.

Fix the Organization Too

Remember: Step 2 told us the root cause was organizational — siloed teams, no end-to-end ownership, features added without production constraints. The progressive architecture only sticks if the organizational structure changes with it. We restructured so ML and platform teams share SLOs, and adding features now requires cross-team cost-benefit approval. Without this, the feature explosion that caused the original bottleneck would have returned within a year.

Key takeaway: The innovation doesn’t have to be novel to the field. It has to be novel to your system. Progressive analysis is a known pattern — applying it to our specific bottleneck was the breakthrough. But the technical fix and the organizational fix are a package deal.

Step 5: Prove It Works

You’ve designed a solution on paper. Before you spend three months building it, spend two weeks proving the riskiest assumption.

An important distinction: a Minimally Viable Experiment (MVE) comes before a Minimally Viable Product (MVP). An MVP builds the smallest usable product. An MVE is smaller — it tests whether the core assumption behind the innovation is even valid. Don’t build anything until you’ve validated the assumption.

Identify the Riskiest Assumption

Ask: what’s the single assumption that, if wrong, kills the entire approach? For our progressive architecture, it was: “Can Tier-1 triage accurately identify benign traffic without missing attacks?” If lightweight features can’t reliably separate benign from suspicious, the whole cascade fails.

Design the smallest test that answers this question. We trained a logistic regression on 5 cheap features and tested on realistic data with known attacks:

tier1_features = [
    'src_reputation_score',  # Pre-computed reputation
    'dst_port',              # Destination port number
    'protocol',              # TCP/UDP/ICMP
    'packet_rate',           # Packets per second
    'byte_rate'              # Bytes per second
]

tier1_model = LogisticRegression(C=1.0)
tier1_model.fit(X_train[tier1_features], y_train_benign)

Results

Metric	Result
Triage rate	68% identified as benign at Tier-1
False negative rate	0% — no attacks missed
Throughput	95K records/sec
p95 latency	3ms per record
Cache hit rate	84% across stages

What the Iterations Taught Us

The MVE revealed things we couldn’t have predicted from design alone:

Confidence calibration: Initial triage was too conservative — 32% of traffic passed to Tier-2 unnecessarily. The model lacked confidence on legitimate-but-unusual ports. Retraining with expanded examples achieved a 71% triage rate without missing attacks.
Dynamic resource allocation: Fixed compute allocation caused bottlenecks when traffic patterns shifted. We implemented stages borrowing compute from idle stages, smoothing throughput across load profiles.
Feature pruning: 15 Tier-3 features never influenced decisions in production. Removing them increased throughput 30% without affecting detection. Track a feature value score — importance / computational_cost — and prune ruthlessly.

Production Rollout Checklist

Shadow mode: Run progressive pipeline parallel to existing system. Compare decisions, measure divergence. Success: no P1 incidents missed for one week.
Canary (10%): Route 10% of traffic through progressive pipeline. A/B test alert quality with analysts. Success: SLOs maintained, analyst preference ≥ baseline.
Gradual expansion: 10% → 25% → 50% → 75%, holding each level for 48 hours. Automated rollback on any SLO violation.
Full production: 100% with old system as instant fallback. Document runbooks, train operations team. Success: one week at 100% with all SLOs met.

Go/No-Go Criteria

After the MVE, the decision is straightforward: does the riskiest assumption hold? If yes — the cascade correctly separates benign from suspicious — proceed to shadow mode. If the assumption fails, you haven’t wasted months; you’ve spent two weeks learning that you need a different innovation. Go back to Step 4 and challenge a different assumption.

Key takeaway: A two-week MVE teaches more than two years of production experience. Test your riskiest assumption first.

The Cycle Continues

Here’s the part that surprises people: solving one constraint doesn’t fix the system forever. It reveals the next constraint. And that’s a feature, not a bug — because you always know exactly what to work on.

With Feature-Tier2 no longer the bottleneck, a new one emerged in our system: alert investigation. SOC analysts averaged 45 minutes per Tier-3 alert. This limited how many sophisticated attacks could be properly investigated. Applying the framework again:

Goal → Constraint: Reduce investigation time to 15 minutes while maintaining decision quality
Constraint → Problem: Analysts manually correlate across multiple tools and data sources
Problem → Conflict: Automated enrichment vs human judgment
Conflict → Innovation: AI-assisted investigation that augments rather than replaces analysts
Innovation → Experiment: Test on historical alerts with analyst feedback

Each cycle makes the system more capable. Here’s where our NTA system ended up after one full pass through the framework:

Metric	Before	After	Change
Detection time (p95)	47 min	3.2 min	15x faster
Daily analyst alerts	847	11	98.7% reduction
Incidents missed/month	23	0	Eliminated
Traffic coverage	30%	98%	Full visibility
Feature-Tier2 utilization	95%	42%	Headroom restored

The constraint moved — from feature extraction to investigation to response automation — and each move represents the next opportunity for breakthrough improvement.

flowchart LR
    A["Goal & Constraint
SLOs + ledger"] --> B["Understand Why
Root cause"]
    B --> C["Map Conflict
Challenge assumptions"]
    C --> D["Innovate
Best of both sides"]
    D --> E["Experiment
MVE → rollout"]
    E --> |"Constraint moves"| A

    style A fill:#e8f5e9,stroke:#333
    style B fill:#e3f2fd,stroke:#333
    style C fill:#fff3e0,stroke:#333
    style D fill:#fce4ec,stroke:#333
    style E fill:#f3e5f5,stroke:#333

The Theory of Constraints is a cycle, not a line

The Theory of Constraints isn’t another optimization technique. It’s an operating system for continuous improvement. The constraint keeps moving, but so do you.

Conclusion

Most ML teams are stuck in optimization theater — tuning components that don’t govern system performance. The Theory of Constraints gives you a way out: find the one bottleneck that sets the ceiling, understand why it’s stuck, and design an innovation that breaks the tradeoff instead of compromising on it.

The method is five steps, but the discipline is one idea: at any moment, only one thing limits your system. Find it. Fix it. Then find the next one.

If you take one thing from this post, make it this: before your next “optimization” sprint, build the constraint ledger. Measure every stage. Find the row with high utilization and a growing queue. That’s where your effort belongs — and nowhere else.

References

Goldratt, E. M. (1984). “The Goal: A Process of Ongoing Improvement.” North River Press.
Sculley, D., et al. (2015). “Hidden Technical Debt in Machine Learning Systems.” NIPS 2015.
Paleyes, A., et al. (2022). “Challenges in Deploying Machine Learning: A Survey of Case Studies.” ACM Computing Surveys.
Kleppmann, M. (2017). “Designing Data-Intensive Applications.” O’Reilly Media.
Polyzotis, N., et al. (2018). “Data Lifecycle Challenges in Production Machine Learning.” SIGMOD Record.
Chandola, V., et al. (2009). “Anomaly Detection: A Survey.” ACM Computing Surveys.
Ahmed, M., et al. (2016). “A Survey of Network Anomaly Detection Techniques.” Journal of Network and Computer Applications.
Sommer, R., & Paxson, V. (2010). “Outside the Closed World: On Using Machine Learning for Network Intrusion Detection.” IEEE S&P.

Hard-Learned Lessons in Shipping Software (AI/ML) Projects

Imad Dabbura — Sun, 05 Jan 2025 06:00:00 GMT

growing

Why ML Projects Fail to Ship

Some ML projects I’ve worked on shipped six months late. Others shipped and quietly died in production. A few never shipped at all — and those are the ones I keep coming back to. I’ve been through this as an individual contributor and as the person leading the team. For a long time I blamed the usual suspects: unclear requirements, technical debt, underestimating complexity. The real cause, I eventually realized, was more structural — and it looked the same from both seats.

A web feature has a clear definition of done: the button appears, the form submits, the data is saved. An ML feature doesn’t. “Improve recommendation accuracy” could mean another week of training runs, another architecture experiment, another round of feature engineering — indefinitely. Unlike traditional software, where the solution space is bounded by the spec, ML projects have an effectively unbounded search space. Every model can be made larger, every feature set more complete, every training run longer.

flowchart TD
    subgraph trad ["Traditional Software"]
        direction LR
        t1["Define Feature"] --> t2["Build"] --> t3["Ship ✓"]
    end
    subgraph ml ["ML Without Constraints"]
        direction LR
        m1["Vague Goal"] --> m2["Experiment #1"]
        m2 -->|"+0.5% accuracy"| m3["Experiment #2"]
        m3 -->|"+0.2% accuracy"| m4["Experiment #3"]
        m4 -.->|"one more try..."| m2
    end

Traditional software has a bounded end state. ML projects without defined constraints loop indefinitely — each experiment looks like progress.

This produces a predictable failure mode: a project that looks like it’s making progress — models training, experiments running, metrics moving — but never ships.

The root cause, I’ve come to believe, is a category error. ML projects sit uncomfortably between research and engineering. Research is unbounded by design — you keep going until you understand something. Engineering is bounded by design — you keep going until it ships. The teams that deliver consistently have made a deliberate choice about which one they’re doing. The ones that don’t, haven’t — and so they run a research process inside an engineering context, indefinitely.

What follows is what I’ve learned — sometimes from my own failures, sometimes from watching teams I was leading repeat patterns I’d already lived through — about how to actually make that choice.

Define the Target Before Writing a Line of Code

The most consistent mistake I’ve made — and watched others make — is starting before the goal is properly defined. It doesn’t feel like a mistake at the time. There’s energy, there’s a general direction, there’s a team ready to move. But you can’t constrain an undefined goal. The first structural requirement for shipping is a precise definition of version one — not the roadmap, not the vision, but version one.

Write it as a falsifiable criterion: “A model that identifies churn risk 14 days in advance with precision ≥ 70% on the holdout set, deployable via the existing prediction service.” That’s a definition. “Improve churn prediction” is not — it’s a direction, and directions don’t ship.

Before writing code, three questions force the definition:

Who is this for, specifically? A customer-facing recommendation system for mobile users has different input distributions, latency constraints, and acceptable error modes than an internal analyst tool. “Users in general” means nobody in particular, and a system designed for nobody in particular gets spec’d indefinitely.

What does version one accomplish — and what does it explicitly not do? The second half matters as much as the first. Scope creep in ML is insidious because experiments feel like progress: an additional feature, a new architecture variant, a cleaned edge case — each looks like forward motion. The out-of-scope list is what makes the in-scope list real.

What are the success criteria, written down and falsifiable? Precision ≥ 0.70. Latency ≤ 100ms at p99. Deployable on the existing serving infrastructure. Criteria that can be verified make it possible to call the project done. Criteria that can’t — “good enough,” “production-ready,” “performs well” — guarantee the project never ends.

Working backwards from these answers also produces the project structure. Once you know what version one must accomplish, you can enumerate prerequisite questions: What training data do we need? What does the evaluation harness look like? How does it plug into production? Each answer either gets scheduled or gets cut. Vague goals don’t allow this decomposition — they keep the surface area perpetually open.

Make Time the Constraint, Not Scope

The natural instinct is to treat scope as fixed and deadline as flexible. This is exactly backwards.

Scope in an ML project is not fixed — it’s infinitely expandable. There’s always a better architecture to try, a cleaner way to handle edge cases, a feature that might help. Teams treat scope as the constraint because it feels owned: the team wrote the requirements, agreed on them, and changing them feels like abandoning a commitment. Deadlines, by contrast, feel externally imposed and therefore more negotiable — something to push when the work “isn’t ready yet.”

I’ve been in this meeting many times — sometimes as the engineer watching the deadline move, more often as the person responsible for it. The deadline slips, then slips again, then becomes a standing item on the weekly status call.

Flip the constraint. Treat the deadline as fixed and scope as the variable. This changes the question from “when will we be done with everything we planned?” to “what’s the most important thing we can ship by this date?” The second question forces real prioritization. The model that trains in four hours ships; the one that takes 24 hours doesn’t. The feature built on existing infrastructure stays; the one that requires a new data pipeline gets cut.

Deadlines as a Design Tool

A deadline doesn’t dictate quality — it dictates scope. The discipline is specifically about protecting the deadline from scope expansion, not accelerating the work. When a new requirement surfaces mid-project, the question isn’t “can we fit it in?” — it’s “what does it displace?”

Version One Is Supposed to Be Small

Version one of most production models is smaller, faster, and more constrained than anything the team initially imagined. Good. The goal of version one isn’t to build the best possible system — it’s to establish the deployment path, validate production integration, and generate real usage data. The best possible system comes later, built on what version one teaches you.

Decompose Until Done Is Unambiguous

Once you have a target and a deadline, break the project into tasks — not work items, not epics, tasks. The distinction matters: a task has an unambiguous definition of done. A project doesn’t.

“Train a production NLP model” is a project. Tasks are: - “Label 500 training examples from the January logs” — done or not done. - “Achieve F1 ≥ 0.82 on the validation split” — done or not done. - “Write the endpoint that accepts raw text and returns a classification” — done or not done.

If you can’t tell whether a piece of work is finished without discussion, break it down further.

With tasks in hand, ruthlessly prioritize against the version one criteria. Not everything is equally important, and pretending otherwise is how projects stall:

Category	Description	Rule
Must-have	System cannot ship without this	Do first, never cut
Should-have	Meaningfully improves the product	Include if time allows
Nice-to-have	Incremental gain, no blocker	Version two
Gold-plating	No clear user benefit	Cut immediately

The failure mode is treating “should-haves” as “must-haves.” It happens because, deep down, the team doesn’t believe version two is coming. If this feels like the only shot, every improvement feels essential. But that’s exactly backwards: version two only exists because version one shipped. Holding version one hostage to version two’s requirements is how you guarantee neither does.

Validate the Core Assumption Before Building the System

Every ML project rests on a single load-bearing assumption: “Does a model trained on this data actually produce useful predictions for this problem?” Everything else — the serving infrastructure, the feature pipeline, the retraining loop, the monitoring dashboard — only matters if the answer is yes.

I’ve fallen into this trap early in my career, and led teams into it later. The pattern is always the same: stand up a feature store, design a training pipeline, architect a serving layer — then train the model and discover the data doesn’t support the prediction task, or the signal is too weak, or the problem is better solved without ML at all. Months of infrastructure work, none of it applicable to the revised approach. The infrastructure trap is just as easy to fall into when you’re the one setting the direction as when you’re the one doing the building.

flowchart TD
    subgraph wrong ["❌ Common Mistake"]
        direction LR
        w1["Feature Store"] --> w2["Training Pipeline"] --> w3["Model Registry"] --> w4["Model"] --> w5["Works?"]
    end
    subgraph right ["✓ Correct Order"]
        direction LR
        r1["Validate\nApproach"] --> r2["Establish\nDeploy Path"] --> r3["Build\nInfrastructure"] --> r4["Ship ✓"]
    end

The infrastructure trap: building the full system before validating the approach. The correct order validates cheaply first, then invests in infrastructure.

The minimum viable experiment: train a simple baseline on a slice of the data, evaluate it against a manually-labeled holdout, and show the results to at least one person who’d actually use the output. Logistic regression, a small neural net, a fine-tuned pretrained model — whatever takes days, not months. If the results are promising, the infrastructure investment is justified. If not, you’ve learned the most important thing about the project for the cost of two weeks.

This also determines tool choice during validation. scikit-learn, PyTorch, pre-trained transformers from HuggingFace — these represent thousands of engineering hours and are battle-tested at scale. Custom architectures and bespoke training loops are justified when profiling data shows standard tools can’t meet your requirements. That data doesn’t exist before validation. Building custom infrastructure before validating the approach is the fastest way to spend six months on something nobody uses.

Ship, Then Compound

Once version one meets the criteria, ship it — even if it’s not perfect.

Every model I’ve shipped has surprised me in production. Not because the evaluation was wrong, but because it was measuring the wrong things. Your holdout set measures what you measured. Real users do things you didn’t anticipate — edge cases you didn’t label, inputs from distributions you didn’t sample, and above all, they surface which errors actually matter. A model that’s 92% accurate on the evaluation set might be systematically wrong on the 8% of inputs that are disproportionately important to users. You won’t know that until the model is deployed.

flowchart LR
    A["Ship\nImperfect v1"] --> B["Real\nUsage Data"]
    B --> C["Discover\nActual Failures"]
    C --> D["Targeted\nFixes"]
    D --> E["Ship\nBetter v2"]
    E --> B

The iteration flywheel: each shipped version surfaces real failures that targeted improvements address, compounding over time.

Ship versions that meet the bar, not versions that approach some imagined ceiling. Version one will be wrong in ways you didn’t anticipate — I’ve never shipped one that wasn’t, and I’ve never led a team that did. One of the harder things about leading engineers through this is convincing them that shipping something imperfect isn’t a compromise — it’s the whole point. The failures you discover in production are the ones that matter. Find them early, when fixing them is fast, not late, when the system is load-bearing and everything is entangled.

Each Version Enables the Next

Real usage reveals the failures that matter — not the ones you hypothesized in the design doc, but the ones users actually encounter. Their feedback tells you which improvements are worth making. Infrastructure built for version one scales to version two. The teams that ship consistently aren’t the ones with better planning processes — they’re the ones who’ve completed more cycles of this loop.

Key Takeaways

Decide whether you’re doing research or engineering before you start. ML projects that don’t make this distinction run research processes in engineering contexts — indefinitely.
Define version one as a falsifiable criterion. Precision ≥ X. Latency ≤ Y. Deployable on Z. Criteria that can’t be verified guarantee the project never ends.
Treat deadline as fixed, scope as variable. The question is always: “What’s the most important thing we can ship by this date?”
Decompose until done is unambiguous. If you can’t tell whether a task is finished without discussion, it’s not a task — it’s a project.
Validate the core assumption before building infrastructure. Does the model work on this data? Answer that first, with the simplest possible tools. Everything else comes after.
Ship the imperfect version. Offline evaluation measures what you measured. Real usage reveals what you missed. Each shipped version enables the next.

From Forgetting to Fluency: How to Learn Smarter, Not Harder

Imad Dabbura — Fri, 13 Sep 2024 05:00:00 GMT

growing

Introduction

In today’s fast-paced world, the ability to learn effectively and retain information has become more crucial than ever. Whether you’re a student preparing for exams, a professional mastering new skills, or simply someone seeking personal growth, finding ways to optimize learning and improve memory recall can make a significant difference in achieving your goals.

Fortunately, research in neuroscience and cognitive psychology has shed light on strategies that enhance how we absorb and retain knowledge. There are numerous methods to make learning not only more efficient but also more enjoyable. This article explores a variety of evidence-based approaches to optimize learning and strengthen recall and hopefully you’ll find actionable insights to elevate your learning journey.

Learning Techniques

Effective instructions should match the content not the learning styles. For example, cooking instruction should use hands-on practices even if the student is a visual learner.
Learning means that a change made to long-term memory.
Human memory is not as precise/reliable as computer memory. It has read-and-update. Reading memory will lead to strengthen and modify the fetched information especially if the information is recently learned.
Stored information is stored in interconnected neural pathways. If we try to access targeted information, we activate a pathway of neurons to access the information which leads to spread the activation to other connected pathways that may not be related to the target information. This spreading activation leave related pathways primed for activation for hours . As a result:
- Spreading activation leads to related but imprecise information to be conflated with target information, which leads to unreliable recall of information.
- Because pathways stay primed for hours, it helps us with problem solving when we step away to work on something else, go for a walk, or take a shower, and the two unrelated areas connect in the middle.
There are two types of memory:
- Long-term memory where information is permanently stored and is unlimited. It is analogous to disk storage.
- Working (short-term) memory is used to solve problems. It is analogous to CPU’s registers. The bigger the working memory, the faster we can learn. It is roughly fixed at birth.
Chunking is when we relate information together as one piece. The more we combine information as one piece (chunk), the easier it is to reason about and solve problems related to it. This is due to the fact that we can store pointers to such chunks in the working memory and access such chunks in long-term memory if needed. Therefore, it is critical to decompose difficult tasks into smaller pieces (chunks) when learning, which later will be chunked together as we practice.
The difference between experts and beginners is that experts remember and recognize patterns to help them solve problems. However, beginners read code line by line to understand what it is doing or how to approach solving problems. Therefore, to achieve proficiency in programming, you need to read/write and work with a lot of code to be exposed to more patterns as well as programming using different programming paradigms/languages.
To understand a concept, we may need to go from abstract to diverse set of concrete examples and back to abstract. This helps us with chunking and treating all the concrete examples as different views of the abstract concept. Once we understand the concrete examples, we can connect it back to the abstract concept.
Spacing and Repetition are keys for learning. We learn problem-solving concepts best by spacing out their practice across multiple sessions, multiple days, and ideally, multiple weeks. Practice helps us connect the text in the problem to the concept and applying the concept to solve the problem.
Concentration after 90 minutes is hard due to neuro-chemical balance in the brain. It is recommended to rest/sleep/walk after the 90 minutes so the information gets consolidated. Don’t work on other tasks, talk to others, or browse the internet.
Even if we can access information on the internet, it is advisable to understand concepts we deal with frequently so the brain can form connections and help us understand deeper concepts. Also, it is much better to try to recall information from long-term memory than search for it on the internet especially if we are not experts.
Problem-solving is not a generic skill. It is domain-specific skill. This means that a good chess player may not be a good problem-solver in programming or other domains. As a result, to get better at programming problem-solving, learn to solve programming problems.
There is no clear predictor in programming ability other than experience.
Growth mindset, learning to overcome setbacks and failures, and practice is all you need to be successful in your career. You will have to always evaluate your learning strategies to get the best outcome.

Recommendations

For recruiting:
- There are no good proxies for programming ability, look at their previous work or test them on authentic programming tasks.
- At least among young developers, years of experience may not be a very reliable measure of ability.
For learning and training:
- Reading a lot of code will help become a more efficient programmer.
- Experts are not always the best at training beginners.
- Learning takes time, including time between learning sessions. Intense cramming is not effective, but spaced repetition is.
- Similarly, spending time away from a problem can help to solve it.
- Just because you can find it through an Internet search or generative AI tool does not mean learning has become obsolete.
- Use examples to go between abstract concepts and concrete learnable facts.
- Seeking to succeed (rather than avoid failure) and believing that ability is changeable are important factors in resilience and learning.

Conclusion

As we’ve explored various approaches to optimize learning and enhance recall, it’s clear that the journey to effective learning is both dynamic and multifaceted. By incorporating evidence-based strategies such as spaced repetition, and active engagement, individuals can significantly improve their ability to absorb and retain information.

Embracing a growth mindset and being open to adapting your strategies will further empower you to navigate the complexities of acquiring knowledge. In conclusion, by implementing these innovative approaches, you can transform your learning experience into a more productive and rewarding endeavor. Start applying these insights today and watch as your ability to learn and recall information flourishes!

Why Your Final Layer Shouldn’t Have Softmax

Imad Dabbura — Sun, 09 Jun 2024 05:00:00 GMT

evergreen

A Common Mistake That’s Hard to See

If you’ve built a classifier in PyTorch, you’ve probably seen nn.Softmax and nn.CrossEntropyLoss in the same codebase. You may have even used them together — softmax at the end of the model, cross-entropy as the loss. The code runs, the loss decreases, the model converges. Everything looks fine.

But something is wrong. nn.CrossEntropyLoss already applies softmax internally. Applying it again in the model’s final layer means softmax is computed twice — and the gradients computed during backpropagation are the gradients of the wrong function. The model still learns, just more slowly, less stably, and to a worse optimum.

This post unpacks why — starting with what softmax actually does, then working through the numerical stability mechanism that motivates keeping raw logits, and finishing with a clear picture of when softmax belongs and when it doesn’t.

What Softmax Does

The softmax function takes a vector of raw scores — logits — and squashes them into a probability distribution:

The outputs are in and sum to 1. For a ten-class classifier, softmax turns a vector like into a proper probability distribution over the ten classes. This seems like exactly the right thing to do before computing a loss that expects probabilities.

The problem isn’t what softmax does. It’s where you do it — and whether the operation downstream already does it better.

The Log-Sum-Exp Trick

To understand why CrossEntropyLoss wants raw logits, we need to look at what it computes. Cross-entropy loss for the true class is:

The second term — — is the log-sum-exp (LSE), and it’s numerically dangerous. If any logit is large, the exponent overflows to inf before the log can bring it back down:

import torch
z = torch.tensor([1000.0, 1001.0, 1002.0])
torch.softmax(z, dim=0)  # → tensor([nan, nan, nan])

The standard fix is the log-sum-exp trick: subtract the maximum logit before exponentiating.

Subtracting keeps all terms in — never overflowing, never underflowing. The mathematical result is identical; the numerical result is stable.

This is exactly what nn.CrossEntropyLoss does internally. It doesn’t apply softmax and then compute cross-entropy — it fuses both operations into one numerically stable pass using the LSE trick. Passing raw logits is what makes this possible.

If you apply softmax first, the loss function receives instead of raw logits and then applies its own log-softmax to those values — effectively computing . The numbers are wrong and the gradients are wrong.

flowchart LR
    subgraph bad ["❌ Double Application"]
        direction LR
        z1["Logits z"] --> s1["nn.Softmax"] --> p1["Probs p"] --> ce1["CrossEntropyLoss\nlog-softmax(p)"]
    end
    subgraph good ["✅ Correct"]
        direction LR
        z2["Logits z"] --> ce2["CrossEntropyLoss\nlog-softmax(z) — fused, stable"]
    end

Left: pre-applying softmax breaks the fused computation, producing gradients of the wrong function. Right: raw logits let CrossEntropyLoss apply the log-sum-exp trick internally.

The Practical Rule

nn.CrossEntropyLoss (PyTorch) and tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) (TensorFlow) both expect raw logits. The loss handles the stable, fused computation internally. Don’t apply softmax to the final layer of a classifier.

Multi-Label Classification: The Wrong Prior

Softmax enforces competition between classes: increasing one class’s probability necessarily decreases the others. This is the correct structure for single-label tasks — exactly one class is true — and entirely the wrong structure for multi-label tasks, where multiple classes can be true simultaneously.

Consider a document classifier that assigns topics like “machine learning,” “software engineering,” and “career advice.” A document can belong to all three. Softmax forces these to compete: pushing “machine learning” up automatically pushes the others down. The model is fighting its own output structure.

There’s a deeper problem. Because softmax outputs always sum to 1, the model is structurally forced to predict high confidence for exactly one class — regardless of the input. If an image contains no objects from the training categories, softmax still redistributes its probability mass across the classes and picks a winner. If an image contains three objects, softmax still collapses to one. It has no way to say “multiple things are present” or “nothing relevant is here” — the sum-to-one constraint makes both answers impossible.

For multi-label classification, the correct output is sigmoid, applied independently per class:

Each output is an independent probability in with no constraint that they sum to 1. Use nn.BCEWithLogitsLoss — which applies sigmoid internally with the same kind of numerical stability fusion — rather than sigmoid in the model followed by nn.BCELoss.

Choosing the Right Output Layer

Task	Loss function	Notes
Single-label classification	`nn.CrossEntropyLoss`	Expects raw logits; applies log-softmax internally
Multi-label classification	`nn.BCEWithLogitsLoss`	Expects raw logits; applies sigmoid internally
Binary classification	`nn.BCEWithLogitsLoss`	Same as above
Probabilities at inference	Apply `softmax` after training	Not during training

Softmax and Overconfidence

Softmax is sensitive to the scale of the logits, not just their relative ordering. Logits and produce the same ranking but very different softmax outputs — the scaled version concentrates nearly all probability mass on the top class. As training progresses, logit magnitudes tend to grow, and softmax increasingly exaggerates these differences.

The result is systematic overconfidence: a model that outputs near-100% probability on examples it gets wrong. The Guo et al. 2017 calibration paper showed this is a consistent property of modern neural networks, not a training artifact.

The standard fix is temperature scaling: divide logits by a learned scalar before applying softmax at inference time.

flattens the distribution (less confident); sharpens it. is fit on a held-out validation set after training finishes. Crucially, this only works if the model was trained on raw logits — the scale information that temperature scaling adjusts is preserved through training and only consumed at inference.

The Calibration Connection

Post-hoc calibration methods (temperature scaling, Platt scaling, isotonic regression) all operate on the raw logit magnitudes that accumulate through training. If your output layer applies softmax during training, the scale information is destroyed before calibration is attempted — the calibration methods have nothing useful to fit.

When Softmax Belongs

Removing softmax from the final classification layer doesn’t mean it’s always wrong — it means the structure it imposes (mutual exclusivity, sum-to-one) has to match what the computation actually needs.

Attention mechanisms. The scaled dot-product attention in Transformers applies softmax to produce a distribution over positions. This is exactly right: each query should distribute its weight across keys, and the competition structure is intentional. There’s no fused loss downstream computing log-softmax again.

Contrastive learning. Methods like CLIP apply softmax across the batch as part of the contrastive loss. The within-batch competition is the learning signal.

Inference-time probabilities. If downstream code requires calibrated probabilities — confidence thresholds, ensemble averaging, displaying to users — apply softmax to the final logits after the forward pass, outside the model:

with torch.no_grad():
    logits = model(x)
    probs = torch.softmax(logits, dim=-1)

The pattern: softmax belongs when the distribution semantics genuinely fit the computation, and when nothing downstream is already computing a fused version of it.

Key Takeaways

Don’t apply softmax in your model’s final layer for classification. nn.CrossEntropyLoss expects raw logits and applies a fused, numerically stable log-softmax internally using the log-sum-exp trick. Pre-applying softmax computes gradients of the wrong function.
The numerical instability is real and silent. Large logits overflow naive softmax — you get nan losses and corrupted gradients, often without a clear error. The fused implementation avoids this entirely.
Multi-label tasks need sigmoid, not softmax. Softmax enforces mutual exclusivity. For tasks where multiple labels are simultaneously valid, use nn.BCEWithLogitsLoss with raw logits.
Overconfidence is a logit scale problem. Softmax exaggerates differences as magnitudes grow through training. Temperature scaling is the standard fix — but only if raw logit scale is preserved through training.
Softmax has legitimate uses. Attention weights, contrastive losses, and inference-time probability outputs are correct applications. The question is always whether competition semantics fit the problem, and whether a fused stable implementation already handles the math downstream.

Resources

PyTorch Documentation — CrossEntropyLoss — Documents why raw logits are expected and how log-softmax is fused internally.
On Calibration of Modern Neural Networks — Guo et al. on systematic softmax overconfidence and temperature scaling as the practical fix.
Deep Learning Book — Chapter 6 — Goodfellow et al. on output units and loss function design for classification.

Cutting the Fat: A Practical Guide to Neural Network Pruning

Imad Dabbura — Fri, 03 May 2024 05:00:00 GMT

Neural network pruning is a critical optimization technique used to enhance the efficiency of deep learning models by systematically removing unnecessary parameters, such as weights or neurons, while maintaining model performance. This technique is particularly important because memory access and movement are extremely expensive operations in terms of both latency and energy consumption.

Why Do We Need Pruning?

The primary objective of pruning can be formalized as minimizing a loss function , where represents the original weights, is the pruned weights subject to the constraint that the number of non-zero weights () is less than a threshold .

subject to

This optimization leads to sparse weight matrices, which can significantly reduce:

Model size
Memory footprint
Computational complexity
Energy consumption

Types of Pruning

Weight Pruning:

Focuses on removing connections between neurons
Reduces model size and computational complexity
Based on weight importance metrics
May require fine-tuning after pruning
Results in sparse weight matrices [1]

Activation Pruning:

Removes entire neurons or channels
Reduces computational cost more directly
Based on activation importance
Can lead to better performance in terms of misclassification error compared to unpruned networks [1]

Pruning Approaches

1. Only Pruning

Simplest approach: directly remove weights without additional steps
Often results in significant accuracy drop
Not recommended for production systems

2. Pruning + Fine-tuning

Prune weights followed by model fine-tuning
Helps recover accuracy lost during pruning
More effective than pruning alone [2]

3. Iterative Pruning + Fine-tuning

Gradually removes weights over multiple steps
Each step is less aggressive than the previous
Includes fine-tuning between pruning steps
Achieves best accuracy with high pruning ratios (90%+ range)
More computationally expensive but yields better results [3]

Pruning Granularities

1. Fine-grained (Unstructured) Pruning

Pros:

Maximum flexibility in weight selection
Highest possible compression ratio
Minimal accuracy loss [4]

Cons:

Irregular weight indices
Difficult to accelerate on hardware
Requires specialized implementations [5]

2. Coarse-grained (Structured) Pruning

Pros:

Hardware-friendly
Easier to implement
Maintains dense matrix operations

Cons:

Less flexible than fine-grained pruning
Limited to row/column pruning
May result in lower compression ratios

3. Pattern-based Pruning (N:M Sparsity)

For every M contiguous elements, N elements must be pruned
Common pattern is 2:4 sparsity (50%)
Uses compressed matrix format:
- One matrix for non-zero values
- One matrix for indices (bit)
Some hardware architectures support this scheme natively

4. Channel-based Pruning

Pros:

Most regular structure
Highest potential speedup
Straightforward implementation [6]

Cons:

Least flexible
Lower compression ratio
Can have uniform or varying sparsity across layers

Pruning Criteria

1. Magnitude-based Pruning

Removes weights with smallest magnitude
Uses L1 or L2 norm for measurement
Can be applied row-wise for improved regularity

2. Scaling-based Pruning

Associates learnable scaling factors with output channels
Prunes channels with small scaling factors
More adaptive than simple magnitude-based methods [7]

3. Percentage-of-Zero-Based Pruning

Focuses on activation patterns
Removes channels with highest percentage of zeros
Requires analysis of activation patterns during inference
Dynamic approach compared to static weight pruning

4. Regression-based Pruning

Minimizes reconstruction error of layer outputs
Avoids full backpropagation
Particularly effective for Large Language Models
More sophisticated approach with better accuracy retention

Important Considerations

Large vs. Small Models: It’s generally better to prune a large model than train a smaller model from scratch. Over-parameterization helps avoid local minima by providing more dimensions to escape saddle points.
Hardware Considerations: The choice of pruning granularity should consider the target hardware architecture. Structured pruning may be preferred for standard hardware, while specialized hardware might better handle unstructured pruning.
Layer-wise Pruning: Different layers may have different levels of redundancy, making uniform pruning across all layers suboptimal. Adaptive approaches that consider layer-specific characteristics often yield better results.

This comprehensive understanding of pruning techniques enables Machine Learning Engineers/Data Scientists to make informed decisions when optimizing their deep learning models for specific applications and hardware constraints.

Building GPT(2/3) from Scratch: Turning Theory into a Working Transformer

Imad Dabbura — Wed, 10 Apr 2024 05:00:00 GMT

evergreen

Introduction

There’s an old saying in engineering: “You don’t really understand something until you can build it.” This has never been more true than in the era of LLMs. While we’ve previously explored the foundational concepts in my post on the Transformer architecture explained here, true understanding comes from implementation. That’s why today, we’re building a GPT-style model (the 124M variant) from scratch in PyTorch.

This project has a different focus than my last “from scratch” endeavor, where I built an entire deep learning framework to grasp the low-level mechanics of autograd and tensor ops. Here, we’ll leverage PyTorch’s battle-tested primitives to focus on what makes GPT special: multi-head attention, positional encodings, and the specific architectural decisions that enable language understanding.

This hands-on process reveals challenges you can’t appreciate from diagrams alone. You’ll watch your GPU memory overflow, see training grind to a halt from inefficient data loading, and learn firsthand why techniques like mixed-precision training, gradient accumulation, and activation checkpointing are necessities, not just optimizations. It’s in facing these hurdles that you truly appreciate the engineering craft required to build and scale transformers efficiently.

GPTs

GPT (Generative Pre-trained Transformer) models, developed by OpenAI, represent a breakthrough in natural language processing. GPT-2, released in 2019, demonstrated that a transformer-based model trained on vast amounts of text could generate remarkably coherent and contextually relevant content. GPT-3, its successor, scaled this approach to 175 billion parameters, showcasing emergent capabilities like few-shot learning and complex reasoning. Both models share the same fundamental architecture: stacked transformer decoder blocks that predict the next token in a sequence, trained on the simple objective of minimizing prediction error across massive text corpora. The 124M parameter version we’ll be building captures the essential architecture while remaining computationally tractable for individual developers—though even at this “small” scale, you’ll quickly discover why the ML community spends so much time optimizing both training efficiency and model performance.

By the end of this journey, you won’t just know how transformers work—you’ll have built the critical components with your own hands, optimized the training loop, and watched your model evolve from random noise to coherent text generation. Let’s begin.

Implementation

Throughout this implementation, every piece of code will be thoroughly annotated with explanations of not just what we’re doing, but why we’re doing it. More importantly, we’ll use few optimizations that make a big difference in terms of computational efficiency:

TensorFloat32 (TF32): NVIDIA’s precision format that uses 19 bits of precision instead of 23, providing up to 8x speedup on A100 GPUs while maintaining model quality. We’ll see how a single line of code can dramatically accelerate matrix multiplications.
BFloat16 with Autocast: Mixed precision training using brain floating-point format, which maintains the same exponent range as FP32 but reduces mantissa precision. Combined with automatic mixed precision (AMP), this cuts memory usage in half and speeds up training significantly.
torch.compile: PyTorch 2.0’s just-in-time compilation that fuses operations and generates optimized kernels. We’ll explore how graph compilation can provide 10-30% speedups with minimal code changes.
Flash Attention and Online Softmax: An algorithmic improvement that computes attention without materializing the full attention matrix, reducing memory complexity from O(n²) to O(n).
Fused AdamW: A single-kernel implementation of the AdamW optimizer that reduces memory reads/writes by computing all parameter updates in one pass, providing up to 2x optimizer step speedup.
Annealed Learning Rate: Starting with a warmup phase followed by cosine decay, we’ll implement the learning rate schedule that has become standard for training transformers, understanding why stable training requires careful lr management.
Weight Decay Only on Matrices: A subtle but crucial detail—applying weight decay only to weight matrices in Linear and Embedding layers while excluding biases and layer normalization parameters, which improves model performance.
Distributed Data Parallelism (DDP): Scaling training across multiple GPUs using PyTorch’s DDP, including gradient synchronization, proper data loading, and the intricacies of maintaining consistent model states across devices.

Finally, since the GPT-2 paper omits certain architectural details and hyperparameter specifications, we’ll refer to the GPT-3 paper to fill these gaps—fortunately, the core architecture remains consistent between the two models, making the GPT-3 paper a reliable source for these missing implementation details.

## | code-fold: true
import inspect
import math
import os
import time
from dataclasses import dataclass
from functools import partial, wraps
from typing import Callable, Iterable

import tiktoken
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as opt
from torch.distributed import destroy_process_group, init_process_group
from torch.nn.parallel import DistributedDataParallel as DDP

## | code-fold: true
def listify(obj):
    if obj is None:
        return []
    elif isinstance(obj, str):
        return [obj]
    elif isinstance(obj, list):
        return obj
    elif isinstance(obj, Iterable):
        return list(obj)
    else:
        return [obj]

## | code-fold: true
def annealer(func: Callable):
    wraps(func)

    def annealer_wrapper(*args, **kwargs):
        return partial(func, *args, **kwargs)

    return annealer_wrapper


@annealer
def lin_sched(start, end, pos):
    """Linear scheduler."""
    return start + (end - start) * pos


@annealer
def cos_sched(start, end, pos):
    """Cosine scheduler."""
    return start + (1 + math.cos(math.pi * (1 - pos))) * (end - start) / 2


def combine_scheds(pcts, scheds):
    """
    Combine multiple schedulers, each run for a given percentage of the
    training process.
    """
    assert len(pcts) == len(scheds), "Each scheduler should have its `pct`."
    assert sum(pcts) == 1.0, "Sum of the `pcts` should be equal to 1."
    pcts = torch.tensor([0] + listify(pcts))
    assert (pcts >= 0).all(), "All percentages should be non-negative."
    pcts = torch.cumsum(pcts, 0)

    def _inner(pos):
        idx = (pos >= pcts).nonzero().max()
        actual_pos = (pos - pcts[idx]) / (pcts[idx + 1] - pcts[idx])
        return scheds[idx](actual_pos)

    return _inner

@dataclass
class GPTConfig:
    block_sz: int = 1024 ## Sequence length
    vocab_sz: int = (
        50257  ## Originally 50000 BPE merges + 256 byte tokens + 1 for <|endoftext|> token
        ## which will delimits different documents. This token's index is 50256
        ## However, we found that using 50257 as the vocab size is not a multiple of 64 and we
        ## could improve efficiency and performance (through better occupancy) if we round up
        ## to the closest multiple of 64, which is 5304.
    )
    n_layer: int = 12 ## Number of layers
    n_embd: int = 768 ## Embedding dimension
    n_head: int = 12 ## Number of attention heads
    lr: float = 3e-4  ## Good for big models
    batch_sz: int = 4
    dropout: float = 0.0
    bias: bool = True

class MLP(nn.Module):
    def __init__(self, config: GPTConfig):
        ## Point-wise feed-forward network that applies non-linearity
        ## on every token separately. THERE IS NO INTERACTION BETWEEN TOKENS
        ## This is where almost all the capacity and non-linearities of the 
        ## model come from especially when we project it to 4 x n_embd
        super().__init__()
        self.c_fc = nn.Linear(config.n_embd, config.n_embd * 4)
        ## Found to be better than ReLU in terms of gradient saturation
        self.gelu = nn.GELU(approximate="tanh")
        self.c_proj = nn.Linear(config.n_embd * 4, config.n_embd)
        self.dropout = nn.Dropout(config.dropout)
        self.c_proj.NANOGPT_SCALE_INIT = 1

    def forward(self, x):
        return self.dropout(self.c_proj(self.gelu(self.c_fc(x))))

class CausalSelfAttention(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.c_attn = nn.Linear(config.n_embd, config.n_embd * 3, bias=config.bias)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)
        self.dropout = config.dropout
        self.c_proj.NANOGPT_SCALE_INIT = 1
        ## NOTE: Mask is not needed when we use Pytorch's Flash attention
        ## self.register_buffer(
        ##     "mask",
        ##     torch.tril(torch.ones(config.block_sz, config.block_sz)).view(
        ##         config.block_sz, config.block_sz
        ##     ),
        ## )

    def forward(self, x):
        B, T, C = x.shape
        qkv = self.c_attn(x)
        ## q/k/v is B x T x n_embd each
        q, k, v = torch.split(qkv, self.n_embd, dim=-1)
        ## Reshape q/k/v to B x n_head x T x (n_embd / n_head)
        ## So each head would be learning different kind of
        ## relationships
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        ## attn is B x T x T
        ## attn = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.shape[-1]))
        ## ## Mask out future tokens
        ## attn = attn.masked_fill(self.mask[:T, :T] == 0, float("-inf"))
        ## attn = self.attn_dropout(F.softmax(attn, dim=-1))
        ## ## y is B x T x n_embd
        ## y = attn @ v
        ## Uses Flash attention that never materialize attention matrices for
        ## each head and is aware of the memory hierarchy and tries to reduce
        ## read/writes with more FLOPs -> Speed up since we're memory bound
        y = F.scaled_dot_product_attention(q, k, v, is_causal=True)
        y = y.transpose(1, 2).contiguous().view(B, T, C)
        return self.resid_dropout(self.c_proj(y))

class Block(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = MLP(config)
        self.attn = CausalSelfAttention(config)

    def forward(self, x):
        ## Use Pre-layer normalization which deviates from the
        ## transformer original paper that uses post-layer normalization.
        ## This should help stabilize training
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x

class GPT2(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.config = config
        self.transformer = nn.ModuleDict(
            dict(
                wte=nn.Embedding(config.vocab_sz, config.n_embd),
                ## Attention operation is a permutation equivariant, this means that
                ## if we permute the input then the corresponding output will be
                ## permuted in exactly the same way. In other words, attention mechanism
                ## is not aware of the relative ordering of the tokens. Therefore, we
                ## need some way to encode the positions of the tokens in each sequence.
                ## This is where positional encoding comes into play.
                ## Here we use a simple positional encoding that is a simple
                ## embedding of the position of the token in the sequence.
                wpe=nn.Embedding(config.block_sz, config.n_embd),
                h=nn.ModuleList(
                    [Block(config) for _ in range(config.n_layer)]
                ),
                ## Final layer norm after all transformer layers
                ln_f=nn.LayerNorm(config.n_embd),
            )
        )
        self.lm_head = nn.Linear(config.n_embd, config.vocab_sz, bias=False)

        ## Weigth sharing between the token embedding layer and
        ## last linear layer (LM head classifier). The rationale is
        ## that tokens that are semantically similar to each other in
        ## the embedding space should have similar probabilities in the
        ## softmax of the LM head layer
        ## Also, these matrices are one of the biggest matrices in the the model
        ## This means, for model like GPT2, we save almost 30 % of the parameters
        ## by sharing the weight matrices (50257 * 768) / 124M = ~31%
        self.transformer.wte.weight = self.lm_head.weight
        self.apply(self._init_weights)

    def _init_weights(self, module):
        ## The following initialization comes from gpt2 src code
        ## NOTE: Because token embedding and classifier weights are shared,
        ## our initialization logic will initialize the weight matrix twice
        ## but shouldn't be an issue since they're being initialized with the
        ## same std and mean
        if isinstance(module, nn.Linear):
            std = 0.02
            ## We're changing std because residual path affect std
            ## by increasing it on every layer so we need to adjust
            ## it so we still have the same std = 0.02
            if hasattr(module, "NANOGPT_SCALE_INIT"):
                ## `2` here because every layer has two blocks:
                ##   - Attention block
                ##   - MLP block
                ## `N` is the number of layers in the model (n_layer)
                ## Since they are independent, variance of the sum of the two
                ## blocks is the sum of the variances
                std *= (2 * self.config.n_layer) ** -0.5
            nn.init.normal_(module.weight, std=std)
            if module.bias is not None:
                nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            ## We're initializing the token and positional embeddings
            ## with the same std but the paper initialized the positional
            ## embedding with std = 0.01
            nn.init.normal_(module.weight, std=0.02)

    def forward(self, x, targets=None):
        T = x.shape[-1]
        assert (
            T <= self.config.block_sz
        ), f"Sequence length must be <= {self.config.block_sz}, got {T}"
        pos_emb = self.transformer.wpe(
            torch.arange(0, T, dtype=torch.long, device=x.device)
        )
        tok_emb = self.transformer.wte(x)
        x = pos_emb + tok_emb
        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x)
        ## logits is B x T x vocab_sz
        logits = self.lm_head(x)
        loss = None
        if targets is not None:
            ## F.cross_entropy expects the 2nd dimension to be probabilities
            loss = F.cross_entropy(
                logits.view(-1, self.config.vocab_sz), targets.view(-1)
            )
        return logits, loss

    def configure_optimizer(self, weight_decay, lr, device):
        params_dict = {
            pn: p for pn, p in self.named_parameters() if p.requires_grad
        }
        ## We're not applying weight decay to bias and layer norm parameters
        ## And any 1D parameters. Therefore, we are ONLY applying weight decay
        ## to the weight matrices in Embedding and Linear layers
        decay_params = [p for p in params_dict.values() if p.ndim >= 2]
        nondecay_params = [p for p in params_dict.values() if p.ndim < 2]
        params_groups = [
            {"params": decay_params, "weight_decay": weight_decay},
            {"params": nondecay_params, "weight_decay": 0.0},
        ]
        ## Fused AdamW is available for PyTorch 2.0+
        fused_available = "fused" in inspect.signature(opt.AdamW).parameters
        use_fused = fused_available and "cuda" in device
        return opt.AdamW(
            params_groups, lr=lr, betas=(0.9, 0.95), eps=1e-8, fused=use_fused
        )

    @torch.no_grad
    def generate(self, idxs: torch.tensor, max_tokens: int = 5):
        for i in range(max_tokens):
            ## x would be B x T x vocab_sz (At most we we would have
            ## block_sz tokens since we're using fixed block_sz for the
            ## positional embedding
            idxs = idxs[:, -self.config.block_sz :]
            logits, _ = self(idxs)
            ## Get probs for last token to predict next token
            ## This would be B x vocab_sz
            logits = logits[:, -1, :]
            ## Apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)
            ## Pick top 50 prob -> we would never pick tokens with
            ## very smally probs (right tails) -> B x 50
            ## probs/idxs are sorted in descending order
            topk_probs, topk_idxs = torch.topk(probs, 50, dim=-1)
            ## Sample 1 token from the top 50 tokens -> idx is B x 1
            idx = torch.multinomial(topk_probs, 1)
            ## Get the vocab idx as `multinomial` returns only indices that
            ## corresponds to the given array
            idx = torch.gather(topk_idxs, -1, idx)
            ## TODO: We should check for end_of_text token and break out of
            ## the loop (stop generation) even if we have not reached max_tokens
            idxs = torch.cat([idxs, idx], dim=1)
        return idxs

class DataLoaderLight:
    def __init__(
        self,
        file_path: str,
        batch_sz: int,
        block_sz: int,
        process_rank: int = 0,
        number_processes: int = 1,
    ) -> None:
        self.batch_sz = batch_sz
        self.block_sz = block_sz
        self.process_rank = process_rank
        self.number_processes = number_processes
        with open(file_path, "r") as f:
            text = f.read()
        encoder = tiktoken.get_encoding("gpt2")
        self.tokens = torch.tensor(encoder.encode(text), dtype=torch.long)
        ## We can truncate the tokens to a be multiple of batch_sz x block_sz
        ## x number_processes. This is useful for multi-node training and mimics
        ## the behavior of DataLoader's `drop_last` parameter.
        self.tokens = self.tokens[: len(self.tokens) // (
            self.batch_sz * self.block_sz * self.number_processes
        )
            * self.batch_sz
            * self.block_sz
            * self.number_processes]
        print(f"Loaded {len(self.tokens)} tokens")
        print(f"1 epoch = {len(self)} batches")
        self.current_pos = batch_sz * block_sz * process_rank

    def __len__(self):
        return len(self.tokens) // (self.batch_sz * self.block_sz)

    def next_batch(self):
        buf = self.tokens[
            self.current_pos : self.current_pos
            + self.batch_sz * self.block_sz
            + 1
        ]
        x = buf[:-1].view(self.batch_sz, self.block_sz)
        y = buf[1:].view(self.batch_sz, self.block_sz)
        ## Each process will process batch_sz x block_sz tokens in each
        ## iteration -> with number_processes processes, total tokens processed
        ## in each iteration is batch_sz x block_sz x number_processes. In the
        ## case of one process, total tokens would be batch_sz x block_sz
        self.current_pos += (
            self.batch_sz * self.block_sz * self.number_processes
        )
        ## Similar to DataLoader's `drop_last` parameter, we drop the last
        ## batch if it's not a multiple of batch_sz x block_sz x number_processes
        ## if self.current_pos + (
        ##     self.batch_sz * self.block_sz * self.number_processes
        ## ) + self.number_processes > len(self):
        if self.current_pos >= len(self.tokens):
            self.current_pos = 0
        return x, y

###########
## Distributed Data Parallel
###########
## Distributed Data Parallel let us run the same model (replica) on different GPUs,
## where each GPU would work on a different slice of data. After we do the backward
## pass, we average the gradients across all processes (GPUs) and synchronize all
## parameters across all devices. We use allReduce op to do this and communicate the
## updates with all processes.
## Each process would go through the same code from top to bottom not aware there
## are other processes running the same thing on other devices
#
## torchrun command sets the following environment variables:
## RANK: Id of the process in the process group. It is an int 0-WORLD_SIZE
## LOCAL_RANK: In the case of multi-nodes, LOCAL_RANK is the id of
##             the process in the same node. Example: If we have a node
##             with 4 GPUs, the first process will have LOCAL_RANK=0
##             but RANK of this process mayn't be 0 if we are running
##             on multiple nodes.
##             This is useful when we have multiple nodes and we want to
##             run the processes on different GPUs in the same node.
##             In this case, we can set the LOCAL_RANK to the GPU id in the
##             node.
## WORLD_SIZE: Total number of processes
ddp = int(os.getenv("RANK", -1)) != -1  ## Check if it is a ddp run
if ddp:
    ## DDP requires CUDA so we need to set the device for each process
    ## so only one process can run per device
    assert torch.cuda.is_available(), "DDP requires CUDA"
    init_process_group(backend="nccl")
    ddp_rank = int(os.getenv("RANK"))
    ddp_local_rank = int(os.getenv("LOCAL_RANK"))
    ddp_world_size = int(os.getenv("WORLD_SIZE"))
    device = f"cuda:{ddp_local_rank}"
    torch.cuda.set_device(device)
    ## Master process will do more things such as checkpointing and logging
    ## while other processes would assist in the computations.
    ## It always has RANK=0
    master_process = ddp_rank == 0
else:
    ddp_rank = 0
    ddp_local_rank = 0
    ddp_world_size = 1
    master_process = True
    if torch.cuda.is_available():
        device = "cuda"
    elif torch.backends.mps.is_built():  ## Apple Silicon
        device = "mps"
        torch.mps.manual_seed(1337)
    else:
        device = "cpu"
print(device)
torch.manual_seed(1337)
if torch.cuda.is_available():
    torch.cuda.manual_seed(1337)
##########
## Initialize model and optimizer
##########
## Everything in GPUs is a power of 2 such as tiling ops
## So try to always have matrices be power of 2 to improve use of:
## •    Tensor Cores
## • Memory coalescing
## •    Shared memory bank alignment
## •    Warp scheduling
## Here we change the vocab_sz by rounding it up to the closest
## number that is power of. This will increase space overhead
## but would speed up computations
model = GPT2(GPTConfig(vocab_sz=50304)).to(device)
## Speed up model by building static graph that analyzes all ops
## and optimizes them such as fusing some of them to avoid unnecessary
## trips to memory
## model = torch.compile(model)
if ddp:
    model = DDP(model, device_ids=[ddp_local_rank])
raw_model = model.module if ddp else model
max_lr = 3e-4
min_lr = max_lr * 0.1
warmup_steps = 10
max_steps = 50
sched = combine_scheds(
    [warmup_steps / max_steps, 1 - (warmup_steps / max_steps)],
    [lin_sched(min_lr, max_lr), cos_sched(max_lr, min_lr)],
)
optimizer = raw_model.configure_optimizer(
    weight_decay=0.1, lr=max_lr, device=device
)

##########
## Run training loop
#########
## NOTE: In order to run 0.5M (from GPT3 paper) tokens per fwd/bwd iteration,
## we need to ## use gradient accumulation because we can't fit it in almost
## any commodity ## GPU -> We only do backward after we loop through ~0.5M tokens.
total_batch_sz = 2**19  ## closest number to 0.5M
assert total_batch_sz % (GPTConfig.batch_sz * GPTConfig.block_sz * ddp_world_size) == 0, "total batch size must be divisible by micro batch_sz x block_sz x ddp_world_size"
grad_accum_steps = total_batch_sz // (
    GPTConfig.batch_sz * GPTConfig.block_sz * ddp_world_size
)
if master_process:
    print(f"Total desired batch size: {total_batch_sz}")
    print(f"Calculated gradient accumulation steps: {grad_accum_steps}")

train_dl = DataLoaderLight(
    "tinyshakespeare.txt",
    batch_sz=GPTConfig.batch_sz,
    block_sz=GPTConfig.block_sz,
    process_rank=ddp_rank,
    number_processes=ddp_world_size
)
## Pytorch will use TensorFloat32 if available, else use FP32
## But the weights will still be stored using FP32 with less precision
## (10 bits for mantissa instead of 23). It is just the
## operations would be executed as TF32 if available
torch.set_float32_matmul_precision("high")

for step in range(max_steps):
    start = time.time()
    x, y = train_dl.next_batch()
    x = x.to(device)
    y = y.to(device)
    ## code.interact(local=locals())
    optimizer.zero_grad()
    loss_accum = 0.0
    for macro_step in range(grad_accum_steps):
        if device == "cuda":
            ## Tensors that will be greatly affected by less precission such
            ## as loss, layernorm would still be in FP32 while others such
            ## as attention weights would be in BF16
            with torch.autocast(device_type=device, dtype=torch.bfloat16):
                logits, loss = model(x, y)
        else:
            logits, loss = model(x, y)
        ## Just accumulating gradients yield to summation of objective but
        ## we want mean so we weight each loss by 1/grad_accum_steps
        loss /= grad_accum_steps
        loss_accum += loss.detach()
        ## To avoid syncing the gradients between the processes after every
        ## macro step, we disable it and only allows the sync up of
        ## gradients after we finish all gradient accumulation in each
        ## process
        if ddp:
            model.require_backward_grad_sync = (
                macro_step == grad_accum_steps - 1
            )
        loss.backward()
    ## Each process would have its own loss_accum tensor, so to get the
    ## average loss_accum across all processes, we want to compute the
    ## average of all loss_accum in all processes
    if ddp:
        dist.all_reduce(loss_accum, op=dist.ReduceOp.AVG)
    ## Clips gradient to global norm. It is very useful to avoid having a
    ## very high loss for some (bad) batch(es) that would have very high
    ## loss ## which would lead to high gradients and huge updates
    ## NOTE: In the beginning of training it is normal to have high norms
    ## as the ## model initialized randomly
    norm = nn.utils.clip_grad_norm_(model.parameters(), 1.0)

    ## TODO: Use ParamScheduler from `cmn_ai`
    lr = sched(step / max_steps)
    for pg in optimizer.param_groups:
        pg["lr"] = lr
    optimizer.step()
    end = time.time()
    elapsed_time = end - start
    token_per_sec = (
        GPTConfig.batch_sz
        * GPTConfig.block_sz
        * grad_accum_steps
        * ddp_world_size
    ) / (elapsed_time)
    print(
        f"step {step}, loss: {loss.item()}, lr {lr:.4e}, norm: {norm:.2f}, time: {elapsed_time:.2f}s, tok/sec: {token_per_sec:.2f}"
    )

if ddp:
    ## Kills all processes
    destroy_process_group()

Conclusion

We’ve come a long way in this journey—from implementing the core transformer architecture with multi-head attention and positional encodings, to building an efficient training pipeline complete with modern optimizations like flash attention, mixed precision training, and distributed parallelism. We’ve debugged exploding gradients, optimized memory usage, and watched our model evolve from producing random gibberish to generating coherent text. Along the way, we’ve gained deep insights into why each component exists and how they work together to create these remarkable language models.

I hope this deep dive has been as illuminating for you as it has been for me. Writing this implementation forced me to confront gaps in my own understanding and solidified concepts that previously felt abstract. There’s something uniquely satisfying about seeing your hand-built transformer successfully predict its first coherent sentence—a moment where theory truly becomes understanding.

If you’ve made it this far, thank you for joining me on this journey. I’d love to hear about your experiences implementing transformers, any bugs you’ve encountered, optimizations you’ve discovered, or questions this post might have raised. Feel free to reach out with feedback, corrections, or insights—the best part of sharing these implementations is learning from the community’s collective wisdom. Happy building!

Resources

Byte Pair Encoding from Scratch

Imad Dabbura — Wed, 10 Apr 2024 05:00:00 GMT

evergreen

Why Tokenization Matters

When you type “unhappiness” into ChatGPT, the model doesn’t see the word “unhappiness.” It sees something like ["un", "happ", "iness"] — three tokens that were chosen by an algorithm months before the model was even trained. That algorithm decided, based on statistics from a massive training corpus, that these three pieces are the right granularity. Not individual characters (too many tokens, too little meaning per token). Not whole words (too many unique words, no way to handle words never seen in training). Subwords — the sweet spot.

This isn’t a minor preprocessing detail. Tokenization defines what the model can see. Consider three strategies on the same sentence:

Strategy	“The cat sat unhappily” becomes	Tokens	Vocab Size
Character-level	`["T","h","e"," ","c","a","t"," ","s","a","t"," ","u","n","h","a","p","p","i","l","y"]`	21	~256
Word-level	`["The", "cat", "sat", "unhappily"]`	4	100,000+
Subword (BPE)	`["The", " cat", " sat", " un", "happ", "ily"]`	6	~50,000

With characters, a fixed context window of 2048 tokens covers ~400 words. With subwords, the same window covers ~1500 words — nearly 4× more context for the model to reason over. Word-level is compact but brittle: “unhappily” might never appear in training data, making it an token the model is completely blind to. But “un”, “happ”, and “ily” almost certainly do appear — the model can compose meaning from pieces it knows.

The algorithm that learns these splits is Byte Pair Encoding (BPE) — originally a data compression technique (Gage, 1994), adapted for NLP by Sennrich et al. (2016), and now used in GPT-2, GPT-3/4, LLaMA, and most modern language models. In this post, we’ll understand how it works, implement it from scratch, and see how GPT-2 refined the basic algorithm for production.

How BPE Works

The core insight is simple: if two symbols frequently appear next to each other, they probably belong together. Merge them into a single token, then look for the next most frequent pair, and repeat. It’s exactly how you’d compress a text file — find repeated patterns and replace them with shorter symbols. Frequent patterns get absorbed into single tokens; rare patterns stay as smaller pieces.

Think of it like learning abbreviations. If you keep writing “machine learning” in your notes, you’d eventually start writing “ML.” Then if “ML model” keeps appearing, maybe you’d abbreviate that too. BPE does the same thing, but systematically and bottom-up — starting from the smallest units (bytes) and building up to subwords.

Seeing It in Action

Before formalizing the algorithm, let’s watch it work on a real example. Consider a tiny training corpus containing the words "low lower lowest":

Step	Token Sequence	Most Frequent Pair	New Token
Start	`l o w _ l o w e r _ l o w e s t`	—	—
Merge 1	`lo w _ lo w e r _ lo w e s t`	`(l, o)` → `lo`	3×
Merge 2	`low _ low e r _ low e s t`	`(lo, w)` → `low`	3×
Merge 3	`low _ lowe r _ lowe s t`	`(low, e)` → `lowe`	2×

BPE discovered that l and o always appear together, then that lo and w always appear together, building up low as a token — effectively learning the word stem. Then it found lowe as a shared prefix of “lower” and “lowest.” Without any linguistic rules, purely from frequency, BPE learned morphological structure.

Notice what happened in merge 2: the algorithm merged lo with w, where lo itself was created in merge 1. BPE builds tokens hierarchically — later merges compose earlier ones. We can visualize this as a tree:

graph BT
    l["l (byte)"] --> lo["lo (merge 1)"]
    o["o (byte)"] --> lo
    lo --> low["low (merge 2)"]
    w["w (byte)"] --> low
    low --> lowe["lowe (merge 3)"]
    e["e (byte)"] --> lowe

    style l fill:#f0f0f0,stroke:#999
    style o fill:#f0f0f0,stroke:#999
    style w fill:#f0f0f0,stroke:#999
    style e fill:#f0f0f0,stroke:#999
    style lo fill:#d4e6f1,stroke:#2980b9
    style low fill:#aed6f1,stroke:#2471a3
    style lowe fill:#85c1e9,stroke:#1a5276

BPE merges build tokens bottom-up. Each merge composes two existing tokens into a new one, forming a hierarchy from bytes to subwords.

Each level in the tree depends on the levels below it. This is why merge order matters during encoding — you can’t build low until lo exists.

The Algorithm

With the intuition in place, here’s the formal procedure:

Initialize: Start with a base vocabulary of all 256 byte values (0–255). Every string can be represented as bytes, so this guarantees full coverage — no tokens, ever.
Count pairs: Scan the corpus and count every adjacent pair of tokens.
Merge the most frequent pair: Create a new token for it and replace all occurrences in the corpus.
Repeat steps 2–3 until you’ve done vocab_size - 256 merges.

The output is a merge table: an ordered list of pair → token mappings. This table is the tokenizer.

Training vs. Encoding: A Subtle Difference

There’s an important asymmetry between how BPE learns merges (training) and how it applies them to new text (encoding).

During training, we always merge the globally most frequent pair — that’s how we decide which merges to learn. But during encoding, we apply merges in the order they were learned, not by their frequency in the new text.

Encoding Replays Merges — It Doesn’t Re-learn Them

A common misconception is that encoding finds the most frequent pair in the new text and merges it. It doesn’t. Encoding applies the training-time merges in their original order. The token low only exists after lo has been created (merge 1). If we tried to merge (lo, w) before creating lo, we’d never find the pair. In the implementation, this shows up as min(stats, key=lambda p: self.merges.get(p, float("inf"))) — picking the pair with the lowest merge index, not the highest frequency.

Why Bytes, Not Characters?

Starting from bytes (0–255) rather than Unicode code points is a practical decision. Unicode has over 150,000 code points — that’s an impractically large base vocabulary. By working at the byte level, we start with just 256 symbols and can represent any string in any language or script. BPE merges then learn to compose bytes into characters, characters into subwords, and subwords into common words — all driven by frequency in the training data.

The Multilingual Tax

Languages underrepresented in training data pay a tokenization tax. English “hello” might be one token, but the same greeting in a low-resource language could take 3–4 tokens because the byte sequences were never frequent enough to merge. This means the model burns more of its context window on the same content — a well-documented source of multilingual inefficiency (Petrov et al., 2023). It also means the model takes more compute per word for these languages, making inference more expensive.

Vocabulary Size: A Key Hyperparameter

How many merges should we do? This is the vocabulary size, and it’s a meaningful trade-off:

Vocab Size	Tokens per Text	Embedding Table	Character
Small (~1k)	Many — close to character-level	Tiny	Better generalization on rare words, but sequences are long and training is slow
Medium (~32k–50k)	Moderate — good compression	Manageable	The sweet spot for most models (GPT-2: 50k, LLaMA: 32k)
Large (~100k+)	Few — common phrases become single tokens	Very large	Risk of overfitting to training distribution; rare tokens get poorly trained embeddings

Larger vocabularies mean each token carries more information, so sequences are shorter and the model sees more context per forward pass. But each token also needs an embedding vector, so the embedding table grows linearly. And tokens that appear rarely in training will have poorly learned embeddings — they’ve simply not been seen enough times.

Most modern models settle in the 32k–100k range. GPT-2 uses ~50k tokens. LLaMA uses 32k. GPT-4 reportedly uses ~100k. The right size depends on the training data, the target languages, and the compute budget.

## | echo: false
## %load_ext lab_black

Implementation

Let’s turn the algorithm into code. The BPETokenizer class below has four core methods, each mapping directly to a step we’ve discussed:

train: The learning loop — encode the corpus to bytes, then greedily merge the most frequent pair vocab_size - 256 times. Each merge is recorded in self.merges as a (pair) → index mapping. This ordered dictionary is the tokenizer.
encode: The encoding step — convert new text to bytes, then apply merges in learned order (earliest first, using the min trick we discussed). This is where training-order matters: we pick the pair with the smallest merge index, not the most frequent.
decode: The inverse — look up each token ID in the vocabulary to get its byte sequence, concatenate, and decode back to a string.
_get_stats / _merge: Helpers that count adjacent pairs and replace a specific pair with its merged token throughout a sequence.

One implementation detail: _build_vocab relies on Python 3.7+ dictionary insertion order. Since merges are inserted chronologically, iterating self.merges replays them in order — each merged token is the byte-concatenation of its two parents, which must already exist in the vocabulary.

## | code-fold: true
from typing import Iterable
import requests

class BPETokenizer:
    """Byte-pair encoder."""

    def __init__(self, vocab_sz: int):
        """
        Args:
            vocab_sz (int): Vocabulary size.
        """
        self.vocab_sz = vocab_sz
        self.vocab = {}
        self.merges = {}

    def train(self, text: Iterable[str]):
        """Train Byte-pair encoder."""
        ids = list(text.encode("utf-8"))
        for idx in range(256, self.vocab_sz):
            stats = self._get_stats(ids)
            pair = max(stats, key=stats.get)
            self.merges[pair] = idx
            ids = self._merge(ids, pair, idx)
        self.vocab = self._build_vocab(ids)

    def encode(self, text):
        """Encode string to bytes using vocabulary built during training."""
        ids = list(text.encode("utf-8"))

        ## If text is empty or has one character -> it is already encoded from previous step
        while len(ids) >= 2:
            ## stats is used only for getting pairs next to each other
            stats = self._get_stats(ids)
            ## Because we built vocab (and merges) bottom-up, we need to encode
            ## idx from smallest index because some later pairs depend on pairs
            ## occured before
            ## If a pair doesn't exist, it wouldn't participate in the list
            pair = min(stats, key=lambda p: self.merges.get(p, float("inf")))
            if pair not in self.merges:
                break  ## No more pairs to merge
            idx = self.merges[pair]
            ids = self._merge(ids, pair, idx)
        return ids

    def decode(self, tokens: Iterable[int]):
        """Decode tokens into string using the vocabulary built during training."""
        tokens = b"".join(self.vocab[idx] for idx in tokens)
        ## It is important to replace tokens that were not seen during training
        ## with `?`; otherwise, it would fail
        return tokens.decode("utf-8", errors="replace")

    def _get_stats(self, ids: Iterable[int]):
        """Get pair counts."""
        counts = {}
        for pair in zip(ids, ids[1:]):
            counts[pair] = counts.get(pair, 0) + 1
        return counts

    def _merge(self, ids: Iterable[int], pair: Iterable[int], idx: int):
        """Merge pairs that match `pair` with new index `idx`."""
        newids = []
        i = 0
        while i < len(ids):
            if i < len(ids) - 1 and pair[0] == ids[i] and pair[1] == ids[i + 1]:
                newids.append(idx)
                i += 2
            else:
                newids.append(ids[i])
                i += 1
        return newids

    def _build_vocab(self, ids: Iterable[int]):
        """Build vocabulary from 0-255 bytes and merges."""
        vocab = {idx: bytes([idx]) for idx in range(256)}
        ## Here we assume the items returned would be in the same order they were inserted.
        ## This is Okay Python 3.7+
        for (p0, p1), idx in self.merges.items():
            ## This would be a concatenation of the bytes
            vocab[idx] = vocab[p0] + vocab[p1]
        return vocab

text = requests.get("https://docs.python.org/3/library/stdtypes.html#bytes.decode").text

tokenizer = BPETokenizer(300)

tokenizer.train(text)

tokenizer.decode(tokenizer.encode(text)) == text

True

From Vanilla BPE to GPT-2’s Tokenizer

The implementation above is vanilla byte-level BPE — it works, but it has a practical problem. Because merges are purely frequency-driven, the algorithm doesn’t respect word boundaries. The word “play” might appear in the corpus as “play.”, “play!”, “play,”, and “play” — and BPE will learn separate tokens for each variant, wasting vocabulary slots on what is essentially the same word with different punctuation.

GPT-2 (Radford et al., 2019) introduced a key refinement: pre-tokenization with a regex pattern that splits text into chunks before BPE runs. The regex prevents merges from crossing certain boundaries — letters can’t merge with digits, punctuation stays separate from words, and spaces attach to the beginning of words rather than the end.

The GPT-2 regex pattern:

'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+

This ensures that:

Contractions are split cleanly: “don’t” → ["don", "'t"]
Spaces attach to the next word: ” hello” stays together, preserving word boundaries
Punctuation stays isolated: “play!” → ["play", "!"] instead of learning “play!” as one token
Digits don’t merge with letters: “h3llo” → ["h", "3", "llo"]

BPE then runs within each chunk independently. The result: a much cleaner vocabulary where tokens correspond to linguistically meaningful units rather than artifacts of adjacent punctuation.

Try It Yourself

Use Tiktokenizer to see how GPT-2 and GPT-4 tokenize arbitrary text. Try pasting the same sentence in English and another language — you’ll immediately see the multilingual tokenization tax in action: the non-English version will use significantly more tokens for the same meaning.

This pre-tokenization pattern has been refined in later models. GPT-4 uses a more sophisticated pattern that handles apostrophes, numbers, and whitespace more carefully, and also limits the length of digit sequences to avoid learning overly specific number tokens. The core idea remains the same: constrain where merges can happen to produce a more useful vocabulary.

References & Resources

Gage, P. (1994). A New Algorithm for Data Compression. The C Users Journal. The original BPE paper — a compression algorithm that found new life in NLP.
Sennrich, R. et al. (2016). Neural Machine Translation of Rare Words with Subword Units. ACL 2016. The paper that adapted BPE for NLP tokenization.
Radford, A. et al. (2019). Language Models are Unsupervised Multitask Learners. The GPT-2 paper that introduced byte-level BPE with regex pre-tokenization.
Karpathy, A. (2024). Let’s build the GPT Tokenizer. Excellent video walkthrough of building a BPE tokenizer from scratch.
A Programmer’s Introduction to Unicode — why bytes vs. code points matters.
UTF-8 Everywhere — the case for UTF-8 as the universal encoding.
Tiktokenizer — interactive web app to visualize how different tokenizers split text.

The RAG Optimization Playbook

Imad Dabbura — Tue, 05 Mar 2024 06:00:00 GMT

evergreen

Introduction

RAG-based applications have so many components and moving parts that seems impossible to optimize or know where to start. Add to that the fact that the field changes so fast, which makes it super hard to keep up. So I’ve gathered few ideas over time to improve RAG-based applications from reading research papers and implementations I’ve deployed in the past.

Note

The list will keep changing as I learn/implement new things

Ideas

Metadata filtering is key for good RAG apps
For an MVP, it is a good idea to use both bi-encoders and full-text search such as Tf-Idf and BM25 and combine them
ColBERT reranker is great and less sensitive to chunking
Have at least 50 chars overlapping in chunks when splitting to not cut-off context
If you have data and compute, always fine-tune both encoders if you can
Use sentence-transformer to fine-tune embedding models
- We typically use triplet loss where for each query we would have positive and negative examples. We want the negative examples to be hard negatives -> very close to positive examples so the model can learn to differentiate between them
Large/New LLM not necessarily are good embedding models and mayn’t be worth the latency. LLMs with ~1B is enough for most cases
Challenges with embedding models:
- Mayn’t transfer to your domain
- Fixed vocabulary used when model was trained
- Because chunk/doc is represented in one vector which is combination of all tokens in the chunk/doc, the output vector may dilute the meaning especially for long texts -> Be careful about chunking strategy
Always start with a baseline such as BM25 (Best Match 25)
Build your own gold dataset and check its correlation with synthetic dataset generated from LLMs
Chunking beyond 256 tokens will affect high precision search because it will dilute the vector representation because embedding models were not trained on long contexts such as BERT-based encoders
Feedback of how users are liking the app is key to guide us where we should focus our efforts to improve the app
- Satisfaction ratings such as “How did we do today” or “Did we answer your question”
Monitor cosine similarity between embeddings of query and retrieved docs and reranking scores that come from reranker (cohere)
Use clustering of questions using tools such as LDA or BERT-Topic to cluster questions into topics and focus on largest topics (by count) that have lowest means of cosine and feedback
We have two kinds of topics:
- Content topics: Topics that we don’t have inventory of documents about such topics
- Capability topics: Topics that reader will never be able to generate if we don’t capture them in our docs and docs metadata and include them in the prompt. For example, “Who last updated the pricing document” is asking about last modified date/person
Build classifier to classify questions real-time for better observability and better react to sudden changes in usage
Generate synthetic data (questions) for topics we’re not doing great job at and evaluate new improvements on the generated questions
- This can be done by providing random chunk from docs that belong to topics we’re trying to improve to decent LLM and ask to generate questions
We can use LLM to get metadata about docs/objects
Lancedb is a good vector database to use for small/scale workloads
BM25 (full text search) outperforms similarity search when questions are just searching for file names. They may have similar performance with similarity search baseline.
- It is always helpful to include BM25
We can do citation through prompting and attaching IDs to chunks
Fine-tune embedding model is key for domain-specific RAGs
- With recent increase in context window sizes, chunk size of 800 and 30% overlap is recommended

Resources

Inside Python’s Modules and Packages: The Machinery Behind import

Imad Dabbura — Fri, 09 Feb 2024 06:00:00 GMT

growing

Introduction

Python’s simplicity and versatility are largely attributed to its extensive ecosystem of modules and packages. These essential components enable developers to write clean, reusable, and efficient code, whether for simple scripts or complex applications.

This article aims to deepen our understanding of Python’s modules and packages and the machinery involved, helping me become a more effective Python programmer. We will explore their structure and functionality, covering everything from the basics of importing to creating custom packages and managing dependencies. By unpacking the underlying machinery of how modules/packages get imported and what they really are, we’ll gain insights that will enhance our coding practices and project organization.

Python has only one type of module object regardless of the language the module was implemented it (C/Python/…)
Package provides a naming hierarchy to organize modules (same analogy to directory in Unix file system):
- All packages are modules but not all modules are packages
- A module is a package that has __path__
- A package can include subpackages (sub-directories)
There are two types of packages:
- Regular packages are directories that have __init__.py. When importing a package/subpackage -> implicitly executes all __init__.py files on the path and bound objects to names in the package’s namespace
  - When import machinery is looking for the package, it stops once it finds it
- Namespace packages are directories that don’t have __init__.py
  - When import machinery is looking for the package, it does not stop when it finds it and assuming there may be a regular package in some other paths in sys.path but keep a record of all namespace packages it found during the search. If it finds a regular package with that name -> discard all namespace packages it found and import the regular package. If it doesn’t find any regular package with that name -> Use all the namespace packages it found during the search and combine their paths in namespace_path so when we try to import subpackage or modules, it checks all the paths in the namespace_path (which is a list)
  - There can be multiple packages of the same name (under different directories) -> They are all combined together and the namespace_path list would have the path for all of them. Therefore, the same package can be used to refer to completely different modules in different directories
  - Python first scans the whole sys.path before deciding that the package is a namespace -> If any name is found with __init__.py in it, it will give this priority and don’t continue.
When importing subpackages such as foo.bar.baz, Python first imports foo, then foo.bar, then foo.bar.baz
- Each of these will be cached in sys.modules
__init__.py makes a directory a python package
- We can use it to import useful stuff from different modules/subpackages so it can be available to user
- When importing an object, it has __module__ attribute which determines the global environment for the object
- We can define __all__ in __init__ as concatenation of all __all__ in modules
  - Example: __all__ = foo.__all__ + bar.__all__ BUT we need to either first import foo and bar or import anything from them so they can be defined such as from .foo import * OR from .foo import y
- __init__ can be used to also initialize things and maybe monkeypatch some other modules
Each package has __path__ attribute that helps when searching for subpackages. This will be given to path finder when loading subpackages
- It is a list; similar to sys.path so we can change it. But it is not recommended
- Example: import math; math.__path__.append(~/Documents")
Relative import is preferred to absolute imports inside packages to avoid having issues if the package name is changed
Packages get loaded once even if we import it multiple times
We can in theory upgrade a package in the cache like this:
- sys.modules[new_name] = package_name
If we use python -m package.module, it executes module as the main program and relative imports works. Otherwise, relative imports won’t work.
- m stands for module
The __main__ module is a special case relative to Python’s import system. The __main__ module is directly initialized at interpreter startup, much like sys and builtins. The manner in which __main__ is initialized depends on the flags and other options with which the interpreter is invoked
__main__.py designates main for a package/subpackage and also allows package directory to be executable -> explicitly marks the entry point. Examples:
- python package would look for __main__.py to execute it
- python -m package.subpackage would look for __main__.py inside package/subpackage to execute
- __package__ is set so the relative imports still work
- A lot of programming tools utilize this to their own benefit: python -m profile script.y OR python -m pdb script.py
- NOTE THAT __init__.py files on the path will still be executed
Depending on how __main__ is initialized, __main__.__spec__ gets set appropriately or to None.
- When Python is started with the -m option, __spec__ is set to the module spec of the corresponding module or package. __spec__ is also populated when the __main__ module is loaded as part of executing a directory, zipfile or other sys.path entry.
- Otherwise, it will be set to None

Sys.path

importlib has a rich API to interact with import system. It is preferred over __import__()
__import__ Only does module search and creation without the name binding
import Does everything. Module search, creation, and name binding. It calls __import__ under the hood
.egg files are just directories or .zip files with extra metadata for package managers
sys.path is where python looks to search for a module/package (last place) that we try to import. It traverses it from start-to-end
- It has the name of directorires, .zipfiles, .egg files
- first match wins
- If it can’t find it -> can not be imported
sys.prefix is where python is stored (os.py is the landmark) and sys.exec_prefix is where compiled binaries are stored (lib-dynload is the landmark)
- With virtual environments -> each one has its own sys.prefix
- It is constructed from sys.prefix, PYTHONHOME, and site.py. Setting PYTHONHOME would override sys.prefix and sys.exec_prefic
- Python looks for its libraries starting from where it is and keep going up until the root of the file syetsm. It looks for os.py and use that location as a landmark
- python -S skips site.py
- python -vv to see what python tries to do with every statement
- Setting PYTHONPATH to some directories will insert them into the beginning of sys.path. Example:
  - env PYTHONPATH="/Users/imad/Documents python to run python with documents inserted at the beginning of the sys.apth
- site.py appends the path to third-party libraries. This is where installed packages get stored. Example: /usr/local/lib/python3.4/site-packages
Python now have builtin virtual environments that can create one using the venv module
- python -m venv env_name will create new environment called env_name
- This environment will include few directories such as include, lib, site-packages, bin and pyvenv.cfg
- This new environment has no third party libraries or any system wide libraries such as those in /usr/local
- All third libraries will be installed in site-packages directory
- Python binary will refer to the original Python installation when the environment was created
- We can use source path_to_env_name/bin/activate to activate the environment. deactivate to deactivate it. Finally, rm -r path_to_env_name or pyenv --rm if we create it using poetry
Files with .pth extension in site-packages directory get added to the sys.path. We can list directories in those files that will be added to sys.path for any new instance of Python
- Package managers and other third-party packages use this kind of hack to add paths to the sys.path
sitecustomize and usercustomize also can be used to add stuff to the sys.path
Finally the current working directory will be added to the path (at the beginning)

Modules

Modules are just objects of type ModuleType. They act like a dictionary that holds references to objects it holds; module.__dict__
- When importing a module, it executes the module from top to bottom before returning to the caller
- Module can be namespace, py file, execution environment for statements or container of global variables
- We can set/delete attributes. module.x = 10 is the same as module.__dict__['x'] = 10
- The dictionary has preset attributes such as __path__, __loader__ …
- Main attributes:
  - __name__ : ## Module name
  - __file__ : ## Associated source file (if any)
  - __doc__ : ## Doc string
  - __path__ : ## Package path. It is used to look for package subcomponents
  - __package__ : ## The module’s __package__ attribute must be set. Its value must be a string, but it can be the same value as its __name__. When the module is a package, its __package__ value should be set to its __name__. When the module is not a package, __package__ should be set to the empty string for top-level modules, or for submodules, to the parent package’s name.
  - __spec__ : ## Module spec
The main difference between modules and packages is that packages have __path__ and __package__ defined (not None)
sys.modules serves as a cache for all imported modules/packages
- It is a dictionary so we can delete/set keys
- If we delete a module, it will force Python to import it when we reimport it
- If we set module key to None -> result in ModuleNotFoundError
Even if we import one object from a module/package, the module/package will be cached in the sys.modules but not available in the global name space
The module created during loading and passed to exec_module() may not be the one returned at the end of the import
- This can happen if the imported module set the sys.modules[__name__] to some other module
The module’s attributes are set after creation and before execution
Execution of the module is what populates the module’s __dict__ (namespace of the module). This is done by the loader
When a submodule is loaded using any mechanism, a binding is placed in the parent module’s namespace to the submodule object. For example, if we have a package called spam that has a submodule foo and it imports any of its objects like from .foo import x, after importing spam, spam will have an attribute foo which is bound to the submodule -> We can now use spam.foo
Relative imports use leading dots. A single leading dot indicates a relative import, starting with the current package. Two or more leading dots indicate a relative import to the parent(s) of the current package, one level per dot after the first.
- Relative imports can only use this form of import: from <> import <>
- It can’t use import .<> because this is not a valid expression
Absolute imports have to start from the top level package and go downward to refer to the module:
- from package.subpackage import module
- Not recommended because if we change the name of the package then we need to change all the import statements -> relative imports are more robust and don’t care about namings
Process when importing a module/package (after locating it):
1. First checks if it is cached. If not, continue
2. It creates a ModuleType object with that name
3. Cache the module in sys.modules
4. Executes the source code inside the module (first prefixing it with .py and then assign __file__)
  - In the case of the package/subpackage, it assign it the __init__.py file
  - It also executes all the __init__.py on the path
5. Assign a variable to the module object

import sys, types

def import_module(modname):
    ## Check if it is in the cache first
    if modname in sys.modules:
        return sys.modules[modname]
    
    sourcepath = modname + '.py'
    with open(sourcepath, 'r') as f:
        sourcecode = f.read()
    mod = types.ModuleType(modname)
    mod.__file__ = sourcepath
    
    ## Cache the module
    sys.modules[modname] = mod
    
    ## Convert it to Python ByteCode
    code = compile(sourcecode, sourcepath, 'exec')
    
    ## Execute the code in the module from top to bottom
    ## And update the state (globals) in the module's dictionary
    exec(code, mod.__dict__)
    
    ## We return the cached one in case there is some patching inside the module
    return sys.modules[modname]

Module Compilation

Python put a lock when importing a module until it is done so that we don’t have multiple threads trying to import the same module at the same time
__import__ is the machinery behind import statement
We can use importlib.import_module(module) which is the same thing as __import__
- importlib.import_module(spam) is the same as import spam
- importlib.import_module('.spam', __package__) is the same as from . import spam
- We can track all imports as follows:

import builtins

def imp_mod(modname, *args, imp=__import__):
    print(f"Importing {modname}")
    return imp(modname, *args)

builtins.__import__ = imp_mod

Module Reloading:
- It is not a good idea to reload a module because it creates zombies. Basically Python doesn’t try to clean up the dictionary from the old module, but instead exec() the new state of the module using the old module.__dict__. This means stuff from previous load may still exist and we end up having weird cases. This is how Python reloads a module:
```
code = open(module.__file__, 'rb').open()
exec(code, module.__dict__)
```
- Also, submodules that are loaded in the module/package don’t get reloaded. They still have their old version. Exampe: If module has import pandas as pd, when reloading the module it doesn’t reload pandas.
- Also, if we have instances that use the old version of the module and then we reload -> New instances of the same object (class) will refer to different code implementation than the instances created before the reload -> Even though they refer to the same class, instances will have different types
sys.path is only the small part of the import machinery
Imports is actually controlled by sys.meta_path
- It is a list of importers
```
[_frozen_importlib.BuiltinImporter,
_frozen_importlib.FrozenImporter,
_frozen_importlib_external.PathFinder,
<six._SixMetaPathImporter at 0x10c8769b0>,
<pkg_resources.extern.VendorImporter at 0x10dbf9300>]
```
- Python’s default sys.meta_path has three meta path finders, one that knows how to import built-in modules, one that knows how to import frozen modules, and one that knows how to import modules from an import path
- For every import statement, it goes from start-to-end to know if sys.meta_path knows how to install it

importlib.util as imp

def find_spec(modname):
    for imp in sys.meta_path:
        spec = imp.find_spec(modname)
        if spec:
            return spec
    return None

ModuleSpec of a module is its metadata that the loader uses to load it. We can also use importlib.util.find_spec() to get the module spec of any loaded package. If the package/module is not found -> returns None. Example of pandas module spec:
```
  ModuleSpec(name='pandas', loader=<_frozen_importlib_external.SourceFileLoader object at 0x10e609f90>, origin='/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages/pandas/__init__.py', submodule_search_locations=['/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages/pandas'])
```
- Module Spec main info:
  - spec.name : ## Full module name
  - spec.parent : ## Enclosing package
  - spec.submodule_search_locations : ## Package path
  - spec.has_location : ## Has external location
  - spec.origin : ## Source file location
  - spec.cached : ## Cached location
  - spec.loader : ## Loader object
- We can use the loader from module spec to get the source code w/o importing it. They actually create the imported module:
```
module = spec.loader.create_module(spec)
if not module:
    module = types.ModuleType(spec.name)
    module.__file__ = spec.origin
    module.__loader__ = spec.loader
    module.__package__ = spec.parent
    module.__path__ = spec.submodule_search_locations
    module.__spec__ = spec
```
- We can create module from spec with importlib.util.module_from_spec. This DOES NOT LOAD THE MODEL., it only creates it. To load the module, the module must be executed with spec.loader.exec_module(spec) and then cache it sys.modules[spec.name] module. exec_module will populate the __dict__ of the module.
We can execute modules lazily on first access. Implementation example:

import types


class _Module(types.ModuleType):
    pass


class _LazyModule(_Module):

    def __init__(self, spec):
        super().__init__(spec.name) 
        self.__file__ = spec.origin
        self.__package__ = spec.parent 
        self.__loader__ = spec.loader
        self.__path__ = spec.submodule_search_locations 
        self.__spec__ = spec

    def __getattr__(self, name):
        self.__class__ = _Module
        self.__spec__.loader.exec_module(self)
        assert sys.modules[self.__name__] == self
        return getattr(self, name)

import importlib.util, sys

def lazy_import(name):
   ## If already loaded, return the module
    if name in sys.modules:
        return sys.modules[name]
    
    ## Not loaded. Find the spec
    spec = importlib.util.find_spec(name)
    if not spec:
        raise ImportError(f'No module {name:r}')
    
    ## Check for compatibility
    if not hasattr(spec.loader, 'exec_module'):
        raise ImportError('Not supported')

    ## Perform the lazy import
    module = sys.modules[name] = _LazyModule(spec)
    return module

Therefore, the module create/loading has been decoupled in recent versions of Python

We can insert an importer to sys.meta_path that can change the behavior of imports

If it is in the beginning, it supercedes all other loaders and we can do crazy things

import sys


class Watcher(object):

    @classmethod
    def find_spec(cls, name, path, target=None):
        print('Importing', name, path, target)
        return None


sys.meta_path.insert(0, Watcher)

We can also use this idea to add some logic such as autoinstall packages that are not found using pip. We insert the installer at the end of sys.meta_path

import sys
import subprocess
import importlib.util


class AutoInstall(object):
    _loaded = set()

    @classmethod
    def find_spec(cls, name, path, target=None):
        if path is None and name not in cls._loaded: 
            cls._loaded.add(name)
            print("Installing", name)
            try:
                out = subprocess.check_output(
                          [sys.executable, '-m', 'pip', 'install', name])
                return importlib.util.find_spec(name) 
            except Exception as! e:
                print("Failed")
        return None
sys.meta_path.append(AutoInstall)

We can also import packages not found on the system from some other systems such as Redis
sys.path_hooks is responsible for the actual loading of the module/package depending on the path
- Each entry in the sys.path is tested against a list of path hooks to assosiate a module finder with each path entry
- Path finders are used to locate module and return module spec along with loader
- Path finders get cached in sys.path_importer_cache
Both loaders and finders have find_spec() that returns spec of module if they know how to find/load it. Otherwise, they return None
What happens during import:

modname = 'somemodulename'
for entry in sys.path:
    finder = sys.path_importer_cache[entry]
    if finder:
        spec = finder.find_spec(modname)
        if spec:
            break
else:
    raise ImportError('No such module')
...
## Load module from the spec
...

Experiments

sys.path.append("/Users/imad/Desktop/")

from pck.mod import X

pck.mod

from pck.test import X

pck.test

sys.modules["pck"].__path__

_NamespacePath(['/Users/imad/Documents/python-materials/modules-and-packages/pck', '/Users/imad/Documents/python-materials/modules-and-packages/pck', '/Users/imad/Desktop/pck'])

foo.__package__, foo.__path__

AttributeError: module 'package.foo' has no attribute '__path__'

globals()["foo"]

def f():
    pass

from pandas import read_csv

sys.path_hooks

[zipimport.zipimporter,
 .path_hook_for_FileFinder(path)>]

list(sys.path_importer_cache.keys())[:10]

['/Users/imad/anaconda3/envs/python-exp/lib/python310.zip',
 '/Users/imad/anaconda3/envs/python-exp/lib/python3.10',
 '/Users/imad/anaconda3/envs/python-exp/lib/python3.10/encodings',
 '/Users/imad/anaconda3/envs/python-exp/lib/python3.10/importlib',
 '/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages',
 '/Users/imad/anaconda3/envs/python-exp/lib/python3.10/lib-dynload',
 '/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages/PyYAML-6.0-py3.10-macosx-10.9-x86_64.egg',
 '/Users/imad/Documents/python-materials/modules-and-packages',
 '/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages/ipykernel',
 '/Users/imad/anaconda3/envs/python-exp/lib/python3.10/json']

from importlib.util import find_spec

m = find_spec("mod")
m.loader.get_source("mod")

'y = 200\nprint(y)\n\nclass A:\n    print("A")\n'

import sys
sys.meta_path

[_frozen_importlib.BuiltinImporter,
 _frozen_importlib.FrozenImporter,
 _frozen_importlib_external.PathFinder,
 ,
 ]

import mod

200
A

a = mod.A()

from importlib import reload

reload(mod)

200
A

b = mod.A()

a.__class__, b.__class__, type(a) == type(b)

(mod.A, mod.A, True)

from importlib.util import find_spec

find_spec("sys")

ModuleSpec(name='sys', loader=, origin='built-in')

find_spec("pandas")

ModuleSpec(name='pandas', loader=<_frozen_importlib_external.SourceFileLoader object at 0x10e609f90>, origin='/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages/pandas/__init__.py', submodule_search_locations=['/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages/pandas'])

import importlib

importlib.import_module()

pd.__path__, pd.__name__, pd.__package__, pd.__file__, pd.__doc__

(['/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages/pandas'],
 'pandas',
 'pandas',
 '/Users/imad/anaconda3/envs/python-exp/lib/python3.10/site-packages/pandas/__init__.py',
 '\npandas - a powerful data analysis and manipulation library for Python\n=====================================================================\n\n**pandas** is a Python package providing fast, flexible, and expressive data\nstructures designed to make working with "relational" or "labeled" data both\neasy and intuitive. It aims to be the fundamental high-level building block for\ndoing practical, **real world** data analysis in Python. Additionally, it has\nthe broader goal of becoming **the most powerful and flexible open source data\nanalysis / manipulation tool available in any language**. It is already well on\nits way toward this goal.\n\nMain Features\n-------------\nHere are just a few of the things that pandas does well:\n\n  - Easy handling of missing data in floating point as well as non-floating\n    point data.\n  - Size mutability: columns can be inserted and deleted from DataFrame and\n    higher dimensional objects\n  - Automatic and explicit data alignment: objects can be explicitly aligned\n    to a set of labels, or the user can simply ignore the labels and let\n    `Series`, `DataFrame`, etc. automatically align the data for you in\n    computations.\n  - Powerful, flexible group by functionality to perform split-apply-combine\n    operations on data sets, for both aggregating and transforming data.\n  - Make it easy to convert ragged, differently-indexed data in other Python\n    and NumPy data structures into DataFrame objects.\n  - Intelligent label-based slicing, fancy indexing, and subsetting of large\n    data sets.\n  - Intuitive merging and joining data sets.\n  - Flexible reshaping and pivoting of data sets.\n  - Hierarchical labeling of axes (possible to have multiple labels per tick).\n  - Robust IO tools for loading data from flat files (CSV and delimited),\n    Excel files, databases, and saving/loading data from the ultrafast HDF5\n    format.\n  - Time series-specific functionality: date range generation and frequency\n    conversion, moving window statistics, date shifting and lagging.\n')

Automatic Differentiation Demystified

Imad Dabbura — Sat, 03 Feb 2024 06:00:00 GMT

evergreen

The Derivative Engine Behind Every `loss.backward()`

Every time you train a neural network, something computes exact derivatives through millions of operations automatically. You call loss.backward() and gradients appear — but how? And why does training a 7B-parameter LLM consume 5x more GPU memory than running inference on it?

The answer to both questions is Automatic Differentiation (AD): a family of techniques for computing exact derivatives through arbitrary code, efficiently. Understanding it changes how you reason about memory budgets, gradient flow failures, and why certain training tricks (gradient checkpointing, mixed precision) exist at all.

There are two fundamentally different approaches — forward mode and reverse mode — and the choice between them explains why deep learning frameworks are built the way they are.

Why Not Just Use Calculus or Finite Differences?

Before getting to AD, it helps to understand what it replaced.

Numerical differentiation approximates the derivative using finite differences: for some small . It’s dead simple but has two fatal flaws: it requires one extra forward pass per parameter (catastrophic for millions of parameters), and floating-point subtraction of nearly-equal numbers amplifies numerical error badly.

Symbolic differentiation (what a computer algebra system does) applies calculus rules to produce a closed-form derivative expression. It’s exact, but the resulting expressions grow exponentially with computation depth — a 100-layer network would produce a gradient expression no machine could reasonably evaluate.

AD is neither. It applies the chain rule mechanically at each elementary operation, accumulating intermediate values rather than symbolic expressions. The result is exact (to floating-point precision) and efficient — no expression explosion, no extra passes per parameter.

Three Ways to Differentiate Code

Method	Accuracy	Cost	Practical for ML?
Numerical (finite diff)	Approximate	1 extra pass per input	❌ Too slow
Symbolic	Exact	Expression explosion	❌ Intractable
AD — forward mode	Exact	1 pass per input	⚠️ Only if few inputs
AD — reverse mode	Exact	1 pass per output	✅ Standard choice

Forward Mode AD: Sensitivity Flowing Downstream

Forward mode AD propagates derivatives alongside values as computation flows from inputs to outputs. At each operation, it tracks not just the result but how sensitive that result is to a chosen input.

The elegant implementation uses dual numbers: instead of a scalar , carry a pair where represents the derivative of with respect to some chosen input . Operations on dual numbers automatically propagate the derivative via the chain rule — you never write it explicitly:

The coefficient carries the derivative forward through every arithmetic operation.

flowchart LR
    x1["x₁\n(x₁, ẋ₁=1)"] --> mul["×"]
    x2["x₂\n(x₂, ẋ₂=0)"] --> mul
    mul -->|"(x₁x₂, x₂·1)"| add["+"]
    x3["x₃\n(x₃, ẋ₃=0)"] --> add
    add -->|"(x₁x₂+x₃, x₂)"| L["L\n∂L/∂x₁ = x₂"]

Forward mode propagates (value, derivative) pairs from inputs to output. The derivative component tracks sensitivity w.r.t. one chosen input. Here, the seed is set for x₁, so x₂’s dot is 0.

The critical limitation: the initial seed vector — the that selects which input you’re differentiating with respect to — means one forward pass gives you the sensitivity with respect to one input. Getting gradients for all inputs requires passes.

For a 7B-parameter LLM, that’s 7 billion passes to compute a single gradient update. Forward mode is not the answer for ML.

When Forward Mode Wins

Forward mode is efficient when outputs greatly outnumber inputs — the opposite of ML. It shines in scientific computing: a simulation with 3 input parameters and 10,000 output metrics needs only 3 forward passes, not 10,000. In ML the ratio is reversed: millions of inputs (parameters), one output (scalar loss). Reverse mode exists to handle exactly this case.

Reverse Mode AD: Tracing Blame Upstream

Reverse mode flips the direction. Instead of asking “how does changing this input affect the output?”, it asks “how much did each node contribute to this output?”

The key insight: for a scalar output (a loss function), one backward pass distributes gradient credit back to every node in the graph simultaneously. One pass. All gradients.

flowchart TD
    subgraph fwd ["① Forward Pass — compute and store"]
        direction LR
        x["x"] --> mul["mul"] --> add["add"] --> L["L (scalar)"]
        w["w"] --> mul
        b["b"] --> add
    end
    subgraph bwd ["② Backward Pass — propagate gradients"]
        direction RL
        dL["∂L/∂L = 1"] --> dadd["∂L/∂add"] --> dmul["∂L/∂mul"]
        dmul --> dx["∂L/∂x"]
        dmul --> dw["∂L/∂w"]
        dadd --> db["∂L/∂b"]
    end
    fwd --> bwd

Reverse mode runs two phases: a forward pass that computes and stores all intermediate values, then a backward pass that propagates ∂L/∂· back to every node.

The Unavoidable Memory Cost

Here’s the catch. To compute gradients during the backward pass, each operation needs its inputs from the forward pass. For a mul node computing , the backward step needs both and to distribute credit:

So the framework must keep every intermediate tensor alive until the backward pass consumes it. The consequence:

Inference: each layer’s activations can be discarded once the next layer is computed → memory is roughly in depth
Training: all activations must survive until their gradient is computed → memory is in depth

This is why training a transformer consumes so much more memory than running inference on it. At large batch sizes, forward activations alone can dwarf the parameter memory.

Why Your GPU OOMs During Training But Not Inference

During inference, each layer’s output overwrites the previous buffer — memory stays roughly constant regardless of model depth. During training, every layer’s output must survive until the backward pass reaches it. A 24-layer transformer holds 24 layers of activations simultaneously. Scale batch size by 4x and activation memory scales 4x too — parameters don’t budge, activations do. This is the first thing to check when you hit an OOM that doesn’t happen at inference time.

Gradient Checkpointing: Buying Memory Back with Compute

The standard solution to activation memory pressure is gradient checkpointing (also called activation recomputation): don’t store all activations during the forward pass. Store only at segment boundaries — checkpoints — and recompute intermediate activations on-the-fly during the backward pass when they’re needed.

flowchart LR
    subgraph s1 ["Segment 1"]
        L1["Layer 1"] --> L2["Layer 2"] --> L3["Layer 3"]
    end
    subgraph s2 ["Segment 2"]
        L4["Layer 4"] --> L5["Layer 5"] --> L6["Layer 6"]
    end
    s1 -->|"✓ checkpoint"| s2
    style L1 fill:#e8f5e9
    style L3 fill:#e8f5e9
    style L4 fill:#e8f5e9
    style L6 fill:#e8f5e9

Checkpointing stores activations only at segment boundaries (green). During backward, each segment re-runs its forward pass to recover the discarded intermediates.

Strategy	Activation memory	Compute overhead
No checkpointing	layers	None
checkpoints	layers	~1 extra forward pass
Recompute everything		Up to extra forward passes

The sweet spot for most LLM training is checkpoints — roughly one extra forward pass in exchange for a meaningful memory reduction. This is what torch.utils.checkpoint.checkpoint_sequential implements.

The Trade-off, Stated Clearly

	Forward Mode	Reverse Mode
Passes needed	1 per input variable	1 per output variable
Best for	Few inputs, many outputs	Many inputs, few outputs (ML)
Memory overhead	Low — no stored intermediates	High — all intermediates stored
What frameworks use	Occasionally for Jacobians	Always for gradient-based training

The Jacobian Perspective

Forward mode naturally computes a Jacobian-vector product (JVP) — the full Jacobian multiplied by a chosen input direction. Reverse mode naturally computes a vector-Jacobian product (VJP) — a chosen output direction multiplied by the full Jacobian. For a scalar loss, the VJP with direction gives you the complete gradient vector in one pass. This is the mathematical reason reverse mode dominates ML training.

What Breaks in Practice

Gradient flow failures. In reverse mode, gradients are products of local Jacobians chained across all layers. If any factor is consistently small (saturating activations, poor initialization) or large (unbounded weights), the gradient signal degrades before reaching early layers. This is the vanishing/exploding gradient problem — it’s not specific to RNNs, it’s a structural property of deep reverse-mode computation.

Silent NaN propagation. A NaN anywhere in the forward pass propagates silently through the computation graph. During backward, every gradient flowing through the affected node becomes NaN, and the weight update corrupts the entire model. Use torch.autograd.set_detect_anomaly(True) to get a traceback pointing to the originating operation — invaluable for tracking these down.

In-place operations on tensors with gradients. In-place ops (e.g., x += 1) can modify a tensor that the backward pass expects to find unchanged. PyTorch raises a runtime error when it detects this, but the error message can be confusing. The fix is simple: avoid in-place ops on any tensor that requires gradients, or clone before modifying.

Key Takeaways

AD is not numerical or symbolic differentiation. It applies the chain rule exactly at each elementary operation — no approximation, no expression explosion.
Forward mode needs one pass per input; reverse mode needs one pass per output. For ML — scalar loss, millions of parameters — reverse mode wins unconditionally.
The cost of reverse mode is memory. Every intermediate tensor from the forward pass must stay alive for the backward pass. This is the root cause of training using far more memory than inference.
Gradient checkpointing trades compute for memory. Store only at segment boundaries, recompute the rest during backward. Expect roughly one extra forward pass overhead for a meaningful memory reduction.
Most gradient problems are reverse-mode problems. Vanishing/exploding gradients, NaN propagation, and in-place op errors all stem from how reverse-mode AD chains local Jacobians through the computation graph. Understanding the mechanism is the fastest path to diagnosing them.

Git from the Inside Out

Imad Dabbura — Fri, 22 Dec 2023 06:00:00 GMT

evergreen

Introduction

Git is a distributed version control system that thinks/stores its data as a series of snapshots (not delta). Each commit is a snapshot for the state of the system at the time of the commit. For files that haven’t changed, Git doesn’t store the file again but uses a pointer to the previous identical file that it stored before. It also lets us do almost all operations locally.

Everything in Git is checksummed before it is stored in its object store using SHA-1 hash. SHA-1 hash returns 40 hexadecimal characters. All objects are referred to by their checksummed because Git is content addressable filesystem. This means that Git notice any changes to the files it tracks by comparing the checksummed of the stored version vs the current version.

All actions in Git only add data to the object store (Git database). Therefore, it is almost impossible to not undo any operation especially if we regularly push our Git database to other repository such as Github.

Git has three states:

Modified: file changed but not yet committed.
Staged: marked changed file to go to next commit snapshot. Staging area is a single file that is typically called “index”, which stores information about what will go into our next commit snapshot. When we run git add file, Git does the following:
- Computes checksum of the file and store the SHA-1 value in index file
- Compress the contents of the file and store it in .git directory under objects where the first two characters of the checkum would be the name of the directory and the next 38 characters would be the name of the file
- Add the checksum to the index file (staging area)
Committed: store data (snapshot) in the database. The snapshot is represented as tree for root directory of the Git project. When we run git commit, Git does the following:
- It computes checksum of each subdirectory until we end up with the root directory.
- Stores them as tree objects in Git repository
- Finally, Git create a commit object and store it in the Git repository with the following metadata:
  - Date
  - Author name
  - Committer name
  - Commit message
  - Parent(s) commit. First commit would have no parents. Following commits may have 1 parent or more parents in the case of merges
  - Pointer to the root project tree

.git directory which is at the root directory of the project has all the metadata for Git project such as the database (object store).

Files can be in two states:

UnTracked: files that Git doesn’t know about. They are files that are neither in any snapshot nor in staging area. Therefore, they don’t have modified/unmodified states.
Tracked: files that were in last snapshot or in staging area. They have all states mentioned above.

Git Object Model

Git stores everything in .git directory. So deleting this directory will basically delete the whole history and can’t be recovered.
Git stores all of its representations using objects directory. Object can be: blob or tree or commit.
Git use sha1sum to get the hash value of each object. It is 40 hexadecimal characters (160 bits).
- Git uses the first two characters for the name of directory for the object and the other 38 characters for the object itself.
- Git stores objects based on their hash values (content addressable storage).
- Git compresses the contents using zlib
```
// a file is a bunch of bytes
type object = blob | tree | commit
objects = map<sha1sum(object), object>
```
```
def store(obj):
  id = sha1sum(obj)
  objects[id] = obj
  return
// a directory contains named files and directories
type tree = map<string, tree | file>
def load(id):
  return objects[id]
```

!ls -al ../../.git

total 48
drwxr-xr-x  15 imad  staff   480 Nov  5 09:08 .
drwxr-xr-x@ 12 imad  staff   384 Nov  5 06:56 ..
-rw-r--r--   1 imad  staff    15 Mar 17  2020 COMMIT_EDITMSG
-rw-r--r--   1 imad  staff    23 Feb 12  2020 HEAD
drwxr-xr-x   2 imad  staff    64 Feb 12  2020 branches
-rw-r--r--   1 imad  staff   455 Mar 17  2020 config
-rw-r--r--   1 imad  staff    73 Feb 12  2020 description
drwxr-xr-x  13 imad  staff   416 Feb 12  2020 hooks
-rw-r--r--   1 imad  staff  3913 Nov  5 09:08 index
drwxr-xr-x   3 imad  staff    96 Feb 12  2020 info
drwxr-xr-x   4 imad  staff   128 Feb 12  2020 logs
drwxr-xr-x   3 imad  staff    96 Mar 17  2020 modules
drwxr-xr-x  94 imad  staff  3008 Mar 17  2020 objects
-rw-r--r--   1 imad  staff   114 Feb 12  2020 packed-refs
drwxr-xr-x   5 imad  staff   160 Feb 12  2020 refs

!ls -a ../../.git/objects

.    0c   1a   26   37   43   50   62   72   8b   9e   b1   c5   d1   e0   fc
..   0f   1d   29   38   45   52   63   73   8e   a2   b2   c7   d2   e1   ff
00   12   1f   2a   39   49   58   64   75   91   a5   b3   ca   d3   eb   info
03   13   20   2b   3f   4a   5a   67   77   96   a9   b4   cc   d8   ee   pack
07   17   21   2c   40   4b   5f   68   7f   98   aa   b6   ce   dc   f0
08   19   22   33   42   4d   61   6f   8a   9d   b0   b8   d0   dd   f8

!ls -Ral ../../.git/objects/ee

total 8
drwxr-xr-x   3 imad  staff    96 Mar  4  2020 .
drwxr-xr-x  94 imad  staff  3008 Mar 17  2020 ..
-r--r--r--   1 imad  staff   166 Mar  4  2020 5941ab3c125a3a669370d96cd5cb8496f8acde

!git cat-file -p ee5941ab3c125a3a669370d96cd5cb8496f8acde

100644 blob b6e47617de110dea7ca47e087ff1347cc2646eda    .gitignore
100644 blob 261eeb9e9f8b2b4b0d119366dda99c6fd7d35c64    LICENSE
100644 blob 58da9de606d62625c379fea5ca020d19d958fb18    README.md
040000 tree 6fea6c3802fd1cf83bf19bfc2302da6b79638ab5    missing-cs-semester

Blobs

blobs are binary large objects which stores only the context of the file; not its name (array of bytes).

type blob = array<byte>

The type of file which is “blob”, the number of characters in it, the separator character, and the actual content are passed to the sha1sum to get the hash value.
Since Git does not store the name of the file or any of its metadata, if you have two files with the same content then Git only stores it once.

!git cat-file -p 58da9de606d62625c379fea5ca020d19d958fb18

# software-engineering
Materials for software engineering.

!wc ../../README.md

       2       6      59 ../../README.md

%%bash
cat <(echo -e "blob 60\0") ../../README.md

blob 60
# software-engineering
Materials for software engineering.

# # Since todo.md and todo2.md are identical, Git saves ONLY one copy
# 100644 blob b3dfa8b0b7c73f2c7156dfc69c737d05f2f900c3    file.txt
# 100644 blob c1ee9d5404109b66f21fa193da635aa8c4f04c47    todo.md
# 100644 blob c1ee9d5404109b66f21fa193da635aa8c4f04c47    todo2.md

%%bash
cat <(echo -e "blob 58\0") ../../README.md | shasum

5ee782528aa7bc3d388c33339962f6fce514b39e  -

Trees

Tree is a recursive data structure that contains other trees/blobs; i.e. it contains a list of pointers to other trees/blobs. In this context, tree is a directory. Therefore, the root directory is the main directory that has .git as its subdirectory. Each line in the tree object’s file contains a pointer (the object’s hash) to one such object (tree or blob), while also providing the mode, object type, and a name for the file or directory.

// a directory contains named files and directories
type tree = map<sha1sum(tree | file), tree | file>;

It maps strings (hash values) to objects. So if a directory is empty, Git does not add it as untracked change until we add a file or a directory to it because empty directory has nothing to map stuff to. Therefore, to track empty directories, we can add .gitkeep to the directory if it is empty to be able to track it.
We pass all objects (not their contents) to get the hash value.
Tree objects themselves do not have names, much like blobs. Parent trees associate names for subtrees, and the root tree, referred to as the “working tree” of a repository, in fact has no name. This has two fun characteristics:
- The repo doesn’t care what you call it. You can rename your local directory that contains your repository to anything you’d like. Git is blissfully unaware of the name of the directory that contains the .git repo directory.
- We can rename subtrees as much as we want, and only parent objects need to update. The subtree object itself and everything below remain untouched.
Trees summary:
- Trees list out the contents of a directory (blobs and subtrees)
- For each object, the mode, permissions, type, hash, and name is listed
- Tree objects must contain at least one blob or tree; otherwise, it won’t be tracked
- Trees can be nested to any depth
- Trees, like blobs, don’t store names. The names are stored in parent trees. Therefore, changing names of subtrees only change the names in the parent tree. Therefore, since root directory has no parent, changing its name doesn’t have any effect on git
- Trees are named and stored in the objects directory by hashing their contents (the list of objects described above)

# master is a branch that points to a commit which also points to a tree
!git ls-tree master

100644 blob ced9612a7d927cdb23d0ba2de47679504b0c9fc3    Command-Line-Environment.ipynb
100644 blob 68fe60e479d316b7f40f194a8d3400e7f5c8af60    Data-Wrangling.ipynb
100644 blob dcee12ed0ae6096339dc40a70c5ff67a2afdec31    Debugging-And-Profiling.ipynb
100644 blob 40582a6d5d23028acc251c54f6e124ce9f2ec5ba    Petpouri.ipynb
100644 blob 6405b8d2b26e17020b12f98639148615d6c9baea    Plan.ipynb
100644 blob f0aca55804924d43eb3d687ebc9a780f3b8baff3    Security-And-Cryptography.ipynb
100644 blob 035f52d11f6126cac575b899f6ff0011060aeddd    Shell-Scripting.ipynb
100644 blob a2235c5c9a32920e7f15c3bf63afde41e60c4e52    Version-Control(Git).ipynb
100644 blob 395d086c29d15560f0b3eee28c0489afe1b6de8e    Vim-Tutor-Summaries.ipynb
100644 blob 9d756b15f398735d1bd414fd97afa5f709db06f5    basic.png
100644 blob 50daf1bb695821f251fbc44880413ca17ca8a8b6    commit_history.png
100644 blob 43d8b8f1031c6c81f0153e0616be7576de6401f9    pycallgraph.png
100644 blob 5a0fa1a6cb3918e9c2d316433edfd5ccd275bd59    vim-tutorial.md

Commits

commits contain parent, message, author, commiter, and current tree. Therefore, it is a file like any other object.

// a commit has parents, metadata, and the top-level tree
type commit = struct {
    parent: array<commit>;
    author: string
    message: string
    snapshot: tree
}

It’s worth noting that the commit object only contains a single reference to a working directory; Git doesn’t store diffs. When diffing between two commits, it compares the working trees of the commits, computing the diff on demand.

Git only stores the delta changes between commits and not everything. It also point to blobs/trees that have not been changed using old commits and don’t store them again for new commits. Therefore, if a file has not been changed from previous commit, the hash value for that commit is the same so its address is still the same -> keep the same pointer.

References are nothing but pointers to commits. They are stored under .git/refs directory as files where each file contains the hash value of some commit. Since it is a hassle to always refer to objects by their 40 hexadecimal string, we can use references to refer to objects. Contrary to objects, references are mutable. For example, master always refers to the latest commit in the main branch. HEAD refers to where we currently are in the history which will be used when creating new snapshot by making the parent for this commit the HEAD and then update HEAD.

references = map<string, commit>;

def update_reference(name, id):
    references[name] = id

def read_reference(name):
    return references[name]

def load_reference(name_or_id):
    if name_or_id in references:
        return load(references[name_or_id])
    else:
        return load(name_or_id)

!tree -C -L 1 ../../.git

../../.git
├── COMMIT_EDITMSG
├── HEAD
├── branches
├── config
├── description
├── hooks
├── index
├── info
├── logs
├── modules
├── objects
├── packed-refs
└── refs

7 directories, 6 files

!tree -C -L 1 ../../.git/refs/

../../.git/refs/
├── heads
├── remotes
└── tags

3 directories, 0 files

!tree -C -L 1 ../../.git/refs/heads/

../../.git/refs/heads/
└── master

0 directories, 1 file

%cat ../../.git/refs/heads/master

e1d95ecb4c02e0c30a8635a37631c523d2041299

!git cat-file -t e1d95ecb4c02e0c30a8635a37631c523d2041299

commit

!git cat-file -p e1d95ecb4c02e0c30a8635a37631c523d2041299

tree 75407c234b245d258c809de234e030f57dd98148
parent 429137bbf1334dcea2719458bcc3a323cd829ecd
author Imad  1584462760 -0500
committer Imad  1584462760 -0500

Review all nbs

HEAD, unlike the other objects we’ve discussed, is a singleton, meaning that there is only ever one HEAD. It identifies the currently checked out object. Typically, this is a branch (with that branch pointing to a commit), but it is possible to check out a commit directly, in which case HEAD would be pointing at that commit.

HEAD is a file just like our branch objects. It lives at the root of the .git directory and its contents are similarly simple.

%cat /Users/imad/Desktop/git-repo/.git/HEAD

ref: refs/heads/master

%%bash
cd ~/Desktop/git-repo/
git graph2

* 6e9c688    (HEAD -> master, tag: v.0.1, test) Renaming (Imad)
* ff9e667    Add test1 dir (Imad)
* e503b6e    Add test dir (Imad)
* 8610c78    Add copied file (Imad)
* ed738c9    (feature) Rebased all commits (Imad)
* 2050b90    Add host to file (Imad)
* 91eacf5    Add host (Imad)
* ed27259    patch commit (Imad)
* ff2d260    third commit (Imad)
* 6dd0c14    Change second commit (Imad)
* c2b7166    first commit (Imad)

%%bash
cd ~/Desktop/git-repo/
git checkout 8610c78

Note: checking out '8610c78'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b 

HEAD is now at 8610c78 Add copied file

%cat /Users/imad/Desktop/git-repo/.git/HEAD

8610c78113fe423b20a9f84d485b49af5ad089b0

%%bash
cd ~/Desktop/git-repo/
git graph2

* 6e9c688    (tag: v.0.1, test, master) Renaming (Imad)
* ff9e667    Add test1 dir (Imad)
* e503b6e    Add test dir (Imad)
* 8610c78    (HEAD) Add copied file (Imad)
* ed738c9    (feature) Rebased all commits (Imad)
* 2050b90    Add host to file (Imad)
* 91eacf5    Add host (Imad)
* ed27259    patch commit (Imad)
* ff2d260    third commit (Imad)
* 6dd0c14    Change second commit (Imad)
* c2b7166    first commit (Imad)

Summary

Objects: blobs, trees, and commits
Refs: branches, tags, and remote branches
HEAD: The single pointer to rule them all

Branches

heads aka branches (because it is a collection of HEADs for each branch in Git repo) are nothing but pointers to commits. They are very simple objects, they only contain hash value of the commit they are pointing to. Therefore, creating a branch is just creating a file in refs/heads with the name of the branch that has the commit of the HEAD of that branch. At the beginning, this file will have the commit the HEAD points to from the branch you were on when created the branch.
When you switch branches, Git resets your working directory to look like it did the last time you committed on that branch. It adds, removes, and modifies files automatically to make sure your working copy is what the branch looked like on your last commit.
Merging:
- If we are merging a feature branch into master branch and the feature branch is directly ahead of master where master’s last commit can be reached following feature branch commit’s history, Git will do fast-forward merge, which means it just updates the pointer to point forward.
- Otherwise, if head of master branch isn’t direct ancestor of feature branch, Git does three-way merge by using 3 commits:
  - Common ancestor commit
  - Last commit from master branch and feature branch
  - Creates a new snapshot with new commit object (merge commit) that points to two parents: last commit from master and last commit from feature branches
- If we have merge conflict, we can either abort the merge or resolve the merge conflist ourselves. Once we resolve the conflicts in all files, we should stage those files and then commit the changes. This would be the merge commit. We can use mergetool to resolve merge conflicts such as vimdiff.
git branch -v will show last commit of all branches
git branch --merged show all branches that were merged with the current branch we are on. git branch --merged master show all branches that were merged with master branch.
git branch --no-merged does the opposite.
We can’t delete a branch if it has work that we haven’t merged with master branch. We can force delete using -D flag.
We can rename a branch, but we should do it both locally and on the remote server. It is recommended to avoid renaming master branch because it would break integrations/scripts/etc. and requires a lot more work.
- Locally:
  - git branch --move oldname newname
- Remote:
  - git push --set-upstream origin newname
  - git push origin -d oldname
heads are for local branches.

Remote Branches

Remote Branches are the same as local branches. They are again files that point to commits.

We can have multiple remotes where each one has its own branches. origin is the (default) main one typically used for the upstream (we can change it to other names when cloning a repo such as git clone URL -o anothername. git remote -v would list all the remotes for the repository. We can add remotes git remote add remote_name remote_url
All the remote branches under remotes/origin/ will be updated ONLY when communicating with the remote server. Such branches act more as bookmarks and can’t be changed by any Git commands to point to different commits directly.
Local branch is called Tracking Branch if it tracks a remote branch (called Upstream Branch)
- git checkout branchname would create tracking branch that tracks default remotename/branchname if branchname doesn’t exist and exactly matches one upstream branch names.
- We can have local branches track branches from different remotes: git checkout -b remotename/remotebranch which would create local branch named remotebranch that tracks remotebranch on remotename server. We can have different name for our local branch as git checkout -b localbranchname remotename/remotebranch.
- If we already have a local branch, we can use git branch --set-upstream-to=remotename/remotebranch to make current branch track remotebranch on remotename server
- If I am on a tracking branch and run git pull, it knows which server to fetch from and which branch to merge in
git fetch download the changes from all branches from remote to local repository without merging them. We should do the merge ourselves such as git merge origin/branchname
git pull download and merge the changes from remote to local branches.
git remote show remote_name will show everything in details about the remote_name such as URL, local/remote branches, etc.
We can rename/delete remotes as git remote rename/remove remote_name. If we delete remote, it deletes all config/settings related to the deleted remote. Renaming would rename branches.
Remote references are read-only, which means we will never update them using git commit but Git manages them as bookmarks.
By default, Git fetches all references from remote to heads -> All branches. We can change this behavior on the command line when running git fetch remote_name remote_branch:refs/remotes/remote_name/branch_name
Pushing local branch to remote can be done in different forms:
- git push origin branchname
- git push origin localbranchname:remotebranchname which lets us have a different name on the remote server for our local branch
We can delete remote branch git push origin --delete branchname

!tree -C ../../.git/refs/remotes/

../../.git/refs/remotes/
└── origin
    ├── HEAD
    └── master

1 directory, 2 files

%cat ../../.git/refs/remotes/origin/*

ref: refs/remotes/origin/master
e1d95ecb4c02e0c30a8635a37631c523d2041299

Cloning Repository

git clone https://github.com/UserName/RepoName would do the following:

Create a directory called RepoName
Create a directory called .git inside RepoName
Pull down all versions for every file for the history of the project
Check out the latest version

As a result, if initially a huge file was committed but then deleted years ago, cloning will pull down the huge file even if such file is never needed again. Therefore, if a project has a long history, we may not need to clone all history and restrict to last N days.

Ignoring Files

.gitignore hosts all patterns that Git should ignore and not track. It is typically located at the root directory of the project and applies recursively to all subsdirectories; however, we can have .gitignore in subdirectories that only gets applied specifically to those subdirectories.

The rules for the patterns you can put in the .gitignore file are as follows:

Blank lines or lines starting with # are ignored.
Standard glob patterns work, and will be applied recursively throughout the entire working tree. Example:
- *.log ignores all files that end with log recursively
- doc/*.txt ignores all .txt files under doc
- doc/**/*.pdf igores all pdf files in the doc directory and all its subdirectoris
We can start patterns with a forward slash (/) to avoid recursivity. Example: /TODO ignores TODO in the current directory.
You can end patterns with a forward slash (/) to specify a directory. Example: build/ ignores all files under build in all directories.
You can negate a pattern by starting it with an exclamation point (!). Example: !test.log tracks test.log

General

git rm would remove a file from working tree and stage it. If we ever staged files by mistake, we could run git rm --cached filename to remove it from staging area and keep it on hard drive especially if we don’t want Git to track it.
git mv both change the file name and stage it
git reflog remember all actions taken in a repository (even intermediate steps such as creating branches, clone, pull, etc.) and not just commits. It is local to your copy of the respository and others who have the same copy of the repoository would have their own version of reflog. It also starts empty after we clone the repository. Therefore, it is more like shell history. We can always get back to some state.
- git reflog show hash_value will show all the actions happened for the hash.
git diff to see the changes in the working tree compared to the index
git diff --cached to see the changes compared to the last commit
git difftool shows the changes in external tools such as vimdiff
git add -i for interactive staging to control which files to stage and which parts of the files (patches) to stage all interactively. This is very helpful if we have done a lot of work on many files without staging anything. We can add/checkout/restore/stash patches (parts of files) by adding --patch or -p flag to their corresponding git command.
git commit -m "test" When we commit, It involves at least a change to one blob. This will lead to a creation of new tree with current state of the code that reflect the changes. Git then creates commit object that will point to the new tree. Finally, it will update the current branch to point to the newly created branch.
git merge --ff-only branch-name This kind of merge creates no objects, It just updates the current branch to a different commit.
git merge branch-name In contrast to the fast forward merge, Git creates a new tree by trying its best to combine two divergent branches. It then creates a new commit that point to the newly created tree and the parent would be the two commits; latest commit from each branch. This is what merging using pull requests does on Github. It may not be preferable because the actual code changes during merge when Git tries to combine all of them and we may end up with conflict.

Stashing

It is very helpful when we staged some work and/or have modified tracked files and want to jump to different branch to work on something else. By default, Git stashes only modified and staged tracked files but not untracked files. We can add -u to add also untracked files.

We can run git stash to stash the work
git stash list to list all the stashes
git stash apply to apply last stash OR git stash apply stashname. This keeps the stash on the stack
git stash drop to remove a stash
git stash pop to apply and remove last stash in one command

We can apply stashes from one branch on another branches.

To avoid issues/merge conflicts when trying to apply stashes, it may be helpful to create new branch and apply stash in the new branch. This can be done by git stash branch newbranchname. This will create new branch, checkout last commit you were on, apply stash, and then drop the stash.

Managing History

git log has all the information we need to the repo’s history.
- git log --all --decorate --graph --oneline is great to get an overview and see the divergence of branches
- git log -n will limit the log to the top n
- git log --oneline file is useful to get an overview of the log of one file
- git log --pretty=format:'%C(yellow)%h%C(reset) - %an [%C(green)%ar%C(reset)] %s to change the format of the log output
- git log -E -i --grep regexp will do extended search the logs for the regexp phrase; case insensitive
- git log -S term will search for changes related to that term in the code base (addition/deletion). Check this post
- git log -G regexp will search for changes related to the regexp in the code base; but looks for patterns not literal string.
git show commit will show everything that happened with that commit including diff
git blame file is useful to know who did what to the file and when especially if we want to trace who introduced some bug/logic to the codebase. Use -L to restrict to specific lines. Use -C to detect if block of lines were copied from other files that were in the same commit.

Bisect

Git Bisect is useful to trace when a bug is introduced to get the commit that introduced the bug especially if the commit was pretty far in the history. It does binary search between the commit that you believe was good (no bug) and the current commit or any commit that we know has the bug. Below are a typical workflow:

git bisect start to start the binary search
git bisect bad which means current HEAD is the bad commit which would be last commit in the range of commits of the binary search
git bisect good commit which tells Git that the provided commit didn’t have the bug and would be the first commit in the range of commits of the binary search
We can use the three commands in one command git bisect start badcommit goodcommit
From here, we interactively run either git bisect good to tell Git the given commit is good so it does binary search from next commit to the last commit OR git bisect bad to tell Git that the given commit is bad and the next binary search stops at the commit before it. We keep doing this until we arrive at the commit that introduced the bug.

We can also use a script that runs tests for us to check whether a commit is good or bad and automate the whole process:

git bisect start badcommit goodcommit
git bisect run test-script.sh OR git bisect run make rule OR git bisect run pytest. For each commit, git bisect runs the script or command on the checked out commit. If it returns 0 -> good; otherwise, bad.

Submodules

Git submodules are Git repositories inside Git repository that allows us to track them and keep commit histories separate. Each submodule would be in different directory inside the project git repository.

We can add submodule by git submodule add URL. This will create a directory with the name of the Git repository (we can have different names using git submodule add URL name). If we run git status, we see that Git added the directory as special type of file as well as add a file named .gitmodules that has the path and the URL for each submodule. We need to commit those two files to include them in our main project history.
If we clone a project that has submodules, we can either pass --recurse-submodules to initialize and pull all contents of all submodules OR go in each submodule directory and run git submodule update --init (add --recursive if there are any nested submodules.
To pull out changes made to submodules, run git submodule update --remote submodule_name
git diff --submodule to get a nice diff for submodules

Hooks

Git hooks are scripts that can either be client-side hooks that run for operations such as committing/merging or system-side hooks that run on network operations such as receiving pushed commits. All hooks are stored in .git/hooks directory. Git prepopulates any new Git repository with example hooks that end with .sample. To use such hooks, remove the extrension. We can write hooks in many languages such as Python but they have to be executable and can’t have any extension. Also, client-side hooks aren’t copied when the repository is cloned.

Below are the most common client-side hooks:

pre-commit: Runs before we type the commit message and abort if the return code is not zero. This can be used to run tests, check code style, check for documentation or whitespaces, etc.
prepare-commit-msg: Runs before the commit message editor but after the default message is created.
commit-msg: Typically used to check if a commit message conforms to some predefined patterns.
post-commit: Runs after the commit proccess is completed.

There are other client-side hooks such as pre-rebase, pre-merge, post-merge, etc.

Below are the most common system-side hooks:

pre-receive: Runs when handling a push from client. It can be used to check for things such as rejecting non-fast-forwards or access control.
update: Similar to pre-receive but runs once for each branch the pusher is trying to update.
post-receive: Runs after the entire push process is completed. It can be used to notify users or update services.

Resetting

git reset HEAD|commit command allows us to:

Move what the branch HEAD points to (stops if --soft and everything will be in the staging).
Make the index look like HEAD (stops here if not --hard)
Make the working directory look like the index

If we provide a path such as git reset filepath, it is a shorthand for git reset --mixed HEAD filepath and does the following:

Move what the branch HEAD points to (skipped)
Make the index look like HEAD; i.e. has the effect of unstaging the file

If we run git reset commit -- filepath, it will act as if we reverted the content of the file to what was in the commit and then ran git add on the file without changing working directory. The HEAD and the working directory would have the same version of the file. Therefore, running git commit will commit the changes back to what was in the commit leaving both index file and HEAD point to the same changes.

git checkout without paths is similar to git reset with two differences:

reset moves the branch HEAD points to while checkout moves HEAD itself. For example, git checkout branch would change what HEAD is pointing to while git reset commit would change what branch points to.
checkout is working-directory safe where it tries to do a trivial merge but reset --hard will overwrite working-directory.

git checkout filepath is similar to `git reset –hard filepath -> overwrite working directory.

Inspecting Commit Ranges

^ refers to the parent. HEAD^ means the parent of last commit in the current branch.
~ refers to the first parent. HEAD~ means the first parent of last commit in the current branch. It will be different than ^ in the case a commit has multiple parents as is the case of merge commits that have multiple parents.
HEAD~5 is equivalent in some sense to HEAD^^^^^.
Double dots (..): If we want to see the commits that are reachable from target branch (commit) but not the source branch (commit), we use git log sourcecommit..targetcommit.
Triple dots (...): If we want to see the commits that are reachable by either of the branches (commits) but not from both of them, we use git log sourcecommit...targetcommit. This will return commits unique to sourcecommit and targetcommit but not common commit.
Multiple points: If we want to see the commits for multiple points such as git log refA refB ^refC which means commits reachable from refA and refB but not C. Therefore:
- git log refA..refB is equivalent to git log refB ^refA

Grep

Git grep allows us to search for a pattern in working directory, index, and committed tree. We can also search in older versions of the code such as using old tags/commits, which grep/ack tools can’t.

The most useful flags to use with grep is git grep -n -p --break --heading pattern optional_path optionalcommit.

Undoing

Commits are immutable. This means that even though we can fix some stuff related to commits, we can’t change the commits themselves. They will still be in the history. Therefore, anything that is committed in Git can almost always be recovered. Even commits that were on branches that were deleted or commits that were overwritten with an –amend commit can be recovered. However, anything you lose that was never committed is likely never to be seen again.

git commit --amend will open an editor to write a new commit message to the already committed changes.
- git commit --amend -m "message" is a shorthand
- git commit --amend --no-edit will add new files to the last commit; in case we forgot to add some files to that belong to the same commit
git reset HEAD file OR git restore --staged file will undo the staging of the file. This is helpful if we staged a file and then we need to change some things before committing.
get checkout -- file OR git restore file will delete all the changes made to a file. We will never be able to get back the deleted changes.
get reset --soft HEAD~2 This will remove the commits from the history and point HEAD to its grand parent. --soft here means to keep the changes in the current working directory and index file. Therefore, running git commit would commit the latest changes and make grand parent as the parent of changes (Squashing Commits).
To cancel the commit while writing the message, we can exit vim with :cquit which exits vim with error and git will get that error -> won’t proceed in creating the commit.

Rebasing History

git add file or git add --all or git add directory. This will add all changes made to a specific file/directory.
git add --patch Allows us to cherry pick the changes that we want to stage. This is useful if we want to split the changes we made to a specific file into different commits. When we run the command, we will interactively choose what we want to stage using shortcuts.
git diff/log HEAD..HEAD~2 will give us the diff/log for the range between two commits in history. We can either choose hash_values of commits or their references such as HEAD/master.
git reset --hard HEAD~1 will make HEAD point to its parent and remove last commit from log history. Note that the last commit is not completely removed, we see that with git reflog.
git cherry-pick origin/master..master will replay the commits with this range in another branch. This is useful when we commit to the wrong branch and we want to make those commits in another branch. We can use this command after we checkout the correct branch and run the above command. To remove the commits from the branch we first commit, we can use git reset --hard (even though the removed commits are still in our history).
git rebase master We want to take the work we’ve done on our feature branch, and reapply it as if it was done on top of the additional commits in our master branch. When performing the rebase, Git finds the commits unique to our branch and computes the diff off the changes they introduced, then moves to the target branch, master in this case, and one by one applies the diffs, creating new commits reusing the commit messages from our branch. Once done, it updates our branch to point at the newest of these commits created by reapplying the diffs.
While we would never revise published history, specifically the master branch, we almost always revise our commits on feature branches before merging them in. We value a clean history, and the majority of the time, the commits in a feature branch contain many rounds of refactoring and PR reviews which we don’t want in the permanent history. Instead, we want the most direct and concise form of the history that fully captures the change we settled on in our feature branch after completing any refactoring or updates. Use git rebase -i master will allow us to do just that.
- We can remove, reorder, squash, edit, and split commits using interactive rebase.
- Git applies and rewrite the changed commits and all the commits that follow the changed ones.
- It is highly recommended to not change history if you already pushed it to the remote server unless we’re working on feature branch and are doing it to clean up history before merging and close the pull request.
- Reording is simply reordering the commits shown in the editor.
- Be careful that the order of commits is reverse order. This means last commit will be last.

Packfiles

These are files that Git uses to combine files into single file to save space instead of having different versions of the same file taking all the space and only saves the original version with deltas where pack index file will have offsets that point to the object in the pack file. Git automatically runs this when we have too many loose files or when run git gc command or when we push to remote server.

Github and Remotes

Hub and Github CLI tool gh make it easy to interact with Github from the command line and integrate well with Git. Useful commands are compare, browse, and pull-request.
To share the code on a given branch using a URL that always point to the same code, we can press y to change the name of the branch with its hash that will always point to the same version of code even if we make changes to the branch. We can also select lines from the code file that will be highlighted when we open the URL.
If we are creating a new branch locally and want to have an upstream version for that branch:
- git branch --remote origin/new-branch-name will create upstream version of the branch so we can easily push.
- git push -u origin new-branch-name will create the new branch while pushing to Github
- If we want the upstream name to have different name than the local branch name, git push -u origin local-branch-name:upstream-branch-name
If we want to delete a branch:
- Locally: git branch -d branch-name
- Upstream: git remote --delete branch-name
We can force to push local branch to another existing upstream branch. This is risky and we may not need to use it git push --force origin local-branch:upstream-branch

Typical Workflow

Always start by creating new branch for new features. Almost always strive to not commit directly to master branch even for small changes. The workflow is:

create new branch -> make small changes -> create pull request -> pass code reviews and other stuff like CI/CD -> Rebase master into feature branch -> Interactive rebase to squash all commits from feature branch into one commit message -> Fast forward merge with master -> push master -> delete feature branch locally and on upstream.

Always commit small changes and don’t wait for large changes to commit. It will be harder to figure out what changes have been made and make it difficult for code reviewers to understand. We can always refine commits with interactive rebase.
Pull Requests:
- We first need to push the feature branch into Github using git push -u origin feature-branch
- We then have two choices to open PRs: Either through Github UI or though command line tools like hub and gh. The advantage of Github UI is that it lets you review the code one more time through compare view before submitting it.
- Provide as much context as possible when drafting your PR description. Try to provide as much useful detail as you can. Answering the following questions is a great start:
  - Why is this change needed?
  - Were other solutions considered?
  - Were any assumptions made?
- For work that can’t be broken down into small changes, we can use Github Task lists that shows all the items that need to be worked on and the methodology so that people would know not to do in depth code reviews. So every time we push changes we mark items that were already done.
- Code reviews resources:
  - Derek Prior’s talk on Code Review Culture
  - thoughtbot guide to code review
- After getting the feedback from the team on the code reviews as well as the CI comments, we can incorporate the changes that team recommended. Then push the new commits to the feature branch and those will automatically be included in the PR.
- We prefer a clean history built using fast-forward merges. In order to ensure this, before merging our PR we always pull master and rebase our feature branch onto master to ensure that our commits are ahead of master. One nice helper for this is the mup alias which checks out master, pulls, then checks back out our feature branch: mup = !git checkout master && git pull && git checkout -. Finally, git rebase master. If we’ve done any rebase, we need to force push changes to remote git push -f
- Once we’re ahead of master, we can perform an interactive rebase to revise our commits and craft our history. In particular, we can use this time to squash down cleanup and WIP commits, ensuring that each commit we keep is useful and has a solid commit message.
- This is the time to ensure that we’ve captured as much context as possible in our commit message to describe the “why” of the change. Two great resources on this topic are:
  - Five Rules for A Good Git Commit Message
  - Stephen Ball’s Deliberate Git talk
- If we’ve performed any form of rebase, then we’ll have created new commits and will want to push those up to GitHub in order to get everything in sync. To do this we can force push (git push -f) our branch.
- Final steps:
  - If we’ve force pushed after rebasing as described above, we should be all set, but never hurts to give one last git push just to confirm that our local and remote feature branches are in sync.
  - Merge fast-forward: git co master & git merge - --ff-only
  - Push master: Now that we’ve merged master, we can push it up to GitHub with git push. As a reminder, with a fast-forward merge we are simply moving our master branch pointer to point at our feature branches tip commit, not actually creating any new commits. This is one of the main benefits of using fast-forward merges, namely that all commits are created and can be reviewed on our feature branch before merging into master. With “Big Green Button on GitHub” merges and other non-fast-forward merges, the merge commit is created directly on master based on Git’s merging algorithm.
  - Delete local branch: git branch -d decks-ordering
  - Delete remote branch: git push origin --delete . We can also delete the branch via the GitHub PR page, and then git pull on master, letting the fetch prune setting automatically clean up our local reference to the remote branch.
  - Pull request auto closing. Assuming we’ve performed the steps outlined above, GitHub will have automatically closed the PR based on the fact that master now contains our branch’s commits.

Configuration

Git looks for configurations in the following places:

First look for/inside /etc/gitconfig. Any time we use git config --system, it reads/writes this file
Second look for/inside ~/.gitconfig for each user. Any time we use git config --global, it reads/writes this file
Finaly look for/inside .gitconfig inside the Git directory. Any time we use git config --local, it reads/writes this file

gitconfig file is read automatically before any Git command is run. That turns out to be very handy as it means you never have to reload or experience out-of-sync commands. Additionally, git automatically writes to it when we run commands like git config –global alias.ga.

The config file is split into sections such as color, alias, core, push, etc. For example:

[push]
    default = upstream

is the same as git config --global push.default upstream.

Few useful configurations:

push.default upstream this instructs Git how to respond when you run git push with no arguments. With the upstream configuration, it will push the configured upstream tracking branch (set up with git push -u).
merge.ff only this configuration tells Git to reject merges that are non-fastforward. With fast-forward merges, no new commits are created, but instead the merging branch (typically master) is only moved to point at the commits on the target branch (typically our feature branch).
fetch.prune true this instructs Git to clear local references to remote branches which have been deleted when you pull.

By default, we can only execute one git command when aliasing. To execute more than one command, we can start the command with ! and then we can execute multiple shell commands using pipes, &&, and ||. For example, !git checkout master && git pull && git checkout -.

Git subcommands allow us to write scripts in any language we want; not necessarily bash, and make Git executes it. The script subcommand has to be:

On our $PATH
Marked as executable
The file name has to be prefixed with git and then dash and then the name of the command. For example, git-subcommand-name. Actually, all git commands are files that share all those criteria such as git-add. Below is an example of subcommand:

#!/bin/bash
#
# Small wrapper around git commit. Bare 'cm' will enter normal git commit
# editor, but with args it will do a direct `commit -m`

if [[ $# > 0 ]]; then
    git commit -m "$@"
else
    git commit -v
fi

Resources

Git Ready: Practical how-to pages on topics like “get a file from a specific revision.”
Pro Git: A great in-depth resource I find myself continually coming back to.
Git Internals: A deep dive into the Git object model, with more detail and nuance than we could cover in the this course’s video on the topic
Thoughtbot Guides
Github CLI
Add this to Vim autocmd Filetype gitcommit setlocal spell textwidth=72
Fugitive Plugin
- five part Fugitive series on Vimcasts
Conflicted Optimizing Fugitive for merge and rebase conflicts

I Built My Own PyTorch (Tiny Version) — Here’s Everything I Learned

Imad Dabbura — Wed, 20 Dec 2023 06:00:00 GMT

evergreen

Why Build a Deep Learning Framework from Scratch?

Every deep learning practitioner eventually runs loss.backward() and watches gradients flow. But what actually happens inside that call? Where do the intermediate tensors live? Why does your GPU run out of memory on a model that “should” fit? And why does reshaping a tensor sometimes silently copy gigabytes of data?

I built tiny_pytorch to answer these questions for myself. Along the way, I encountered nearly every foundational design decision that real frameworks like PyTorch, TensorFlow, and Caffe had to make — and learned why they made them.

This post distills everything I learned into a coherent narrative. We’ll start from the framework-level design philosophy, work our way down to how bytes are laid out in memory, and then zoom back out to distributed training across multiple GPUs. The goal is intuition: mental models you can carry with you when debugging real systems.

Roadmap

Here’s what we’ll cover and why it matters:

Section	What You’ll Learn	Why It Matters
Framework Design	Static vs. dynamic graphs, and the Caffe → TF → PyTorch arc	Understand trade-offs you inherit from your framework
Automatic Differentiation	Forward vs. reverse mode AD, what gets saved	Know why backward passes consume so much memory
Memory Layout	Shapes, strides, views, and when copies happen	Stop guessing about tensor memory behavior
Hardware Acceleration	Alignment, parallelism, BLAS, im2col	Understand the layer between your code and silicon
Initialization & Normalization	Why init persists, and how norms fix training	Debug training instabilities at their root
Regularization	Implicit vs. explicit, dropout mechanics	Apply regularization correctly (L2 ≠ weight decay!)
Scaling Up	Checkpointing, data/model/pipeline parallelism	Train models that don’t fit in memory
Neural Network Architectures	CNN, RNN, LSTM, Transformer, GAN design choices	See architectures through a systems lens

The Evolution of DL Frameworks

Before writing a single line of code, it helps to understand the three philosophies that shaped modern deep learning frameworks. Each solved a real problem — and introduced new ones.

Caffe: Layers All the Way Down

Caffe (C++ only) was beautifully simple. You defined your computation as a stack of layers, each implementing a forward() and backward() method. The backward pass was a direct implementation of the backpropagation algorithm from Hinton’s seminal work — each layer knew how to compute its own gradients, and updates happened in-place.

Mental Model

Think of Caffe layers like a stack of Lego bricks. Each brick knows its own shape (forward) and how to “unstick” itself (backward). Simple, intuitive, but rigid — you can’t easily build non-linear architectures.

TensorFlow 1.x: The Static Graph

TensorFlow introduced a powerful idea: construct a static computation graph first, then execute it. This separation of definition and execution unlocked serious optimizations — the compiler could fuse operations, reuse memory, and skip unnecessary computations at run-time.

The cost? Debugging was painful. You couldn’t just print a tensor mid-computation. The graph had its own “programming language” that felt alien to Python developers. Experimentation slowed down because every change required rebuilding the graph.

PyTorch: Define by Run

PyTorch flipped the script with dynamic computation graphs — the graph is built on-the-fly as you execute operations. This is called define by run. You can mix Python control flow (if/else, loops) directly with tensor operations, set breakpoints anywhere, and inspect intermediate values trivially.

The trade-off? Dynamic graphs are typically harder to optimize ahead of time. You lose the global view that static compilation provides. Modern PyTorch addresses this with torch.compile() and JIT compilation, getting closer to static-graph performance while keeping the dynamic-graph developer experience.

The Trade-off Triangle

Every DL framework navigates three competing goals: ease of debugging, optimization potential, and flexibility. Caffe optimized for simplicity, TensorFlow for optimization, and PyTorch for flexibility. No framework gets all three for free.

flowchart LR
    A["Caffe
Layers with forward/backward
In-place updates
C++ only"] --> B["TensorFlow 1.x
Static graph
Compile-then-run
Hard to debug"]
    B --> C["PyTorch
Dynamic graph
Define-by-run
Python-native"]
    C --> D["Modern PyTorch
torch.compile / JIT
Best of both worlds"]

    style A fill:#f9f,stroke:#333
    style B fill:#bbf,stroke:#333
    style C fill:#fbb,stroke:#333
    style D fill:#bfb,stroke:#333

The evolution of DL framework design philosophies

Key takeaway: Framework design is fundamentally about when the computation graph is known. Know it early (static) and you can optimize aggressively. Know it late (dynamic) and you can iterate fast. Modern systems try to give you both.

Automatic Differentiation: The Engine Room

Automatic differentiation (AD) is the core engine of every deep learning framework. It’s what makes loss.backward() work. But there are two fundamentally different approaches, and understanding why we use one over the other is essential.

Forward Mode AD

In forward mode, we walk from inputs to outputs. At each node, we compute the partial derivative of that node with respect to a single input variable. This means:

For each input variable, we need a full forward pass through the graph.
If we have inputs, we need forward AD passes.

For a typical deep learning loss function — a scalar output with millions of input parameters — this is catastrophically inefficient. We’d need millions of passes just to get one gradient update.

Reverse Mode AD (Backpropagation)

Reverse mode flips the direction. We walk from the output back to inputs, computing the gradient of the scalar output with respect to all input nodes in a single backward pass. This is why it’s the standard for deep learning: one output, millions of inputs, one pass.

flowchart TD
    subgraph forward["Forward Mode (one pass per input)"]
        direction LR
        x1f["x₁"] --> |"∂a/∂x₁"| af["a"] --> |"∂b/∂x₁"| bf["b"] --> |"∂L/∂x₁"| Lf["L"]
    end

    subgraph reverse["Reverse Mode (one pass for ALL inputs)"]
        direction RL
        Lr["L"] --> |"∂L/∂b"| br["b"] --> |"∂L/∂a"| ar["a"] --> |"∂L/∂x₁
∂L/∂x₂
∂L/∂x₃"| xr["x₁, x₂, x₃"]
    end

Forward vs. reverse mode AD — reverse mode computes all gradients in a single backward pass

The Memory Cost of Reverse Mode

Reverse mode has a catch: to compute gradients during the backward pass, we need the intermediate values from the forward pass. For each operation, we must store the input tensors and remember which operation created them. This is why training uses far more memory than inference — all those “saved tensors” accumulate on the graph.

Here’s what the autograd system actually tracks:

flowchart LR
    x["Input x
leaf tensor"] --> mul["Mul"]
    w["Weight W
leaf tensor"] --> mul
    mul --> |"z = W·x
saved: W, x"| act["ReLU"]
    act --> |"a = relu(z)
saved: z"| loss_fn["MSELoss"]
    y["Target y"] --> loss_fn
    loss_fn --> |"L = loss(a, y)
saved: a, y"| L["Scalar Loss L"]

    L -.-> |"backward()"| loss_fn
    loss_fn -.-> act
    act -.-> mul
    mul -.-> x
    mul -.-> w

    style x fill:#e8f5e9
    style w fill:#e8f5e9
    style L fill:#ffcdd2

What the autograd engine saves during a forward pass — every intermediate result and its creator must be retained for backward

The dashed arrows show the backward pass, which retraces the forward graph in reverse. At each node, the saved tensors are consumed to compute local gradients.

Gradients as Directional Information

The gradient at each node tells you: “In which direction would changing this value increase the loss most steeply?” It points toward steepest ascent — the direction of maximum loss increase. To decrease the loss, we move in the negative gradient direction. This is why gradient descent subtracts the gradient from the parameters: .

One powerful consequence: the backward pass itself builds a computation graph for the gradients. This means you can compute gradients of gradients simply by adding more operations — which is exactly what second-order methods and some meta-learning approaches do.

Key takeaway: Reverse mode AD gives us all gradients in one pass, but the price is memory — every intermediate tensor from the forward pass must be kept alive until it’s consumed by the backward pass.

Memory Layout: Shapes, Strides, and the View/Copy Divide

This is where the rubber meets the road. Understanding how tensors are stored in memory explains a surprising number of performance issues and subtle bugs.

The Flat Array Reality

Whether you’re on CPU or GPU, the hardware gives you a flat, contiguous block of memory. There are no “dimensions” at the hardware level — just consecutive slots. To create the illusion of an N-dimensional array, we need three pieces of metadata:

Shape: The logical dimensions (e.g., [3, 4] for a 3×4 matrix)
Stride: How many elements to skip in the flat array to move one step along each dimension
Offset: Where the data starts within the flat array

Row-Major vs. Column-Major via Strides

For a 2D array A with shape [R, C]:

Row-major (C/NumPy/PyTorch default): stride = [C, 1] — rows are contiguous
Column-major (Fortran/BLAS): stride = [1, R] — columns are contiguous

Most BLAS libraries (the workhorses of linear algebra) are implemented in Fortran and expect column-major layout. This is why you sometimes see frameworks internally transposing data before calling into BLAS routines.

Views: Same Memory, Different Perspective

The stride mechanism enables something powerful: multiple tensor objects can share the same underlying memory with different shapes, strides, and offsets. These are called views. Three critical operations create views, not copies:

Operation	What Changes	Memory Cost
Slice	Offset + shape + stride	Zero (view)
Transpose	Strides are swapped, shape changes	Zero (view)
Broadcast	Stride set to 0 along new dims	Zero (view)
Reshape/View	Shape + stride (if compatible)	Zero or copy

When Reshape Becomes a Copy

reshape / view can create a view only when the new shape is compatible with existing strides (i.e., the data is already contiguous in the right order). If the tensor has been transposed or sliced in a way that makes the data non-contiguous, reshape must copy the data into a new contiguous block. This can silently allocate gigabytes of memory.

How to detect it: In PyTorch, call tensor.is_contiguous() before reshaping. If it returns False, the reshape will trigger a copy. Use tensor.contiguous() explicitly to make the copy intentional and visible.

The Contiguity Problem

After operations like slicing or transposing, the logical tensor and the physical memory layout can diverge. The tensor is no longer compact — meaning the offset isn’t 0 or the strides don’t correspond to row-major order.

This matters because many operations (especially matrix multiplication) require contiguous data for efficient memory access. The framework typically handles this by checking compactness before an operation and creating a contiguous copy if needed. But this implicit copy is a hidden performance cost.

flowchart TD
    flat["Flat memory: [a b c d e f g h i j k l]"] --> orig["Tensor A
shape=[3,4], stride=[4,1], offset=0"]
    flat --> slice["Slice A[0:2, 1:3]
shape=[2,2], stride=[4,1], offset=1
VIEW (shared memory)"]
    flat --> trans["A.T
shape=[4,3], stride=[1,4], offset=0
VIEW (shared memory)"]

    trans --> |"reshape(-1) on
non-contiguous tensor"| copy["New flat memory
COPY (new allocation)"]

    style flat fill:#fff3e0
    style slice fill:#e8f5e9
    style trans fill:#e8f5e9
    style copy fill:#ffcdd2

View operations share memory; some operations force a copy when data is non-contiguous

Rule of Thumb

If you chain transpose + reshape, you’re almost certainly triggering a copy. If you’re in a hot loop or a custom kernel, this matters. Profile with torch.cuda.memory_allocated() to catch surprise allocations.

Key takeaway: Tensors are flat arrays dressed up with metadata. Operations that only change metadata (slice, transpose, broadcast) are free. Operations that need physically contiguous data may silently copy. Know which is which.

Broadcasting and Its Gradient Implications

Broadcasting is one of the most convenient features in numerical computing — and one of the most misunderstood when it comes to gradients.

The Forward Pass: Implicit Repetition

When you add a bias vector b of shape [1, C] to an activation matrix A of shape [N, C], broadcasting logically repeats b along the batch dimension N times. But crucially, no data is copied. The framework simply sets the stride to 0 along the broadcast dimension, so the same values are read repeatedly.

The Backward Pass: Sum-Reduce

Here’s the subtle part. During the backward pass, if a value was broadcast (repeated) across a dimension, the gradients must be summed along that dimension. Why? Because the same parameter contributed to multiple outputs — its total influence is the sum of all its partial effects.

flowchart LR
    subgraph fwd["Forward: broadcast adds"]
        direction TB
        A_fwd["A: shape [N, C]"] --> plus["+ (broadcast)"]
        b_fwd["b: shape [1, C]
(stride 0 on dim 0)"] --> plus
        plus --> out_fwd["Output: shape [N, C]"]
    end

    subgraph bwd["Backward: sum-reduce"]
        direction TB
        grad_out["∂L/∂Output: shape [N, C]"] --> sum_op["sum(dim=0)"]
        sum_op --> grad_b["∂L/∂b: shape [1, C]"]
        grad_out --> grad_A["∂L/∂A: shape [N, C]
(passed through directly)"]
    end

    fwd --> |"backward()"| bwd

Broadcasting repeats values in the forward pass; gradients must sum-reduce along broadcast dimensions in the backward pass

Worked example:

Suppose A has shape [3, 2] and b has shape [1, 2] with values [0.5, -0.3]. After broadcasting, every row of A gets the same bias added. If the upstream gradient ∂L/∂Output is:

[[1.0, 2.0],
 [0.5, 1.5],
 [0.3, 0.7]]

Then ∂L/∂b = sum along dim 0 = [1.8, 4.2], because b influenced all three rows.

General Rule

For any operation in autograd: the gradient of a broadcast is a reduction, and the gradient of a reduction is a broadcast. This duality shows up everywhere — in loss functions, in normalization layers, and in attention mechanisms.

Key takeaway: Broadcasting doesn’t copy data (strides handle it), but gradients must sum-reduce along every dimension that was broadcast. Forgetting this is a common source of shape mismatch bugs in custom autograd functions.

Hardware Acceleration: From Strides to Silicon

Understanding the hardware layer helps you write code that runs fast by default instead of fighting the machine.

Memory Alignment

Hardware loads data into caches in fixed-size chunks called cache lines (typically 64 bytes). If your data is aligned to cache line boundaries, a single load brings in exactly what you need. If it’s misaligned, you need two loads for data that spans a boundary — doubling the memory traffic for that access.

Practical Impact

Memory alignment mostly matters for custom kernels and low-level code. High-level frameworks handle this for you. But if you’re writing CUDA kernels or using ctypes to interface with C libraries, ensure your allocations are aligned.

Parallelization with OpenMP

On CPU, the simplest form of parallelism is loop parallelization. Tools like OpenMP let you annotate a loop with #pragma omp parallel for, and the runtime splits iterations across CPU cores automatically.

This is the basis for CPU-accelerated tensor operations. Each core processes a different slice of the tensor, and the results are combined. The bottleneck shifts from compute to memory bandwidth — reading and writing large tensors becomes the limiting factor, not arithmetic.

The im2col Trick: Convolution as Matrix Multiplication

Convolution is the most compute-intensive operation in CNNs. The im2col (image-to-column) trick converts convolution into matrix multiplication, which lets us use heavily optimized BLAS routines.

The process for a batch of images (N × H × W × Cᵢₙ) with filters (K × K × Cᵢₙ × Cₒᵤₜ):

Create a 6D strided view: N × H_out × W_out × K × K × Cᵢₙ
Reshape to a 2D im2col matrix: (N·H_out·W_out) × (K·K·Cᵢₙ)
Reshape weights to 2D: (K·K·Cᵢₙ) × Cₒᵤₜ
Matrix multiply: im2col @ weights
Reshape result: N × H_out × W_out × Cₒᵤₜ

im2col Memory Overhead

The im2col matrix is typically much larger than the original image tensor because filter patches overlap. Each input pixel appears in multiple rows of the im2col matrix. The reshape from the 6D strided view to 2D cannot be done as a view (the data isn’t contiguous in the right order), so it triggers a full copy. This is a significant memory cost — for large images with many channels, the im2col matrix can be several times the size of the input.

When it helps: When your BLAS library is highly optimized (which it usually is). The speedup from using GEMM far outweighs the memory copy cost.

When it hurts: When you’re memory-constrained. Alternative approaches like FFT-based convolution or Winograd transforms can reduce memory usage at the cost of implementation complexity.

Key takeaway: The gap between “logical operations on tensors” and “what the hardware actually does” is large. Frameworks bridge it with tricks like im2col, cache-aware memory layout, and loop parallelization. When performance matters, understanding this layer is essential.

Weight Initialization: The Effects That Persist

Weight initialization might seem like a minor detail — just pick some random numbers and start training. But the evidence tells a more nuanced story.

Why Initialization Matters More Than You Think

Two observations that changed how I think about initialization:

The effect of initialization persists throughout training. Bad initialization affects the relative norms of activations and gradients at every step. If you don’t initialize appropriately (e.g., using a standard deviation of for ReLU networks, known as He initialization), the L2-norm of activations or gradients will drift — leading to vanishing signals or exploding values.
Weights don’t move far from their initial values. This is surprising. If you plot the variance of weights before and after training for each layer, you’ll see remarkably similar values. The weights shift in certain directions, but relative to their initial magnitude, the change is small — especially for deep networks.

The Implication

Together, these observations mean initialization isn’t just “where you start” — it effectively defines the neighborhood of weight space you’ll explore during training. Proper initialization puts you in a good neighborhood. Bad initialization puts you somewhere the optimizer can’t easily escape.

How to Diagnose Initialization Problems

Monitor two metrics across layers over all training iterations:

Norm of weights per layer
Norm of gradients per layer

If the weight norms explode or collapse across layers, or if gradient norms vary by orders of magnitude between early and late layers, your initialization is likely wrong. Proper initialization keeps these norms roughly stable across layers.

Key takeaway: Proper weight initialization speeds up training and leads to lower final error rates. It defines the effective search region for your optimizer, and its influence doesn’t fade — it persists throughout training.

Normalization: Fixing What Initialization Can’t

If we know that activation norms can drift during training (due to imperfect initialization or the dynamics of optimization itself), why not just force them to be well-behaved? That’s the idea behind normalization layers.

Batch Normalization

Batch Normalization normalizes activations across the batch dimension for each feature independently. For a given feature, it computes the mean and variance across all examples in the batch, then normalizes to zero mean and unit variance.

When it helps:

Dramatically speeds up training by maintaining stable activation norms
Preserves the discriminative information between features within each layer (because normalization is per-feature, not per-example)

When it hurts:

Creates dependency between samples in a batch — each example’s normalized activation depends on the other examples in the batch
Unstable with small batches — statistics become noisy, and with a batch of 1, the variance is undefined
Doesn’t work well with RNNs — the hidden state has temporal dependencies across time steps, and computing batch statistics independently at each time step ignores this structure

Layer Normalization

Layer Normalization normalizes across all features for each sample independently. No dependency on other samples in the batch.

When it helps:

Works with any batch size, including batch size 1
Perfect for RNNs and Transformers — it normalizes across the embedding dimension for each token in each example, respecting temporal structure
This is why it’s the standard in Transformer architectures

When it hurts:

For fully connected networks, forcing zero mean and unit variance across features can destroy the relative magnitude differences between activations for different examples. These magnitude differences can be an important discriminative signal.
This makes it harder to drive loss low on tasks where inter-example feature magnitude differences matter

Choosing Between Them

Use BatchNorm for CNNs with reasonably large batches (≥32). Use LayerNorm for Transformers, RNNs, and any setting where batch size is small or variable.

Key takeaway: Normalization layers fix the activation drift that initialization can only partially prevent. BatchNorm and LayerNorm make different trade-offs about what to normalize over, and the right choice depends on your architecture and batch size.

Regularization: Controlling Complexity

Regularization prevents models from memorizing the training data, forcing them to learn patterns that generalize to unseen examples.

Implicit Regularization

Before you add any explicit regularization, your training procedure already constrains the model. SGD with a particular initialization only explores a subset of all possible neural networks. The initialization defines the starting point, and the optimizer’s dynamics (step size, momentum, batch sampling) determine the trajectory through weight space.

This is called implicit regularization, and it’s powerful. The fact that SGD-trained networks generalize well — even when they have enough capacity to memorize the training set — is partly due to these implicit biases of the optimization procedure.

Explicit Regularization

Explicit regularization directly limits the functions the model can learn:

L2 Regularization adds a penalty proportional to the squared magnitude of the weights. The premise: smoother functions (which don’t change dramatically for small input changes) tend to have smaller weights. By penalizing large weights, we encourage smoother, simpler functions.

Dropout randomly zeroes out activations with probability during training. A useful mental model: dropout is a stochastic approximation of each layer’s activations, similar to how SGD approximates the full gradient with a mini-batch sample. During inference, we multiply activations by (or equivalently, scale during training) to keep the expected value consistent.

L2 Regularization ≠ Weight Decay (for Adam!)

For vanilla SGD, L2 regularization and weight decay are mathematically equivalent. But for adaptive optimizers like Adam, they are not the same.

Why? Adam computes first and second moments of the gradients. If you add the L2 penalty to the gradient (L2 regularization), the penalty gets scaled by Adam’s adaptive learning rate, making it less effective than intended. Weight decay, which adds the penalty directly to the parameter update step without modifying the gradient, avoids this issue.

This distinction — first identified in the “Decoupled Weight Decay” paper (AdamW) — is why AdamW is preferred over Adam + L2 regularization in practice.

Key takeaway: Regularization operates at two levels: the implicit biases of SGD and initialization, and explicit penalties like L2/weight decay and dropout. For Adam-family optimizers, always use weight decay (AdamW), not L2 regularization.

Scaling Up: When One GPU Isn’t Enough

Large datasets demand large models, and large models push hardware to its limits. Here’s how the systems community addresses this.

The Memory Bottleneck

The memory hierarchy tells the story:

Shared memory per core (GPU): ~64 KB — fast, tiny
Global GPU memory: 10–80 GB depending on the device — this is the typical bottleneck
CPU RAM: 64–512 GB — large but slow to access from GPU

Most large models can’t fit entirely in GPU global memory during training, because we need to store: model parameters, optimizer state (2x or 3x model size for Adam), activations (saved for backward), and gradients.

Memory-Saving Techniques

Inference: Buffer Reuse

During inference, we don’t need to keep activations for backward. We can reuse a small set of buffers (2 or 3) across layers, writing each layer’s output into a buffer that a previous layer no longer needs. This reduces memory from O(N) to O(1) in the number of layers.

Training: Activation Checkpointing

During training, we normally keep all activations for the backward pass. Checkpointing trades memory for compute:

Divide the network into segments of roughly layers
Only store activations at segment boundaries (checkpoints)
During the backward pass, recompute the forward pass within each segment to recover the needed activations

flowchart LR
    subgraph seg1["Segment 1"]
        L1["Layer 1"] --> L2["Layer 2"] --> L3["Layer 3"]
    end
    subgraph seg2["Segment 2"]
        L4["Layer 4"] --> L5["Layer 5"] --> L6["Layer 6"]
    end
    subgraph seg3["Segment 3"]
        L7["Layer 7"] --> L8["Layer 8"] --> L9["Layer 9"]
    end

    seg1 --> |"✓ checkpoint"| seg2
    seg2 --> |"✓ checkpoint"| seg3

    style L1 fill:#e8f5e9,stroke:#333
    style L3 fill:#e8f5e9,stroke:#333
    style L4 fill:#e8f5e9,stroke:#333
    style L6 fill:#e8f5e9,stroke:#333
    style L7 fill:#e8f5e9,stroke:#333
    style L9 fill:#e8f5e9,stroke:#333

Activation checkpointing: store only segment boundaries, recompute the rest during backward

Approach	Memory	Compute Overhead
No checkpointing	`O(N)` activations	None
checkpoints	`O(√N)` activations	~1 extra forward pass
Aggressive checkpointing	`O(1)` activations	Up to `N` extra forward passes

Smart Checkpoint Placement

Choose checkpoints at layers with cheap recomputation. ReLU activations are trivial to recompute (just check sign). Convolution or attention layers are expensive. Checkpoint after cheap layers to minimize the recomputation cost.

Distributed Training: Data and Model Parallelism

When one GPU isn’t enough, we spread the work across multiple devices. There are two fundamental strategies:

flowchart TD
    DT["Distributed Training"] --> DP["Data Parallelism
Same model, different data"]
    DT --> MP["Model Parallelism
Different parts of model"]

    DP --> PS["Parameter Server
Central coordinator"]
    DP --> AR["AllReduce
Peer-to-peer"]

    MP --> TP["Tensor Parallelism
Split layers across devices"]
    MP --> PP["Pipeline Parallelism
Different layers on different devices"]

    style DT fill:#fff3e0
    style DP fill:#e3f2fd
    style MP fill:#fce4ec

Taxonomy of distributed training approaches

Data Parallelism

Every worker runs a full replica of the model on a different micro-batch. Since gradients are additive (they’re independent across examples), we just need to sum them across workers before performing the weight update.

Two coordination strategies:

Parameter Server: A central server collects gradients from all workers, sums them, performs the update, and broadcasts the new weights. Workers can start sending gradients as soon as they’re computed (layer by layer), overlapping communication with computation.

Bottleneck: The parameter server becomes a communication bottleneck as the number of workers grows. All traffic flows through one node.

AllReduce: A peer-to-peer approach where all workers collectively sum their gradients and each receives the result. No central bottleneck — communication scales more gracefully. Algorithms like Ring-AllReduce distribute the bandwidth load evenly.

Bottleneck: Total communication volume still grows with model size. Network bandwidth between nodes becomes the limiting factor.

When Communication Dominates

Communication overhead dominates training time when:

Model is large relative to batch computation time (small compute-to-communication ratio)
Network bandwidth is low (especially across nodes vs. within a node with NVLink)
Gradient compression isn’t used

Rule of thumb: if your per-step compute time is less than 3x the gradient synchronization time, communication is your bottleneck. Scale batch size or use gradient compression/accumulation to amortize the cost.

Model Parallelism (Pipeline Parallelism)

When the model itself doesn’t fit on one device, we split the computation graph across devices. Each device handles a different set of layers, and they pipeline the computation: while device 2 processes micro-batch 1, device 1 can start on micro-batch 2.

Communication happens at layer boundaries via send/recv operations. The challenge is minimizing pipeline bubbles — idle time when a device is waiting for input from the previous stage.

Key takeaway: Scaling from one GPU to many introduces a new bottleneck: communication. Data parallelism is simpler and scales well when the model fits on one device. Model/pipeline parallelism is necessary when it doesn’t, but introduces pipeline bubbles and more complex communication patterns.

Neural Network Architectures Through a Systems Lens

The remaining sections cover architectures not as algorithmic curiosities, but as systems design decisions — what problem does each one solve, and what trade-off does it introduce?

Convolutional Neural Networks (CNNs)

CNNs exploit three structural priors about spatial data:

Property	What It Means	Systems Benefit
Parameter sharing	Same filter everywhere in the image	Massive reduction in parameters
Sparse connectivity	Each output depends only on a local receptive field	Few computations per output pixel
Translation equivariance	Shifting input shifts output the same way	No need to learn position-specific detectors

Dilation increases the receptive field without increasing parameters — each filter element is spread out by a dilation factor, giving access to a larger spatial area. This is particularly useful for temporal problems where context matters.

Convolution as Matrix Multiplication

We can express convolution as a matrix multiplication where the weight matrix has a specific sparsity pattern (filled with actual weights and zeros reflecting the filter structure). We don’t actually construct this matrix — it would be enormous — but this view explains why the backward pass of a convolution is a convolution with a flipped filter: multiplying by the transpose of the convolution matrix is equivalent to convolving with the spatially flipped kernel.

Recurrent Neural Networks (RNNs)

RNNs address temporal dependencies by maintaining a hidden state that gets updated at each time step as a function of the current input and the previous hidden state. In theory, the last hidden state captures the entire input history.

In practice, the hidden state is a bottleneck. The entire past is compacted into a single vector, and information from early time steps () gets diluted compared to recent ones ().

Backpropagation Through Time (BPTT): Because weights are shared across time steps, gradients must flow through the entire unrolled sequence. If the dominant eigenvalue of the weight matrix is less than 1, gradients vanish exponentially with sequence length. Greater than 1, they explode.

LSTM: Gating the Information Flow

LSTMs address vanishing gradients by separating the hidden state into two components:

Cell state: A “highway” for long-range information flow
Hidden state: The working memory exposed to the next layer

Four gates (learned transformations) control information flow at each step:

Forget gate: What information from the cell state to discard
Input gate: What new information to add to the cell state
Cell update: The candidate new information
Output gate: What to expose as the hidden state

LSTMs Don’t Fully Solve Long-Range Dependencies

Despite the gating mechanism, both RNNs and LSTMs struggle with information far in the past. Recent tokens have a much more direct connection to the current hidden state. The cell state highway helps, but it’s not a complete solution for very long sequences. This is the fundamental motivation for attention mechanisms.

Transformers: Global Receptive Field via Attention

Transformers replace recurrence with attention, which gives every position direct access to every other position — a global receptive field.

However, the attention mechanism is inherently order-invariant: permuting the input tokens permutes the output in the same way. There’s no notion of “first” or “last.” This is why positional encodings are essential — they inject order information that attention alone cannot capture.

For autoregressive tasks (language modeling, text generation), a causal mask restricts each position to attend only to current and previous positions, preserving the left-to-right generation constraint.

GANs: Adversarial Generation

GANs learn to generate data by pitting two networks against each other:

Generator: Takes a random noise vector and tries to produce realistic images. Its objective is to maximize the discriminator’s error — make the discriminator believe the fake images are real.
Discriminator: Receives both real and generated images and tries to classify them correctly. It minimizes its classification loss.

The discriminator acts as a learned loss function that guides the generator toward producing increasingly realistic outputs. The “adversarial” aspect refers to the generator learning to exploit subtle distributional differences that are imperceptible to humans.

Conv2dTranspose (Deconvolution): The generator typically needs to upsample from a small latent vector to a full-resolution image. Transposed convolution reverses the spatial dimension change of convolution — taking a small spatial input and producing a larger spatial output.

Key takeaway: Each architecture encodes different assumptions about data structure. CNNs assume spatial locality. RNNs assume temporal ordering. Transformers assume that global relationships matter and let attention learn what to focus on. GANs assume that the best loss function is a learned one.

Model Deployment Considerations

Training a model is only half the battle. Deploying it introduces a different set of constraints:

Application environment restrictions: Model size limits, no Python runtime available (embedded/mobile)
Hardware acceleration: Leveraging mobile GPUs, NPUs, or specialized CPU instructions (AVX, NEON)
Integration: Fitting into existing application architectures and serving infrastructure

These constraints often drive post-training optimizations like quantization, pruning, distillation, and conversion to inference-specific formats (ONNX, TensorRT, Core ML).

Tying It All Together

If you’ve made it this far, you’ve traced the full stack of a deep learning system:

Framework design determines your development experience and optimization ceiling
Autograd gives you gradients but demands memory for saved tensors
Memory layout (strides, views, contiguity) determines whether operations are free or expensive
Hardware acceleration turns logical operations into physical memory accesses and arithmetic
Initialization and normalization keep training stable from start to finish
Regularization prevents overfitting at both implicit and explicit levels
Scaling trades communication overhead for the ability to train larger models
Architecture choices encode structural assumptions about your data

These layers interact. Autograd’s saved tensors create memory pressure, which motivates checkpointing, which trades memory for recomputation. Initialization determines activation norms, which normalization layers can stabilize, which affects gradient flow, which determines whether training converges. Strides determine memory access patterns, which determine kernel performance, which determines whether you’re compute-bound or memory-bound.

The Systems Thinking Payoff

The next time training is slow, memory is exploding, or loss isn’t decreasing — you’ll have a mental model of the full stack to reason about where the problem might be. That’s the real value of building a framework from scratch.

Breaking Text Apart (The Smart Way)

Imad Dabbura — Sat, 14 Jan 2023 06:00:00 GMT

evergreen

Introduction

Tokenization sits at the foundation of every NLP system — and it’s where more bugs, performance failures, and cross-lingual headaches originate than most practitioners expect.

The core problem: neural networks can’t consume raw text. They need numbers. Tokenization is the bridge — converting a string into a sequence of integer IDs that the model can embed and process. But how you make that conversion has enormous downstream consequences: for vocabulary size, sequence length, out-of-vocabulary handling, and multilingual generalization.

There are three fundamental strategies, sitting on a spectrum from fine-grained to coarse:

Character tokenization: split at every character — maximum granularity, minimum vocabulary
Word tokenization: split at word boundaries — minimum granularity, maximum vocabulary
Subword tokenization: split rules learned from corpus statistics — the practical sweet spot used by every modern LLM

We’ll work through each in turn with concrete code, then zoom in on the two subword algorithms that dominate modern NLP: WordPiece (BERT, DistilBERT) and BPE via SentencePiece (XLM-R, LLaMA, GPT-family models).

Tokenization Process

The tokenization pipeline has four stages, each with a distinct job:

Normalization: Clean the raw text before any splitting. Common operations include Unicode normalization (collapsing different byte representations of the same character), lowercasing, and accent stripping. Critically, what gets normalized here is permanent — the model never sees the original form.
Pretokenization: Split the normalized text into coarse units, typically words or word-like chunks. For English and German, splitting on whitespace and punctuation works well. For languages like Japanese or Chinese — which have no whitespace — language-specific rules or character-level splits are used instead.
Tokenizer model: Apply the learned subword splitting algorithm (WordPiece, BPE, Unigram, etc.) to each pretokenized chunk. This is the only trained stage — everything else is rule-based. The vocabulary and merge rules come from the pretraining corpus.
Postprocessing: Wrap the token sequence with any model-specific special tokens. BERT prepends [CLS] and inserts [SEP] between sequences. XLM-R uses ~~and~~ . These tokens have specific learned representations and must be consistent between pretraining and fine-tuning.

The Pipeline Is Framework-Agnostic

This four-stage structure underpins Hugging Face tokenizers, SentencePiece, and most production tokenizer implementations. Most unexpected token outputs trace back to either normalization (e.g., surprise lowercasing or accent stripping) or postprocessing (missing or double-added special tokens).

Tokenization Strategies

There are three core tokenization schemes. Before diving in, here’s a preview of the trade-offs that motivate the progression from characters to subwords:

Strategy	Vocab size	Sequence length	OOV handling	Multilingual
Character	Tiny (~100s)	Very long	✅ None	✅ Natural
Word	Huge (millions)	Short	❌ UNK collapse	⚠️ Poor
Subword	Medium (10K–100K)	Medium	✅ Decompose	✅ Good

The pattern is clear: characters and words are opposite extremes, each with a disqualifying flaw. Subword tokenization is the engineered middle ground — and why every modern LLM uses it.

Character Tokenization

Character tokenization is the simplest possible approach: split the input string into individual characters and treat each one as a token. No learned vocabulary, no language-specific rules — just list(text). It’s the floor of the granularity spectrum.

text = "I love NLP!"
list(text)

['I', ' ', 'l', 'o', 'v', 'e', ' ', 'N', 'L', 'P', '!']

From here, it is easy to convert each character into integers that would be fed to the model. This step is called numericalization. We can numericalize the above text by first building the vocabulary, and then convert each character to its corresponding index as follows:

vocab = {char: idx for idx, char in enumerate(sorted(set(text)))}
print(vocab)

{' ': 0, '!': 1, 'I': 2, 'L': 3, 'N': 4, 'P': 5, 'e': 6, 'l': 7, 'o': 8, 'v': 9}

Now we can simply map each token (character in this case) to its own corresponding index:

[vocab[char] for char in text]

[2, 0, 7, 8, 9, 6, 0, 4, 3, 5, 1]

Why Character Tokenization Is Appealing

No out-of-vocabulary problem: every possible input — misspellings, code, emojis, neologisms — is representable from the same small fixed alphabet
Tiny vocabulary: ~100 characters for English. The embedding matrix and output projection stay small, which reduces parameter count and memory

Why Character Tokenization Fails in Practice

Sequences become extremely long: “I love NLP!” becomes 11 tokens. A typical 512-word document becomes several thousand characters. For Transformers with quadratic attention cost, this is prohibitively expensive
No free linguistic priors: the model has no prior knowledge that l, o, v, e together constitute a meaningful unit. Recovering word-level and phrase-level structure from raw characters requires far more data, compute, and model depth than most tasks justify
Context window exhaustion: with fixed-length context windows, very long character sequences mean the model can attend to only a small slice of a document at a time, losing long-range dependencies that often carry the most important signal

Word Tokenization

Word tokenization takes the opposite approach: split on whitespace (and often punctuation) and treat each word as an atomic token. Sequences stay short and tokens carry recognizable meaning — but the vocabulary problem quickly becomes unmanageable at scale.

text.split()

['I', 'love', 'NLP!']

vocab = {char: idx for idx, char in enumerate(sorted(set(text.split())))}
print(vocab)

{'I': 0, 'NLP!': 1, 'love': 2}

[vocab[word] for word in text.split()]

[0, 2, 1]

Most production word tokenizers go beyond whitespace splitting and include language-specific heuristics — for example, separating contractions like “doesn’t” into “does” and “n’t”, or splitting punctuation from adjacent words. These rules improve coverage but don’t solve the fundamental vocabulary size and OOV problems.

Why Word Tokenization Seems Appealing

Short sequences: “I love NLP!” is 3 tokens. The model attends to far more context within the same fixed-length window
Tokens carry meaning directly: each token maps to a recognizable linguistic unit, giving the model useful priors without learning from scratch

Why Word Tokenization Breaks Down

Vocabulary explosion: a large corpus contains millions of distinct word forms — declinations, misspellings, punctuation variants, domain-specific terms. An embedding table with 1M entries at dimension 512 requires ~500M parameters for the embedding layer alone. Truncating to the top-N words forces everything else to [UNK], which destroys information silently — the model has no way to recover what word was there
Under-trained embeddings: rare words appear too infrequently to accumulate meaningful gradient signal. They occupy slots in the vocabulary without learning useful representations — wasted capacity
Language boundary failures: languages without clear word boundaries (Japanese, Chinese, Thai) have no natural whitespace to split on. Word tokenization either silently fails or requires expensive language-specific preprocessing at training and inference time

Subword Tokenization

Subword tokenization is the engineered middle ground between the two extremes. The core insight: most words in any language are built from a small set of recurring morphemes — prefixes, roots, suffixes. “tokenization”, “tokenizer”, “tokenized” all share the root “token”. Word tokenization throws that structure away by treating each form as an unrelated atomic entry. Character tokenization preserves the raw signal but forces the model to discover linguistic structure from scratch, without any priors.

Subword algorithms exploit this structure directly. They learn a vocabulary of high-frequency subword units from a large pretraining corpus. Common words like “love” stay as single tokens. Rare or novel words get decomposed into familiar pieces: “tokenization” → ["token", "##ization"] in WordPiece, or ["▁token", "ization"] in SentencePiece. The model has seen “token” thousands of times and has a rich representation for it — that representation is now available even when encountering “detokenization” for the first time.

This also handles misspellings and out-of-domain terms gracefully. “GPT-4o” doesn’t need to be in the vocabulary — it gets decomposed into known subwords rather than collapsing to [UNK].

Two algorithms dominate modern NLP: WordPiece (BERT, DistilBERT) and BPE via SentencePiece (XLM-R, LLaMA, GPT-family models). Both learn subword vocabularies from corpus statistics, but they use different objectives and produce different tokenization behavior — differences that matter when debugging cross-lingual failures or unexpected token splits.

WordPiece

WordPiece is the subword algorithm behind BERT and DistilBERT. Like BPE, it starts with a character-level vocabulary and iteratively merges pairs — but the key difference is in how it chooses which pair to merge next.

BPE picks the most frequent pair. WordPiece picks the pair that maximizes the likelihood of the training corpus when merged. Concretely, for a candidate pair , it evaluates:

This is a pointwise mutual information criterion: it rewards pairs that appear together more than their individual frequencies would predict. Merging “##iz” with “##ation” scores high not just because the bigram is frequent, but because seeing “##iz” almost always predicts “##ation” — the merge buys maximum information.

The training process:

Initialize the vocabulary with all characters in the corpus, prepending ## to all characters that don’t start a word
Score every adjacent pair using the PMI formula above
Merge the highest-scoring pair and add it to the vocabulary
Repeat until the vocabulary reaches the target size (BERT uses 30,000)

The ## prefix is the signature of WordPiece. It marks continuation subwords — pieces that are not at the start of a word boundary. So ["nl", "##p"] means: “nl” starts a word, “##p” continues it. Reconstructing the original word means stripping ## and concatenating.

from transformers import DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
encoded_text = tokenizer(text)
encoded_text

{'input_ids': [101, 1045, 2293, 17953, 2361, 999, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

tokenizer.convert_ids_to_tokens(encoded_text["input_ids"])

['[CLS]', 'i', 'love', 'nl', '##p', '!', '[SEP]']

Reading the DistilBERT output token by token:

[CLS] — a special classification token prepended to every sequence. Its final hidden state is used as the aggregate sequence representation for classification tasks
i — “I” was lowercased (DistilBERT uses distilbert-base-**uncased**)
love — a common English word; gets its own token
nl — the first subword of “NLP”. “NLP” is rare enough in BERT’s training corpus that it was never merged into a single token
##p — continues from “nl”. The ## prefix signals “this piece is not at a word boundary — attach it to the previous token”
! — punctuation gets its own token
[SEP] — marks the end of a sequence (or the boundary between two sequences in sentence-pair tasks)

Decoding the ## Prefix

When you see ## in WordPiece output, it means: strip the ## and concatenate directly to the previous token. ["nl", "##p"] → "nlp". ["un", "##believ", "##able"] → "unbelievable". The ## is how WordPiece encodes which subwords are word-internal vs. word-initial — critical for reconstructing the original string.

tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(encoded_text["input_ids"])
)

'[CLS] i love nlp ! [SEP]'

SentencePiece

SentencePiece is a language-agnostic tokenization library that implements both BPE and unigram language model algorithms. Two properties make it the dominant choice for multilingual models.

First: it treats the input as a raw Unicode character stream — no language-specific pretokenization required. It never assumes whitespace marks word boundaries, which means it works equally well on English, Chinese, Japanese, Arabic, and any language mixture. This is why XLM-R, mT5, and LLaMA all use SentencePiece.

Second: it uses ▁ (U+2581, lower one-eighth block) to encode the start of a new word. Rather than marking continuation pieces like WordPiece does with ##, SentencePiece marks word-starts. A ▁ at the beginning of a token means “there was a space before this character in the original text.” Absence of ▁ means “this token is a continuation.”

The BPE algorithm it implements:

Initialize the vocabulary with individual Unicode characters plus an end-of-word marker
Count all adjacent character pairs across the corpus
Merge the most frequent pair into a new subword unit
Repeat until the vocabulary reaches the target size

Unlike WordPiece’s PMI-based selection, BPE uses raw frequency. It’s simpler but produces similar results in practice — both algorithms converge on vocabularies dominated by common morphemes.

BPE vs. Unigram in SentencePiece

SentencePiece supports two algorithms. BPE builds the vocabulary bottom-up by merging. Unigram starts with a large candidate vocabulary and prunes it by removing tokens that minimally reduce the likelihood of the training corpus — a top-down approach. Unigram is used by XLNet and some multilingual models; BPE is more common. Both are interchangeable in the SentencePiece API.

from transformers import XLMRobertaTokenizer

tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")
encoded_text = tokenizer(text)
encoded_text

{'input_ids': [0, 87, 5161, 541, 37352, 38, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

tokenizer.convert_ids_to_tokens(encoded_text["input_ids"])

['', '▁I', '▁love', '▁N', 'LP', '!', '']

Reading the XLM-R output token by token:

~~— sequence start token (XLM-R’s equivalent of [CLS])~~

▁I — the ▁ prefix means “there was a space before this character.” Since “I” starts the sentence (treated as if preceded by whitespace), it gets ▁

▁love — common word, single token; ▁ marks it as word-initial

▁N — “NLP” is split; ▁N is the word-initial piece

LP — continues from ▁N, no ▁ prefix (it’s a word-internal continuation)

! — punctuation token

— sequence end token (XLM-R’s equivalent of [SEP])

WordPiece ## vs. SentencePiece ▁ — Two Sides of the Same Coin

These two prefixes encode word boundary information in opposite ways:

Tokenizer	Marker	Meaning
WordPiece (BERT)	`##token`	This piece continues the previous word
SentencePiece (XLM-R, LLaMA)	`▁token`	A space preceded this character — new word starts here

Both fully encode the original whitespace and allow perfect string reconstruction. The difference is convention, not capability. But you need to know which convention a tokenizer uses when writing postprocessing code to detokenize outputs.

tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(encoded_text["input_ids"])
)

' I love NLP!'

Conclusion

The three tokenization strategies form a clear hierarchy in practice:

Character tokenization is essentially unused in production NLP. Sequence lengths become prohibitively long for Transformer attention, and the model must learn linguistic structure entirely from scratch. It survives in niche applications: character-level language models, certain byte-level models (GPT-2 uses byte-level BPE as a starting point), and as a fallback for extremely small vocabularies.
Word tokenization appears in legacy systems and simple bag-of-words pipelines, but fails at scale. Vocabulary explosion, [UNK] collapse, and multilingual brittleness make it unsuitable for anything pretrained on broad corpora.
Subword tokenization is the universal standard for pretrained language models. WordPiece and SentencePiece BPE both solve the core trade-offs: bounded vocabulary, graceful OOV handling, multilingual coverage, and sequences short enough for Transformer attention.

Always Use the Tokenizer the Model Was Trained With

When fine-tuning a pretrained model, you must use the exact same tokenizer — not just the same algorithm, but the same vocabulary file. The model’s embedding matrix maps token ID 1045 to a learned vector for the word “i” (in DistilBERT). Swap in a different tokenizer and ID 1045 now refers to something else entirely. The embeddings become noise, the model is unrecoverable, and fine-tuning won’t fix it. This applies to vocabulary size, normalization rules, and special token placements — all of it must match pretraining exactly.

Most practical work doesn’t require building tokenizers from scratch — Hugging Face tokenizers and SentencePiece handle it. What matters operationally is understanding the output: recognizing ## vs ▁ markers, knowing which special tokens a model expects and in what order, and catching normalization surprises (casing, accent stripping) before they cause silent failures downstream.

C Program Startup

Imad Dabbura — Fri, 21 Oct 2022 05:00:00 GMT

growing

Introduction

In this post, I will try to write down the steps of C program execution on x86. I used to believe that all C programs start execution at main, or at least this was my understanding from different books/courses until my best friend gdb debugger showed the symbol for _start. This is how I got curious until I got to the bottom of it. Below are my notes that I took during my learning.

Execution Steps

The linker inject _start which is called in the process of loading.
- It is written in assembly language
- Always placed at the beginning of the .text section -> Always guaranteed to run before anything else
- It sets up some registers and arguments and calls __start which is called __libc_start_main
__libc_start_main is written in C that:
- function prototype:
```
__libc_start_main (int (*main) (int, char **, char **),
                   int argc,
                   char *argv,
                   int  (*init) (int, char **, char **),
                   void (*fini) (void),
                   void (*rtld_fini) (void),
                   void *stack_end
                  )
```
- Define environ global variable using ps_string: environ = ps_strings->ps_envstr
  - Below are some details about ps_strings structure:
```
/*
 * The following structure is found at the top of the user stack of each
 * user process. The ps program uses it to locate argv and environment
 * strings. Programs that wish ps to display other information may modify
 * it; normally ps_argvstr points to argv[0], and ps_nargvstr is the same
 * as the program's argc. The fields ps_envstr and ps_nenvstr are the
 * equivalent for the environment.
 */
struct ps_strings {
    char    **ps_argvstr;       /* first of 0 or more argument strings */
    int       ps_nargvstr;      /* the number of argument strings */
    char    **ps_envstr;        /* first of 0 or more environment strings */
    int       ps_nenvstr;       /* the number of environment strings */
};
```
- It is typically defined as char envp = argv[argc + 1] in libc_init_first
- It also registers cleanup and exit handlers
- It define init & fini that defines function prolog and epilogue which means defining what happens when calling a function and when returning from a function. They also align the stack to be multiple of 16 bytes so it is more efficient and cache friendly. They are written in assembly language
- It sets %rbp to zero because main would be the outermost frame
- Finally it calls:
```
    exit(main(ps_strings->ps_nargvstr, ps_strings->ps_argvstr, environ));
```
- After the NULL of envp, there is ELF auxiliary vector that the loader uses to provide information to the process such as user id and page size etc.
- Therefore, __libc_start_main in general does the following:
  - Set up argv and envp
  - Initialize the thread local storage by calling __pthread_initialize_minimal (which only calls __libc_setup_tls). __libc_setup_tls will initialize Thread Control Block and Dynamic Thread Vector.
  - Set up the thread stack guard
  - Register the destructor (i.e. the rtld_fini argument passed to __libc_start_main) of the dynamic linker (by calling __cxa_atexit) if there is any
  - Initialize Glibc itself by calling __libc_init_first
  - Register __libc_csu_fini (i.e. the fini argument passed to __libc_start_main) using __cxa_atexit
  - Call __libc_csu_init (i.e. the init argument passed to __libc_start_main). __libc_csu_init execute them in the following order:
    - Function pointers in .preinit_array section
    - Functions marked as __attribute__ ((constructor)), via _init
    - Function pointers in .init_array section
  - Set up data structures needed for thread unwinding/cancellation
  - Call main of user’s program.
  - Call exit
    - In reverse order, functions registered via atexit or on_exit
    - Function pointers in .fini_array section, via __libc_csu_fini
    - Functions marked as __attribute__ ((destructor)), via __libc_csu_fini (which calls _fini after Step 2)
    - stdio cleanup functions
    - The .fini_array section must also contain function pointers and the prototype is like the destructor, i.e. taking no arguments and returning void. If the program exits normally, then the exit function (Glibc source file stdlib/exit.c)

Conclusion

So starting program will call execve that starts the loader that at some point pass control to _start, which calls __libc_start_main which calls __libc_csu_init which calls _init.

The Transformer Architecture: A Deep Dive

Imad Dabbura — Mon, 14 Feb 2022 06:00:00 GMT

evergreen

Introduction

If you’ve called from transformers import BertModel or prompted GPT-4, you’ve used a Transformer. But what actually happens when it processes text? Why does attention use three separate projections — Q, K, and V? Why does the decoder need a causal mask?

The Transformer displaced a decade of sequence modelling research not because it was more complex, but because it was more general: the same architecture, with minimal changes, now handles text, images, protein structures, and audio. Understanding why it generalises is what separates someone who can use these models from someone who can reason about them.

This post builds one from scratch — understanding the motivation behind every design choice before any code. By the end, you will be able to:

Explain why each component exists, not just what it does
Trace a forward pass through the full encoder-decoder architecture, step by step
Understand the three architecture variants (encoder-only, decoder-only, encoder-decoder) and when to use each
Read modern Transformer papers and recognise the improvements they describe

We start with the problem that motivated the Transformer (sequential bottlenecks in RNNs), build the attention mechanism from scratch, implement each component in PyTorch with annotated shapes, and assemble all three architecture variants — using the architecture diagram below as our map throughout.

Figure 1: The encoder-decoder Transformer (Vaswani et al., 2017) — the architecture we’ll build in this post. Left stack (Encoder): reads the full source sequence; every token attends to every other token with no masking. Right stack (Decoder): generates the target sequence one token at a time; each layer has three sublayers: ① masked self-attention (tokens attend only to past positions), ② cross-attention (Q from decoder, K and V from the encoder output — the arrow connecting the two stacks), and ③ a feed-forward network. “N×” means the layer block repeats N times (typically 6–12). Add & Norm is a residual connection followed by LayerNorm. The Linear + Softmax at the top projects the decoder’s final representation to a probability distribution over the vocabulary. Every component labelled here has its own section below. (source)

1. The Problem: Why Not RNNs?

To understand why the Transformer is designed the way it is, you first need to understand what it replaced — and what was fundamentally broken about it.

1.1 The RNN Mental Model

A Recurrent Neural Network processes a sequence one token at a time. After seeing each token , it updates a fixed-size hidden state that is supposed to summarize everything the model has seen so far:

The hidden state is then passed to the next step. Think of it as a single notepad that a reader carries through a book, rewriting one paragraph of notes after each page. By the time they reach page 500, that notepad contains almost nothing from page 1 — there simply wasn’t room to preserve it through 499 rewrites.

This is not a metaphor for a failure mode; it is the fundamental architectural constraint. The RNN must compress all prior context into a fixed-size vector, and that compression is lossy by design.

1.2 The Long-Range Dependency Problem

Language is full of dependencies that span many tokens. Consider:

“The trophy didn’t fit in the suitcase because it was too large.”

To resolve what “it” refers to, a model must connect a pronoun near the end of the sentence back to a noun near the beginning. In an RNN, that connection must survive through every intermediate hidden state update. Each update potentially overwrites or dilutes the earlier information. The longer the sequence, the worse this gets.

1.3 The Vanishing Gradient Problem

The training-time failure mirrors the inference-time failure. When we backpropagate through an RNN, the gradient of the loss with respect to early hidden states is a product of Jacobians — one per time step:

If the entries of those Jacobians are consistently less than 1 (common with bounded activations like tanh), the product shrinks exponentially with . Gradients from the loss signal barely reach the early time steps, so the model cannot learn from long-range dependencies even if it wanted to.

LSTMs and GRUs mitigate this with gating mechanisms, but they don’t eliminate it — they just slow the decay. (For a full treatment of LSTMs and their gating solution, see the Inside LSTMs post.)

1.4 The Sequential Processing Bottleneck

RNNs are inherently sequential: you cannot compute until you have . This makes it impossible to parallelize across the time dimension. For a sequence of length , the forward pass requires sequential steps regardless of how many GPUs you have.

Modern GPUs are massively parallel processors — they shine on matrix multiplications that can be batched across thousands of operations simultaneously. RNNs waste almost all of that capacity.

The Three Failure Modes

RNNs fail in three compounding ways: (1) the hidden state bottleneck loses information over long sequences; (2) vanishing gradients prevent learning long-range relationships from the training signal; (3) sequential computation prevents parallelization, making training slow regardless of hardware. The Transformer addresses all three — not with patches, but by replacing sequential recurrence with a fundamentally different mechanism.

2. The Big Idea: Attention as Direct Communication

The central insight of the Transformer is deceptively simple: throw out sequential processing entirely and let every token communicate directly with every other token, in a single parallel operation.

2.1 From Sequential Relay to Direct Access

With an RNN, every relationship between tokens must be mediated through the hidden state — information travels through a long chain before it reaches its destination. With attention, every token asks every other token directly: “How relevant are you to me?” The answer shapes what information each token receives.

This is a fundamentally different computational paradigm: instead of routing information through a bottleneck, we create a direct, differentiable communication channel between all pairs of tokens simultaneously. The attention matrix for a sequence of length is — every pair gets its own weight.

2.2 The Library Analogy: Query, Key, Value

The attention mechanism is most naturally understood as a soft database lookup.

Imagine walking into a library. You have a query in mind — say, you’re looking for books about long-range dependencies in sequences. Every book in the library has a key on its spine: a short descriptor of what’s inside. You compare your query against every key, computing a relevance score for each book. Then you retrieve the values — the actual content — weighted by those relevance scores. The most relevant books contribute the most to what you walk away knowing.

This is exactly what the Transformer’s attention mechanism does at every layer, for every token:

Query (): what this token is looking for
Key (): what this token offers to match against
Value (): what this token actually communicates if attended to

The attended output for each token is a weighted mixture of all value vectors, where the weights are determined by the similarity between that token’s query and all other tokens’ keys.

The Key Insight

Attention is not a neural network layer in the traditional sense — it is a soft, differentiable database lookup. It is differentiable because the retrieval weights are produced by a smooth function (softmax), so gradients flow through the lookup operation during backpropagation. The queries, keys, and values are all learned — the model learns what to look for, what to advertise, and what to say.

We’ll return to this library analogy throughout — it explains why Q, K, and V need to be separate projections, and what the attention weights actually represent numerically.

2.3 Why This Architecture Generalizes Beyond Language

Here is the deeper insight that explains why Vision Transformers, AlphaFold2, audio Transformers, and point cloud Transformers all use the same architecture as BERT and GPT — often with almost no modification.

The Transformer has almost no structural inductive bias. CNNs assume that nearby pixels are related — they bake in locality and translation equivariance as a prior. RNNs assume sequential order — they process left-to-right by construction. The Transformer assumes nothing about the structure of its input beyond what the positional encoding tells it. Every pair of positions is treated symmetrically by the attention mechanism until the training data says otherwise.

This is simultaneously the weakness and the superpower:

Weakness: Without structural priors, the model needs more data to learn relationships that CNNs or RNNs would pick up for free. A CNN learns “adjacent pixels tend to be related” from very few examples; a Transformer must discover this from data.
Superpower: Any domain with a set of elements you want to relate to each other can be modeled by a Transformer. Images? Treat patches as tokens, inject 2D positional encodings (ViT). Proteins? Treat amino acids as tokens, use pairwise distances as positional information (AlphaFold2). Audio? Treat spectrogram frames as tokens. Graphs? Treat nodes as tokens.

The key insight: positional encoding is the only thing that changes across domains. The attention mechanism, FFN, LayerNorm, and residual connections are entirely domain-agnostic. Swap the positional encoding and the same architecture processes any structured data. This is why the Transformer became the universal architecture — not because it is uniquely suited to language, but because it is uniquely generic.

3. Tokenization: From Text to Numbers

Before anything else, raw text must be converted into numbers that the model can process. This conversion — tokenization — splits text into a vocabulary of subword units and maps each unit to an integer ID. The Transformer receives a B × T matrix of integers as input, where B is the batch size and T is the sequence length.

There are three families of tokenization strategy — character-level, word-level, and subword — each with distinct tradeoffs in vocabulary size, sequence length, and out-of-vocabulary handling. Modern language models universally use subword tokenization (BPE or WordPiece), which offers a vocabulary of tens of thousands of tokens while gracefully handling rare and novel words by decomposing them into known pieces.

This post focuses on the Transformer architecture that consumes tokenized sequences, not on tokenization itself. For a deep dive into how tokenization works:

Breaking Text Apart (The Smart Way) — covers all three strategies, the four-stage tokenization pipeline (normalization, pretokenization, subword model, postprocessing), WordPiece (BERT), and SentencePiece (LLaMA, XLM-R)
Byte Pair Encoding from Scratch — builds a BPE tokenizer from scratch, explains the training vs. encoding asymmetry, vocabulary size tradeoffs, and GPT-2’s regex pre-tokenization refinement

4. Embedding Layer

The embedding layer is the first thing the model does with the token IDs it receives. It has two jobs: turn integers into meaningful vectors, and inject positional information so the model knows where each token sits in the sequence.

4.1 Token Embeddings

An integer ID has no geometric structure. The number 42 is not “close to” 41 in any meaningful sense for language — the token at position 42 in the vocabulary might be completely unrelated to token 41. Neural networks need continuous-valued vectors they can do math on: compute dot products, measure distances, apply linear transformations.

A token embedding is a lookup table: a matrix of shape vocab_sz × embed_dim where each row is a learnable vector associated with one token. When the model sees token ID , it looks up row and uses that vector downstream.

What makes embeddings powerful is that training forces semantically similar tokens into nearby regions of this vector space. After training on enough text, the embedding for “king” minus the embedding for “man” plus the embedding for “woman” lands close to “queen” — not because we encoded this relationship by hand, but because the training signal shaped the space that way.

An embedding turns a name tag into a GPS coordinate — suddenly you can measure distance, find neighbors, and do arithmetic.

Shape: B × T (integer IDs) → B × T × embed_dim (float vectors)

Weight tying. In most language models, the embedding matrix is reused as the output projection at the end of the network — the final linear layer that maps from d_model back to vocab_sz uses the same weights, transposed. This is called weight tying. The core intuition: if two tokens have similar embeddings (i.e., they are semantically close), they should also receive similar probabilities when the model generates the next-token distribution. Since the LM head scores each candidate token by taking the dot product of the model’s output vector with that token’s embedding row, tokens whose embedding vectors are close to the output vector will score similarly — producing nearby probabilities. Weight tying enforces this consistency directly: the same geometry that groups similar tokens together in the input space also determines their relative scores in the output distribution. As a bonus, it halves the parameter count for the vocabulary components (often 30–100K tokens × 768 dims = a significant share of total parameters) and keeps input and output representations aligned throughout training.

4.2 Positional Encodings

Why position carries meaning

Word order is one of the primary mechanisms through which human languages encode meaning. Consider how much information is carried purely by where a word sits in a sentence:

Order determines who does what to whom. “The dog bit the man” and “The man bit the dog” contain identical tokens. The meaning is completely reversed. Without positional information, a model sees the same set of embeddings for both — it cannot distinguish them.

Agreement and dependency span long distances. In “The cats that live in the house are noisy”, the verb “are” must agree with “cats” (plural), not “house” (singular). Correctly resolving this requires knowing that “cats” appears before the relative clause, and “house” is inside it — a structural relationship determined entirely by position.

Negation scope is positional. “I never said she stole the money” and “I said she never stole the money” contain the same words. The position of “never” determines the scope of negation — whether the speaker denies making the claim or denies the theft itself.

Modifier attachment is determined by proximity. “I photographed the man with a telescope” is ambiguous in isolation. In context, positional proximity to either “man” or “photographed” is the primary cue for whether the telescope was used for photographing or was held by the man.

In short: token embeddings capture what each word means in isolation; positional encodings capture where each word sits, which encodes its grammatical role, its relationships to surrounding words, and its structural function in the sentence.

The permutation equivariance problem

Here is the technical issue: attention is permutation equivariant. If you reorder the input tokens, the output tokens reorder identically — the attention mechanism has no internal sense of sequence order. From the model’s perspective, “the cat sat on the mat” and “the mat sat on the cat” produce the same set of output vectors (just shuffled). Position is invisible.

To fix this, we add positional encodings to the token embeddings before feeding them into the Transformer. The result: two otherwise identical tokens at different positions get different combined representations, making order visible to every downstream layer.

There are three main strategies:

Strategy 1: Sinusoidal Encodings (Original Paper)

The original Transformer paper uses fixed, non-learned positional encodings based on sine and cosine functions at different frequencies:

Why low dimensions oscillate fast and high dimensions oscillate slowly comes directly from the formula. The denominator is the key — it grows exponentially with the dimension index . Dividing by a larger number slows the wave down:

Dimension	Denominator	Wave period (positions for one full cycle)	What it encodes
		~6 positions	Very fine — distinguishes adjacent tokens
		~110 positions	Phrase-level distance
		~2,000 positions	Sentence-level distance
		~62,800 positions	Barely changes — encodes very coarse, document-level position

Think of it as an odometer: the rightmost digit (low dimension) flips every meter, the leftmost digit (high dimension) barely moves over a typical journey. Each digit alone is ambiguous — the rightmost digit of “7” could be position 7, 17, 27, or 107. But all digits together uniquely identify every position.

This multi-scale design is intentional. Low dimensions give the model a fine-grained signal that changes every few positions — useful for detecting whether two tokens are immediate neighbors. High dimensions give a coarse signal that changes only over long distances — useful for detecting whether two tokens are in the same half of the document. Together, the full vector is unique for every position from 0 to the maximum.

The advantage of sinusoidal encodings is that they can generalize to sequence lengths longer than those seen during training — the functions extend naturally to any position.

Strategy 2: Learned Absolute Encodings

Instead of fixing the positional encoding by formula, we can make it a learned parameter — another nn.Embedding table of shape max_seq_len × embed_dim. Each position from 0 to max_seq_len-1 gets its own learnable row, updated via backpropagation just like token embeddings.

This is what BERT and GPT use. The model learns what positional fingerprints work best for its task. The downside: sequences longer than max_seq_len seen during training have no positional encoding — the model has never learned what those positions mean.

Strategy 3: Rotary Positional Encoding (RoPE)

RoPE, introduced by Su et al. (2021) and used in LLaMA, Mistral, and GPT-NeoX, takes a fundamentally different approach: instead of adding a fixed vector to the embeddings, it rotates the query and key vectors by an angle proportional to their absolute position before computing the attention dot product.

The key property: when you rotate at position and at position , their dot product becomes a function of only the relative distance :

This is highly desirable. Relative position — how far apart two tokens are — is often more informative than absolute position. Whether “cat” is token 5 or token 50 in the sentence matters less than how far it sits from the verb it modifies. Syntactic dependencies (subject → verb, adjective → noun) tend to hold over short distances regardless of where the sentence begins. RoPE bakes this directly into the attention computation at every layer, without requiring separate positional embedding vectors.

RoPE also generalizes better to longer sequences than the model was trained on, making it the dominant choice in modern open-source LLMs.

Implementation Note

The code below uses learned absolute positional embeddings — the simplest approach and standard for BERT-style encoder models. The embedding layer adds the token embedding and positional embedding, normalizes with LayerNorm, and applies dropout.

Code

Inside LSTMs: Implementing and Optimizing Sequential Models from First Principles

Imad Dabbura — Tue, 10 Mar 2020 05:00:00 GMT

evergreen

Why Implement an LSTM from Scratch?

If you’ve used nn.LSTM in PyTorch, you’ve seen it work. But how does it decide what to remember and what to forget? Why does it need four gates instead of one? And why is it so much better than a vanilla RNN at handling long sequences?

The best way to answer these questions is to build one yourself. In this post, we’ll start with the problem that motivated LSTMs (vanishing gradients), build up the intuition for how they solve it, then implement both LSTMCell and a multi-layer LSTM from scratch in PyTorch — verifying each against the official implementation down to floating-point precision.

The Vanishing Gradient Problem

Long Short-Term Memory (LSTM) is a recurrent neural network architecture introduced by Hochreiter and Schmidhuber (1997) to solve the vanishing gradient problem — the central failure mode of vanilla RNNs on long sequences.

To understand why LSTMs exist, we first need to understand what goes wrong. In a vanilla RNN, the hidden state is completely overwritten at every time step:

During backpropagation, the gradient of the loss with respect to an early hidden state must pass through the nonlinearity and the weight matrix at every single time step between and . If the sequence has 100 tokens, the gradient is multiplied by roughly 100 times. If the dominant eigenvalue of is even slightly less than 1 — say 0.9 — the gradient shrinks by a factor of . The signal from early tokens effectively disappears.

The Fundamental Issue

The problem isn’t just mathematical — it has a concrete consequence: vanilla RNNs can’t learn long-range dependencies. If the answer to a question depends on a word 50 tokens earlier in the sentence, the gradient signal connecting them is essentially zero. The model can’t learn that relationship, no matter how long you train.

How LSTMs Fix It

The LSTM introduces a cell state — a separate memory channel that runs parallel to the hidden state. The critical difference is in how it gets updated:

	Vanilla RNN	LSTM Cell State
Update rule
Mechanism	Complete replacement through nonlinearity	Selective modification via additive gating
Gradient flow	Must pass through and at every step	Can flow directly through the forget gate
Long-range memory	Exponential decay	Controlled retention

The cell state update is additive: when the forget gate is close to 1 and the input gate is close to 0, the cell state passes through unchanged: . Gradients flow backward through time with minimal decay — no weight matrix or nonlinearity in the way.

If this looks familiar, it should — it’s the same principle behind residual connections in ResNets. In a ResNet, each layer computes : the input passes through unchanged, and the layer only learns the residual. The LSTM cell state works the same way, but across time instead of depth: the previous cell state passes through (scaled by ), and the network adds a residual update (). Both create a gradient highway. ResNets made it possible to train 100+ layer networks; the LSTM cell state makes it possible to learn dependencies across 100+ time steps. Same insight, different axis.

Figure 1: The LSTM cell. The horizontal line at the top is the cell state — the “highway” through time. The four yellow boxes () are the forget, input, cell, and output gates respectively. The cell state is updated additively (the ⊕ node), while the gates use element-wise multiplication (⊗) to control information flow.

Why Two States?

A vanilla RNN has a single hidden state that must do everything: store long-term memory, carry short-term context, and produce the output that downstream layers consume. That’s too many jobs for one vector — optimizing the hidden state for the current prediction destroys the long-term information stored in it.

LSTMs split this into two specialized roles:

Cell state (): the long-term internal memory. The cell state is the LSTM’s private memory — never directly exposed to the rest of the network. Its job is to retain information across long distances without interference. Because it’s updated additively, gradients can flow through it across hundreds of time steps. Think of it as a notebook that the LSTM writes to and reads from, but never shows to anyone directly.

Hidden state (): the short-term working output. The hidden state is what the LSTM exposes to the outside world — the input to the next layer, the softmax, or whatever comes next. It’s computed by selectively reading from the cell state via the output gate: . The output gate decides: “Given everything I know and the current context, what’s relevant right now?”

This separation is crucial. The cell state can hold information like “the subject is plural” or “we’re inside a quotation” for as long as needed, without being distorted by the demands of predicting intermediate tokens. When it is needed — the output gate reads it out at exactly the right moment.

	Cell State ()	Hidden State ()
Role	Long-term memory	Short-term working output
Visible to	Only the LSTM itself (internal)	Next layer, softmax, classifier (external)
Updated by	Forget gate (erase) + input gate (write)	Output gate reading from cell state
Gradient flow	Additive — gradients pass through cleanly	Through tanh and output gate — more lossy
Analogy	A notebook you write in privately	The answer you speak aloud when asked

A Concrete Example

Consider: “The cat, which sat on the mat in the living room near the window overlooking the garden, was sleeping.” The verb “was” must agree with “cat” (singular), not “garden” or “window” — a dependency spanning ~15 tokens. A vanilla RNN’s gradient signal from “was” back to “cat” would be multiplied by fifteen times — likely vanishing. An LSTM can keep “cat = singular noun” in its cell state with the forget gate near 1, preserving the information until it’s needed at “was.”

One important constraint: RNNs and LSTMs are sequential models — the output at time depends on the hidden state from . We cannot parallelize across time steps; we must iterate one token at a time. This is the limitation that the Transformer (Vaswani et al., 2017) later addressed with self-attention.

Inside the LSTM Cell

An LSTMCell computes four gates, then uses them to update the cell and hidden states. Each gate has the same dimension as the hidden state:

Name	Activation	What It Does
Input gate	Sigmoid (0–1)	How much of the new candidate values to write into the cell
Forget gate	Sigmoid (0–1)	How much of the old cell state to keep (1 = remember everything, 0 = forget everything)
Cell gate	Tanh (-1 to 1)	The candidate new values to potentially add to the cell state
Output gate	Sigmoid (0–1)	How much of the cell state to expose as the hidden state output

Notice the activation functions: three gates use sigmoid, but the cell gate uses tanh. This isn’t arbitrary — it reflects their different roles. The sigmoid gates () answer “how much?” questions: how much to write, how much to keep, how much to expose. Sigmoid squashes values to (0, 1), making each gate a dimmer switch that scales its input between “fully off” and “fully on.” The cell gate answers a different question: “what values?” It proposes candidate content to write into the cell state. Tanh maps to (-1, 1), which is critical — it allows the cell state to both increase and decrease. If used sigmoid (0, 1), the additive update could only ever push the cell state upward, and it would grow without bound. Tanh lets the network write negative corrections, keeping the cell state centered and bounded.

Independent Gates: Four Operating Modes

A critical design choice is that the input gate and forget gate are completely independent — computed from separate weight matrices and biases, with nothing constraining them to sum to 1. The network is free to set both high, both low, or any combination.

Contrast this with the GRU (Gated Recurrent Unit), where the equivalent gates are complementary: a single update gate weights new content by and old content by , forcing a trade-off. The GRU is more parameter-efficient, but less expressive — it can only interpolate between “keep old” and “write new.”

The LSTM’s independence gives it four distinct operating modes:

Mode	Effect	When It’s Useful
Accumulate	Keep old state and write new info	Building up a running representation (e.g., accumulating features of a described entity)
Replace	Flush old state, write new info	Topic change, sentence boundary — start fresh with new content
Preserve	Keep old state, ignore current input	Carrying information across irrelevant tokens (e.g., remembering subject across a parenthetical)
Reset	Forget old state and ignore input	Clearing a dimension that’s no longer needed

The GRU can only express the diagonal of this table. This is why LSTMs tend to outperform GRUs on tasks requiring long-range memory: the accumulate mode lets information persist indefinitely while still absorbing new inputs, and the reset mode provides a clean mechanism for freeing capacity.

Gates as Learned Pattern Detectors

It’s tempting to think of gates as simple switches, but each gate is a learned pattern detector — analogous to how a CNN filter activates on specific visual patterns, a gate’s weight matrix learns to activate on specific contextual patterns in the input and hidden state. A CNN filter produces a high activation when the input patch matches its learned pattern; a gate weight matrix produces a high activation (close to 1 after sigmoid) when the combination of and matches its learned pattern. CNN filters detect spatial patterns in pixel neighborhoods; gate weights detect contextual patterns across the current token and sequence history.

Consider the forget gate: . After training, specific rows of these weight matrices become specialized detectors:

Some rows might detect “end of clause” patterns (a period, “but”) — signaling that old context should be flushed
Other rows might detect “continuation” patterns (a comma, “which”) — signaling that existing context should be preserved
Rows in the input gate might detect “salient new information” patterns (a named entity, a negation word) — signaling that this input should be written into memory

This happens per dimension of the hidden state. The gate output is a vector, not a scalar — dimension 42 of the forget gate might be close to 0 (forget) while dimension 73 is close to 1 (keep), because each dimension stores different information and each gate dimension detects different patterns.

The Single-Matrix Trick

Even though we describe four separate gates, in practice we compute them all in one matrix multiplication by concatenating the four weight matrices into a single 4 * hidden_size matrix. We then split the result into four chunks. This is much faster because it replaces four small matmuls with one large one — better utilizing GPU parallelism and memory bandwidth.

Implementation

With the conceptual foundation in place, let’s turn these equations into code. We’ll build two modules — LSTMCell (one time step) and LSTM (full sequences with multiple layers) — verifying each against PyTorch’s official implementation.

`LSTMCell`

We implement two versions: a verbose one that makes every operation explicit (separate weight matrices for each gate), and a compact one using nn.Linear with the single-matrix trick. Both produce identical results — the compact version is what you’d use in practice.

Code

Anomaly Detection

Imad Dabbura — Wed, 11 Sep 2019 05:00:00 GMT

growing

Introduction

Anomaly Detection is the identification of examples or events that don’t confront to an expected pattern or the majority of examples. Roughly speaking, it’s the process of identifying an example that is not normal (outlier) given the distribution of the data. Outlier is an example that deviates so much from the other examples that arouse suspicions that it was generated by different data generating process. Mainly, such outliers would have a very low probability (on the very end of both left and right tails of the probability density function) that they belong to the same data generating process.

The algorithm works as follows: 1. Fit a Gaussian Probability Density Function (PDF) for each feature in the training dataset. 1. Calculate the mean and the variance of each feature: Where is the mean and is the variance that controls the shape of the density function. 2. Compute the density function for each feature using the following formula:
Since the mean and the variance are sensitive to outliers, we use training dataset that has only normal examples to fit the model and calculate both the mean vector and the covariance matrix. 2. Compute the gaussian density by taking the product of all features’ density functions. 3. If then anomaly; otherwise, normal. Epsilon controls how sensitive the detection algorithm is. If is large flag a lot of the examples as anomalous and that would increase the False Positives. However, If is small very small portion of the examples will be flagged as anomalous and that would increase the False Negatives. 4. Use Cross Validation for tuning the hyper-parameter that yields the best performance metrics value. F1 score is commonly used: Where: tp: True Positive, fp: False Positive, fn: False Negative.

We have two kinds of anomaly detection algorithms: 1. Univariate Gaussian Density Function * It assumes that all features are independent. Therefore, the covariance between all pairs of features is zero. * It’s computationally faster and more efficient. * Use it if we have very large number of features. * Make sure to add features manually that captures unusual values for combination of features; such as . Otherwise, the algorithm may fail to detect anomalies that takes values that are considered normal when looked at each feature separately but are unusual when looking at values of all features together such as having high value for feature 2 compared to low value for feature 1.

Multivariate Gaussian Density Function Where is n x n covariance matrix: Where is the covariance between features 1&2. Therefore, the covariance matrix is symmetric positive (semi) definite.
- Computationally expensive
- Use it when number of examples 10 times number of features, i.e.
- If some features are linearly dependent or number of examples is less than number of features covariance matrix won’t be invertible
- No need to add more features to capture unusual values of combination of features because it captures that through covariances of all pairs of features
- Univariate density function can be derived from Multivariate density function where covariance matrix would be a diagonal matrix. Therefore, for all

There are some assumptions made implicitly here: - For each feature, ’s are IID (independently and identically distributed). - Using Central Theorem (CLT): the distribution of sum of iid random variable are approximately normal. Therefore, this would allow us to fit normal distribution that’s parameterized by and . - and will be estimated using maximum-likelihood estimation method.

When fitting multivariate probability distribution using the above assumptions, we’ll use that pdf to estimate the probability that each example from the validation/test set was generated by this pdf. If the probability is smaller that , then we believe that such example was generated by different mutlivariate PDF and, therefor, classified as anomaly (outlier).

In this exercise, we’ll implement an anomaly detection algorithm to detect anomalous behavior in server computers. The features measure the throughput (mb/s) and latency (ms) of response of each server. While servers were operating, examples of how they were behaving were captured. We suspect that the vast majority of them are normal (non-anomalous) examples of the servers operating normally.

Let’s first load and plot the data:

Code

Gradient Descent Algorithm and Its Variants

Imad Dabbura — Mon, 18 Feb 2019 06:00:00 GMT

evergreen

Introduction

Optimization refers to the task of minimizing/maximizing an objective function parameterized by . In machine/deep learning terminology, it’s the task of minimizing the cost/loss function parameterized by the model’s parameters . Optimization algorithms (in case of minimization) have one of the following goals: - Find the global minimum of the objective function. This is feasible if the objective function is convex, i.e. any local minimum is a global minimum. - Find the lowest possible value of the objective function within its neighbor. That’s usually the case if the objective function is not convex as the case in most deep learning problems.

There are three kinds of optimization algorithms:

Optimization algorithm that is not iterative and simply solves for one point.
Optimization algorithm that is iterative in nature and converges to acceptable solution regardless of the parameters initialization such as gradient descent applied to logistic regression.
Optimization algorithm that is iterative in nature and applied to a set of problems that have non-convex cost functions such as neural networks. Therefore, parameters’ initialization plays a critical role in speeding up convergence and achieving lower error rates.

Gradient Descent is the most common optimization algorithm in machine learning and deep learning. It is a first-order optimization algorithm. This means it only takes into account the first derivative when performing the updates on the parameters. On each iteration, we update the parameters in the opposite direction of the gradient of the objective function w.r.t to the parameters where the gradient gives the direction of the steepest ascent. The size of the step we take on each iteration to reach the local minimum is determined by the learning rate . Therefore, we follow the direction of the slope downhill until we reach a local minimum.

In this notebook, we’ll cover gradient descent algorithm and its variants: Batch Gradient Descent, Mini-batch Gradient Descent, and Stochastic Gradient Descent.

Let’s first see how gradient descent and its associated steps works on logistic regression before going into the details of its variants. For the sake of simplicity, let’s assume that the logistic regression model has only two parameters: weight and bias .

Initialize weight and bias to any random numbers.
Pick a value for the learning rate . The learning rate determines how big the step would be on each iteration.

If is very small, it would take long time to converge and become computationally expensive.
IF is large, it may fail to converge and overshoot the minimum.

Therefore, plot the cost function against different values of and pick the value of that is right before the first value that didn’t converge so that we would have a very fast learning algorithm that converges (see figure 1).

Figure 1: Gradient descent with different learning rates Source

The most commonly used rates are : 0.001, 0.003, 0.01, 0.03, 0.1, 0.3.

Make sure to scale the data if it’s on very different scales. If we don’t scale the data, the level curves (contours) would be narrower and taller which means it would take longer time to converge (see figure 2).

Figure 2: Gradient descent: normalized versus unnormalized level curves

Scale the data to have and . Below is the formula for scaling each example: 4. On each iteration, take the partial derivative of the cost function w.r.t each parameter (gradient): The update equations are: * For the sake of illustration, assume we don’t have bias. If the slope of the current values of , this means that we are to the right of optimal . Therefore, the update will be negative, and will start getting close to the optimal values of . However, if it’s negative, the update will be positive and will increase the current values of to converge to the optimal values of (see figure 3):

Figure 3: Gradient descent. An illustration of how gradient descent algorithm uses the first derivative of the loss function to follow downhill it’s minimum.

Continue the process until the cost function converges. That is, until the error curve becomes flat and doesn’t change.
In addition, on each iteration, the step would be in the direction that gives the maximum change since it’s perpendicular to level curves at each step.

Now let’s discuss the three variants of gradient descent algorithm. The main difference between them is the amount of data we use when computing the gradients for each learning step. The trade-off between them is the accuracy of the gradient versus the time complexity to perform each parameter’s update (learning step).

Batch Gradient Descent

Batch Gradient Descent is when we sum up over all examples on each iteration when performing the updates to the parameters. Therefore, for each update, we have to sum over all examples:

for i in range(num_epochs):
grad = compute_gradient(data, params)
params = params - learning_rate * grad

The main advantages:

We can use fixed learning rate during training without worrying about learning rate decay.
It has straight trajectory towards the minimum and it is guaranteed to converge in theory to the global minimum if the loss function is convex and to a local minimum if the loss function is not convex.
It has unbiased estimate of gradients. The more the examples, the lower the standard error.

The main disadvantages:

Even though we can use vectorized implementation, it may still be slow to go over all examples especially when we have large datasets.
Each step of learning happens after going over all examples where some examples may be redundant and don’t contribute much to the update.

Mini-Batch Gradient Descent

Instead of going over all examples, Mini-batch Gradient Descent sums up over lower number of examples based on batch size. Therefore, learning happens on each mini-batch of examples:

Shuffle the training dataset to avoid pre-existing order of examples.
Partition the training dataset into mini-batches based on the batch size. If the training set size is not divisible by batch size, the remaining will be its own batch.

for i in range(num_epochs):
np.random.shuffle(data)
for batch in radom_minibatches(data, batch_size=32):
    grad = compute_gradient(batch, params)
    params = params - learning_rate * grad

The batch size is something we can tune. It is usually chosen as power of 2 such as 32, 64, 128, 256, 512, etc. The reason behind it is because some hardware such as GPUs achieve better runtime with common batch sizes such as power of 2.

The main advantages:

Faster than Batch version because it goes through a lot less examples than Batch (all examples).
Randomly selecting examples will help avoid redundant examples or examples that are very similar that don’t contribute much to the learning.
With batch size < size of training set, it adds noise to the learning process that helps improving generalization error.
Even though with more examples the estimate would have lower standard error, the return is less than linear compared to the computational burden we incur.

The main disadvantages:

It won’t converge. On each iteration, the learning step may go back and forth due to the noise. Therefore, it wanders around the minimum region but never converges.
Due to the noise, the learning steps have more oscillations (see figure 4) and requires adding learning-decay to decrease the learning rate as we become closer to the minimum.

Figure 4: Gradient descent: batch versus mini-batch loss function

With large training datasets, we don’t usually need more than 2-10 passes over all training examples (epochs). Note: with batch size , we get the Batch Gradient Descent.

Stochastic Gradient Descent

Instead of going through all examples, Stochastic Gradient Descent (SGD) performs the parameters update on each example . Therefore, learning happens on every example:

Shuffle the training dataset to avoid pre-existing order of examples.
Partition the training dataset into examples.

for i in range(num_epochs):
    np.random.shuffle(data)
    for example in data:
        grad = compute_gradient(example, params)
        params = params - learning_rate * grad

It shares most of the advantages and the disadvantages with mini-batch version. Below are the ones that are specific to SGD:

It adds even more noise to the learning process than mini-batch that helps improving generalization error. However, this would increase the run time.
We can’t utilize vectorization over 1 example and becomes very slow. Also, the variance becomes large since we only use 1 example for each learning step.

Below is a graph that shows the gradient descent’s variants and their direction towards the minimum:

Figure 5: Gradient descent variants’ trajectory towards minimum

As the figure above shows, SGD direction is very noisy compared to mini-batch.

Challenges

Below are some challenges regarding gradient descent algorithm in general as well as its variants - mainly batch and mini-batch:

Gradient descent is a first-order optimization algorithm, which means it doesn’t take into account the second derivatives of the cost function. However, the curvature of the function affects the size of each learning step. The gradient measures the steepness of the curve but the second derivative measures the curvature of the curve. Therefore, if:
Second derivative = 0 the curvature is linear. Therefore, the step size = the learning rate .
Second derivative > 0 the curvature is going upward. Therefore, the step size < the learning rate and may lead to divergence.
Second derivative < 0 the curvature is going downward. Therefore, the step size > the learning rate .

As a result, the direction that looks promising to the gradient may not be so and may lead to slow the learning process or even diverge. - If Hessian matrix has poor conditioning number, i.e. the direction of the most curvature has much more curvature than the direction of the lowest curvature. This will lead the cost function to be very sensitive in some directions and insensitive in other directions. As a result, it will make it harder on the gradient because the direction that looks promising for the gradient may not lead to big changes in the cost function (see figure 7).

Figure 6: Gradient descent fails to exploit the curvature information contained in the Hessian matrix. Source

The norm of the gradient is supposed to decrease slowly with each learning step because the curve is getting flatter and steepness of the curve will decrease. However, we see that the norm of the gradient is increasing, because of the curvature of the curve. Nonetheless, even though the gradients’ norm is increasing, we’re able to achieve a very low error rates (see figure 8).

Figure 7: Gradient norm. Source

In small dimensions, local minimum is common; however, in large dimensions, saddle points are more common. Saddle point is when the function curves up in some directions and curves down in other directions. In other words, saddle point looks a minimum from one direction and a maximum from other direction (see figure 9). This happens when at least one eigenvalue of the hessian matrix is negative and the rest of eigenvalues are positive.

Figure 8: Saddle point

As discussed previously, choosing a proper learning rate is hard. Also, for mini-batch gradient descent, we have to adjust the learning rate during the training process to make sure it converges to the local minimum and not wander around it. Figuring out the decay rate of the learning rate is also hard and changes with different datasets.
All parameter updates have the same learning rate; however, we may want to perform larger updates to some parameters that have their directional derivatives more inline with the trajectory towards the minimum than other parameters.

Conda Essentials

Imad Dabbura — Mon, 18 Feb 2019 06:00:00 GMT

growing

Introduction

Conda in an open source package management system that works on all platforms. It is a tool that helps manage packages and environments for different programming languages. Develop a high level understanding of how Conda works helped me at so many levels especially when it comes to managing environments and make my work more reproducable. Below are the notes that I wrote down during my journey of learning Conda and I always refere back to them:

General

Conda packages are files and executables that can in principle contain images, data, noteboeeks, files, etc.
Conda mainly used in Python ecosystem; however, it can be used with other languages such R, Julia, Scala, etc.
When installing a package using Conda, it installs its dependencies with it. Also, Conda is able to figure out the platform you’re using without the need to specify the platform when installing packages.
When installing a package, Conda:
- Checks the platform.
- Checks the Python version.
- Install the latest version of the package that is compatible with Python.
- If it has dependencies, installs the latest versions of the dependencies that are also compatible with each other.
Under semantic versioning, software is labeled with a three-part version identifier of the form MAJOR.MINOR.PATCH; the label components are non-negative integers separated by periods. Assuming all software starts at version 0.0.0, the MAJOR version number is increased when significant new functionality is introduced (often with corresponding API changes). Increases in the MINOR version number generally reflect improvements (e.g., new features) that avoid backward-incompatible API changes. For instance, adding an optional argument to a function API (in a way that allows old code to run unchanged) is a change worthy of increasing the MINOR version number. An increment to the PATCH version number is approriate mostly for bug fixes that preserve the same MAJOR and MINOR revision numbers. Software patches do not typically introduce new features or change APIs at all (except sometimes to address security issues).
We can specify MAJOR, MAJOR.MINOR, or MAJOR.MINOR.PATCH when installing any package.
We can use logical operators to install versions of a package. Examples:
- conda install 'python=3.6|3.7'.
- conda install 'python=3.6|3.7*' .
- conda install 'python>=3.6, <=3.7'.

Common Commands

To update a package, conda update pckg.
To uninstall a package, conda remove pckg.
To search what available versions of a specific package is available, use conda search pckg.
conda list will list all installed packages.
conda list -n env-name will list all packages in the environment env-name.
conda list pckg will give information about pckg.
When installing a pckg without including a channel, it defaults to the main channel that is maintained by Anaconda Inc.
There other channels where people can upload their packages to and we can reach to those channels when looking for installation such fastai. We use conda install -c fastai fastai. Here the channel is fastai and the pckg is also fastai.
conda search -c conda-forge -c fastai --override-channels --platform osx-64 fastai means:
- Search for fastai in two channels: conda-forge, fastai.
- override-channels means do not go to default main channel.
- platform specify which platform.
Sometimes we don’t know the channel of the pckg, we can use anaconda search pckg that will return all the channels that the pckg is at and their versions.
conda-forge is almost as good as the main channels which is led by the community. It has a lot more packages than the main channel.
There is no system that rates channels, so be carefel when installing packages from any channel.
We can list all packages in a channel such as conda search -c conda-forge --override-channels that will list all packages for the conda-forge channel.

Environments

Environments are a good practice of documenting data science/software development work.
Environments are nothing more than a directory that contains all the packages so that when trying to import them, it imports them from this directory only. we can use conda env list to see all the available environments on our machine.
To get the packages from a specific environment by name, use conda list -n env-name. Otherwise, we get the packages from the current environment.
To activate an environment, use conda activate env-name. To deactivate, conda deactivate.
Environments usually don’t take a lot of space.
We can remove environments using conda env remove -n env-name.
To create an environment, use conda create -n env-name. We can also add additional package names to install after creation such as conda create -n env-name python=3.6* numpy>=1.1.
To export an environment, use conda env export -n env-name. This will return the output to the terminal. We can also export to a file. For that use conda env export -n env-name -f env-name.yml. The ‘.yml’ extension is strongly enouraged. Doing this will assure that all the packages used can be installed by others exactly.
We can create also an environment from .yml file using conda env create -f env-name.yml. Note also that if we only use conda env create, it will look for a file that has .yml extension and has the same name as env-name in the current local directory. Moreover, we can create the .yml file with doing the export ourselves and only specify what is important in our environments.

K-means Clustering: Algorithm, Applications, Evaluation Methods, and Drawbacks

Imad Dabbura — Tue, 11 Sep 2018 05:00:00 GMT

growing

Introduction

Clustering is one of the most common exploratory data analysis technique used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different. In other words, we try to find homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure such as euclidean-based distance or correlation-based distance. The decision of which similarity measure to use is application-specific.

Clustering analysis can be done on the basis of features where we try to find subgroups of samples based on features or on the basis of samples where we try to find subgroups of features based on samples. We’ll cover here clustering based on features. Clustering is used in market segmentation; where we try to fined customers that are similar to each other whether in terms of behaviors or attributes, image segmentation/compression; where we try to group similar regions together, document clustering based on topics, etc.

Unlike supervised learning, clustering is considered an unsupervised learning method since we don’t have the ground truth to compare the output of the clustering algorithm to the true labels to evaluate its performance. We only want to try to investigate the structure of the data by grouping the data points into distinct subgroups.

In this post, we will cover only Kmeans which is considered as one of the most used clustering algorithms due to its simplicity.

Kmeans Algorithm

Kmeans algorithm is an iterative algorithm that tries to partition the dataset into pre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the inter-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.

The way kmeans algorithm works is as follows:

Specify number of clusters .
Initialize centroids by first shuffling the dataset and then randomly selecting data points for the centroids without replacement.
Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t changing.
- Compute the sum of the squared distance between data points and all centroids.
- Assign each data point to the closest cluster (centroid).
- Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster.

The approach kmeans follows to solve the problem is called Expectation-Maximization. The E-step is assigning the data points to the closest cluster. The M-step is computing the centroid of each cluster. Below is a break down of how we can solve it mathematically (feel free to skip it).

The objective function is: where for data point if it belongs to cluster ; otherwise, . Also, is the centroid of ’s cluster.

It’s a minimization problem of two parts. We first minimize J w.r.t. and treat fixed. Then we minimize J w.r.t. and treat fixed. Technically speaking, we differentiate J w.r.t. first and update cluster assignments (E-step). Then we differentiate J w.r.t. and recompute the centroids after the cluster assignments from previous step (M-step). Therefore, E-step is: In other words, assign the data point to the closest cluster judged by its sum of squared distance from cluster’s centroid.

And M-step is: Which translates to recomputing the centroid of each cluster to reflect the new assignments.

Few things to note here:

Since clustering algorithms including kmeans use distance-based measurements to determine the similarity between data points, it’s recommended to standardize the data to have a mean of zero and a standard deviation of one since almost always the features in any dataset would have different units of measurements such as age vs income.
Given kmeans iterative nature and the random initialization of centroids at the start of the algorithm, different initializations may lead to different clusters since kmeans algorithm may stuck in a local optimum and may not converge to global optimum. Therefore, it’s recommended to run the algorithm using different initializations of centroids and pick the results of the run that that yielded the lower sum of squared distance.
Assignment of examples isn’t changing is the same thing as no change in within-cluster variation:

Implementation

We’ll use simple implementation of kmeans here to just illustrate some concepts. Then we will use sklearn implementation that is more efficient take care of many things for us.

Code

Imad Dabbura

Make ML Systems Ship Again

Introduction

Roadmap

The Theory of Constraints in 5 Minutes

Step 1: Define Your Goal and Find the Bottleneck

Set Your SLOs

Find the Bottleneck

Validate Before You Invest

Step 2: Understand Why It’s Stuck

Five Whys — With Evidence

Step 3: See the Hidden Tradeoff

Why Teams Get Stuck

Map Your Conflict

Step 4: Break the Tradeoff

The Thinking Process

A Worked Example: Progressive Analysis

Fix the Organization Too

Step 5: Prove It Works

Identify the Riskiest Assumption

Results

What the Iterations Taught Us

Production Rollout Checklist

Go/No-Go Criteria

The Cycle Continues

Conclusion

References

Hard-Learned Lessons in Shipping Software (AI/ML) Projects

Why ML Projects Fail to Ship

Define the Target Before Writing a Line of Code

Make Time the Constraint, Not Scope

Decompose Until Done Is Unambiguous

Validate the Core Assumption Before Building the System

Ship, Then Compound

Key Takeaways

From Forgetting to Fluency: How to Learn Smarter, Not Harder

Introduction

Learning Techniques

Recommendations

Conclusion

Further Reading

Why Your Final Layer Shouldn’t Have Softmax

A Common Mistake That’s Hard to See

What Softmax Does

The Log-Sum-Exp Trick

Multi-Label Classification: The Wrong Prior

Softmax and Overconfidence

When Softmax Belongs

Key Takeaways

Resources

Cutting the Fat: A Practical Guide to Neural Network Pruning

Why Do We Need Pruning?

Types of Pruning

Pruning Approaches

1. Only Pruning

2. Pruning + Fine-tuning

3. Iterative Pruning + Fine-tuning

Pruning Granularities

1. Fine-grained (Unstructured) Pruning

2. Coarse-grained (Structured) Pruning

3. Pattern-based Pruning (N:M Sparsity)

4. Channel-based Pruning

Pruning Criteria

1. Magnitude-based Pruning

2. Scaling-based Pruning

3. Percentage-of-Zero-Based Pruning

4. Regression-based Pruning

Important Considerations

Building GPT(2/3) from Scratch: Turning Theory into a Working Transformer

Introduction

GPTs

Implementation

Conclusion

Resources

Byte Pair Encoding from Scratch

Why Tokenization Matters

How BPE Works

Seeing It in Action

The Algorithm

Training vs. Encoding: A Subtle Difference

The Derivative Engine Behind Every `loss.backward()`