A 1958 Physicist's Rules for Doing AI Research

ARCHIVE NOTE

Originally published as an X article on March 4, 2026. I am archiving it here in a cleaner and more durable format, while keeping the embedded posts that were part of the original piece.

My mentor recently sent me a 1958 essay by Nobel laureate Walther Bothe, written as advice to young physicists. I read it and realized that it still applies remarkably well to modern AI research.

It also reminded me how many of these basic disciplines I still fail to practice consistently.

What follows is an adapted version for AI research, with the original structure and spirit preserved.

Ideas

Some people, at the right moment, have one idea that becomes the foundation of their entire scientific life. But the more common pattern is different.

Out of 100 ideas you have, 90 will turn out to be infeasible, mistimed, trivial, wrong, or simply not worth pursuing. In the remaining 10, there is usually one that is the most promising, the most practical, and the easiest to actually execute.

The hard part is spending the time to find that one. Once you find it, stop hedging and throw your full force behind it.

Benhao Huang@huskydogewoof

Oct 26, 2025

Indeed, I also find that simply increasing the number of inference steps, even when the model is trained with only 16, can substantially improve performance.

Config: TRM-MLP-EMA on Sudoku1k. The 16-step version reached 84% instead of 87%, but the scaling trend was already quite clear.

View post on X

In AI, this also means resisting the temptation to keep ten parallel half-projects alive forever. Keep a parking lot for the other ideas, but do not confuse keeping notes with making progress.

The idea that matters is the one that survives contact with implementation, data, evaluation, and ablation.

Simplicity and Flexibility

Even if the golden age of tin cans and sealing wax is gone, you should still make your research setup as simple and as flexible as possible.

The ideal setup is one you can change quickly. In AI, that means code you can refactor without fear, pipelines you can modify and rerun without friction, and an evaluation harness that does not collapse the moment you touch it.

Personally, I would not rely on Git commits alone to track the exact code used in each experiment. I prefer logging a snapshot of the code to W&B for every run I launch. wandb.run.log_code() is genuinely useful here.

Start from the center of the room. Build the core loop first: train, evaluate, log. Then add peripherals. The first time you do a new experiment, run it once from start to finish in a rough form. Yes, this often creates trouble. Make it a rule anyway.

The reward is immediate knowledge of where the difficulties are, and where your errors and biases enter. The biggest problems often appear exactly where you did not expect them.

You are human. You will not foresee everything. But after each attempt, you should think carefully before the next one. Changing the method often helps. In AI, that can mean changing the simplest thing that tests a hypothesis: a different baseline, a different data slice, a different metric, a simpler model, a controlled perturbation, or a sanity check that looks almost silly.

The goal is to learn what really works rather than to appear sophisticated.

You will often find that the original plan and objective can no longer be maintained. Before you abandon the experiment, you must understand why it fails, and then have the courage to interrupt it. If a direction is wrong, continuing is not perseverance. It is inertia.

Benhao Huang@huskydogewoof

Nov 3, 2025

It is fascinating that FP16 can reduce training-inference mismatch in RL fine-tuning.

Out of curiosity, I tried the same precision swap on the Tiny Recursive Model (TRM), which iterates hidden states to reason over inputs. The outcome was different enough that it forced me to debug the setup more carefully.

Penghui Qi@QPHutuOct 31, 2025

Excited to share our new work. Problem: BF16 causes a large training-inference mismatch in RL fine-tuning. Solution: just switch to FP16.

View post on X

FP16 also works well with loop models. In that case, it was not failing at all. The real issue was a bug: I had forgotten to use gradient scaling for FP16. It looked like a mysterious failure only because I had not understood it yet.

If something is broken, fix it.

In the original essay, quietly shelving a broken instrument was called a physicist’s mortal sin. In AI, the equivalent is letting a flaky training script, a leaking evaluation, a silent data bug, or a corrupted checkpoint system rot in the corner while pretending it does not matter.

It always matters.

The Economy of Scientific Work

Every day you should plan a schedule. Often you will not follow it, and then you should find the reason. Without a schedule, work dissolves into motion without results.

A schedule forces you toward a concrete goal and protects you from being consumed by side details that feel productive but do not move the core question.

You should allow some leisure, but leisure must not mean mental idleness. The mind should keep working, even outside formal work hours. Many good moves happen when you are not actively typing, but are still thinking.

More than once, it happens that starting from an unreasonable, even incorrect statement leads to an unexpected result. There is an old military principle that doing one wrong thing can be better than doing nothing. The point is not to be careless, but to buy information.

A stupid experiment, if it costs little time, can be worth doing once, especially now that we have tools like claude-code. It can reveal a hidden confounder, a missing control, or a false assumption that would otherwise waste weeks.

Read the literature regularly, but do not freeze in fear that you might accidentally repeat something. Even if two researchers work on the same topic, the work is rarely identical. At minimum, the starting point differs, the framing differs, the comparisons differ, and the failures differ.

The path you take can still be valuable, and it will shape what you are able to see next. More concretely, trying to reproduce a paper and actually working through its codebase often produces both insight and new ideas.

Benhao Huang@huskydogewoof

Feb 15, 2026

Know it is a bit late, but when I saw the figure in the Kimi Linear report, I got curious about what would happen if we benchmarked KDA against full attention and sliding-window attention.

So I reproduced task (b), Multiple Query Associative Recall (MQAR), and used it as a concrete way to compare those mechanisms.

View post on X

For research, I especially love the moment when several things I thought were unrelated suddenly turn out to be connected.

The Work Notebook

This is where many sins occur. Avoid writing experimental results on loose scraps, random chat logs, scattered text files, or temporary notes that evaporate. If it happens anyway, transfer them immediately into a proper notebook, for example Notion or Obsidian, or rewrite them cleanly. The real lesson is immediacy. Do not trust memory.

In general, record everything at once. Keep measurements and outputs on one side. Keep reasoning, quick calculations, setup sketches, decisions, and conclusions on the other. In a digital workflow, this means adopting a consistent structure: raw results and links to runs in one place, narrative reasoning and decision logs in another.

Number pages. Do not tear them out. Do not delete data because it is embarrassing. When an experiment must be paused, interrupted, or abandoned, write down a convincing reason, even if it is just in the W&B run description.

Preserve notebooks carefully. Something you believed was wrong may, months or years later, turn out to be the key that makes a new result make sense.

Writing Papers

Start writing as early as possible. At the latest, begin as soon as the measurements are finished. Do not wait until the setup is gone.

In AI, that means: do not wait until the code has drifted, the environment has changed, the dataset version is lost, the cluster configuration has been forgotten, and the exact run that mattered is no longer reproducible.

I wrote my latest conference submission in 14 days, and I think it is a total mess. Writing is no easier than the math or coding itself. In fact, I have found that even advanced AI tools still do not really write with logic and organization for you.

Taking a vacation after the last run and coming back weeks later to summarize the project is not just a bad habit. It is an organizational error. First, notebooks often contain unclear points that you can only resolve while the work is still fresh in your mind. Second, you often discover gaps in the argument only when you start writing. Many of those gaps can be repaired with a small extra experiment, but only if the setup still exists.

Think about style. Aim for simple clarity. Short sentences. Straight claims. First decide what you actually found, and what you actually conclude.

After finishing, let the draft sit for a week or two. Then reread it as someone who is only mildly interested and slightly skeptical. That reader will find what you could not see while you were still inside the work.

In general, if you cannot be fully absorbed by the work, not only during work hours but also outside them, that is a bad sign. Serious research has a way of following you home. The goal is sustained curiosity that does not switch off when the clock says it should.

Ideas#

Simplicity and Flexibility#

The Economy of Scientific Work#

The Work Notebook#

Writing Papers#

Sources#

Ideas

Simplicity and Flexibility

The Economy of Scientific Work

The Work Notebook

Writing Papers

Sources