Explore

Examples of evaluation-driven updates

Macroeconomic Effects of Climate Change (Bilal & Kanzig)

⁠

Evaluation package⁠

⁠

2024 evaluated version⁠

⁠

2025 updated version⁠

⁠

⁠
`NotebookLM`⁠
reports

We manually confirmed the majority of the LLM claims

⁠

⁠

Above graphic created via NotebookLM

The revisions between the November 2024 and the September 2025 versions of the working paper reflect a substantial strengthening of the evidence, primarily through the incorporation of a longer historical dataset, leading to an upward revision of the overall economic damage estimates.

...

Reflection of Evaluation Suggestions

The revisions appear to address several key concerns raised by the evaluators, particularly regarding methodological robustness and context:

Addressing Sample Size and Robustness (Evaluator 2)

Sample Size and Inference: Evaluator 2 expressed concern over the small sample size ($N<60$) of the main results and the need for simulation evidence for valid inference. The September 2025 version directly tackles this by introducing the BU dataset, which spans 160 years. The stability of the estimates across the shorter (PWT) and longer (BU) samples strongly corroborates the robustness of the central finding. The authors also note they performed inference robustness using bootstrapping techniques in the appendix.

Longer Horizon Effects: Evaluator 2 noted that impacts did not seem to have fully reverted after 10 years. The revised paper directly addresses this by converting their estimates to the damage implied by a permanent 1°C rise in temperature, yielding the much larger long-run figure of 22–34% GDP reduction. The authors explicitly state that impacts do not fully mean-revert even after 10 years.

Lags and Geoscience: Evaluator 1 requested increasing the number of lags, especially given the longer cycles noted in geoscience. The newer version notes that in the longer BU sample, they use 4 to 8 lags to capture the complexity of the historical series.

Addressing Comparability and Context (Evaluators 1 & 2)

Non-Linear Effects and Comparability: Evaluator 2 requested allowing the local temperature model to include non-linear or permanent growth effects for a fair comparison with previous literature. While the authors maintain their main conservative assumption of level effects (not permanent growth effects), they clarify that their local temperature estimates generate a 3% long-run GDP reduction, which is consistent with the “level effects” benchmark of the conventional literature. They explicitly report that imposing non-linear specifications on their shocks did not materially change the core estimates.

Out-of-Sample Extrapolation: Evaluator 1 suggested acknowledging how far out of sample the 1°C calibration is, given that shocks are typically around 0.3°C. The 2025 version notes that scaling up to a 1°C increase requires assuming linearity, thereby explicitly addressing this methodological limitation, although they suggest that the consistency of their linear estimates across shock sizes supports the approach.

Addressing Specific Policy Metrics

SCC Uncertainty: Evaluator 1 requested tightening the confidence interval on the SCC or providing a higher lower bound. While the point estimate increased from $1,367 to over $1,500/ton, the authors state they remain conservative by using the 10th percentile for climate sensitivity, thereby addressing the credibility concern around maximum estimates, even if the absolute range remains wide.

GPT 5.2 Thinking mode query

⁠

Did the Unjournal evaluations actually change the paper?

Short answer: yes, quite a bit — and in some places very clearly so.

I compared the Nov 2024 version of “Macroeconomic Effects of Climate Change” with the Sept 2025 revision, and then mapped the changes against the two Unjournal evaluations. Of course, authors also respond to seminars, referee reports, and their own ongoing work — so causality is never certain. But in several cases, the match between evaluation comment → paper change is unusually tight.

Below is a concise summary, followed by a table with probabilities.

⁠

Where the evidence is strongest

1. Permanent-temperature damages are now front and center

One of the clearest changes is that the revised paper explicitly converts short-run temperature shocks into damages from a permanent 1°C increase, reporting GDP losses of 22–34% and clearly flagging the extrapolation involved.

This looks like a direct response to an Unjournal evaluator asking the authors to clarify long-run interpretation and not stop at finite impulse-response horizons.

Evaluator: “Show results more than 10 years out, or provide robust SCC values that allow for longer run effects.” Revised paper: “We convert… by taking the ratio of cumulative impulse responses… A permanent 1°C rise leads to a 22–34% reduction…”

This is exactly the kind of “make the implicit mapping explicit” improvement that evaluations aim for.

⁠

2. Extreme-event definitions now get explicit robustness checks

Another near-textbook case: the revised paper adds robustness checks showing that results are stable across alternative definitions of ‘extreme’ temperature events.

That addition tracks almost one-for-one with an evaluator’s concern:

Evaluator: “Where do these thresholds come from? Are the results robust to alternative choices?” Revised paper: “Robustness to alternative thresholds… yields very similar results.”

This is about as close as you get to a smoking gun for evaluation-driven change.

⁠

3. A much stronger long-run data strategy

The authors now clearly separate and emphasize results from:

a broad post-1960 PWT sample, and

a long historical BU sample (back to the 19th century).

That’s a costly change — not cosmetic — and it directly addresses concerns raised in the evaluations about finite samples, inference, and credibility.

⁠

Where the evidence is moderate

4. Domestic cost of carbon moves into the abstract

The revised abstract now highlights a U.S. domestic cost of carbon above $200/ton, something the evaluators explicitly flagged as a major policy implication worth elevating.

This looks like a “you’re underselling your own result” correction prompted by the evaluations.

⁠

5. Headline framing shifts (“six times larger” → “order of magnitude”)

The headline language changes, and interestingly, it now mirrors wording used by one evaluator. This could reflect feedback, or simply convergence on a clearer phrase — hard to know — but the overlap is notable.

⁠

Where causality is weaker or ambiguous

6. Welfare losses and SCC point estimates rise

The abstract moves from 25% → over 30% welfare loss and $1,367 → $1,500+ SCC. That’s directionally consistent with evaluators urging clarity about how large the damages really are — but it’s more likely driven by updated estimation and new samples than by a specific “do this” suggestion.

7. Scenario language gets stronger, but not fully reworked

Evaluators asked for broader comparisons across pathways and IAMs. The revised paper strengthens its “this holds under any plausible scenario” claims, but doesn’t fully implement the suggested cross-IAM comparisons. Partial uptake at best.

⁠

Summary table: how likely was UJ feedback the cause?

Change in Sept 2025 version

Evidence of UJ influence

Probability

Permanent 1°C damage conversion foregrounded

Direct conceptual match to evaluator request

0.75

Extreme-event threshold robustness added

Near point-by-point response

0.90

Long historical (BU) sample added

Directly addresses inference critique

0.80

Domestic SCC highlighted in abstract

Matches evaluator emphasis

0.65

Headline framing (“order of magnitude”)

Wording overlap with evaluation

0.55

Higher welfare & SCC point estimates

Plausible but indirect

0.35

Stronger scenario language

Partial alignment only

0.45

⁠

Want to print your doc?
This is not the way.

Try clicking the ··· in the right corner or using a keyboard shortcut (

CtrlP

) instead.