Prior Prompt Engineering for Reinforcement Fine-Tuning

1. Introduction to Prior Prompt Engineering (pPE) in Reinforcement Fine-Tuning (RFT)

1.1. Defining pPE and its role in RFT

Prior Prompt Engineering (pPE) refers to the strategic design and application of instructional prompts that are prepended to queries during the training phase of Reinforcement Fine-Tuning (RFT) for Language Models (LMs) . The primary role of pPE is to guide the LM to internalize and exhibit desired behaviors, such as step-by-step reasoning, planning, or knowledge recall, in response to inputs, as incentivized by reward signals within the RFT framework . Unlike inference-time prompt engineering (iPE), which aims to elicit specific behaviors from a pre-trained model during its deployment, pPE focuses on shaping the model’s fundamental response patterns by integrating these instructions directly into the RFT process . This integration allows the model to learn these behaviors as intrinsic aspects of its generation strategy, rather than as ad-hoc responses to external cues at inference time. The “prior” in pPE signifies that these prompts are used before the model generates its output during training, effectively setting a behavioral precedent that the RFT process then reinforces. The study highlights that while existing RFT research has extensively explored algorithms, reward shaping, and data curation, the systematic design and impact of these prior prompts have remained relatively underexplored .

1.2. Motivation: Moving beyond traditional RFT focus areas

The motivation behind investigating Prior Prompt Engineering (pPE) stems from the observation that much of the existing research in Reinforcement Fine-Tuning (RFT) has concentrated on optimizing algorithmic components, shaping reward functions, and curating training datasets . While these areas are crucial for advancing RFT, the specific design of the instructional part of the prompt used during training—the prior prompt—has received comparatively less systematic attention. The authors identify this as a critical gap, noting that the prior prompt is instrumental in eliciting desired behaviors, such as structured reasoning, from language models . The paper posits that by focusing on pPE, it’s possible to unlock new avenues for enhancing LM performance and instilling more nuanced and effective problem-solving approaches. This shift in focus moves beyond simply rewarding correct answers to actively guiding how the model arrives at those answers. The research aims to demonstrate that pPE is not merely a minor implementation detail but a powerful and understudied axis for RFT, capable of significantly influencing the learned behaviors and overall capabilities of the resulting models .

1.3. Core idea: Translating inference-time strategies to training-time prompts

The core idea of this research is to translate successful strategies from inference-time prompt engineering (iPE) into corresponding prior prompt engineering (pPE) approaches for use during Reinforcement Fine-Tuning (RFT) . Inference-time prompting techniques, such as Chain-of-Thought (CoT), Plan-and-Solve, and Program-of-Thought, have demonstrated significant success in eliciting complex reasoning and problem-solving behaviors from large language models by providing explicit instructions or examples within the input prompt at the time of query . This study investigates whether these iPE strategies, when adapted as prior prompts (i.e., instructions consistently prepended during the RFT training process), can guide LMs to internalize these desirable behaviors. The hypothesis is that by consistently exposing the model to these behavioral cues during training and reinforcing the resulting behaviors through reward signals, the model will learn to adopt these strategies intrinsically . For instance, a CoT-style iPE prompt, which asks the model to “think step by step,” would be translated into a pPE approach where the training prompt always instructs the model to engage in such reasoning . This translation aims to move from merely eliciting a behavior temporarily at inference to embedding it as a fundamental part of the model’s operational repertoire.

2. Investigated pPE Strategies

The research investigates five distinct Prior Prompt Engineering (pPE) strategies, each translated from a representative inference-time prompt engineering (iPE) technique. These strategies are designed to elicit different problem-solving behaviors from the language model during Reinforcement Fine-Tuning (RFT). The study aims to determine whether these pPE approaches can guide the LM to internalize these distinct behaviors and how these internalized behaviors impact performance and response styles . The selection of these five strategies is based on their demonstrated effectiveness in iPE and their diversity in eliciting varied generation patterns. The expectation is that each pPE strategy will not only lead to differences in performance but also instill specific behavioral characteristics in the resulting RFT models, observable in their training dynamics, average response lengths, and the nature of their generated outputs . The specific pPE tags (e.g., <think>, <plan>) are integrated into the prior prompt templates and are also used to define the expected output format for the reward function, ensuring that the model is trained to produce responses consistent with the targeted behavior .

The table below summarizes the five pPE strategies investigated in the study.

pPE Strategy	iPE Inspiration	Core Instruction for Model	Expected Internalized Behavior
Reasoning (Think)	Chain-of-Thought (CoT)	“The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.”	Explicit, step-by-step reasoning before final answer.
Planning (Plan)	Plan-and-Solve (PS)	“The assistant first plans about the reasoning process in the mind and then provides the user with the answer.”	Generation of a high-level plan or strategy before solution execution.
Code-based Reasoning (Code)	Program-of-Thought (PoT)	“The assistant first writes required code used to solve the problem and then provides the user with the answer.”	Formulation of solutions using executable code or code-like structured logic.
Knowledge Recall (Knowledge)	Generated Knowledge Prompting	“The assistant first recalls relevant knowledge used to solve the problem and then provides the user with the answer.”	Explicit retrieval and articulation of relevant facts, definitions, or formulas.
Null-example Utilization (Examples)	Null-shot Prompting	“The assistant first lists relevant examples used to solve the problem and then provides the user with the answer.”	Generation or consideration of illustrative examples or counterexamples.

Table 1: Summary of Investigated Prior Prompt Engineering (pPE) Strategies, their Inference-Time Prompt Engineering (iPE) inspirations, core instructions, and expected internalized behaviors.

2.1. Reasoning (Think Prompt)

The “Reasoning” pPE strategy, instantiated through the <think> tag, is designed to encourage the language model to engage in explicit, step-by-step reasoning before arriving at a final answer. This approach is directly inspired by Chain-of-Thought (CoT) prompting, a widely recognized iPE technique where models generate intermediate reasoning steps . The prior prompt used for this strategy explicitly instructs the model: “The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively” . This instruction is followed by a placeholder for the problem and a reminder to state the final answer mathematically within \boxed{} tags inside the <answer> tag. The primary goal of this pPE approach is to cultivate a model behavior that systematically breaks down complex problems into smaller, manageable steps, articulates these steps, and then synthesizes them to reach a conclusion. By internalizing this CoT-like behavior during RFT, the model is expected to improve its problem-solving capabilities, particularly on tasks that require multi-step reasoning, such as mathematical word problems .

2.2. Planning (Plan Prompt)

The “Planning” pPE strategy, implemented using the <plan> tag, aims to guide the language model to formulate an explicit plan or a high-level strategy before attempting to solve the given problem. This approach is based on the plan-and-solve prompting technique, an iPE method where the model first outlines a strategy . The prior prompt for this strategy instructs the model: “The assistant first plans about the reasoning process in the mind and then provides the user with the answer. The plan and answer are enclosed within <plan> </plan> and <answer> </answer> tags, respectively” . Similar to the “Reasoning” prompt, it also includes instructions for the problem context and the format for the final answer. The key distinction of this strategy is its focus on high-level strategy formulation rather than detailed, step-by-step reasoning at the initial stage. The expectation is that by prompting the model to generate a plan, it will learn to outline the major steps or sub-goals required to tackle a problem, potentially leading to more structured and efficient problem-solving . This could be particularly beneficial for complex tasks where a clear roadmap can prevent the model from getting lost in intermediate computations.

2.3. Code-based Reasoning (Code Prompt)

The “Code-based Reasoning” pPE strategy, utilizing the <code> tag, encourages the language model to leverage programming code as a tool for reasoning and problem-solving. This strategy draws inspiration from Program-of-Thought (PoT) prompting, where models are prompted to generate and sometimes execute code to arrive at a solution . The specific prior prompt for this approach states: “The assistant first writes required code used to solve the problem and then provides the user with the answer. The code and answer are enclosed within ` ` and <answer> </answer> tags, respectively” . The prompt structure remains consistent, with instructions for the problem and answer formatting. The objective here is to instill a behavior where the model identifies situations where algorithmic or computational thinking, expressible through code (e.g., Python), can be advantageous. This is particularly relevant for mathematical problems involving repetitive calculations, complex data manipulations, or problems that can be elegantly solved using established algorithms . By internalizing this code-based reasoning style during RFT, the model is expected to become more proficient at translating problem statements into executable code snippets.

2.4. Knowledge Recall (Knowledge Prompt)

The “Knowledge Recall” pPE strategy, implemented with the <knowledge> tag, is designed to prompt the language model to explicitly retrieve and state relevant factual knowledge, definitions, formulas, or concepts before proceeding with the problem-solving process. This approach is inspired by generated knowledge prompting techniques, where models are encouraged to recall pertinent information to aid in reasoning . The prior prompt for this strategy instructs: “The assistant first recalls relevant knowledge used to solve the problem and then provides the user with the answer. The knowledge and answer are enclosed within <knowledge> </knowledge> and <answer> </answer> tags, respectively” . The standard problem and answer formatting instructions follow. The core aim of this pPE strategy is to make the model’s reliance on internalized knowledge more explicit and deliberate. By prompting the model to first articulate the relevant knowledge pieces, it is encouraged to activate the correct contextual information needed to address the problem . This can be particularly useful for tasks that heavily depend on specific domain knowledge, such as recalling mathematical theorems, physical laws, or definitions.

2.5. Null-example Utilization (Examples Prompt)

The “Null-example Utilization” pPE strategy, which employs the <examples> tag, encourages the language model to generate or list relevant illustrative examples or counterexamples before attempting to solve the main problem. This strategy is based on null-shot prompting techniques, where models are guided to provide examples that help clarify the problem space or constraints, exploiting the model’s inductive biases without providing actual demonstrations . The prior prompt for this approach states: “The assistant first lists relevant examples used to solve the problem and then provides the user with the answer. The examples and answer are enclosed within <examples> </examples> and <answer> </answer> tags, respectively” . The structure includes the usual problem context and answer format instructions. The primary goal of this pPE strategy is to foster a problem-solving behavior where the model actively considers concrete instances or scenarios related to the problem. Generating examples can help in understanding the problem’s boundaries, identifying patterns, or testing hypotheses . By internalizing this behavior through RFT, the model is expected to develop a more nuanced understanding of problems by grounding its reasoning in concrete cases. Notably, the abstract indicates that the null-example pPE approach achieved the largest average performance gain in their experiments .

3. Experimental Setup

The experimental setup is designed to systematically evaluate the impact of different Prior Prompt Engineering (pPE) strategies on language model performance and behavior during Reinforcement Fine-Tuning (RFT) . The core of the setup involves training multiple instances of a base language model using a standard RFT pipeline, where the only varying factor across experiments is the pPE approach and its associated output format expectation for the reward function. This controlled variation allows for a direct comparison of how different instructional cues, translated from inference-time prompt engineering (iPE) strategies, influence the model’s learning process and final capabilities. The evaluation encompasses both quantitative assessments on established benchmarks and qualitative analyses of model behaviors and training dynamics . The study utilizes a specific base model, a consistent RFT algorithm, a defined training dataset, and a structured reward function to ensure comparability across the different pPE conditions.

3.1. Model used: Qwen2.5-7B

For the main experiments conducted in this study, the Qwen2.5-7B model was employed as the base language model . Qwen2.5-7B is a 7-billion parameter language model developed by Qwen. The choice of this specific model provides a strong, contemporary foundation for investigating the effects of different pPE strategies. All five pPE variants (Reasoning/Think, Planning/Plan, Code-based Reasoning/Code, Knowledge Recall/Knowledge, and Null-example Utilization/Examples) were trained using this same Qwen2.5-7B base model, ensuring that any observed differences in performance or behavior could be attributed to the pPE approach rather than variations in the underlying model architecture or pre-training . The study also uses Qwen2.5-7B, prompted at inference time with each corresponding iPE approach, as a comparison baseline to assess the benefits of internalizing these behaviors through RFT versus eliciting them at inference . This consistent use of Qwen2.5-7B across all experimental conditions and baselines allows for a clear and direct evaluation of the pPE methodologies.

3.2. Benchmarks for evaluation

The evaluation of models trained with different pPE strategies, as well as baseline iPE-prompted models, involves a comprehensive suite of quantitative benchmarks. These benchmarks are selected to assess performance across a range of capabilities, including mathematical reasoning, coding, and general question answering . The use of diverse benchmarks aims to understand the in-domain and out-of-domain generalization capabilities of the pPE-trained models. Specifically, the performance is reported as average accuracy across these benchmarks. The selection of these benchmarks allows the researchers to determine not only if pPE improves performance but also if certain pPE strategies are particularly well-suited for specific types of tasks. The details of these benchmarks are further elaborated in Appendix D. 4.1 of the paper .✅

The table below lists the benchmarks used for evaluating the pPE-trained models.

Benchmark Category	Benchmark Name(s)	Description	Relevance to pPE Strategies
Mathematical	AIME2024 (AIME)	Problems from the American Invitational Mathematics Examination, known for difficulty and deep insight.	Tests deep mathematical reasoning, planning, and potentially code-based solutions.
	AMC12 』22–』23 (AMC)	Problems from the American Mathematics Competition, testing a broad range of high school math concepts.	Assesses general mathematical problem-solving and reasoning.
	MATH-500 (MATH)	500 problems from the MATH dataset by Hendrycks et al. (2021), evaluating pre-university level math.	Measures performance on standard mathematical reasoning tasks.
General	GPQA-Diamond (GPQA)	Graduate-Level Google-Proof Q&A, Diamond subset (most difficult), testing advanced reasoning and research.	Assesses generalization to complex, non-mathematical reasoning and knowledge application.
	HumanEval+ (HE+)	Evaluates functional correctness of code generated by LMs from docstrings (Python).	Crucial for evaluating the “Code-based Reasoning” pPE strategy and coding proficiency.

Table 2: Benchmarks used for evaluating models trained with different Prior Prompt Engineering (pPE) strategies.

3.2.1. Mathematical benchmarks (e.g., AIME2024, AMC12, MATH-500)

To assess mathematical reasoning capabilities, the study employs several challenging mathematical benchmarks. These include AIME2024 (AIME), which features problems from the American Invitational Mathematics Examination, known for its difficulty and requirement for deep mathematical insight . Another benchmark is the AMC12 』22–』23 (AMC), comprising problems from the American Mathematics Competition, which tests a broad range of high school mathematics concepts . Additionally, the MATH-500 (MATH) benchmark, derived from the larger MATH dataset introduced by Hendrycks et al. (2021), is used. MATH-500 consists of 500 problems and is designed to evaluate the problem-solving abilities of models on pre-university level mathematics . These benchmarks collectively provide a rigorous testbed for the mathematical reasoning skills cultivated by the different pPE strategies, especially since the RFT training dataset (STILLv3) is also math-focused . The performance on these benchmarks will indicate how well the internalized behaviors translate to solving complex mathematical problems.

3.2.2. General benchmarks (e.g., GPQA-Diamond, HumanEval+)

Beyond mathematical reasoning, the study evaluates model performance on general benchmarks covering coding and question answering to test for broader generalization. For coding capabilities, HumanEval+ (HE+) is used, which includes both base and extra sets of problems designed to assess the functional correctness of code generated by LMs in response to docstrings . This benchmark is crucial for evaluating the effectiveness of the “Code-based Reasoning” pPE strategy. For general question answering, particularly focusing on graduate-level reasoning, the GPQA-Diamond (GPQA) benchmark is utilized . GPQA is designed to be resistant to shallow pattern matching or memorization, requiring deep understanding and complex reasoning across various domains. The inclusion of GPQA-Diamond allows the researchers to assess whether the behaviors internalized through math-focused RFT with different pPE strategies can transfer to challenging, non-mathematical reasoning tasks. The performance on these general benchmarks provides insights into the versatility and robustness of the pPE-trained models .

3.3. Reward design: Accuracy and Format rewards

The reward function used during Reinforcement Fine-Tuning (RFT) is a critical component designed to reinforce the desired behaviors elicited by the prior prompts. In this study, the reward function comprises two equally weighted components, summing to a total of 1.0: an accuracy reward and a format reward . This balanced approach aims to incentivize the model to produce responses that are not only factually correct but also well-structured and adherent to the specified output format. The total reward is the sum of these two components, resulting in a final reward value within the range of [0, 1]. This design choice, following the approach introduced by Xie et al. (2025), is intended to ensure that the model learns to generate outputs that are both accurate and usable .

The accuracy reward assesses whether the model produces the correct final answer to a given problem. This typically involves verifying the final answer extracted from the model’s output (e.g., from \boxed{} tags) using tools like math-verify for mathematical problems, and awarding a score of 0.5 for a correct answer and 0.0 for an incorrect one . The format reward assesses whether the model』s output follows an expected, specific format, dynamically updated to match the pPE approach being used. Specifically, the expected format is <x> [model's reasoning/planning/code/knowledge/examples related to the pPE strategy] </x> followed by <answer> [final answer] </answer>, where x is one of think, plan, code, knowledge, or examples . A format reward of 0.5 is granted if the response contains exactly one pair of the expected XML-like tags with a valid basic structure, even if other elements are present; otherwise, it is 0.0 . This dual-component reward ensures that the model is incentivized not only to get the right answer but also to articulate its process in the manner dictated by the pPE strategy. Additional details on the reward function are available in Appendix D. 2 of the paper .✅

4. Key Findings and Results

4.1. Performance of pPE-trained models vs. iPE-prompted counterparts

A central finding of the research is that Language Models (LMs) trained using Prior Prompt Engineering (pPE) strategies during Reinforcement Fine-Tuning (RFT) consistently outperformed their counterparts that relied solely on inference-time Prompt Engineering (iPE) . The study compared Qwen2.5-7B models fine-tuned with various pPE approaches against a baseline where the base model was prompted using standard iPE techniques at the time of inference. The results indicated a clear advantage for the pPE-trained models across the evaluated benchmarks, which included mathematical reasoning tasks (like AIME2024) and general problem-solving benchmarks (such as GPQA-Diamond and HumanEval+) . This suggests that internalizing the desired behaviors—such as step-by-step reasoning, planning, code generation, knowledge recall, or null-example utilization—during the RFT process leads to more effective problem-solving than merely prompting for these behaviors at inference time. The performance gain highlights the efficacy of pPE in shaping the model’s fundamental approach to tasks, making it more inherently aligned with the desired problem-solving methodologies.

4.2. Superior performance of Null-example pPE

Among the five Prior Prompt Engineering (pPE) strategies investigated, the Null-example Utilization approach (referred to as <examples> pPE) demonstrated particularly noteworthy results. This strategy, which encourages the model to generate or reference illustrative examples before answering, achieved the largest average performance gain across the evaluated benchmarks . Furthermore, the Null-example pPE approach led to the highest improvement on two challenging benchmarks: AIME2024 (a mathematical reasoning test) and GPQA-Diamond (a graduate-level question-answering dataset) . This outcome was somewhat unexpected, as null-shot prompting, the iPE technique inspiring this pPE, involves prompting the model to utilize non-existent in-context examples . The superior performance of the Null-example pPE suggests that training the model to actively engage in generating or considering relevant examples as part of its problem-solving process can significantly enhance its reasoning and generalization abilities. This finding positions Null-example pPE as a highly effective method for RFT, surpassing even the commonly used reasoning-based pPE approaches in terms of overall performance improvement .

4.3. Impact of different pPE strategies on model behavior and style

The research provides evidence that different Prior Prompt Engineering (pPE) strategies instill distinct behavioral styles in the resulting Language Models (LMs) after Reinforcement Fine-Tuning (RFT) . By adapting a behavior-classification framework, the authors were able to demonstrate that the choice of pPE approach—be it reasoning, planning, code-based reasoning, knowledge recall, or null-example utilization—leads to observable differences in how the models approach and solve tasks . This finding is significant because it suggests that pPE can be used not only to improve overall performance but also to cultivate specific “thinking” modes or problem-solving methodologies within LMs. For instance, a model trained with a <code> pPE might inherently favor algorithmic solutions, while one trained with a <think> (reasoning) pPE might excel at breaking down problems into logical steps. The paper highlights that these distinct behavioral styles are a direct consequence of the internalized prompting strategy, effectively shaping the model’s operational characteristics. This positions pPE as a powerful yet understudied axis for RFT, offering a pathway to develop more specialized and versatile AI systems with diverse cognitive profiles .

4.4. Domain generalization capabilities

The study observed that the performance improvements achieved through Prior Prompt Engineering (pPE) and Reinforcement Fine-Tuning (RFT) often extend beyond the specific domain of the training data . Although the RFT process in these experiments primarily used mathematical problems for training, the resulting performance gains were also evident in other domains, as demonstrated by improvements on benchmarks like GPQA (Graduate-Level Google-Proof Q&A. and HumanEval+✅ (a code generation benchmark) . This suggests a degree of robustness in the RFT approach, indicating that RFT may function more as a mechanism for discovering generally useful generation patterns and problem-solving strategies rather than merely infusing the model with new, domain-specific knowledge. The authors hypothesize that broader performance generalization is likely to emerge with the use of more diverse training data, potentially incorporating code or logic problems alongside mathematical ones . This finding is important as it implies that the behavioral patterns internalized through pPE can be transferable across different types of tasks, enhancing the model’s overall versatility and adaptability.

5. Implications and Significance

5.1. pPE as a powerful, understudied axis for RFT

The findings of this research strongly position Prior Prompt Engineering (pPE) as a powerful, yet previously understudied, dimension within Reinforcement Fine-Tuning (RFT) for Language Models (LMs) . While existing RFT research has predominantly concentrated on algorithmic innovations, reward function design, and data curation, the systematic engineering of the prior prompt—the instruction that elicits specific behaviors during training—has received comparatively less attention . This paper demonstrates that varying the pPE strategy can lead to significant differences in model performance and, crucially, instill distinct behavioral styles in the resulting LMs . The success of pPE, particularly strategies like Null-example Utilization, in surpassing traditional inference-time prompting and even common reasoning-based pPE, highlights its potential as a highly effective tool for model optimization . By showing that how a model is prompted during training (i.e., the nature of the pPE) can fundamentally shape its internalized problem-solving approaches, the study opens up a new avenue for research and development in RFT.

5.2. Potential for cultivating diverse AI “thinking” modes

One of the significant implications of this work is the potential for Prior Prompt Engineering (pPE) to cultivate diverse AI “thinking” modes or cognitive styles within Language Models . The research demonstrates that different pPE strategies—such as those encouraging reasoning, planning, code-based problem-solving, knowledge recall, or the use of illustrative examples—result in models that exhibit distinct behavioral characteristics . This suggests that pPE can be used not just to improve task performance in a general sense, but to intentionally shape the underlying processes by which an LM approaches and solves problems. For example, a model trained with a <code> pPE might develop a more algorithmic and structured “thought process,” while one trained with a <plan> pPE might become more adept at strategic decomposition of tasks. This ability to instill specific problem-solving heuristics or reasoning patterns offers a pathway to creating a suite of specialized LMs, each optimized for different types of tasks or exhibiting different cognitive strengths. Such diversification could lead to more robust and versatile AI systems.

5.3. Robustness of RFT in discovering useful generation patterns

The research highlights the robustness of Reinforcement Fine-Tuning (RFT) as a mechanism for Language Models (LMs) to discover useful generation patterns, particularly when guided by effective Prior Prompt Engineering (pPE) . An important observation is that performance improvements achieved through RFT, using mathematical problems for training, often extended to other domains, as evidenced by enhanced performance on benchmarks like GPQA and HumanEval+ . This suggests that RFT, in conjunction with pPE, is not merely about memorizing training data or acquiring narrow, domain-specific knowledge. Instead, it appears to facilitate the learning of more generalizable problem-solving strategies and response generation patterns that are applicable across different types of tasks. The authors posit that RFT may function more as a discovery process for these beneficial patterns, which are then internalized by the model . This inherent capability of RFT to uncover and reinforce effective behaviors, especially when steered by well-designed pPE, underscores its value in developing LMs that are not only high-performing on specific benchmarks but also exhibit a degree of adaptability and transferable skill.

6. Conclusion

6.1. Summary of contributions

The paper “Prior Prompt Engineering for Reinforcement Fine-Tuning” makes several key contributions to the field of Language Model (LM) optimization. Firstly, it systematically investigates the underexplored area of Prior Prompt Engineering (pPE) within the Reinforcement Fine-Tuning (RFT) paradigm, demonstrating its significant impact on model performance and behavior . Secondly, the study translates five representative inference-time prompt engineering (iPE) strategies—reasoning, planning, code-based reasoning, knowledge recall, and null-example utilization—into corresponding pPE approaches and empirically evaluates their effectiveness using the Qwen2.5-7B model . A major finding is that all pPE-trained models surpassed their iPE-prompted counterparts, with the null-example pPE approach achieving the largest average performance gain and the highest improvements on challenging benchmarks like AIME2024 and GPQA-Diamond . Thirdly, the research shows that different pPE strategies instill distinct behavioral styles in the resulting models, indicating that pPE can be used to cultivate diverse AI “thinking” modes . Finally, the work positions pPE as a powerful and previously understudied axis for RFT, offering a new avenue for enhancing LM capabilities and robustness, with observed performance generalization across different domains .