NLPExplorer

GEM - 2025

Total Papers:- 68

Total Papers accross all years:- 164

Total Citations :- 0

« 1 2 3 4 5 »

PersonaTwin: A Multi-Tier Prompt Conditioning Framework for Generating and Evaluating Personalized Digital Twins

Sihan Chen | John P. Lalor | Yi Yang | Ahmed Abbasi |

Evaluating Intermediate Reasoning of Code-Assisted Large Language Models for Mathematics

Zena Al Khalili | Nick Howell | Dietrich Klakow |

Cleanse: Uncertainty Estimation Approach Using Clustering-based Semantic Consistency in LLMs

Minsuh Joo | Hyunsoo Cho |

The 2025 ReproNLP Shared Task on Reproducibility of Evaluations in NLP: Overview and Results

Anya Belz | Craig Thomson | Javier González Corbelle | Malo Ruelle |

Metric assessment protocol in the context of answer fluctuation on MCQ tasks

Clustering Zero-Shot Uncertainty Estimations to Assess LLM Response Accuracy for Yes/No Q&A

Christopher T. Franck | Amy Vennos | W. Graham Mueller | Daniel Dakota |

ReproHum #0729-04: Human Evaluation Reproduction Report for “MemSum: Extractive Summarization of Long Documents Using Multi-Step Episodic Markov Decision Processes”

Simeon Junker |

Are Bias Evaluation Methods Biased ?

Lina Berrayana | Sean Rooney | Luis Garcés-Erice | Ioana Giurgiu |

ARGENT: Automatic Reference-free Evaluation for Open-Ended Text Generation without Source Inputs

Xinyue Zhang | Agathe Zecevic | Sebastian Zeki | Angus Roberts |

ReproHum #0067-01: A Reproduction of the Evaluation of Cross-Lingual Summarization

Supryadi | Chuang Liu | Deyi Xiong |

Measure only what is measurable: towards conversation requirements for evaluating task-oriented dialogue systems

Selective Shot Learning for Code Explanation

Paheli Bhattacharya | Rishabh Gupta |

(Dis)improved?! How Simplified Language Affects Large Language Model Performance across Languages

HuGME: A benchmark system for evaluating Hungarian generative LLMs