NLPExplorer
Papers
Venues
Authors
Authors Timeline
Field of Study
URLs
ACL N-gram Stats
TweeNLP
API
Team
GEM - 2025
Total Papers:- 68
Total Papers accross all years:- 164
Total Citations :- 0
1
2
3
4
5
»
SECQUE: A Benchmark for Evaluating Real-World Financial Analysis Capabilities
Noga BenYoash |
Menachem Brief |
Oded Ovadia |
Gil Shenderovitz |
Moshik Mishaeli |
Rachel Lemberg |
Eitam Sheetrit |
Big Escape Benchmark: Evaluating Human-Like Reasoning in Language Models via Real-World Escape Room Challenges
Zinan Tang |
QiYao Sun |
ReproHum #0033-05: Human Evaluation of Factuality from A Multidisciplinary Perspective
Andra-Maria Florescu |
Marius Micluța-Câmpeanu |
Stefana Arina Tabusca |
Liviu P Dinu |
ReproHum #0031-01: Reproducing the Human Evaluation of Readability from “It is AI’s Turn to Ask Humans a Question”
Daniel Braun |
Free-text Rationale Generation under Readability Level Control
Yi-Sheng Hsu |
Nils Feldhus |
Sherzod Hakimov |
Spatial Representation of Large Language Models in 2D Scene
WenyaWu WenyaWu |
Weihong Deng |
Finance Language Model Evaluation (FLaME)
Glenn Matlin |
Mika Okamoto |
Huzaifa Pardawala |
Yang Yang |
Sudheer Chava |
(Towards) Scalable Reliable Automated Evaluation with Large Language Models
Bertil Braun |
Martin Forell |
U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in Large Language Models
Konstantin Chernyshev |
Vitaliy Polshkov |
Vlad Stepanov |
Alex Myasnikov |
Ekaterina Artemova |
Alexei Miasnikov |
Sergei Tilga |
HEDS 3.0: The Human Evaluation Data Sheet Version 3.0
Anya Belz |
Craig Thomson |
An Analysis of Datasets, Metrics and Models in Keyphrase Generation
Florian Boudin |
Akiko Aizawa |
Prompt, Translate, Fine-Tune, Re-Initialize, or Instruction-Tune? Adapting LLMs for In-Context Learning in Low-Resource Languages
Christopher Toukmaji |
Jeffrey Flanigan |
From Calculation to Adjudication: Examining LLM Judges on Mathematical Reasoning Tasks
Andreas Stephan |
Dawei Zhu |
Matthias Aßenmacher |
Xiaoyu Shen |
Benjamin Roth |
Investigating the Robustness of Retrieval-Augmented Generation at the Query Level
Sezen Perçin |
Xin Su |
Qutub Sha Syed |
Phillip Howard |
Aleksei Kuvshinov |
Leo Schwinn |
Kay-Ulrich Scholl |
Evaluating the Quality of Benchmark Datasets for Low-Resource Languages: A Case Study on Turkish
Elif Ecem Umutlu |
Ayse Aysu Cengiz |
Ahmet Kaan Sever |
Seyma Erdem |
Burak Aytan |
Busra Tufan |
Abdullah Topraksoy |
Esra Darıcı |
Cagri Toraman |
Conference Topic Distribution
Linguistic
Task
Approach
Language
Dataset
Conference Citation Distribution
Conference Papers have no Citations yet
Topics