Skip to content

Benchmark 🧪

History

This page tracks the evolution of benchmark results over time, with the goal of ensuring continuous improvement in overall performance with each new version.

TIP

Want to see a detailed breakdown of the latest results? Go to Latest.

Stable Modules and Providers

Not all modules and providers are stable. Some are in beta or alpha stages, which means they might have breaking changes in future releases, and their benchmark results can vary significantly between versions.

This section focuses on the stable modules and providers:

  • Modules: DataExtractor, Comparator, and Scorer.
  • Providers: OpenAI, Anthropic, Google, and Deepseek.

overall history

Versionclaude-3-5-haiku-20241022claude-sonnet-4-20250514deepseek-chatgemini-2.5-flashgemini-2.5-progpt-4.1-minigpt-5gpt-5-mini
v0.26.5--70 ~ 92 (79)64 ~ 93 (82,7)-70 ~ 100 (87,7)--
v0.26.664 ~ 91 (80,3)-64 ~ 92 (79,7)64 ~ 94 (83)-64 ~ 100 (85,7)--
v0.27.170 ~ 100 (85)-76 ~ 92 (86,3)76 ~ 93 (86,7)-70 ~ 100 (87)--
v0.28.064 ~ 85 (75)-64 ~ 95 (81)70 ~ 94 (82,7)-58 ~ 92 (78)--
v0.28.164 ~ 85 (75)-70 ~ 95 (83)64 ~ 94 (78)-52 ~ 94 (71,7)--
v0.28.258 ~ 84 (75,3)-76 ~ 93 (84,3)52 ~ 91 (70,7)-70 ~ 94 (82,7)--
v0.29.070 ~ 84 (79)-70 ~ 95 (83)70 ~ 94 (80)-70 ~ 91 (81,7)--
v0.29.164 ~ 86 (78)-82 ~ 94 (86,7)58 ~ 92 (73)-58 ~ 90 (77,3)--
v0.30.374 ~ 96 (84)41 ~ 96 (71,3)83 ~ 96 (90)61 ~ 95 (83,1)54 ~ 95 (80,3)-58 ~ 96 (79,7)51 ~ 96 (80,4)

DataExtractor

DataExtractor history

Versionclaude-3-5-haiku-20241022claude-sonnet-4-20250514deepseek-chatgemini-2.5-flashgemini-2.5-flash-litegemini-2.5-progpt-4.1-minigpt-5gpt-5-minigpt-5-nano
v0.26.5--92 ~ 92 (92)93 ~ 93 (93)--93 ~ 93 (93)---
v0.26.686 ~ 86 (86)-92 ~ 92 (92)94 ~ 94 (94)--93 ~ 93 (93)---
v0.27.185 ~ 85 (85)-92 ~ 92 (92)93 ~ 93 (93)--91 ~ 91 (91)---
v0.28.085 ~ 85 (85)-95 ~ 95 (95)94 ~ 94 (94)--92 ~ 92 (92)---
v0.28.185 ~ 85 (85)-95 ~ 95 (95)94 ~ 94 (94)--94 ~ 94 (94)---
v0.28.284 ~ 84 (84)-93 ~ 93 (93)91 ~ 91 (91)--94 ~ 94 (94)---
v0.29.083 ~ 83 (83)-95 ~ 95 (95)94 ~ 94 (94)--91 ~ 91 (91)---
v0.29.186 ~ 86 (86)-94 ~ 94 (94)92 ~ 92 (92)--90 ~ 90 (90)---
v0.30.384 ~ 85 (84,7)77 ~ 77 (77)93 ~ 94 (93,7)93 ~ 95 (94,3)73 ~ 76 (74,3)95 ~ 95 (95)-85 ~ 85 (85)88 ~ 93 (91,3)78 ~ 80 (79)

Comparator

Comparator history

Versionclaude-3-5-haiku-20241022claude-sonnet-4-20250514deepseek-chatgemini-2.5-flashgemini-2.5-flash-litegemini-2.5-progpt-4.1-minigpt-5gpt-5-minigpt-5-nano
v0.26.5--75 ~ 75 (75)91 ~ 91 (91)--100 ~ 100 (100)---
v0.26.691 ~ 91 (91)-83 ~ 83 (83)91 ~ 91 (91)--100 ~ 100 (100)---
v0.27.1100 ~ 100 (100)-91 ~ 91 (91)91 ~ 91 (91)--100 ~ 100 (100)---
v0.28.076 ~ 76 (76)-84 ~ 84 (84)84 ~ 84 (84)--84 ~ 84 (84)---
v0.28.176 ~ 76 (76)-84 ~ 84 (84)76 ~ 76 (76)--69 ~ 69 (69)---
v0.28.284 ~ 84 (84)-84 ~ 84 (84)69 ~ 69 (69)--84 ~ 84 (84)---
v0.29.084 ~ 84 (84)-84 ~ 84 (84)76 ~ 76 (76)--84 ~ 84 (84)---
v0.29.184 ~ 84 (84)-84 ~ 84 (84)69 ~ 69 (69)--84 ~ 84 (84)---
v0.30.392 ~ 96 (93,3)96 ~ 96 (96)92 ~ 96 (93,3)92 ~ 92 (92)85 ~ 92 (89,7)92 ~ 92 (92)-96 ~ 96 (96)92 ~ 96 (94,7)92 ~ 100 (96)

Scorer

Scorer history

Versionclaude-3-5-haiku-20241022claude-sonnet-4-20250514deepseek-chatgemini-2.5-flashgemini-2.5-flash-litegemini-2.5-progpt-4.1-minigpt-5gpt-5-minigpt-5-nano
v0.26.5--70 ~ 70 (70)64 ~ 64 (64)--70 ~ 70 (70)---
v0.26.664 ~ 64 (64)-64 ~ 64 (64)64 ~ 64 (64)--64 ~ 64 (64)---
v0.27.170 ~ 70 (70)-76 ~ 76 (76)76 ~ 76 (76)--70 ~ 70 (70)---
v0.28.064 ~ 64 (64)-64 ~ 64 (64)70 ~ 70 (70)--58 ~ 58 (58)---
v0.28.164 ~ 64 (64)-70 ~ 70 (70)64 ~ 64 (64)--52 ~ 52 (52)---
v0.28.258 ~ 58 (58)-76 ~ 76 (76)52 ~ 52 (52)--70 ~ 70 (70)---
v0.29.070 ~ 70 (70)-70 ~ 70 (70)70 ~ 70 (70)--70 ~ 70 (70)---
v0.29.164 ~ 64 (64)-82 ~ 82 (82)58 ~ 58 (58)--58 ~ 58 (58)---
v0.30.374 ~ 74 (74)41 ~ 41 (41)83 ~ 83 (83)61 ~ 64 (63)70 ~ 77 (73,7)54 ~ 54 (54)-58 ~ 58 (58)51 ~ 61 (55,3)41 ~ 48 (43,3)

Lister (unstable)

Lister history

Versionclaude-3-5-haiku-20241022claude-sonnet-4-20250514deepseek-chatgemini-2.5-flashgemini-2.5-flash-litegemini-2.5-progpt-5gpt-5-minigpt-5-nano
v0.30.373.0 ~ 75.0 (74.3)70.0 ~ 70.0 (70.0)14.0 ~ 79.0 (56.0)61.0 ~ 63.0 (61.7)51.0 ~ 55.0 (52.7)52.0 ~ 52.0 (52.0)42.0 ~ 42.0 (42.0)45.0 ~ 51.0 (48.7)22.0 ~ 36.0 (29.0)

Ranker (unstable)

Ranker history

Versiondeepseek-chatgemini-2.5-flashgemini-2.5-flash-litegpt-4.1-minigpt-5-minigpt-5-nano
v0.26.550 ~ 50 (50)50 ~ 50 (50)-100 ~ 100 (100)-50 ~ 100 (66,7)
v0.26.6100 ~ 100 (100)100 ~ 100 (100)-50 ~ 50 (50)-0 ~ 100 (62,5)
v0.27.150 ~ 50 (50)50 ~ 50 (50)-50 ~ 50 (50)-0 ~ 50 (37,5)
v0.28.2100 ~ 100 (100)100 ~ 100 (100)-50 ~ 50 (50)-0 ~ 100 (62,5)
v0.29.0100 ~ 100 (100)50 ~ 50 (50)-50 ~ 50 (50)-0 ~ 100 (50)
v0.29.1-50 ~ 50 (50)-50 ~ 50 (50)-0 ~ 50 (25)
v0.30.350 ~ 50 (50)50 ~ 100 (66,7)0 ~ 50 (33,3)-50 ~ 100 (66,7)0 ~ 100 (36,1)

ps: Logs for versions prior to v0.26.5 are unavailable. I will, however, keep this page updated with every new version.