Skip to content

Benchmark 🧪 ​

What is it? ​

Simply put: benchmarking here means putting LLMs through real assessments, using precise questions and expecting clear, targeted answers. For developers, think of it as an end-to-end test without any mocks, just actual calls to the LLM provider. We run the same method with different inputs to verify that the outputs consistently meet expectations.

Importantly, the benchmark doesn’t check every word of the response. Instead, it validates the overall result. For example, if we ask for a score for the beauty of a T-Shirt and respond with 81, we check that the score falls within 70-90, not that the score is exactly 81.

Why Benchmark LLM Features? ​

Treating benchmarks like end-to-end tests ensures everything works as expected, even as you tweak prompts, models updates, adjust temperature, or refactor workflows. This way, you can confidently make changes without losing sight of the final goal.

Smart Recommendation ​

When you use a module without specifying a model or provider, ActiveGenie automatically selects the best performer based on the latest benchmarks. The selection process considers three key factors:

  • Precision: How accurate the model is on average.
  • Variance: How much the model's results fluctuate (lower is better for stability).
  • Cost: How much you pay per request (lower is better, especially for frequent use).

To balance these, we calculate a recommendation score for each model using the formula:

Recommendation Score = (Avg Precision / 100) - (Normalized Variance + Normalized Cost Ă— 0.5)

Where:

  • Normalized Variance = (Max Precision - Min Precision) / 100
  • Normalized Cost = Cost / Max Cost

The model with the highest recommendation score is chosen for each module. The table below shows the overall best performers.

All modules

ModelMIN PrecisionMAX PrecisionAvg PrecisionCost per run ($)Recommendation Score
deepseek-chat83.096.090.00.70.7
gemini-2.5-flash-lite70.092.079.20.30.6
claude-3-5-haiku-2024102274.096.084.03.50.4
gemini-2.5-flash61.095.083.12.80.3
gpt-5-mini51.096.080.41.80.2
gpt-558.096.079.74.20.1
gpt-5-nano41.0100.072.80.60.1
gemini-2.5-pro54.095.080.36.3-0.1
claude-sonnet-4-2025051441.096.071.34.1-0.1

Module by module benchmark ​

Extractor Module Benchmark Results ​

Extractor

Smart recommendation:

  1. deepseek-chat
  2. gemini-2.5-flash
  3. gpt-5-mini
#ProviderModelTestsPrecisionDuration (s)RequestsTokensAvg. Duration (s)cost
17003761591anthropicclaude-3-5-haiku-20241022101/18 (120)84518.61552037514.300.8
17008403285anthropicclaude-3-5-haiku-20241022103/16 (120)85520.41562046354.340.81854
17008682819anthropicclaude-3-5-haiku-20241022102/17 (120)85515.81552037904.320.81516
17009564599anthropicclaude-sonnet-4-2025051493/8 (120)77624.71381788595.212.682885
17003761591deepseekdeepseek-chat113/7 (120)941402.416214535213.490.1598872
17008403285deepseekdeepseek-chat113/7 (120)941484.016314582612.370.1604086
17008682819deepseekdeepseek-chat112/8 (120)931619.316414632611.690.1609586
17003761591googlegemini-2.5-flash112/8 (120)93406.71642127303.530.531825
17008403285googlegemini-2.5-flash114/6 (120)95147.11662124343.500.531085
17008682819googlegemini-2.5-flash114/6 (120)95419.81662158033.390.5
17003761591googlegemini-2.5-flash-lite88/19 (120)73174.91481538961.400.0615584
17008403285googlegemini-2.5-flash-lite92/18 (120)76423.71511552021.460.0620808
17008682819googlegemini-2.5-flash-lite89/19 (120)74168.51471534591.230.0613836
17009564599googlegemini-2.5-pro115/5 (120)951769.816326106414.753.9
17003761591openaigpt-5-nano94/26 (120)78925.31452505107.590.100204
17008403285openaigpt-5-nano95/25 (120)791419.41502577817.900.1031124
17008682819openaigpt-5-nano96/24 (120)80947.41502576177.710.1030468
17003761591openaigpt-5-mini112/8 (120)931331.015318419412.970.368388
17008403285openaigpt-5-mini106/14 (120)88910.215518526411.090.370528
17008682819openaigpt-5-mini112/8 (120)931556.415719165811.830.383316
17009564599openaigpt-5102/18 (120)852004.416324773416.702.47734

Comparator Module Benchmark Results ​

Comparator

Smart recommendation:

  1. claude-sonnet-4-20250514
  2. deepseek-chat
  3. gpt-5

Summary Table:

RunProviderModelTestsPrecisionDuration (s)RequestsTokensAvg. Duration (s)Cost ($)
17003761591anthropicclaude-3-5-haiku-2024102227/1 (28)96206.828413317.350.165324
17008403285anthropicclaude-3-5-haiku-2024102226/2 (28)92203.728411987.280.164792
17008682819anthropicclaude-3-5-haiku-2024102226/2 (28)92205.828412957.390.2
17009564599anthropicclaude-sonnet-4-2025051427/1 (28)96458.0284833816.360.72507
17003761591deepseekdeepseek-chat27/1 (28)96486.1283042021.340.0
17008403285deepseekdeepseek-chat26/2 (28)92522.9283035218.680.0333872
17008682819deepseekdeepseek-chat26/2 (28)92597.5283027517.360.0333025
17003761591googlegemini-2.5-flash26/2 (28)92249.728719389.410.179845
17008403285googlegemini-2.5-flash26/2 (28)9277.828725379.370.1813425
17008682819googlegemini-2.5-flash26/2 (28)92262.428739138.920.1847825
17003761591googlegemini-2.5-flash-lite26/2 (28)92178.128358173.150.0143268
17008403285googlegemini-2.5-flash-lite24/3 (28)85263.628357836.360.0143132
17008682819googlegemini-2.5-flash-lite26/2 (28)9288.228357862.780.0
17009564599googlegemini-2.5-pro26/2 (28)92579.4287888520.691.183275
17003761591openaigpt-5-nano27/1 (28)96530.42812404718.680.0496188
17008403285openaigpt-5-nano28/0 (28)100673.42812373119.000.0494924
17008682819openaigpt-5-nano26/2 (28)92531.92812429918.940.0497196
17008403285openaigpt-5-mini27/1 (28)96645.1286498323.040.1
17003761591openaigpt-5-mini26/2 (28)92523.1286384226.430.127684
17008682819openaigpt-5-mini27/1 (28)96740.0286827524.050.13655
17009564599openaigpt-527/1 (28)961174.82810201041.961.0201

Scorer Module Benchmark Results ​

Scorer

Smart recommendation:

  1. deepseek-chat
  2. claude-3-5-haiku-20241022
  3. gemini-2.5-flash-lite

Summary Table:

RunProviderModelTestsPrecisionDuration (s)RequestsTokensAvg. Duration (s)Cost ($)
17003761591anthropicclaude-3-5-haiku-2024102223/8 (31)74282.331493778.940.198
17008403285anthropicclaude-3-5-haiku-2024102223/8 (31)74275.631492058.890.197
17008682819anthropicclaude-3-5-haiku-2024102223/8 (31)74277.131496309.110.199
17009564599anthropicclaude-sonnet-4-2025051413/12 (31)41347.3254349011.200.652
17003761591deepseekdeepseek-chat26/5 (31)83368.8313071514.110.034
17008403285deepseekdeepseek-chat26/5 (31)83386.3313071112.460.034
17008682819deepseekdeepseek-chat26/5 (31)83437.3313068111.900.034
17003761591googlegemini-2.5-flash20/11 (31)6470.4318998911.190.225
17008403285googlegemini-2.5-flash19/12 (31)61322.6318764210.510.219
17008682819googlegemini-2.5-flash20/11 (31)6467.7319034310.410.226
17003761591googlegemini-2.5-flash-lite22/9 (31)70325.931364902.290.015
17008403285googlegemini-2.5-flash-lite23/8 (31)7471.031369402.190.015
17008682819googlegemini-2.5-flash-lite24/7 (31)77346.831378432.270.015
17009564599googlegemini-2.5-pro17/14 (31)54576.4318196318.591.229
17003761591openaigpt-5-nano15/16 (31)48303.131793109.470.000
17008403285openaigpt-5-nano13/18 (31)41456.9318397210.390.034
17008682819openaigpt-5-nano13/18 (31)41322.031820619.780.033
17003761591openaigpt-5-mini17/14 (31)54460.3315571917.010.111
17008403285openaigpt-5-mini16/15 (31)51293.6315498014.850.110
17008682819openaigpt-5-mini19/12 (31)61527.4315541214.740.111
17009564599openaigpt-518/13 (31)58677.8317025821.860.703

Lister benchmark result ​

Lister

Smart recommendation:

  1. claude-3-5-haiku-20241022
  2. claude-sonnet-4-20250514
  3. gemini-2.5-flash
#ProviderModelTestsPrecisionDuration (s)RequestsTokensAvg. Duration (s)cost
17003761591anthropicclaude-3-5-haiku-2024102251/17 (68)75181.069548012.700.219204
17008403285anthropicclaude-3-5-haiku-2024102250/18 (68)73184.369549202.710.21968
17008682819anthropicclaude-3-5-haiku-2024102251/17 (68)75183.569548002.660.2192
17009564599anthropicclaude-sonnet-4-2025051448/20 (68)70218.469517813.210.776715
17003761591deepseekdeepseek-chat54/14 (68)7998.8692761207.040.0303732
17008403285deepseekdeepseek-chat51/17 (68)75434.169276536.380.0304183
17008682819deepseekdeepseek-chat10/2 (68)14478.81246431.450.0051073
17003761591googlegemini-2.5-flash42/26 (68)61280.569718444.260.17961
17008403285googlegemini-2.5-flash42/26 (68)6161.969733284.300.18332
17008682819googlegemini-2.5-flash43/25 (68)63292.769743504.120.185875
17003761591googlegemini-2.5-flash-lite36/32 (68)5262.769313191.000.0125276
17008403285googlegemini-2.5-flash-lite35/33 (68)51289.469313630.920.0125452
17008682819googlegemini-2.5-flash-lite38/30 (68)5568.369311760.910.0124704
17009564599googlegemini-2.5-pro36/32 (68)52930.56910819213.681.62288
17003761591openaigpt-5-mini35/33 (68)51522.969569427.590.1
17008403285openaigpt-5-mini34/34 (68)50352.569587107.620.11742
17008682819openaigpt-5-mini31/37 (68)45518.369588177.690.117634
17003761591openaigpt-5-nano15/53 (68)22340.469796274.880.0
17008403285openaigpt-5-nano20/48 (68)29516.1698200105.010.0
17008682819openaigpt-5-nano25/43 (68)36332.169834415.180.0333764
17009564599openaigpt-529/39 (68)42972.0699349514.290.93495

ranker benchmark result ​

Ranker

Smart recommendation:

  1. gemini-2.5-flash
  2. gpt-5-mini
#ProviderModelTestsPrecisionDuration (s)RequestsTokensAvg. Duration (s)cost
17003761591anthropicclaude-3-5-haiku-202410220/0 (2)030.3243432915.340.137316
17008403285anthropicclaude-3-5-haiku-202410220/0 (2)029.9243443814.940.137752
17008682819anthropicclaude-3-5-haiku-202410220/0 (2)030.7233288715.170.1
17003761591deepseekdeepseek-chat1/0 (2)50446.6205289104274.440.3180144
17008403285deepseekdeepseek-chat1/0 (2)50475.8201279556237.920.3
17008682819deepseekdeepseek-chat1/0 (2)50548.9202288354223.310.3171894
17003761591googlegemini-2.5-flash1/1 (2)50381.6138599784183.261.49946
17008403285googlegemini-2.5-flash1/1 (2)50160.3201844904244.492.11226
17008682819googlegemini-2.5-flash2/0 (2)100489.0166688138190.811.720345
17003761591googlegemini-2.5-flash-lite1/1 (2)50149.8314584318124.400.2337272
17008403285googlegemini-2.5-flash-lite0/0 (2)0366.530256345974.920.2253836
17008682819googlegemini-2.5-flash-lite1/1 (2)50248.832359690580.150.238762
17003761591openaigpt-5-nano0/1 (2)0897.62691156009341.140.5
17008403285openaigpt-5-nano0/1 (2)01466.02581112166334.980.4448664
17008682819openaigpt-5-nano0/2 (2)0670.03451569127448.780.6276508
17003761591openaigpt-5-mini1/1 (2)501335.23891109705793.782.21941
17008403285openaigpt-5-mini1/1 (2)50682.33781083312667.582.2
17008682819openaigpt-5-mini2/0 (2)1001587.65711210776732.982.421552

Benchmarking Stats ​

The latest benchmarking run involved:

  • Total requests: 10,086
  • Total tokens processed: 20,021,757
  • Total cost: ~$45
  • Estimated time (no parallelism): 14 hours 42 minutes
  • Unique tests: 249 (each run up to 3 times)
  • Models covered (9):
    • claude-3-5-haiku-20241022
    • claude-sonnet-4-20250514
    • deepseek-chat
    • gemini-2.5-flash
    • gemini-2.5-flash-lite
    • gemini-2.5-pro
    • gpt-5-nano
    • gpt-5-mini
    • gpt-5

Running Benchmarks ​

To run the benchmarks yourself:

shell
bundle exec rake active_genie:benchmark

To benchmark a specific module:

shell
bundle exec rake active_genie:benchmark[extractor]
bundle exec rake active_genie:benchmark[scorer]
bundle exec rake active_genie:benchmark[comparator]
bundle exec rake active_genie:benchmark[ranker]

Future Improvements ​

  • Expanded test coverage for edge cases
  • Multi-language support testing