DA-bench

Visual Benchmark for Data Analytics AI Agents

cortexRun #149 (2024-12-04)

Overall Score: 52.3%
(40% Scalability Score + 60% Test Score)

DA-bench Setup for Run #149 — Scalability Score: 71.4% (5 / 7)

Scalability score reflects Snowflake's data warehouse functionality, although the setup video demonstrates uploading csvs. Setup also includes creating YAML and updating Streamlit code. These steps are not captured in the setup video.

  • Verified Setup Less Than 20 Minutes
  • Verified Connects to Data Warehouse
  • Verified Handles 1TB Table
  • Verified Handles 10+ Tables
  • Unchecked No Table Structure Changes
  • Unchecked No SQL Expertise for Setup

DA-bench Results for Run #149 — Test Score: 39.5% (85 / 215)

Data Querying (75 / 120)
18 Correct Answers, 3 Hallucinations
Question Date Tested Overall Score Video Recording

dq01
Perform an aggregation on an explicit column

2024-12-04 5

dq02
Perform an aggregation with an explicit table but not an inferred column

2024-12-04 5

dq03
Perform an aggregation with an implicit table and implicit column

2024-12-04 5

dq04
Find and compare information across tables without joins

2024-12-04 -5

dq05
Work with non-literal values

2024-12-04 5

dq06
Work with non-literal values and non-SQL data manipulation

2024-12-04 5

dq07
Deal with common acronymns and more advanced aggregations

2024-12-04 5

dq08
Multi-step queries

2024-12-04 0

dq09
Aggregations with numeric predicates to filter

2024-12-04 5

dq10
Aggregations with categorical predicates to filter

2024-12-04 5

dq11
Schema review

2024-12-04 5

dq12
Aggregate records that are filtered with a predicate requiring a join

2024-12-04 5

dq13
Recognizes truly ambiguous queries.

2024-12-04 -5

dq14
Can handle boolean features

2024-12-04 5

dq15
Handles ambiguous column names

2024-12-04 5

dq16
Understands set operates require consideration of overlap

2024-12-04 5

dq17
Finds relevant values inside a Column to answer questions

2024-12-04 5

dq18
Lookup a single record by ID

2024-12-04 5

dq19
Perform an aggregation by a different name and a second query from that

2024-12-04 5

dq20
Perform an aggregation based on a very different question name

2024-12-04 5

dq21
Perform a filter and an unusually-phrased aggregation in the correct order

2024-12-04 0

dq23
Can handle incorrect column names well

2024-12-04 5

dq24
Schema review

2024-12-04 0

dq25
Work with non-literal values

2024-12-04 -5
Domain Knowledge (5 / 5)
1 Correct Answer, 0 Hallucinations
Question Date Tested Overall Score Video Recording

dk01
Column Relevance Determination

2024-12-04 5
Feature Engineering (-10 / 40)
1 Correct Answer, 3 Hallucinations
Question Date Tested Overall Score Video Recording

fe1
Make a boolean indicator feature for a criteria set

2024-12-04 5

fe2
Make a categorical feature from a criteria set

2024-12-04 0

fe3
Minmax normalization

2024-12-04 -5

fe4
Combining two input columns

2024-12-04 -5

fe5
Sentiment

2024-12-05 -5

fe6
Phrase Identification in Text

2024-12-04 0

fe7
Advanced NLP

2024-12-05 0

fe8
Advanced NLP

2024-12-04 0
Insight Identification (10 / 25)
3 Correct Answers, 1 Hallucination
Question Date Tested Overall Score Video Recording

ii2
Compare an aggregation for two distinct subsets of data

2024-12-04 5

ii5
Identifying basic trends on short timelines

2024-12-04 5

ii6
Understands statistical significance

2024-12-04 0

ii7
Understands derivitives

2024-12-04 5

ii8
Can use NLP feature engineering as part of an insight request

2024-12-04 -5
Learning (0 / 10)
1 Correct Answer, 1 Hallucination
Question Date Tested Overall Score Video Recording

l1
Can remember the meanings of oddly-named columns

2024-12-04 -5

l2
Can remember criteria sets under a single name

2024-12-04 5
Visualization (5 / 15)
1 Correct Answer, 0 Hallucinations
Question Date Tested Overall Score Video Recording

v1
Basic Charting

2024-12-04 5

v2
Charting with two series

2024-12-04 0

v3
Categorical Charts

2024-12-04 0