Paper:《Is GPT-4 a Good Data Analyst?GPT-4是一个好的数据分析师吗?》翻译与解读

Paper:《Is GPT-4 a Good Data Analyst?GPT-4是一个好的数据分析师吗?》翻译与解读

导读:该论文提出了一个框架,用来引导GPT-4进行端到端的数据分析任务,包括数据提取、可视化生成以及数据分析。GPT-4 能生成SQL查询来提取相关数据,根据问题绘制合适的图表,并分析数据识别趋势和见解。实验结果表明,在某种程度上GPT-4与人类数据分析师达到相当的水平,虽然在准确性和避免产生幻觉方面仍有改进空间。与人类数据分析师相比,GPT-4 的速度和成本更低。但它缺少背景知识的结合和能力在分析中表达情感。作者指出需要解决的几个问题,才能得出结论认为GPT-4完全可以替换人类数据分析师,例如保证高精度、结合背景假设和收集更多的商业需求。总的来说,虽然根据初步实验GPT-4表现出作为数据分析师很有前途的能力,但还需要进一步的研究,考虑到各种实际问题,才能得出结论认为它完全可以替换人类数据分析师。 该论文提供了有用的启示和前景作业。简而言之,虽然GPT-4展示出部分能力去完成与人类相当水平的数据分析任务,但还需要弥合差距,大型语言模型才能完全替换人类数据分析师


《Is GPT-4 a Good Data Analyst? GPT-4是一个好的数据分析师吗?》翻译与解读



2、Related Work相关工作

2.1、Related Tasks and Datasets相关任务和数据集

2.2、Abilities of GPT-3, ChatGPT and GPT-4——GPT-3、ChatGPT和GPT-4的能力

3、Task Description任务描述

3.1、Background: Data Analyst Job Scope——背景:数据分析师工作范围

3.2、Our Task Setting我们的任务设置

4、Our Framework我们的框架

4.1、Step 1: Code Generation——第一步:代码生成 第一步的输入

4.2、Step 2: Code Execution第二步:代码执行

4.3、Step 3: Analysis Generation第三步:分析生成




5.3、Main Results主要结果 GPT-4性能

6、Case Study案例研究

7、Findings and Discussions研究结果与讨论



Is GPT-4 a Good Data Analyst? GPT-4是一个好的数据分析师吗?翻译与解读




Liying Cheng1 Xingxuan Li ∗ 1,2 Lidong Bing1

1DAMO Academy, Alibaba Group    2Nanyang Technological University, Singapore




As large language models (LLMs) have demon-strated their powerful capabilities in plenty of domains and tasks, including context under-standing, code generation, language generation, data storytelling, etc., many data analysts may raise concerns if their jobs will be replaced by artificial intelligence (AI). This controver-sial topic has drawn great attention in pub-lic. However, we are still at a stage of diver-gent opinions without any definitive conclusion. Motivated by this, we raise the research ques-tion of “is GPT-4 a good data analyst?” in this work and aim to answer it by conducting head-to-head comparative studies. In detail, we regard GPT-4 as a data analyst to perform end-to-end data analysis with databases from a wide range of domains. We propose a frame-work to tackle the problems by carefully de-signing the prompts for GPT-4 to conduct ex-periments. We also design several task-specific evaluation metrics to systematically compare the performance between several professional human data analysts and GPT-4. Experimental results show that GPT-4 can achieve compara-ble performance to humans. We also provide in-depth discussions about our results to shed light on further studies before we reach the conclusion that GPT-4 can replace data ana-lysts. Our code, data and demo are available at:



LLMs such as GPT series have shown their strong abilities on various tasks in natural language pro- cessing (NLP) community, including data anno- tator (Ding et al., 2022), data evaluator (Chiang and Lee, 2023; Luo et al., 2023; Wang et al., 2023; Wu et al., 2023b; Shen et al., 2023), etc. Outside the NLP community, researchers also evaluate the LLM abilities in multiple domains, such as finance(Wu et al., 2023c), healthcare (Han et al., 2023; Li et al., 2023b), biology (Zheng et al., 2023), law (Sun, 2023), psychology (Li et al., 2023a), etc. Most of these researches demonstrate the effec- tiveness of LLMs when applying it to different tasks. However, the strong ability in understanding, reasoning, creativity causes some potential anxiety among certain groups of people.

As LLMs are introduced and becoming popu- lar not only in NLP community but also in many other areas, those people in and outside of the NLP community are considering or worrying whether AI can replace certain jobs (Noever and Ciolino, 2023; Wu et al., 2023a). One such job role that could be naturally and controversially “replaced” by AI is data analyst (Tsai et al., 2015; Ribeiro et al., 2015). The main and typical job scopes for a data analyst include extracting relevant data from several databases based on business partners’ requirements, presenting data visualization in an easily understandable way, and also providing data analysis and insights for the audience. The overall process is shown in Figure 1, which a data an- alyst is often asked to fulfill during work. This job involves a relatively routine scope, which may become repetitive at times. It also requires sev- eral technical skills, including but not limited to SQL, Python, data visualization, and data analysis, making it relatively expensive. As this job scope may adhere to a relatively fixed pipeline, there is a heated public debate about the possibility of an AI replacing a data analyst, which attracts significant attention.



In this paper, we aim to answer the following research question: Is GPT-4 a good data analyst? To answer this question, we conduct preliminary studies on GPT-4 to demonstrate its potential capa- bilities as a data analyst. We quantitatively evaluate the pros and cons of LLM as a data analyst mainly from the following metrics: performance, time, and cost. In specific, we treat GPT-4 as a data analyst to conduct several end-to-end data analysis problems. Since there is no existing dataset for such data anal- ysis problems, we choose one of the most related dataset NvBench, and add the data analysis part on top. We design several automatic and human evaluation metrics to comprehensively evaluate the quality of the data extracted, charts plotted and data analysis generated. Experimental results show that GPT-4 can beat an entry level data analyst in terms of performance and have comparable performance to a senior level data analyst. In terms of cost and time, GPT-4 is much cheaper and faster than hiring a data analyst.


However, since it is a preliminary study on whether GPT-4 is a good data analyst, we pro- vide fruitful discussions on whether the conclu- sions from our experiments are reliable in real life business from several perspectives, such as whether the questions are practical, whether the human data analysts we choose are representative, etc. To sum- marize, our contributions include:

We for the first time raise the research question of whether GPT-4 is a good data analyst, and quantitatively evaluate the pros and cons.

For such a typical data analyst job scope, we propose an end-to-end automatic framework to conduct data collection, visualization, and analysis.

We conduct systematic and professional human evaluation on GPT-4’s outputs. The data analy- sis and insights with good quality can be considered as the first benchmark for data analysis in  NLP  community.





Figure 1: A figure showing the flow of our proposed framework that regarding GPT-4 as a data analyst. The compulsory input information containing both the busi- ness question and the database is illustrated in the blue box on the upper right. The optional input referring to the external knowledge source is circled in the red dotted box on the upper left. The outputs including the extracted data (i.e., “data.txt”), the data visualization (i.e., “figure.pdf ”) and the analysis are circled in the green box at the bottom.


2、Related Work相关工作

2.1、Related Tasks and Datasets相关任务和数据集

Since our task setting is new in NLP community, there is no existing dataset that is fully suitable for our task. We explore the most relevant tasks and datasets. First, the NvBench dataset (Luo et al., 2021) translates natural language (NL) queries to corresponding visualizations (VIS), which covers the first half of the main job scope of a data an- alyst. This dataset has 153 databases along with 780 tables in total and covers 105 domains, and this task (NL2VIS) has attracted significant atten- tion in both commercial visualization vendors and academic researchers. Another popular subtask of the NL2VIS task is called text-to-sql, which con- verts natural language questions into SQL queries (Zhong et al., 2017; Guo et al., 2019; Qi et al., 2022; Gao et al., 2022). Spider (Yu et al., 2018), SParC (Yu et al., 2019b) and CoSQL (Yu et al., 2019a) are three main benchmark datasets for text-to-sql tasks. Since this work is more focused on imitating the overall process of the job scope of a data analyst, we adopt the NL2VIS task which has one more step forward than the text-to-sql task.


For the second part of data analysis, we also ex- plore relevant tasks and datasets. Automatic chart summarization (Mittal et al., 1998; Ferres et al., 2013) is a task that aimed to explain a chart and summarize the key takeaways in the form of natu- ral language. Indeed, generating natural language summaries from charts can be very helpful to infer key insights that would otherwise require a lot of cognitive and perceptual efforts. In terms of the dataset, the chart-to-text dataset (Kantharaj et al., 2022) aims to generate a short description of the given chart. This dataset also covers a wide range of topics and chart types. Another relevant NLP task is called data-to-text generation (Gardent et al., 2017; Duk et al., 2020; Koncel-Kedziorski et al., 2019; Cheng et al., 2020). However, the output of all these existing works are descriptions or sum- maries in the form of one or a few sentences or a short paragraph. In the practical setting of data analytics work, one should highlight the analysis and insights in bullet points to make it clearer to the audience. Therefore, in this work, we aim to gen- erate the data analysis in the form of bullet points instead of a short paragraph.

对于数据分析的第二部分,我们还探索了相关的任务和数据集。自动图表摘要(Mittal等,1998;Ferres等,2013)是一个旨在解释图表并以自然语言形式总结要点的任务。的确,从图表中生成自然语言摘要对于推断关键见解非常有帮助,否则需要大量的认知和感知努力。在数据集方面,图表到文本数据集(Kantharaj et al., 2022)旨在生成给定图表的简短描述。该数据集还涵盖了广泛的主题和图表类型。另一个相关的NLP任务被称为数据到文本生成(Gardent等人,2017;Duk et al., 2020;Koncel-Kedziorski等人,2019;Cheng等人,2020)。然而,所有这些现存作品的输出都是以一句话或几句话或一小段话的形式进行的描述或总结。在数据分析工作的实际设置中,应该突出要点的分析和见解,使观众更清楚。因此,在这项工作中,我们的目标是以要点的形式生成数据分析,而不是一个简短的段落。

2.2、Abilities of GPT-3, ChatGPT and GPT-4——GPT-3、ChatGPT和GPT-4的能力

Researchers have demonstrated the effectiveness of GPT-3 and ChatGPT on several tasks (Ding et al., 2022; Chiang and Lee, 2023; Shen et al., 2023; Luo et al., 2023; Wang et al., 2023; Wu et al., 2023b; Li et al., 2023a; Han et al., 2023; Li et al., 2023b). Ding et al. (2022) evaluated the performance of GPT-3 as a data annotator. Their findings show that GPT-3 performs better on simpler tasks such as text classification than more complex tasks such as named entity recognition (NER). Wang et al. (2023) treated ChatGPT as an evaluator, used ChatGPT to evaluate the performance of natural language generation (NLG) and study its correlations with human evaluation. They found that the ChatGPT evaluator has a high correlation with humans in most cases, especially for creative NLG tasks.


GPT-4 is proven to be a significant upgrade over the existing models, as it is able to achieve more advanced natural language processing capabilities (OpenAI, 2023). For instance, GPT-4 is capable of generating more diverse, coherent, and natural language outputs. It is also speculated that GPT-4 may be more capable for providing answers to com- plex and detailed questions and performing tasks requiring deeper reasoning and inference. These advantages will have practical implications in var- ious industries, such as customer service, finance, healthcare, and education, where AI-powered lan- guage processing can enhance communication and problem-solving. In this work, we regard GPT-4 as a data analyst to conduct our experiments.


3、Task Description任务描述

3.1、Background: Data Analyst Job Scope——背景:数据分析师工作范围

The main job scope of a data analyst involves uti- lizing business data to identify meaningful patterns and trends from the data and provide stakeholders with valuable insights for making strategic deci- sions. To achieve their goal, they must possess a variety of skills, including SQL query writing, data cleaning and transformation, visualization genera- tion, and data analysis.

To this end, the major job scope of a data analyst can be split into three steps based on the three main skills mentioned above: data collection, data visual- ization and data analysis. The initial step involves comprehending business requirement and deciding which data sources are pertinent to answering it. Once the relevant data tables have been identified, the analyst can extract the required data via SQL queries or other extraction tools. The second step is to create visual aids, such as graphs and charts, that effectively convey insights. Finally, in data analysis stage, the analyst may need to ascertain correlations between different data points, identify anomalies and outliers, and track trends over time. The insights derived from this process can then be communicated to stakeholders through written reports or presentations.



3.2、Our Task Setting我们的任务设置

Following the main job scope of a data analyst, we describe our task setting below. Formally, as illus- trated in Figure 1, given a business-related question

(q) and one or more relevant database tables (d) and their schema (s), we aim to extract the required data (D), generate a graph (G) for visualization and

provide some analysis and insights (A).

More specifically, according to the given ques- tion, one has to identify the relevant tables and schemes in the databases that contain the necessary data for the chart, and then extract the data from the databases and organize it in a way that is suit- able for chart generation. One example question can be: “Show me about the correlation between Height and Weight in a scatter chart”. As it can be seen, the question also includes the chart type information, thus one also has to choose an appro- priate chart type based on the nature of the data and the question being asked, and to generate the chart using a suitable software or programming language. Lastly, it is required to analyze the data to identify trends, patterns, and insights that can help answer the initial question.



4、Our Framework我们的框架

To tackle the above task setting, we design an end- to-end framework. With GPT-4’s abilities on con- text understanding, code generation, data story- telling being demonstrated, we aim to use GPT-4 to automate the whole data analytics process, follow- ing the steps shown in Figure 1. Basically, there are three steps involved: (1) code generation (shown in blue arrows), (2) code execution (shown in or- ange arrows), and (3) analysis generation (shown in green arrows). The algorithm of our framework is shown in Algorithm 1.


4.1、Step 1: Code Generation——第一步:代码生成 第一步的输入

The input of the first step contains a question and database schema.  The goal here is to generate the code for extracting data and drawing the chart in later steps. We utilize GPT-4 to understand the questions and the relations among multiple database tables from the schema. Note that only the schema of the database tables is provided here due to the data security reasons. The massive raw data is still kept safe offline, which will be used in the later step. The designed prompt for this step is shown in Table 1. By following the instructions, we can get a piece of python code containing SQL queries. An example code snippet generated by GPT-4 is shown in Appendix A.


4.2、Step 2: Code Execution第二步:代码执行

As mentioned earlier in the previous step, to main- tain the data safety, we execute the code generated by GPT-4 offline. The input in this step is the code generated from Step 1 and the raw data from the database, as shown in Figure 1. By locating the data directory using “conn = sqlite3.connect([database file name])” as shown in Table 1 in the code, the massive raw data is involved in this step. By ex- ecuting the python code, we are able to get the chart in “figure.pdf” and the extracted data saved in “data.txt”.

如前一步骤中提到的,为了保持数据安全性,我们离线执行由GPT-4生成的代码。这一步的输入是从第一步生成的代码和数据库中的原始数据,如图1所示。通过在代码中使用“conn = sqlite3.connect([数据库文件名])”来定位数据目录,本步骤涉及大量的原始数据。通过执行Python代码,我们能够获得“figure.pdf”中的图表,并将提取的数据保存在“data.txt”中。

4.3、Step 3: Analysis Generation第三步:分析生成

After we obtain the extracted data, we aim to generate the data analysis and insights. To make sure the data analysis is aligned with the original query, we use both the question and the extracted data as the input. Our designed prompt for GPT-4 of this step is shown in Table 2. Instead of generat- ing a paragraph of description about the extracted data, we instruct GPT-4 to generate the analysis and insights in 5 bullet points to emphasize the key takeaways. Note that we have considered the al- ternative of using the generated chart as input as well, as the GPT-4 technical report (OpenAI, 2023) mentioned it could take images as input. However, this feature is still not open to public yet. Since the extracted data essentially contains at least the same amount of information as the generated figure, we only use the extracted data here as input for now. From our preliminary experiments, GPT-4 is able to understand the trend and the correlation from the data itself without seeing the figures.


In order to make our framework more practical such that it can potentially help human data an- alysts boost their daily performance, we add an option of utilizing external knowledge source, as shown in Algorithm 1. Since the actual data analyst role usually requires relevant business background knowledge, we design an external knowledge re- trieval model g(·) to query real-time online infor- mation (I) from external knowledge source (e.g. Google). In such an option, GPT-4 takes both the data (D) and online information (I) as input to gen- erate the analysis (A).




Since there is no exactly matched dataset, we choose the most relevant one, named the NvBench dataset. We randomly choose 100 questions from different domains with different chart types and different difficulty levels to conduct our main ex- periments. The chart types cover the bar chart, the stacked bar chart, the line chart, the scatter chart, the grouping scatter chart and the pie chart. The dif- ficulty levels include: easy, medium, hard and extra hard. The domains include: sports, artists, trans- portation, apartment rentals, colleges, etc. On top of the existing NvBench dataset, we additionally use our framework to write insights drawn from data in 5 bullet points for each instance and evalu- ate the quality using our self-designed evaluation metrics.



To comprehensively investigate the performance, we carefully design several human evaluation met- rics to evaluate the generated figures and analysis separately for each test instance.


Figure Evaluation We define 3 evaluation met- rics for figures:

information correctness: is the data and infor- mation shown in the figure correct?

chart type correctness:  does the chart type match the requirement in the question?

aesthetics: is the figure aesthetic and clear with- out any format errors?

The information correctness and chart type correct- ness are calculated from 0 to 1, while the aesthetics is on a scale of 0 to 3.






Analysis Evaluation For each bullet point gener- ated in the analysis and insight, we define 4 evalua- tion metrics as below:

correctness: does the analysis contain wrong data or information?

alignment: does the analysis align with the question?

complexity: how complex and in-depth is the analysis?

fluency: is the generated analysis fluent, gram- matically sound and without unnecessary repe- titions?






We grade the correctness and alignment on a scale of 0 to 1, and grade complexity and fluency in a range between 0 to 3.

To conduct human evaluation, 6 professional data annotators are hired from a data annotation company to annotate each figure and analysis bul- let point following the evaluation metrics detailed above. The annotators are fully compensated for their work. Each data point is independently la- beled by two different annotators.



5.3、Main Results主要结果 GPT-4性能

GPT-4 Performance Table 3 shows the perfor- mance of GPT-4 as an data analyst on 200 samples. We show the results of each individual evaluator group and the average scores between these two groups. For chart type correctness evaluation, both evaluator groups give almost full scores. This in- dicates that for such a simple and clear instruction such as “draw a bar chart”, “show a pie chart”, etc., GPT-4 can easily understand its meaning and has background knowledge about what the chart type means, so that it can plot the chart in the correct type accordingly. In terms of aesthetics score, it can get 2.73 out of 3 on average, which shows most of the figures generated are clear to audience without any format errors. However, for the information correctness of the plotted charts, the scores are not so satisfied. We manually check those charts and find most of them can roughly get the correct figures despite some small errors. Our evaluation criteria is very strict that as long as any data or any label of x-axis or y-axis is wrong, the score has to be deducted. Nevertheless, it has room for further improvement.


For analysis evaluation, both alignment and flu- ency get full marks on average. It again verifies generating fluent and grammatically correct sen- tences is definitely not a problem for GPT-4. We notice the average correctness score for analysis is much higher than the information correctness score for figures. This is interesting because despite the wrong figure generated, the analysis could be cor- rect. It again verifies our explanation earlier for the information correctness scores of figures. As mentioned, since the figures generated are mostly consistent with the gold figures, thus some of the bullet points can be generated correctly. Only a few bullet points related to the error parts from the figures are considered wrong. In terms of the complexity scores, 2.16 out of 3 on average is rea- sonable and satisfying.


Comparison between Human Data Analysts and GPT-4 To further answer our research question, we hire professional data analysts to do these tasks and compare with GPT-4 comprehensively.  We fully compensate them for their annotation. Table 4 shows the performance of several data analysts of different expert levels from different backgrounds compared to GPT-4. Overall speaking, GPT-4’s performance is comparable to human data analysts, while the superiority varies among different metrics and human data analysts.


The first block shows 10-sample performance of a senior data analyst (i.e., Senior Data Analyst 1) who has more than 6 years’ data analysis working experience in finance industry. We can see from the table that GPT-4 performance is comparable to the expert data analyst on most of the metrics. Though the correctness score of GPT-4 is lower than the human data analyst, the complexity score and the alignment score are higher.

The second block shows another 8-sample per-formance comparison between GPT-4 and another senior data analyst (i.e., Senior Data Analyst 2) who works in internet industry as a data analyst for over 5 years. Since the sample size is relatively smaller, the results shows larger variance between human and AI data analysts. The human data an-alyst surpasses GPT-4 on information correctness and aesthetics of figures, correctness and complex-ity of insights, indicating that GPT-4 still still has potential for improvement.

The third block compares another random 9-sample performance between GPT-4 and a junior data analyst who has data analysis working expe-rience in a consulting firm within 2 years. GPT-4 not only performs better on the correctness of fig-ures and analysis, but also tends to generate more complex analysis than the human data analyst.




Apart from the comparable performance be-tween all data analysts and GPT-4, we can notice the time spent by GPT-4 is much shorter than hu-man data analysts. Table 5 shows the cost compari-son from different sources. We obtain the median annual salary of data analysts in Singapore from and the average annual salary of data an-alysts in Singapore from Glassdoor. We assume there are around 21 working days per month and the daily working hour is around 8 hours, and cal-culate the cost per instance in USD based on the average time spent by data analysts from each level. For our annotation, we pay the data analysts based on the market rate accordingly. The cost of GPT- 4 is approximately 0.71% of the cost of a junior data analyst and 0.45% of the cost of a senior data analyst.

除了所有数据分析师与GPT-4之间的可比性之外,我们可以注意到GPT-4花费的时间比人类数据分析师要短得多。表5显示了不同来源的成本比较。我们从水平得到新加坡数据分析师的年薪中位数。还有来自Glassdoor的新加坡数据分析师的平均年薪。我们假设每个月约有21个工作日,每天工作时间约为8小时,并根据每个级别的数据分析师平均花费的时间来计算每个实例的成本,以美元为单位。对于我们的注释,我们根据相应的市场汇率向数据分析师支付报酬。GPT- 4的成本约为初级数据分析师成本的0.71%,高级数据分析师成本的0.45%

6、Case Study案例研究

In this section, we show a few cases done by GPT-4 and our data analysts.

In this section, we show a few cases done by GPT-4 and our data analysts.

In the first case as shown in Table 6, GPT-4 is able to generate a python code containing the correct SQL query to extract the required data, and to draw a proper and correct pie chart according to the given question. In terms of the analysis, GPT-4 is capable of understanding the data by conducting proper comparisons (e.g., “most successful”, “less successful”, “diverse range”). In addition, GPT-4 can provide some insights from the data, such as: “indicating their dominance in the competition”. These aforementioned abilities of GPT-4 including context understanding, code generation and data storytelling are also demonstrated in many other cases. Furthermore, in this case, GPT-4 can also make some reasonable guess from the data and its background knowledge, such as: “potentially due to their design, performance, or other factors”.

The second case (Table 7) shows another ques- tion addressed by GPT-4. Again, GPT-4 is able to extract the correct data, draw the correct scatter plot and generate reasonable analysis. Although most of the bullet points are generated faithfully, if we read and check carefully, we can notice the numbers of the average height and weight are wrong. Apart from the well-known hallucination issue, we sus- pect that GPT-4’s calculation ability is not strong, especially for those complex calculation. We also notice this issue in several other cases. Although GPT-4 generates the analysis bullet point in a very confident tone, but the calculation is sometimes inaccurate.




Table 8 shows an example done by Senior Ana- lyst 2. We can notice that this expert human data analyst can also understand the requirement, write the code to draw the correct bar chart, and ana- lyze the extracted data in bullet points. Apart from this, we can summarize three main differences with GPT-4. First, different from GPT-4, the human data analyst can express the analysis with some personal thoughts and emotions. For example, the data analyst mentions “This is a bit surprising ...”. In real-life business, personal emotions are impor- tant sometimes. With the emotional phrases, the audience can easily understand whether the data is as expected or abnormal. Second, the human data analyst tends to apply some background knowledge. While GPT-4 usually only focuses on the extracted data itself, the human is easily linked with one’s background knowledge. For example, as shown in Table 8, the data analyst mentions “... is commonly seen ...”, which is more natural during a data an-alyst’s actual job. Therefore, to mimic a human data analyst better, in our demo, we add an option of using Google search API to extract real-time online information when generating data analysis. Third, when providing insights or suggestions, a human data analyst tends to be conservative. For instance, in the 5th bullet point, the human data analyst mentions “If there’s no data issue” before giving a suggestion. Unlike humans, GPT-4 will directly provide the suggestion in a confident tone without mentioning its assumptions.


7Findings and Discussions研究结果与讨论

During our experiments, we notice several phe-nomena and conduct some thinking about them. In this section, we discuss our findings and hope researchers can address some of the issues in future work.

Generally speaking, GPT-4 can perform com-parable to a data analyst from our preliminary ex-periments, while there are still several issues to be addressed before we can reach a conclusion that GPT-4 is a good data analyst. First, as illustrated in the case study section, GPT-4 still has hallucination problems, which is also mentioned in GPT-4 tech-nical report (OpenAI, 2023). Data analysis jobs not only require those technical skills and analytics skills, but also requires high accuracy to be guaran-teed. Therefore, a professional data analyst always tries to avoid those mistakes including calculation mistakes and any type of hallucination problems. Second, before providing insightful suggestions, a professional data analyst is usually confident about all the assumptions. Instead of directly giving any suggestion or making any guess from the data, GPT- 4 should be careful about all the assumptions and make the claims rigorous.



The questions we choose are randomly selected from the NvBench dataset. Although the questions indeed cover a lot of domains, databases, difficulty levels and chart types, they are still somewhat too specific according to human data analysts’ feed-back. The questions usually contain the informa-tion such as: a specific correlation between two variables, a specific chart type. In a more practical setting, the original requirements are more general, which require a data analyst to formulate such a specific question from the general business require-ment, and to determine what kind of chart would represent the data better. Our next step is to collect more practical and general questions to further test the problem formulation ability of GPT-4.


The quantity of human evaluation and data an-alyst annotation is relatively small due to budget limit. For human evaluation, we strictly select those professional evaluators in order to give better rat-ings. They have to pass our test annotations for several rounds before starting the human evalua-tion. For the selection of data analysts, we are even more strict. We verify if they really had data analy-sis working experience, and make sure they master those technical skills before starting the data anno-tation. However, since hiring a human data analyst (especially for those senior and expert human data analyst) is very expensive, we can only ask them to do a few samples.



The potential for large language models (LLMs) like GPT-4 to replace human data analysts has sparked a controversial discussion. However, there is no definitive conclusion on this topic yet. This study aims to answer the research question of whether GPT-4 can perform as a good data ana-lyst by conducting several preliminary experiments. We design a framework to prompt GPT-4 to per-form end-to-end data analysis with databases from various domains and compared its performance with several professional human data analysts us-ing carefully-designed task-specific evaluation met-rics. The results and analysis show that GPT-4 can achieve comparable performance to humans, but further studies are needed before concluding that GPT-4 can replace data analysts.



We would like to thank our data annotators and data evaluators for their hard work. Especially, we would like to thank Mingjie Lyu for the fruitful discussion and feedback.

我们要感谢我们的数据注释员和数据评估员的辛勤工作。特别感谢Mingjie Lyu进行了富有成果的讨论和反馈。

