Developers Rely Too Much on Generative AI: Over Half of Software Engineering Answers from ChatGPT Found to be Incorrect, Purdue Study Finds
Generative AI tools have become increasingly popular in the software development and programming communities. These tools aim to provide helpful and efficient solutions to developers by generating code and answers to their queries. However, a recent study conducted by Purdue University has shed light on some concerning findings regarding the accuracy and reliability of one such generative AI tool called ChatGPT.
The study examined 517 software engineering questions posted on Stack Overflow and analyzed the answers provided by ChatGPT. The researchers evaluated the correctness, consistency, comprehensiveness, and conciseness of the generated answers. The results were far from satisfactory, with 52% of programming-related answers found to be inaccurate. Additionally, the majority of the answers (77%) were deemed verbose, lacking the concise nature that developers often prefer.
One of the key issues highlighted in the study was the interpretation of ChatGPT’s answers by users and their perception of the answers’ legitimacy. Despite the inaccuracies, users still preferred ChatGPT’s answers 39.34% of the time due to their comprehensiveness and well-articulated language style. This reliance on the tool’s answers without verifying their correctness raises concerns about the potential impact on software development.
Interestingly, the study revealed that users often struggled to identify errors in ChatGPT-based answers, especially when the errors were not readily apparent. Even when errors were glaringly obvious, two out of twelve participants still marked them as correct and even preferred those answers. This highlights the perceived legitimacy of the answers produced by ChatGPT, which should be a cause for concern among users.
To its credit, ChatGPT does provide a generic warning that the information it produces may be inaccurate, but the study suggests that this warning is insufficient. The researchers recommend complementing the answers with a disclaimer that clearly communicates the level of incorrectness and uncertainty associated with them. This additional information would provide users with a better understanding of the reliability of the tool’s responses.
The adoption of generative AI tools in software development has been on the rise, with GitHub’s Copilot services being a notable example. These tools offer assistance in coding and are seen by developers as valuable assets in their daily operations. However, the Purdue University study highlights the need for developers to exercise caution and not blindly rely on generative AI tools without verifying the accuracy of their outputs.
It is crucial for the creators of such tools to prioritize communication correctness and find effective ways to communicate the level of speculation and incorrectness in the answers generated by AI models like ChatGPT. Without proper communication and transparency, users may unknowingly incorporate inaccurate code or solutions into their projects, leading to potential issues down the line.
The study’s findings serve as a reminder that while generative AI tools can be immensely helpful, they should not be seen as infallible sources of information. Developers must maintain a critical eye, verify answers independently, and strive for a balance between leveraging the capabilities of these tools and maintaining a high standard of coding accuracy.
Overall, the Purdue University study raises important questions about the reliance on generative AI tools in software engineering. As the development community continues to explore and integrate these tools into their workflows, it is crucial to address the concerns surrounding accuracy, consistency, and communication in order to maximize their benefits and mitigate potential risks.