Testing AI, Students, 3

In a recent edition of Science, an article called “We Need a Weizenbaum Test for AI” proposes that rather than the Turing test, which has about outlived whatever usefulness it may have had, we should develop a test for AI tools and programs that evaluates the “public value of AI technologies…according to their real-world implications.”  Specifically, it proposes a six-question test:

  1. Who will benefit?
  2. Who will bear the costs?
  3. What will the technology mean for future generations?
  4. What will be the implications not just for economies and international security, but also for our sense of what it means to be human?
  5. Is the technology reversible?
  6. What limits should be imposed on its application?

As considerations, these are worthy of contemplation, and not only in the context of AI; however, I take issue with referring to them as a “test.”  The Merriam-Webster dictionary defines a test as “a critical examination, observation, or evaluation…the procedure of submitting a statement to such conditions or operations as will lead to its acceptance or rejection…a basis for evaluation.”  These six questions are far too open-ended, and the possible answers too varied and ambiguous, to lead to any useful evaluation of conditions or operations as to lead to definitive acceptance or rejection.

Consider the role of testing in the context of academics or training.  Like most people, as a student either in school or in various trainings I did not enjoy tests, and I sought to probe whether or not they provide an irreplaceable benefit.  Surely, I considered, projects, or practical applications, or something similar could take the place of the strict tests that introduce their own variables into the system of evaluation.  Perhaps a holistic assessment would be more beneficial and more reflective of real abilities.  It’s a seductive idea, but one that runs up against severe practical constraints: namely, objectivity.

Tests can provide something that no other evaluation mechanism can manage: objectivity.  They target specific properties under evaluation and rate their fulfillment against an objective standard.  This is not always executed as well as it should be, and there are confounding factors which can complicate the assessment, but no replacement mechanism can serve as well in maintaining both objectivity and specificity.  That is not the basis only of tests in education; it is why we structure rigorous, replicable tests to evaluate hypotheses in accordance with the scientific method.

The proposed Weizenbaum test is so lacking in structure, objectivity, rigor, and specificity as to not even define what the grading scale or reference answers would be.  If we cannot even agree upon desired answers to these questions, they cannot begin to be useful in evaluating anything.  Again, this does not mean that they are not useful and valuable considerations, but that role should not be confused with that of a test.  Fundamentally, any answers to these questions, both in determining what the desired answer might be and the evaluated answer for a particular system, will be subjective.  A test that provides a different result when executed by different people upon the same system is not useful as a test.  Furthermore, the questions are largely normative and presuppose underlying moral judgments which are not examined or controlled.

I do not disagree that better tests to analyze new AI technologies would be useful; however, they must meet standards of rigor, replicability, objectivity, and specificity such as to render them applicable across diverse systems and contexts, and they should not require normative moral judgements.  The “Weizenbaum test” as it stands is more useful as a framework for an AI policy discussion, not as a means of evaluating AI technologies.  That enters the realm of philosophy, not science.

Leave a comment