Member Login Become a Member
Advertisement

AI-First Professional Military Education: Validating the Grade Chain Before the Kill Chain

  |  
05.13.2026 at 06:00am
AI-First Professional Military Education: Validating the Grade Chain Before the Kill Chain Image

Abstract: This article argues that reluctance to trust Artificial Intelligence (AI) for grading within Professional Military Education (PME) represents a critical failure to validate the core hypothesis of its own AI-first strategy. It posits that PME has a professional and moral obligation to serve as the proving ground for human-machine teaming, using the low-stakes “grade chain” to test and refine the AI agents and leader skills required for the high-consequence “kill chain.” Through the case study of an AI-grading assistant, this article demonstrates how PME can rigorously test AI-augmented decision-making, train leaders to critically interrogate AI recommendations, and ultimately ensure the Department of War’s transition to an AI-enabled force is a strategy, not a gamble.


The Department of War is preparing to trust AI agents to inform and accelerate a commander’s decisions in the kill chain, where lives are on the line. However, within Professional Military Education institutions, some might hesitate to trust that same class of technology to help inform and accelerate a faculty member’s decisions in the “grade chain,” the process of grading a student’s essay. This is more than a contradiction; it is a critical failure to perform our most fundamental duty: rigorously test the tools and concepts we will ask our soldiers to use in combat. PME has a moral and professional obligation to become the proving ground for the Department of War’s AI-first future, starting with the immediate need to validate the grade chain before we accept the profound risks of the kill chain.

If we cannot test these assumptions about AI in the controlled environment of our classrooms, then our AI-first vision ceases to be a strategy; it becomes a reckless gamble with our soldiers’ lives, one whose core hypothesis will be tested for the first time not in a lab, but on the battlefield.

The Artificial Intelligence Strategy for the Department of War is clear: we are to become an “AI-first” fighting force. This strategy is built on the fundamental hypothesis that AI will enhance human decision-making in war, the most complex and consequential environment. The strategy specifically mandates that military institutions, “[unleash] Al agent development and experimentation for Al­ enabled battle management and decision support, from campaign planning to kill chain execution.” This strategic approach hinges on forging leaders who can critically command AI, not just operate it. If we cannot test these assumptions about AI in the controlled environment of our classrooms, then our AI-first vision ceases to be a strategy; it becomes a reckless gamble with our soldiers’ lives, one whose core hypothesis will be tested for the first time not in a lab, but on the battlefield.

This internal military dilemma is happening against the backdrop of an existential crisis in all of higher education. An article in The Business Times, titled, “AI Companies are Eating Higher Education,” accurately diagnoses a world where universities, increasingly dependent on commercial AI tools, risk surrendering the “ultimate high ground: human intelligence itself.” The integrity of intellectual assessment is in question.

This challenge, however, presents an opportunity for PME institutions to conduct experiments that help solve this AI dilemma in higher education while also ensuring human-machine teaming can leverage AI agents and safeguard human intelligence.

The Doctrine Proving Ground: From the Grade Chain to the Kill Chain

PME can serve as a testbed for our military’s core AI hypothesis: whether AI can inform and accelerate command decisions on the battlefield when lives are at stake. To do this, PME should conduct experiments, as mandated by the Department of War’s AI Strategy, that connect the comparatively low-stakes process of grading to the high-stakes reality of the kill chain. The logic is based on a parallel process model:

Cognitive Step The Kill Chain The Grade Chain
1. Target ID Identify a potential object of interest. Identify a student paper for evaluation.
2. Criteria Application Assess against: Rules of Engagement, Law of Armed Conflict, Commander’s Intent. Assess against: Grading rubric, exam prompts, learning objectives.
3. AI Agent Recommendation AI agent classifies the target and provides a recommendation. AI agent classifies the paper and provides a recommendation.
4. Human Judgment The AI agent informs the human commander, who retains final authority. The AI agent informs the human faculty member, who retains final authority.

This is not to suggest an equivalence in complexity or consequence. The battlefield is infinitely more chaotic and ambiguous than a classroom; that is precisely the point. The “grade chain” is a simplified, controlled, and measurable test case. If we cannot gather empirical evidence that AI improves a human’s decision-making in this simpler scenario, how can we, in good faith, expect it to do so in the infinitely more complex kill chain?

This is where the human-in-the-loop develops the essential skills and trust required for human-machine teaming: spotting bias, questioning assumptions, identifying flaws in agent instructions, and exercising the moral courage to override the machine while exercising the inherent human authority vested in human commanders and faculty alike.

Failure here is not a failure of AI technology but a priceless early warning about our strategic assumptions. Success, on the other hand, provides the first data-driven evidence that the Department of War’s hypothesis is sound. Our initial experiments with Athena, our own AI-grading assistant developed in-house by soldier-developers, provide a powerful case study of this approach.

The primary goal is not to test these AI tools, but rather to train the “human-in-the-loop”. Before we ask a commander to interrogate an AI’s targeting recommendation under fire, faculty members must undergo hundreds of repetitions interrogating an AI’s grading recommendation in the classroom. This is where the human-in-the-loop develops the essential skills and trust required for human-machine teaming: spotting bias, questioning assumptions, identifying flaws in agent instructions, and exercising the moral courage to override the machine while exercising the inherent human authority vested in human commanders and faculty alike.

The Athena Case Study: AI Agents and Informed Decision Making

We discovered many benefits during the process of creating and employing Athena, an AI agent that assists faculty members with grading. This case study illustrates how building AI agents can enhance critical thinking, as even the earliest iterations of Athena revealed something problematic with our grading process: some of our own rubrics and writing prompts were flawed. Students at the Command and General Staff College have used quality assurance surveys to report confusion during exams after reading assessment rubrics and writing prompts. Athena validated this observation when it pointed out conflicting directions contained within certain assessments’ writing prompts and rubrics. After identifying these conflicts, Athena assisted faculty by providing recommendations to improve both the writing prompts and rubrics to avoid confusion in the future. These observations informed faculty decisions to improve the exam rubrics and prompts for the following academic year.

There is no doctrine for building and employing agents that inform decision making yet, so the Army is learning in stride. We discovered that assessing grading – much like targeting – uses strict criteria but remains subjective.

To quantify the AI’s baseline performance against this subjective variability, we conducted a blind test where Athena graded fifty exams previously graded by faculty. The results provided a powerful validation of the AI’s consistency. Athena’s grade was within five percentage points of the faculty member’s grade for 84 percent of the exams. The mean absolute difference across all papers was just 2.6 percent, and the standard deviation of the differences was only 3.7 percent. This demonstrated that a properly instructed AI agent could perform the grading task with a high degree of consistency and alignment to human graders. This baseline success gave us the confidence to explore more advanced methods to address the nuances of subjective assessment.

During this process, we also accounted for the variance human faculty members exhibit while grading exams. For example, a faculty member grading an exam in the morning after breakfast and a cup of coffee differs from how that same faculty member would grade the same essay at night, or after they have graded exams all day. This human variability was the first problem we sought to address. To counter this, we had Athena act not as a single grader, but as a simulated group of 50 faculty members with an even dispersion of the following grading archetypes:

  • Rubric Hawk: Grades exactly as the official rubric criteria allows
  • Big Idea Grader: Focuses on overall argument quality; more forgiving of minor errors
  • By the Book Doctrinal Expert: Scrutinizes doctrinal consistency; penalizes misinterpretations
  • Strict Grader: Holds high standards; tends to score lower
  • Lenient Grader: Gives maximum benefit of the doubt; tends to score higher

Now, instead of one assessment, we provided the faculty member with a ‘council of peers’ to inform their decision. Each archetype graded differently but within the established criteria, allowing the faculty member to consider a full spectrum of perspectives when making their own determination.

This process trains the commander to look beyond a single AI recommendation in the kill chain and to intuitively ask, “What is the confidence score? Are there dissenting assessments from other agents, models, or intelligence feeds? What is causing the variance?”

This approach became even more robust when we explored the impact of different large language models. We created clones of Athena, running one on OpenAI’s GPT-4.1 and another on the now deprecated Claude Sonnet 4.5 large language model. Suddenly, faculty received input from one hundred simulated graders, fifty from each version of Athena. This taught us a valuable lesson for the kill chain: the underlying model of an AI agent can create subtle but significant differences in output. True decision support may require leveraging multiple, diverse agents and models to provide the most accurate assessment.

A skeptic might argue that this “council of agents” is a deliberative academic exercise, ill-suited for the speed of combat. This misses the point. The simulation of 100 faculty opinions is generated nearly instantaneously. The data can be immediately aggregated to provide a simple output for rapid decision-making such as, “the mean assessment is an 85% B.” But the true, profound lesson for a future commander lies not in the average, but in the divergence. When 95 of the simulated agents assess a paper as a ‘B’, but a dissenting minority see it as an ‘A’ or ‘C’, it provides an invaluable insight: a quantification of uncertainty. It forces the human leader to ask why there is disagreement and to examine the outliers. This is the critical skill we must build. This process trains the commander to look beyond a single AI recommendation in the kill chain and to intuitively ask, “What is the confidence score? Are there dissenting assessments from other agents, models, or intelligence feeds? What is causing the variance?” It inoculates the leader against the dangerous allure of AI-generated certainty and constantly reinforces the most critical lesson of all: their own human judgment is the final, indispensable authority.

Levering AI to Expand Assessments and Ensure Faculty Relevancy

Validating AI-assisted grading has a powerful second-order effect: it unlocks innovative new forms of assessment that might solve the crisis facing higher-education and the threat to human intelligence and cognition. For the first time ever, we can now assess a student’s raw cognitive process via their “work” when they use AI. Specifically, where some might think this is advocating for students to use AI to produce papers and then an AI agent grades that paper, they miss a critical point. We can now assess human intelligence and cognition in novel ways unavailable to us before the advent of AI. Now, we can request students use AI to produce products and assess not only the product but also the student’s conversation with AI in the creation of that product. Through that conversation, we can observe cognitive processes and identify levels of learning the student achieved in the process. Just as math exams require students to “show their work,” we can now do the same for the essay-writing process. Additionally, AI agents can act as written oral boards. One such agent, Socrates, is already being used at the U.S. Command and General Staff School to conduct structured Socratic examinations of a student’s understanding of course material. In the military, we do not lament when a new technology is implemented by our adversary; we adapt. PME is poised to introduce novel forms of assessment that mitigate the underlying issues AI introduces to higher education.

This use of AI might cause some faculty members to question their relevancy. This approach does not decrease the relevancy of faculty; on the contrary, it magnifies it. The faculty members within PME are some of the brightest minds in the military. Their relevance comes not from the administrative task of grading but from their ability to innovate, write, and mentor the next generation of leaders. By leveraging AI as a powerful assistant, the Department of War can free its most valuable intellectual capital to focus on solving the complex problems of future warfare. If higher education seeks to defend the ultimate high ground of human intelligence, then we must use AI agents to emancipate human intelligence from the mundane tasks that detract our faculty from maximizing the use of their own.

A Call to Action: Validate the Grade Chain

What is proposed here is controversial, no doubt. However, we have received a mandate from our leadership to become an AI-first force. If AI-assisted grading is detrimental to our force, then it is possible that AI agents in the kill chain might have profoundly worse implications. My purpose is not to antagonize but illuminate a critical connection we can no longer ignore and conduct experiments so we can validate the use of AI in a way the enhances the human decision-making process. These experiments also build upon recent academic research findings that indicate AI-assisted grading enhances assessment accuracy, improves grading consistency, and provides valuable feedback for students and faculty.

Given these developments, the Department of War should formally expand the mission of PME institutions, chartering them not only as hubs to develop lethal soldiers equipped to fight our next war, but as official “Test and Validation” centers with a direct feedback loop to inform policy and doctrine. Additionally, we can begin testing the use of AI agents to grade students’ cognitive process in their conversation with AI to produce products. We should normalize the use of AI an “show your work” as a means of assessment. It is our duty to rigorously test the tools we will give our soldiers and we should not wait to test new processes on the battlefield, nor should we allow the misuse of AI to degrade our soldiers’ intellectual and cognitive abilities. Now that the parallel between the grade chain and the kill chain is clear, it would be a profound failure of our professional stewardship to choose not to act and validate these systems in the controlled environment of our classrooms.

The views here are those of the author and do not represent the opinions or positions of the Command and General Staff College, the U.S. Army, the Department of Defense, or any part of the U.S. government.

About The Author

  • Anthony A. Joyce

    Lieutenant Colonel Anthony A. Joyce, US Army, is an FA59 Strategist and Instructor at the US Army Command and General Staff College. Additionally, he co-founded an AI tech startup in 2023 and is an award-winning tabletop game designer who has designed games for companies including Netflix, Meta, and Wizards of the Coast. As a strategist, Lieutenant Colonel Joyce has served at all levels of government to include the Office of the Secretary of Defense (Policy), Headquarters Department of the Army, and the US House of Representatives as an Army Liaison. He has a robust academic record, including three master’s degrees: Georgetown University (Policy Management), University of Louisville (Higher Education Administration), and Virginia Tech (Political Science).

    View all posts

Article Discussion:

5 3 votes
Article Rating
Subscribe
Notify of
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments