AI and the Law

What is AI?

Setting current manifestations aside for the moment, we could say, more generally, that artificial intelligence is some form of human-made artifact that we can use to reason like intelligent humans do naturally. The artifact is what makes the intelligence artificial, distinguishing it from the naturally occurring kind. We make this same distinction for the entire "built environment": houses, trains, tools, languages, governments, books — every artifact of culture that humans have introduced that did not exist naturally before we arrived on the scene.

From this more general point of view, the first artificial intelligence to arrive on the scene was formal intelligence — systems of mathematics and logic that allow humans to reason symbolically, much better than the average human can naturally. By following the rules of these artificial systems, a select set of practitioners are able to produce reasoned outcomes that exceed the native abilities of any human by orders of magnitude. And by implementing these systems as computer programs, almost any human can now achieve these unnatural results with a few keystrokes.

Large Language Models

What we have not been able to effectively formalize is natural language understanding. Humans have always outperformed machines at this, until Generative AI came along with its Large Language Models (LLMs). Now the LLMs can outperform humans at this. What they are natively very good at — what they were natively trained to do — is to understand the similarities in meaning (if any) between two chunks of text, however large, even when the two samples have no actual words in common. They know every synonym, every hypernym, every hyponym, every holonym, every ambiguity, every ellipsis, every idiom, every metaphor, every alternative way of phrasing the same text without changing its meaning.

This form of generative AI has made a lot of news recently, both good and bad, about its application to the legal domain. The good is its uncanny ability to summarize large collections of natural language documents, such as legal briefs, case findings, contracts, and even the laws themselves. The bad is its unreliability in reaching logical conclusions from these documents, and hallucinating fictitious court cases in support of a position. The root problem here is that the LLMs were not designed to rigorously reason about text, but to statistically emulate the collective “reasoning” of every native speaker who has ever committed something to writing available on the Internet. That they sometimes approach humans’ ability to reason, both logically and mathematically, is a very low bar because most humans are not natively very good at either type of reasoning. We particularly struggle with large collections of disjunctions and negations (the kinds of things you find in laws).

All Generate, No Test

We can’t expect this situation to fundamentally change. Nor do we need it to. One of the earliest models for artificial intelligence was the generate and test method. It was thought that this might emulate the process that successful human reasoners employ. Producing good reasoned outcomes requires both imagination and precision. Precision hampers imagination by pruning wild ideas too soon. Imagination interferes with precision by jumping to conclusions too quickly based on vague analogies. Successful theorists tend to let inspiration and imagination run freely at first to generate lots of hypotheses. Then they switch to pruning mode where they use precision to test out these ideas.

The problem with generative LLMs, from a legal reasoning point of view, is that they are all generate and no test. They are natively trained to predict the next word that is most likely to follow an initial prompt text — based on the statistical distribution of all words in all texts. They then add this word to the end of the prompt and predict the next word. So the growing completion becomes based on progressively more of the LLM’s own imagination. There is nothing to test this against. That is why the same, non-deterministic process can produce both well reasoned legal arguments, and reasonable sounding court cases that are entirely fictitious. When you consider how the generative process works, for both the well-reasoned arguments and the hallucinated cases, this result is not surprising. The chatbot doesn't know the difference.

When a human constructs a legal brief, he or she is operating in two different information spaces: 1) the space of informal, analogical argumentation — how premises (approximately) support conclusions, and 2) the real world, where one finds supporting case law by looking up actual cases. The human is acutely aware that the cases must be real to qualify. The AI is operating in a single (much larger) information space — the length and breadth of virtually all recorded human language communication. Given the task of completing a good legal brief for a position, its analogical argumentation emulates the best of human legal reasoning (it has seen it all). But the best briefs also cite the best case law, so it emulates the citing of these as well, based on fragments of actual cases it has been trained on. The most likely words to come next as it plows through its text generation are cases that would support the brief. It has no notion of actual cases.

Now, compare this with the limited reach of logic in deciding the application of rules via LogicMaps. The logical reasoning is in the relative ordering of the paths. But at each node, the decision defaults to non-logical natural language understanding. You could say that a shortcoming of the maps is that they are all test and no imagination. And indeed they are. Formal systems are essentially testers. We rely on them because they don't imagine anything. They are deterministic verifiers of conjectures. You provide the conjecture, they tell you whether it follows from the rules. There is no practical need for using generative AI to do logical or mathematical reasoning, or hoping that it will someday be able to reliably do so (except for intellectual curiosity). We already have that problem solved by deterministic formal systems and calculators whose results we can rely on unconditionally. Where generative AI can help the legal profession is in generate/test partnerships with either formal systems or actual humans playing the test role.

The LogicMap Partnership

Automated LogicMaps are one such result of a partnership between generative and formal AI. The formal AI part is both precise and deterministic, once the rules are translated into logic. But without the generative part, we would need a human logician sitting next to every legal professional to do the initial translation. Not very practical. The generative part is, by contrast, both approximate and non-deterministic, so we want to minimize its discretion in doing the logical reasoning.

Fortunately, when we generate LogicMaps the AI task begins with a concrete legal text (no opportunity for hallucination), and the goal of the task is translation, not reasoning. This plays to the AI’s strength. As soon as it encodes the text as a large vector in the very large similarity space of natural language meanings generally, it begins generating a decoding of that vector into the target language. Approximation is constrained in translation by the need to express “the same meaning” in the target language. So summarization is (generally) suppressed. Precision is enforced by the learned, concrete grammar of the target language it was trained on. One challenge in using symbolic logic as the target language is that it is what’s known as a low-resource language for the AI. There aren’t that many symbolic logic texts in its training set. Mitigating this is that logic is a very simple language with a very simple grammar. And just like natural languages, many different symbolic expressions mean the same thing logically. So we can (generally) tolerate the non-determinism. Three separate requests to translate the same legal text will often yield three different concrete translations, but as with natural language translations, they all denote the same thing.

The Progress of Generative AI

We began this project in December of 2023. At the time, we found that Anthropic’s Claude 2 was the best of the LLMs at translating natural language to logic. It often got it right, which was encouraging, but often enough it got it wrong. Worse still, the correct translations weren’t stable because of the non-deterministic nature of all generative LLMs. If we submitted the same prompt that had achieved a correct translation a second time, Claude was just as likely to get the translation comically wrong. From this, we inferred that AI was not up to the task. So we started with a system to translate a human logician’s predicate logic specifications of natural language rules into the binary decision diagrams. This would require human logicians in the loop, so it could, at best, be a professional service.

When Claude 3.5 Sonnet came out in June of 2024, we revisited the automated translation decision, only to discover that AI was now (mostly) up to the task. The results were now consistently correct, and of much better quality than Claude 2. This started us on our present course of using LLMs as the translating front end, taking the human logician out of the loop, and making it possible to deliver the the solution as a totally automated, software service.

A challenge that remained was dealing with the non-determinism. When you use a traditional, deterministic software service through an API, extensive testing tells you what it always gets right and always gets wrong. So you put checks and repairs in your backend to fix the mistakes. After a point, you can depend on what it is going to do. With the nondeterminism of generative AI, you can never be sure exactly what you are going to get next. So you need repairs not just for the things you know it occasionally gets wrong, but for possible mistakes you have not yet seen.

Our extensive testing of 3.5 Sonnet revealed that Claude makes occasional mistakes in the translation, as one would expect from generative AI. The good news was that when presented with these mistakes, Claude knew exactly what to do to fix them. So its understanding of propositional logic was sound, it was just subject to the usual occasional vagaries of nondeterministic generation. The possible mistakes fell into two categories: the known and the unknown.

Because Claude is trained on legal discourse in natural language, it tends to make some of the same mistakes that humans do (including legal professionals). One that it shares with humans is a tendency to treat rules that begin with the conclusion as necessary and sufficient, when they are often only sufficient. Because of this, we would always challenge Claude in the next message turn to verify any choice of a bi-conditional against the natural language phrase it corresponds to. This specific attention to the operator and the text phrase, after the initial translation, was sufficient to get Claude to weaken the conditional where that was justified.

For random mistakes we have not yet seen, without a human present, we have no way to detect them in our own back-end if the translation is well-formed. We have to depend on Claude to tell us what needs fixing (and fix it). So every translation that passed all of our specific tests was sent back to Claude for one more round of general sanity checking. We first logically transformed the expressions it had given us, using various normal-form transformations that re-express the logic in a different form with the same deductive consequences, then translated these expression back to natural language. These alternate texts were then sent back to Claude to compare to the original text. This forced Claude to consider whether these consequence are consistent with the original text. So rather than checking its original logic, it is checking the consequences of its original logic in textual form. This focused its attention back on natural language meanings. When there is a mismatch in meanings, it has a specific target for fixing the logic.

The good news here was that Claude’s understanding of propositional logic was good enough to recognize a bad translation. The bad news was that, being all generate and no test, it didn’t know that it was making mistakes until subsequent rounds of the conversation. The subsequent rounds were essentially asking it to generate a test of its first generation. If generative LLMs have reached the point where they can subsequently recognize their own mistakes — at least in logic translation — why couldn’t they be designed to do these subsequent test rounds during the first iteration, combining generation and testing?

When OpenAI released the preview versions of its o1 models in September of 2024, we got our answer. They can be. The o1 models are trained with reinforcement learning to generate an internal Chain of Thought as an answer is being generated. This makes them self-reflective. Reifying the chain of thought allows them to iteratively evaluate their own intermediate work product, sometimes from several different perspectives, before the final response is delivered. So they can now intermix generate and test in a single transaction. A cure for blindness in generative AI! This comes at a cost in latency. The o1-mini version can spend anywhere rom 5 to 15 seconds “thinking” about a logic translation before actually producing it. But the improvement in accuracy is dramatic. It scores near 100% in our tests so far. The few mistakes encountered are the same ones human professionals routinely make, so they are not due to random variation but to incorrect training data. Since the OpenAI API allows its models to be further fine-tuned to specific domains, we can increase the training set for logic translation by adding our own rules-to-logic examples to over-represent areas where human understanding is weakest: generalizing conditionals to bi-conditionals, and translating negations of conditionals.