TL;DR
Claude, Anthropic's Large Language Model AI, can build tools to help teach AI,
BUT
The tools that Claude builds are prototypes -- not high-quality finished products:
They look like "programmer art" and
They plateau at a certain level of complexity
Join me on a journey to convince Claude that a functional interactive tool that actually looks nice really would be a good idea...
AI Whack-a-Mole
We last left our hero with the beginnings of a functional (if aesthetically lacking) tool to visualize the inner workings of a decision tree
Let's improve both the look and the function of this tool by painting the the different rectangular areas of these plots to match the color of the class that they represent. Color will not only brighten up this drab off-white plot, but will better highlight the relationship between the straight black and gray lines one the one hand and the colored dots on the other.
Here's what Claude came back with
Nothing has changed whatever!
If you've been following along on this journey, Claude's verbal excitement but lack of execution will come as no surprise. I will not bore you with the interminable back and forth I then commence with Claude. I point out Claude's lack of progress. Claude promises to fix the issue. Claude comes back with a new artifact that is deficient in some way. Sometimes nothing has changed, sometimes nothing is displayed whatever, sometimes the text is printed but none of the graphics show up. This cycle repeated itself no less than 14 times.
About halfway through this process I asked Claude to just start over
Claude then came back with an artifact that was only mostly broken.
On the upside, the artifact does display something. But all the lines are missing. In fact, the artifact also printed an error message that hinted at the cause
If the buildDecisionTree() function is missing, then obviously no tree decision boundary lines will be produced either. Our conversation continued with me pointing out this error to Claude, and after several more iterations the AI actually managed to produce a functioning artifact with colored background!
The astute will notice, though, that there is still something wrong — there are no confidence ellipses! Back when I'd asked Claude to start over, I was so ecstatic that the new artifact printed anything at all that I didn't notice the missing confidence ellipses. (Did you?)
From that point on, though I tried every technique I could think of (restating, simplifying the instructions, restarting Claude in new sessions, kindly pointing out runtime errors, ...), none of the additional 17 Claude artifacts I created made use of both colored decision areas and confidence ellipses. (At one point, Claude did bring back the confidence ellipses, but lost the background color. *sigh*)
Claude was like a too-small blanket — no matter how I tugged at the corners I could never get Claude to cover the entire task. And make no mistake confidence ellipses and colored backgrounds are just the tip of the iceberg among features I want in a full decision tree education tool!
Some small wins
Although I was unable to re-introduce confidence ellipses back into Claude's artifact, I did successfully add a couple other small improvements. The first improvement is relatively minor. In the above artifact the "Total Sample Size" slider represents the total number of datapoints. I asked Claude to change the "Total Sample Size" slider with a slider that represented the number of datapoint per class. Claude readily complied.
The second feature I managed to add is more substantial.
Leaf-Size Regularization
Decision trees have a bad habit of memorizing their training data. In real-world machine learning applications you don't actually care how well your model does at predicting the class label for items in the training dataset — you already know the correct label for the training data items! Instead you care about how well your model does at predicting the correct class label for new example data points that were never seen during training.
If you leave a decision tree to its own devices it will continue to subdivide the input space until every area only contains examples from the same class — even if that means creating a rectangle that contains just a single datapoint. You can see this in action in the above (colorful) artifact: the skinny horizontal green rectangle on the left only contains a single (training) datapoint. When the decision tree glues this additional skinny rectangle onto the main green area (the wide vertical green band) the resulting full green area is very oddly shaped. Indeed, if a new datapoint were to fall into the narrow green strip, it is far from certain that the correct class for that datapoint would be green (the new data point's true color might just as well be blue or red!)
One simple way to remedy this problem is to constrain the decision tree to not sub-divide an area if the resulting smaller areas would contain fewer than some threshold number of training datapoints.
Which, combined with the per class sample size change, led to the best artifact I was able to produce. Behold Claude's creation! (go ahead, read that aloud).
Conclusion
So, Can AI Teach AI?
Well, maybe it can make a prototype teaching aid, but at the moment, Claude (at least) can't seem to make it out of the prototype phase.
While I don't definitively know why Claude hits a ceiling in the complexity of the code it can write, I do have some hypotheses. First, input length. Large language models have a context window size. Any tokens that fall outside of this context window are simply ignored. Modern LLMs employ fancy tricks to increase the size of their context widow (for example, an LLM might reference every nearby token, then only every other token up to a distance twice that far away, then every fourth token that lies even farther away, then every eighth token, and so on.) But they still have length limitations. The artifacts Claude produced in this experiment are not that long, just a few hundred lines of code, but still, the further away a token is in the sequence, the more difficult it becomes for an LLM to focus on that token.
Second, Claude treats code the same way it treats language — as a sequence of tokens that starts at the beginning and progresses to the end in one unbroken stream. But complex software doesn't execute line by line from the beginning of the codebase to the end. Instead, computer code of any complexity contains functions: self-contained blocks of code that typically perform one specific task. Functions can be defined in code in one order, but called in a completely different order.
In my experiments, Claude would frequently call a function, but then entirely forget to include that function in the artifact. Oversights like this likely arise exactly because when the code decides to call a function, the actual function definition is far away (and possibly even outside the length limits).
For the moment, this means I have abandoned my attempt to use Claude to write a tool to teach AI. And instead I am now writing my own interactive tool to teach machine learning — details on that tool will appear in a future post...
Comentarios