Specifically, we wanted to see if the scale of the mannequin, i.e. the variety of parameters, impacted performance. The original Binoculars paper recognized that the number of tokens within the enter impacted detection performance, so we investigated if the same applied to code. The ROC curves point out that for Python, the choice of model has little influence on classification efficiency, whereas for Javascript, smaller fashions like DeepSeek Ai Chat 1.3B perform better in differentiating code varieties. In May 2024, DeepSeek’s V2 mannequin despatched shock waves through the Chinese AI trade-not only for its efficiency, but in addition for its disruptive pricing, providing performance comparable to its rivals at a a lot lower price. This, coupled with the fact that efficiency was worse than random probability for enter lengths of 25 tokens, instructed that for Binoculars to reliably classify code as human or AI-written, there may be a minimum enter token size requirement. However, from 200 tokens onward, the scores for AI-written code are generally decrease than human-written code, with growing differentiation as token lengths grow, meaning that at these longer token lengths, Binoculars would better be at classifying code as either human or AI-written.