The meta cheated on the AI benchmarks and it looked at the new golden age

https://www.effectiveratecpm.com/ewvkb7gdh?key=c57b08190fb7ebcbd795da62a99d879a

The meta cheated on AI benchmark and it’s funny. According to Kyle Robon on the edge Then suspicions began Meta released two new AI models based on the 4 large language model of its Llama on weekends. New models are a scout, a smaller model designed for quick queries and Maverick, which means that it is a superherent competitor for more popular models like Openai GPT-4O (:Our Miyazaki Apocalypse’s Harbinger)

By declaring them on the blog, the meta did what every AI company is doing now. They removed a whole bunch of high quality technical data to boast of how Metati AI is smarter and more efficient than companies that are better related to AI. Google, Openai and Anthropic. These release records are always mixed with deep technical data and benchmarks that are extremely useful for researchers and the most obsessions, but the rest of us. The Metal Statement was not anything.

But many Obsessives of AI immediately noticed that one shocking base result is underlined in his post. Maurita had 1417 elo unit in Louren. Lmeen An open source is a tool that users can vote on the best results. The better assessment is better, and 1417 of Mauritius is in the world of Lapering No. 2, right on top of GMINI 2.5 PRO. The results were surprised by the whole AI ecosystem.

Then they started digging and quickly noted that the sila was identified in a delicate printing press, which was different than the version than the version users. The company has planned this model to talk more than usual. Effectively it charmed the benchmark in the presentation.

It seems that Lourena was pleased with the insulting of the charm. “Our Metal Policy Comment does not match what we expect from model providers.” It says in a statement of XA number of “meta should have been clearer that” LLAMA-4-MAVERIC-03-26 experimental “was a customized model to optimize a person’s preference. As a result, we refresh our leader in order not to be our leader in the future. “

I love Lmarena’s optimism here, because the criteria is to make the right to accept consumer technology, and I doubt that this trend will continue. I have been covering consumer technologies for more than a decade, I once ran one of the broader standard of industry, and I saw many phones and laptops who tried all kinds of juice tricks. They have a mess for a better battery life and free options for laptops for better performance.

Now AI models talk more in their units juice. And the reason I doubt that this will not be the latest carefully developed assessment that these companies are desperately distinguished their large language models from each other. If each model can help you write five minutes before the class to write shy English paper, then you need another reason to distinguish your preference. “My model uses less energy and the task is performed by 2.46% faster,” it may not be the biggest boasting for everyone, but it is important. It’s still 2.46% faster than everyone else.

Because this AIS continues to mature in the products facing real consumers, we will start to see more benchmarks. I hope we will also see the other items too. The user interfaces will start changing, Goofy stores such as Catcht app gpt section will become more common. These companies need to prove why their models are the best models and they will not do that alone. Not when Clatle Bot can make a gambling system so easily.

Source link

Leave a ReplyCancel Reply