Elon Musk's New Grok 4 Takes on ‘Humanity’s Final Examination’ because the AI Race Heats Up

New Grok 4 Takes on ‘Humanity’s Final Examination’ because the AI Race Heats Up

Elon Musk has launched xAI’s Grok 4—calling it the “world’s smartest AI” and claiming it might probably ace Ph.D.-level exams and outpace rivals comparable to Google’s Gemini and OpenAI’s o3 on robust benchmarks

By Deni Ellis Béchard edited by Dean Visser

Elon Musk launched the most recent synthetic intelligence mannequin from his firm xAI on Wednesday evening. In an hour-long public reveal session, he referred to as the mannequin, Grok 4, “the neatest AI on the earth” and claimed it was able to getting good SAT scores and near-perfect GRE leads to each topic, from the humanities to the sciences.

In the course of the on-line launch, Musk and members of his staff described testing Grok 4 on a metric referred to as Humanity’s Final Examination (HLE)—a 2,500-question benchmark designed to guage an AI’s educational data and reasoning ability. Created by almost 1,000 human consultants throughout greater than 100 disciplines and launched in January 2025, the take a look at spans subjects from the classics to quantum chemistry and mixes textual content with pictures. Grok 4 reportedly scored 25.4 p.c by itself. However given entry to instruments (comparable to exterior aids for code execution or Internet searches), it hit 38.6 p.c. That jumped to 44.4 p.c with a model referred to as Grok 4 Heavy, which makes use of a number of AI brokers to unravel issues. The 2 subsequent best-performing AI fashions are Google’s Gemini-Professional (which achieved 26.9 p.c with the instruments) and OpenAI’s o3 mannequin (which obtained 24.9 p.c, additionally with the instruments). The outcomes from xAI’s inside testing have but to look on the leaderboard for HLE, nonetheless, and it stays unclear whether or not it’s because xAI has but to submit the outcomes or as a result of these outcomes are pending evaluation. Manifold, a social prediction market platform the place customers guess play cash (referred to as “Mana”) on future occasions in politics, know-how and different topics, predicted a 1 p.c likelihood, as of Friday morning, that Grok 4 would debut on HLE’s leaderboard with a forty five p.c rating or higher on the examination inside a month of its launch. (In the meantime xAI has claimed a rating of solely 44.4.)

In the course of the launch, the xAI staff additionally ran dwell demonstrations exhibiting Grok 4 crunching baseball odds, figuring out which xAI worker has the “weirdest” profile image on X and producing a simulated visualization of a black gap. Musk advised that the system could uncover fully new applied sciences by later this 12 months—and presumably “new physics” by the top of subsequent 12 months. Video games and films are on the horizon, too, with Musk predicting that Grok 4 will have the ability to make playable titles and watchable movies by 2026. Grok 4 additionally has new audio capabilities, together with a voice that sang throughout the launch, and Musk stated new picture era and coding instruments are quickly to be launched. The common model of Grok 4 prices $30 a month; SuperGrok Heavy—the deluxe bundle with a number of brokers and analysis instruments—runs at $300.

On supporting science journalism

In case you’re having fun with this text, take into account supporting our award-winning journalism by subscribing. By buying a subscription you might be serving to to make sure the way forward for impactful tales concerning the discoveries and concepts shaping our world immediately.

Synthetic Evaluation, an impartial benchmarking platform that ranks AI fashions, now lists Grok 4 as highest on its Synthetic Evaluation Intelligence Index, barely forward of Gemini 2.5 Professional and OpenAI’s o4-mini-high. And Grok 4 seems because the top-performing publicly accessible mannequin on the leaderboards for the Abstraction and Reasoning Corpus, or ARC-AGI-1, and its second version, ARC-AGI-2—benchmarks that measure progress towards “humanlike” common intelligence. Greg Kamradt, president of ARC Prize Basis, a nonprofit group that maintains the 2 leaderboards, says that when the xAI staff contacted the inspiration with Grok 4’s outcomes, the group then independently examined Grok 4 on a dataset to which the xAI staff didn’t have entry and confirmed the outcomes. “Earlier than we report efficiency for any lab, it’s not verified except we confirm it,” Kamradt says. “We accepted the [testing results] slide that [the xAI team] confirmed within the launch.”

In response to xAI, Grok 4 additionally outstrips different AI methods on plenty of extra benchmarks that counsel its power in STEM topics (learn a full breakdown of the benchmarks right here). Alex Olteanu, a senior information science editor at AI training platform DataCamp, has examined it. “Grok has been robust on math and programming in my assessments, and I’ve been impressed by the standard of its chain-of-thought reasoning, which reveals an ingenious and logically sound strategy to problem-solving,” Olteanu says. “Its context window, nonetheless, isn’t very aggressive, and it could wrestle with massive code bases like these you encounter in manufacturing. It additionally fell quick after I requested it to research a 170-page PDF, seemingly because of its restricted context window and weak multimodal skills.” (Multimodal skills consult with a mannequin’s capability to research a couple of type of information on the identical time, comparable to a mixture of textual content, pictures, audio and video.)

On a extra nuanced entrance, points with Grok 4 have surfaced since its launch. A number of posters on X—owned by Musk himself—in addition to tech-industry information retailers have reported that when Grok 4 was requested questions concerning the Israeli-Palestinian battle, abortion and U.S. immigration regulation, it typically looked for Musk’s stance on these points by referencing his X posts and articles written about him. And the discharge of Grok 4 comes after a number of controversies with Grok 3, the earlier mannequin, which issued outputs that included antisemitic feedback, reward for Hitler and claims of “white genocide”—incidents that xAI publicly acknowledged, attributing them to unauthorized manipulations and stating that the corporate was implementing corrective measures.

At one level throughout the launch, Musk commented on how making an AI smarter than people is horrifying, although he stated he believes the last word end result will probably be good—in all probability. “I considerably reconciled myself to the truth that, even when it wasn’t going to be good, I’d not less than wish to be alive to see it occur,” he stated.

Trending

Assessment: Tatiana Trouvé’s Maps of Reminiscence and Collapse at Palazzo Grassi

An underwater volcano off Oregon didn’t erupt in 2025 in spite of everything. Why not?

HelloFresh Meal Equipment’s Low cost Code for December 2025 Unlocks a Free Zwilling Knife

Treasury yields edge increased as holiday-shortened week begins

Greenback Basic's Excessive Valuation Means It's Time To Take Income

What to Watch on the Motion pictures in Theaters on Christmas Day 2025

Jeffrey Epstein Video Purportedly Reveals Him Making an attempt To Kill Himself: DOJ

Elon Musk’s New Grok 4 Takes on ‘Humanity’s Final Examination’ because the AI Race Heats Up

An underwater volcano off Oregon didn’t erupt in 2025 in spite of everything. Why not?

Trump Administration Targets Offshore Wind Farms, Citing Nationwide Safety Issues

New Scientist modified the UK’s freedom of knowledge legal guidelines in 2025

Assessment: Tatiana Trouvé’s Maps of Reminiscence and Collapse at Palazzo Grassi

An underwater volcano off Oregon didn’t erupt in 2025 in spite of everything. Why not?

HelloFresh Meal Equipment’s Low cost Code for December 2025 Unlocks a Free Zwilling Knife

Treasury yields edge increased as holiday-shortened week begins

Greenback Basic's Excessive Valuation Means It's Time To Take Income

What to Watch on the Motion pictures in Theaters on Christmas Day 2025

Jeffrey Epstein Video Purportedly Reveals Him Making an attempt To Kill Himself: DOJ

Our Picks

Assessment: Tatiana Trouvé’s Maps of Reminiscence and Collapse at Palazzo Grassi

An underwater volcano off Oregon didn’t erupt in 2025 in spite of everything. Why not?

HelloFresh Meal Equipment’s Low cost Code for December 2025 Unlocks a Free Zwilling Knife

Trending

Treasury yields edge increased as holiday-shortened week begins

Greenback Basic's Excessive Valuation Means It's Time To Take Income

What to Watch on the Motion pictures in Theaters on Christmas Day 2025

Trending

Elon Musk’s New Grok 4 Takes on ‘Humanity’s Final Examination’ because the AI Race Heats Up

On supporting science journalism

Related Posts